Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .

John suggests "the goal of CIF being compatible with general-purpose text tools"

This is possibly the crux of the matter.
Unless a general-purpose text tool is capable of the determining text encoding system, it ain't going to be much use
for a CIF that was encoded on a different system and uses non-ASCII chars?
By extending the character set beyond ASCII, we have to accept that not all general-purpose text tools are going to
be applicable as CIF editors/viewers.

John asks: "So (everyone), within your domain, do you then favor addressing the encoding question administratively?"

Whatever is decided upon regarding encoding, I would hope, at least in the short term, that IUCr journals would
attempt to accomodate a variety of encodings even if they were illegal according to the spec, then process the CIF
accordingly and notify the author that their encoding wasn't actually valid (then point them at software
that would help them in the future). This would only be possible if the encodings are unambiguously identifiable.
So you may say that adding a declaration would help in these cases. But I must reiterate that I do not believe we
can rely on manually edited encoding declarations in the CIF world.
Indeed, they cannont be relied upon in HTML. For example, I have a program that creates PDF's from HTML.
It uses html encoding declarations to determine the encoding; if this the declaration is absent it attempts to
identify unicode by BOM. When it fails to render the HTML correctly it is
because the declaration is either incorrect or missing.

This may seem irrelevent when you are processing 'inhouse' CIFs using 'inhouse' systems where you now about and control the encoding, but when you are offering
automated services such as printCIF, which many authors use along with checkCIF as a means of validating their CIF prior to
submission, then the chances of failure (and thus disgruntled authors), increase with the more encoding systems that
are allowed. Better to have *inherently* identifiable encodings and software support to implement them?

This is my view from the domain I am familiar with.



From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Monday, 28 June, 2010 15:47:43
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .

On Friday, June 25, 2010 6:17 PM, SIMON WESTRIP wrote:

[Herb Bernstein wrote:]
>>  We don't need everybody to be doing the same thing.  We need everybody
>>to be able to send everybody else their information in a form in which
>>other people can correctly undertstand what they have been sent.
>I totally agree with this - which is why I have advocated that the standard should be totally unambigous and
at the same time be as accessible as possible. I beleive that I have expressed before an acceptance that we
>may have to adopt a certain degree of heuristic encoding determination in order to accommodate user practice;
>I do not shy away from this. I am, however, seeking a way to avoid, if possible, the amiguity that code-page based
>encodings present.

As Herb has steadfastly maintained, there is no such way.  In particular, only by controlling the encoding process can you avoid having to deal with every conceivable text encoding scheme.  Given the specification and history of CIF1, and the goal of CIF being compatible with general-purpose text tools, it is unreasonable to believe that standardizing one or a few official text encoding schemes for CIF2 will provide an effective control on CIF2 encoding.  (That's important, so let's discuss it further if there is disagreement.)  That leaves everyone, in practice, having to deal with every conceivable encoding.

That does not mean that anyone must deal *equally* with every encoding, however.  That would be impossible, unless you count manipulations that are insensitive to the encoding.  The current spec draft provides little guidance here: it's requirement for UTF-8 could be taken to mean that CIF2 processors must reject otherwise-conformant CIFs encoded via some other scheme, but few here seem to anticipate that their software will actually be that strict.

So (everyone), within your domain, do you then favor addressing the encoding question administratively?  For example, you might as a matter of policy reject CIFs encoded via schemes outside some chosen set you can reliably recognize.  That would be entirely reasonable, but it does not rely on or benefit from any particular encoding requirement in the standard.

Or do you instead favor adapting as best you can to whatever you receive?  That might benefit from having a standardized mechanism for communicating encoding information along with a CIF, and at worst it would be no worse off for there being such a standard mechanism.

Or do you have another alternative?  If so, how does it benefit in practice from the standard designating one or a few allowed text encodings, or how is it harmed by a standardized mechanism for communicating encoding information?


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer
ddlm-group mailing list
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.