Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
>To my mind, the encoding of plain CIF files remains an open issue.  I
>do not view the mechanisms for managing file encoding that are
>provided by current OSs to be sufficiently robust, widespread or
>consistent that we can rely on developers or text editors respecting
>them [...].

I agree that the encoding of plain CIF files remains an open issue.

I confess I find your concerns there somewhat vague, especially to the extent that they apply within the confines of a single machine.  Do your concerns extend to that level?  If so, can you provide an example or two of what you fear might go wrong in that context?

As Herb recently wrote, "Multiple encodings are a fact of life when working with text."  CIF2 looks like text, it feels like text, and despite some exotic spice, it tastes like text -- even in UTF-8 only form.  We cannot pretend that we're dealing with anything other than text.  We need to accept, therefore, that no matter what we do, authors and programmers will need to account for multiple encodings, one way or another.  The format specification cannot relieve either group of that responsibility.

That doesn't necessarily mean, however, that CIF must follow the XML model of being self-defining with regard to text encoding.  Given CIF's various uses, we gain little of practical value in this area by defining CIF2 as UTF-8 only, and perhaps equally little by defining required decorations for expressing random encodings.  Moreover, the best reading of CIF1 is that it relies on the *local* text conventions, whatever they may be, which is quite a different thing than handling all text conventions that might conceivably be employed.

With that being the case, I don't think it needful for CIF2 in any given environment to endorse foreign encoding conventions other than UTF-8.  CIF2 reasonably could endorse UTF-16 as well, though, as that cannot be confused with any ASCII-compatible encoding.  Allowing UTF-16 would open up useful possibilities both for imgCIF and for future uses not yet conceived.  Additionally, since CIF is text I still think it important for CIF2 to endorse the default text conventions of its operating environment.

Could we agree on those three as allowed encodings?  Consider, given that combination of supported alternatives and no extra support from the spec, how might various parties deal with the unavoidable encoding issue.  Here are some of the more reasonable alternatives I see:

1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB:

        Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.  The responsibility to perform any needed transcoding is on the other party.  This is just as it might be with UTF-8-only.

        Option b) in addition to supporting UTF-8 and/or UTF-16, support other encodings by allowing users to explicitly specify them as part of the submission/retrieval process.  The processor / repository would either ensure the CIF is properly labeled, or, better, transcode it to UTF-8[/16].  This also is just as it might be with UTF-8 only.

2. Programs and Libraries:

        Option a) On input, detect encoding by checking first for UTF-16, assuming UTF-8 if not UTF-16, and falling back to default text conventions if a UTF-8 decoding error is encountered.  On output, encode as directed by the user (among the two/three options), defaulting to the input encoding when that is available and feasible.  These would be desirable behaviors even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, but they do exceed UTF-8-only requirements.

        Option b) Require input and produce output according to a fixed set of conventions (whether local text conventions or UTF-8/16).  The program user is responsible for any needed transcoding.  This would be sufficient for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those differ, however, in which text conventions would be assumed.

3. Users/Authors:
3.1. Creating / editing CIFs
        No change from current practice is needed, but users might choose to store CIFs in UTF-8[/16] form.  This is just as it would likely be under UTF-8 only.

3.2. Transferring CIFs
        Unless an alternative agreement on encoding can be reached by some means, the transferor must ensure the CIF is encoded in UTF-8[/16].  This differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed.

3.3. Receiving CIFs
        The receiver may reasonably demand that the CIF be provided in UTF-8[/16] form.  He should *expect* that form unless some alternative agreement is established.  Any desired transcoding from UTF-8[/16] to an alternative encoding is the user's responsibility.  Again, this is not significantly different from the UTF-8 only case.

A driving force in many of those cases is the well-understood (especially here!) fact that different systems cannot be relied upon to share text conventions, thus leaving UTF-8[/16] as the only available general-purpose medium of exchange.  At the same time, local conventions are not forbidden from use where they can be relied upon -- most notably, within the same computer.  Even if end-users, as a group, do not appreciate those details, we can ensure via the spec that CIF2 implementers do.  That's sufficient.

So, if pretty much all my expected behavior under UTF-8[/16]+local is the same as it would be under UTF-8-only, then why prefer the former?  Because under UTF-8[/16]+local, all the behavior described is conformant to the spec, whereas under UTF-8 only, a significant proportion is not.  If the standard adequately covers these behaviors then we can expect more uniform support.  Moreover, this bears directly on community acceptance of the spec.  If flaunting the spec with respect to encoding becomes common, then the spec will have failed, at least in that area.  Having failed in one area, it is more likely to fail in others.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

cif2-encoding mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.