Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .

On Wednesday, June 23, 2010 5:33 AM, Brian McMahon wrote:


>Expecting every CIF application to be robustly able to handle every
>conceivable - or even every reasonable - encoding is (what's the
>word?) "optimistic", and places a heavy burden on application

I thought you were an optimist?  :-)

Indeed, I agree that such an expectation would be optimistic in the extreme, and I don't think anyone has been advocating such a requirement.

>Consider instead the approach of defining the CIF standard as a
>text file and using UTF-8 for a "canonical" description of low-level
>representations. Supply a set of such canonical CIFs in the
>next-generation trip test suite. Require a "compliant" CIF
>application to handle the trip tests with the canonical encoding.
>Permit - indeed encourage - applications developers to accommodate
>other encodings to the extent they can easily do with their standard
>text-processing libraries/utilities/tools. Encourage or perhaps
>commission a "canonicalisation" suite for use in contexts where
>an application cannot natively handle a submitted encoding.


>This isn't a radical new suggestion; it seems to me to encapsulate
>many of the points of common ground around which we're still
>negotiating our points of principle or philosophy, but I would hope it
>can help us to move forward.

That satisfactorily captures the key points I have been pursuing.  With only a bit of tweaking, the "CIF Interchange Format" proposal I floated would serve this end nicely.  Alternatively, the same end could be reached by couching the requirement in terms of a "canonical" encoding, more along the lines of Brian's text above:

1. In "TERMINOLOGY", insert a new first paragraph:
Reference to characters means numeric code points in the Unicode code space.  Where Unicode has assigned 'abstract characters' to specific code points, those code points may sometimes be referred to by the Unicode-assigned name or a colloquial equivalent.  Otherwise, they are referred to according to Unicode convention, U+[[x]x]xxxx, where [[x]x]xxxx is the four- to six-digit hexadecimal representation of the code point value.

2. Change the heading "CHANGE 2 - NEW (ENCODING)" to "CHANGE 2 - NEW (CHARACTER SET)".

3. Replace the first paragraph in the CHANGE 2 section with:
CIF2 files are variable-length Unicode text files, but for historical reasons will have a maximum record length of 2048 characters.  As described in detail below, CIF2 imposes restrictions on the characters allowed in data names, block codes, and save frame codes, and it disregards the Unicode-defined separating and delimiting functions of all but a few characters.

4. Change the format of the explicit included character set to use Unicode convention.  (A few weeks ago I provided James a proposed draft update that does this.)

5. Delete all remaining appearances of the text "UTF-8" in that section and those following, without replacement (the definition of "character(s)" obviates these).

6. Add a new section at the end:

Many alternative encodings are available for recording and exchanging Unicode text (such as CIF2 data) via byte-oriented media.  This specification does not forbid the use of any particular encoding for storing and exchanging CIF2 data, but UTF-8 is the canonical encoding for CIF2.  All CIF2 readers conformant with this specification are prepared to accept CIF2 input encoded in UTF-8.  They may in addition accept CIF2 input encoded via other schemes, but they are not required to do so.  CIF2 writers may produce output in any encoding, but they are strongly encouraged to use UTF-8 unless environment- or purpose-specific circumstances direct otherwise.

As used with CIF2, UTF-8 encoding includes an optional initial UTF-8 encoded byte-order mark (character U+FEFF).  Such a code is accepted and ignored if present, but it is considered part of the encoding, not part of the encoded CIF2 data.

Reasoning: A canonical encoding is chosen to standardize one means of exchanging CIF data without data corruption or loss.  UTF-8 in particular is chosen because of its widespread and growing acceptance and implementation, its coverage of the entire Unicode code space, and its congruence with 7-bit ASCII over the entire ASCII range.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.