Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. .. .

Hello All,

I apologize for my recent silence, resulting from my vacationing away from ready access to an Internet connection (a blessing to me, and perhaps to you as well!).

I am all in favor of polling the communities of interested authors and developers for input on the text vs. binary matter, and I am glad that such a poll is underway.  The way James couched it in terms of reliable transfer vs. text conventions got me spinning in a new direction, however, and a bit back to an older one:

The vast majority of the of the CIF 1 and 2 specifications concern the text content of CIFs, and those matters are orthogonal to the question of mechanisms for storing and exchanging CIF text.  It would be entirely feasible, and possibly useful, to separate the formal specifications for these matters.  In particular, I think the DDLm specifications depend little, if at all, on the storage and exchange side (although *implementations* will need to be concerned with that side, of course), so DDLm discussion could move forward based on the CIF text specification while the storage and exchange details are hammered out.  My personal interest in such a division is more philosophical, though, rooted in a long-cultivated instinct for separation of concerns.

Additional comments inline below:

On Wednesday, June 30, 2010 11:25 PM, James Hester wrote:
>  The two polar positions are then:
>(a) Reliable transfer and archiving of information are the top priority

For clarity: by "reliable" I think we mean the receiving party having high confidence that they are interpreting the content in the way the sending party intended.  Is that accurate?

>(b) Being able to process CIF using text conventions is the top priority

One of the things that occurred to me is that these are polar opposites only when the scope of the discussion is limited to which text encodings should be allowed by the CIF specification.  If we expand the scope, as James does in his next comments, then we can consider alternatives that meet both goals.

>If we consider CIF as text as the overriding priority:
>1. How do we then make exchanging and storing files according to text conventions sufficiently reliable for the >purposes of CIF?  How far are we prepared to compromise?
>If we consider reliable exchange of information as the top priority:
>2. How do we then make CIFs sufficiently accessible to text-based tools?  How far are we prepared to compromise?
>It would be useful if we came up with suitable answers to (1) and (2) to clarify what are alternatives are.  We >can then seek a compromise after finding a balance between the loss of textual convenience from emphasizing >reliability with the loss of reliability gained by textual convenience (Simple!).  I will be proposing a poll of >the international CIF using community in a separate email to aid us in our deliberations.

Whether we treat it as a separate matter or as integral to CIF itself, choice of encoding is not the only, nor even the most effective possible approach to reliable storage and exchange of CIF.  Hashing is the industry standard technique for reliable data exchange, and it could be applied with good effect to storage and exchange of CIFs.  Typically a file is hashed as a byte stream, but for CIF, if the Unicode code point stream were hashed instead then that would provide for an excellent check on whether the CIF was decoded correctly by the receiver.  The hash could be included in the CIF itself; for example, it could be put in a structured comment at the end of the file.

The difficulty with hashing is, of course, that it requires an extra layer of software to compute and check hashes.  Additionally, a hash is invalidated when the subject file is modified, though it would be possible to implement the hashing algorithm such that it is insensitive to some or all semantically-meaningless modifications.  Also, the hashing algorithm on the sender's side needs to know the encoding.

On the plus side, a hash of the Unicode code points would allow a receiver to test the content against *all* encodings available to them, thus enabling them to decode the content correctly without any a priori knowledge of the encoding.  That includes distinguishing among the many encodings that are supersets of ASCII.  Also, hashing as I suggest would provide the same ability to detect ordinary transmission errors that the more common uses of hashing do.  Moreover, such a hash would be insensitive to harmless encoding mismatches, where the two encodings being confused agree on all the characters actually present in the CIF (example: ISO-8859-x vs. UTF-8 or ISO-8859-y, where the CIF contains only ASCII characters).

Supposing that hashing tools are provided to authors, I don't see the extra work for them that hashing would require as being qualitatively different from what would be required to ensure UTF-8 encoding, except for those few authors for whom UTF-8 is the system default.  Hashing could be implemented on top of locale-specific text conventions, and the reliability it would afford surpasses that provided by standardizing on UTF-8, in that it would detect transmission errors and provide for correct detection and verification of any encoding.  Furthermore, to the extent that recognizable encodings seem sufficient to some, it is not necessary to require hashes for CIFs encoded via a recognizable encoding.

Hashing solves both problems simultaneously: CIFs could be handled according to local text conventions and even freely transcoded (provided the transcoding is lossless), yet also reliably archived and exchanged.  The cost would be the need for an additional piece of software involved in the archiving and exchange processes.

>Being clearly in the camp that values reliability before text, I propose the following scheme as a scheme for >ensuring reliability while attempting to keep the file as accessible as possible to text tools, and invite >others to propose a scheme that satisfies those that value text as the highest priority:
>Scheme A:
>1. For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes.
>2. Any encoding specifying the mapping from bytes to Unicode code points may be used, provided that:
>    1. The encoding is specified in an international standard
>    2. The encoding is distinguishable from all other standard encodings at the binary level.  This requirement may be satisfied by an initial 'signature', provided that this signature is specified in the relevant international standard as being mandatory
>    3. The encoding is supported across the range of platforms likely to use CIF2.  "Support" includes:
>        1. Availability of text input and output functions using this encoding across a range of programming languages
>        2. Availability of applications on the platform to manipulate text in this encoding, most importantly text editors but also tools such as search
>    4. The encoding is coincident with US-ASCII for codepoints <= 127.  This requirement may be dropped in the future if CIF2 becomes the dominant form of CIF file.
>Requirement (2.4) is there for backwards compatibility with the de-facto practice of CIF1.  I submit that currently only UTF-8 meets these requirements for encoding ((2.2) in only a probabilistic sense).

Based on my commentary above, I offer this alternative scheme:

Scheme B:
1. This scheme provides for reliable archiving and exchange of CIF text.  Although it depends in some cases on metadata embedded in the CIF text, presence of such metadata is not a well-formedness constraint on the text itself.
2. For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes.
3. Any text encoding may be used.  It is encouraged, but not required, that the name of the encoding be specified via an "encoding tag" appended to the CIF2 magic code.  If present, such a tag is advisory-only in nature: consumers must be prepared for it to be incorrect.
4. Archiving or exchange of CIF text complies with this scheme if the CIF text contains a correct content hash:
    a) The hash value is computed by applying the MD5 algorithm to the Unicode code point values of the CIF text, in the order they appear, excluding all code points of CIF comments and all other CIF whitespace appearing outside data values or separating List or Table elements.
    b) The code point stream is converted to a byte stream for input to the hash function by interpreting each code point as a 24-bit integer, appearing on the byte stream in order from most-significant to least-significant byte.
    c) The hash value is expressed in the CIF itself as a structured comment of the form:
       where the Xs represent the hexadecimal digits of the computed hash value.
    d) The hash comment may appear anywhere in the CIF that a comment may appear, but conventionally it is at the end of the CIF.
5. Archiving or exchange of CIF text that does not contain a content hash complies with this scheme if
    a) the text encoding is specified in an international standard and is distinguishable from all other encodings at the binary level, or
    b) the text encoding is coincident with US-ASCII for all code points appearing in the CIF.
For the purposes of (5a), distinguishing encodings may rely on the characteristics of CIF, such as the allowed character set and the required CIF version comment, and also on the actual CIF text (such as for recognition of UTF-8 by its encoding of non-ASCII characters).

Best Regards,

John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.