Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .. .

First, to reassure those that have tired of the encoding discussion, I do not think that this particular thread need delay discussion of other facets of the DDLm project.  In fact, it would probably be wise at this stage for the interested parties to branch off into a 'ddlm-encodings-group' and bash out a scheme along the lines that John B has proposed.  Perhaps Brian could set that up for us? Pending that branching off:

John B proposes an enhanced scheme which uses an embedded hash value as a check of encoding integrity (see his email below for the details).   This hash value would need to be computed by tools separate from a user's normal text editing workflow.  I might cheekily comment that, if the user needs to run another program to convert their 'non-CIF2' text file to a 'CIF2' text file, why not just run an encoding converter instead of a hash value inserting program?  But perhaps there are legitimate reasons, so below I analyse John's scheme B, which I think is the best proposal yet for handling multiple encodings.

The weak points that I see in John B's scheme are:
(1) Information will not be correctly transferred if the hasher uses the wrong encoding for calculating the hash, and the recipient uses the same wrong encoding.  The recipient is likely to use the encoding suggested by the creator, so the probability of this type of failure occurring is essentially the probability of the CIF writer instructing the hash calculator to use the wrong encoding.  Other mistakes by the CIF writer (forgetting to add a hash, leaving an old hash in the file) are likely to simply result in rejection, which I don't see as a failure.
(2) In order to read the hash value, the encoding of the file needs to be known (!)
(3) The recipient doesn't know if a hash value is present until they have parsed the entire file
(4) Assumption that all recipients will be able to handle all encodings
(5) Potential for intermediate files to be lying around the users' system which are neither CIF2/UTF-8 nor CIF2/hashed but are in some sense CIF2 files.
 
A strong point:
(6): user must run a CIF-aware program to produce the hash value, so there is an opportunity to hide complexity inside the program (or just convert to UTF-8...)

We can reduce the likelihood of (1) by producing interactive CIF-hash calculators that present the file text to the user in the nominated encoding scheme for checking before the hash is calculated, with intelligent choice of file contents to find non-ASCII code points.

We can reduce the impact of the remaining issues with the following adjusted Scheme B (Scheme C).  I would find something like Scheme C acceptable.  Relevant changes:

(i) mandate putting the hash comment (if necessary) on the very first line of the file, using ASCII encoding for each character.  Most text editors would find such mixed encoding a challenge, but as hashing must be done programmatically I don't see an issue.  Likewise, before any further text processing is attempted, the file should be put through a hash checker, which would output a file ready for the local environment (without a hash check at the top).  Note that the hash comment effectively replaces the CIF2.0 magic number, reducing potential for confusion.  Note that a non-UTF-8 file without hash comment should not have the CIF2.0 magic number. This change addresses points (2) and (3) above
(ii) state the encoding scheme as part of the hash line, inserted as part of the hash calculation.  In this way, at least the hasher's choice of encoding scheme is known, rather than allowing the further possible errors arising from hasher and user having different ideas of the encoding.  Addresses point (1)
(iii) restrict possible encodings to internationally recognised ones with well-specified Unicode mappings.  This addresses point (4)

Scheme C (adapted from Scheme B):
1. This scheme provides for reliable archiving and exchange of CIF text.  Although it depends in some cases on metadata embedded in the CIF text, presence of such metadata is not a well-formedness constraint on the text itself.
2. For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes.
3. Any internationally-recognised text encoding with a well-specified mapping to Unicode code points may be used.  The name of the encoding must be specified via an "encoding tag" prepended to the hash value computed using that encoding (see below).
4. Archiving or exchange of CIF text complies with this scheme if the CIF text contains a correct content hash:
   a) The hash value is computed by applying the MD5 algorithm to the Unicode code point values of the CIF text, in the order they appear, excluding all code points of CIF comments and all other CIF whitespace appearing outside data values or separating List or Table elements.
   b) The code point stream is converted to a byte stream for input to the hash function by interpreting each code point as a 24-bit integer, appearing on the byte stream in order from most-significant to least-significant byte.
   c) The hash value is expressed in the CIF itself as a structured comment of the form:
       #\#cif_content_hash_md5-YYYYY: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      where the Xs represent the hexadecimal digits of the computed hash value and the Ys represent the standard encoding name used to compute the hash.
   d) The hash comment must appear as the first line of a file
5. Archiving or exchange of CIF text that does not contain a content hash complies with this scheme if
   a) the text encoding is specified in an international standard and is distinguishable from all other encodings at the binary level, or
   b) the text encoding is coincident with US-ASCII for all code points appearing in the CIF.
For the purposes of (5a), distinguishing encodings may rely on the characteristics of CIF, such as the allowed character set and the required CIF version comment, and also on the actual CIF text (such as for recognition of UTF-8 by its encoding of non-ASCII characters).


On Sat, Jul 10, 2010 at 2:38 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Hello All,

I apologize for my recent silence, resulting from my vacationing away from ready access to an Internet connection (a blessing to me, and perhaps to you as well!).

I am all in favor of polling the communities of interested authors and developers for input on the text vs. binary matter, and I am glad that such a poll is underway.  The way James couched it in terms of reliable transfer vs. text conventions got me spinning in a new direction, however, and a bit back to an older one:

The vast majority of the of the CIF 1 and 2 specifications concern the text content of CIFs, and those matters are orthogonal to the question of mechanisms for storing and exchanging CIF text.  It would be entirely feasible, and possibly useful, to separate the formal specifications for these matters.  In particular, I think the DDLm specifications depend little, if at all, on the storage and exchange side (although *implementations* will need to be concerned with that side, of course), so DDLm discussion could move forward based on the CIF text specification while the storage and exchange details are hammered out.  My personal interest in such a division is more philosophical, though, rooted in a long-cultivated instinct for separation of concerns.

Additional comments inline below:

On Wednesday, June 30, 2010 11:25 PM, James Hester wrote:
[...]
>  The two polar positions are then:
>
>(a) Reliable transfer and archiving of information are the top priority

For clarity: by "reliable" I think we mean the receiving party having high confidence that they are interpreting the content in the way the sending party intended.  Is that accurate?

Yes, where sending and receiving may also include 'saving' and 'retrieving'.
 
>(b) Being able to process CIF using text conventions is the top priority

One of the things that occurred to me is that these are polar opposites only when the scope of the discussion is limited to which text encodings should be allowed by the CIF specification.  If we expand the scope, as James does in his next comments, then we can consider alternatives that meet both goals.

>If we consider CIF as text as the overriding priority:
>
>1. How do we then make exchanging and storing files according to text conventions sufficiently reliable for the >purposes of CIF?  How far are we prepared to compromise?
>
>If we consider reliable exchange of information as the top priority:
>
>2. How do we then make CIFs sufficiently accessible to text-based tools?  How far are we prepared to compromise?
>
>It would be useful if we came up with suitable answers to (1) and (2) to clarify what are alternatives are.  We >can then seek a compromise after finding a balance between the loss of textual convenience from emphasizing >reliability with the loss of reliability gained by textual convenience (Simple!).  I will be proposing a poll of >the international CIF using community in a separate email to aid us in our deliberations.

Whether we treat it as a separate matter or as integral to CIF itself, choice of encoding is not the only, nor even the most effective possible approach to reliable storage and exchange of CIF.  Hashing is the industry standard technique for reliable data exchange, and it could be applied with good effect to storage and exchange of CIFs.  Typically a file is hashed as a byte stream, but for CIF, if the Unicode code point stream were hashed instead then that would provide for an excellent check on whether the CIF was decoded correctly by the receiver.  The hash could be included in the CIF itself; for example, it could be put in a structured comment at the end of the file.

The difficulty with hashing is, of course, that it requires an extra layer of software to compute and check hashes.  Additionally, a hash is invalidated when the subject file is modified, though it would be possible to implement the hashing algorithm such that it is insensitive to some or all semantically-meaningless modifications.  Also, the hashing algorithm on the sender's side needs to know the encoding.

On the plus side, a hash of the Unicode code points would allow a receiver to test the content against *all* encodings available to them, thus enabling them to decode the content correctly without any a priori knowledge of the encoding.  That includes distinguishing among the many encodings that are supersets of ASCII.  Also, hashing as I suggest would provide the same ability to detect ordinary transmission errors that the more common uses of hashing do.  Moreover, such a hash would be insensitive to harmless encoding mismatches, where the two encodings being confused agree on all the characters actually present in the CIF (example: ISO-8859-x vs. UTF-8 or ISO-8859-y, where the CIF contains only ASCII characters).

Supposing that hashing tools are provided to authors, I don't see the extra work for them that hashing would require as being qualitatively different from what would be required to ensure UTF-8 encoding, except for those few authors for whom UTF-8 is the system default.  Hashing could be implemented on top of locale-specific text conventions, and the reliability it would afford surpasses that provided by standardizing on UTF-8, in that it would detect transmission errors and provide for correct detection and verification of any encoding.  Furthermore, to the extent that recognizable encodings seem sufficient to some, it is not necessary to require hashes for CIFs encoded via a recognizable encoding.

Hashing solves both problems simultaneously: CIFs could be handled according to local text conventions and even freely transcoded (provided the transcoding is lossless), yet also reliably archived and exchanged.  The cost would be the need for an additional piece of software involved in the archiving and exchange processes.

>Being clearly in the camp that values reliability before text, I propose the following scheme as a scheme for >ensuring reliability while attempting to keep the file as accessible as possible to text tools, and invite >others to propose a scheme that satisfies those that value text as the highest priority:
>
>Scheme A:
>1. For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes.
>2. Any encoding specifying the mapping from bytes to Unicode code points may be used, provided that:
>    1. The encoding is specified in an international standard
>    2. The encoding is distinguishable from all other standard encodings at the binary level.  This requirement may be satisfied by an initial 'signature', provided that this signature is specified in the relevant international standard as being mandatory
>    3. The encoding is supported across the range of platforms likely to use CIF2.  "Support" includes:
>        1. Availability of text input and output functions using this encoding across a range of programming languages
>        2. Availability of applications on the platform to manipulate text in this encoding, most importantly text editors but also tools such as search
>    4. The encoding is coincident with US-ASCII for codepoints <= 127.  This requirement may be dropped in the future if CIF2 becomes the dominant form of CIF file.
>Requirement (2.4) is there for backwards compatibility with the de-facto practice of CIF1.  I submit that currently only UTF-8 meets these requirements for encoding ((2.2) in only a probabilistic sense).

Based on my commentary above, I offer this alternative scheme:

Scheme B:
1. This scheme provides for reliable archiving and exchange of CIF text.  Although it depends in some cases on metadata embedded in the CIF text, presence of such metadata is not a well-formedness constraint on the text itself.
2. For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes.
3. Any text encoding may be used.  It is encouraged, but not required, that the name of the encoding be specified via an "encoding tag" appended to the CIF2 magic code.  If present, such a tag is advisory-only in nature: consumers must be prepared for it to be incorrect.
4. Archiving or exchange of CIF text complies with this scheme if the CIF text contains a correct content hash:
   a) The hash value is computed by applying the MD5 algorithm to the Unicode code point values of the CIF text, in the order they appear, excluding all code points of CIF comments and all other CIF whitespace appearing outside data values or separating List or Table elements.
   b) The code point stream is converted to a byte stream for input to the hash function by interpreting each code point as a 24-bit integer, appearing on the byte stream in order from most-significant to least-significant byte.
   c) The hash value is expressed in the CIF itself as a structured comment of the form:
       #\#content_hash_md5:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      where the Xs represent the hexadecimal digits of the computed hash value.
   d) The hash comment may appear anywhere in the CIF that a comment may appear, but conventionally it is at the end of the CIF.
5. Archiving or exchange of CIF text that does not contain a content hash complies with this scheme if
   a) the text encoding is specified in an international standard and is distinguishable from all other encodings at the binary level, or
   b) the text encoding is coincident with US-ASCII for all code points appearing in the CIF.
For the purposes of (5a), distinguishing encodings may rely on the characteristics of CIF, such as the allowed character set and the required CIF version comment, and also on the actual CIF text (such as for recognition of UTF-8 by its encoding of non-ASCII characters).

Best Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.