Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .. .. .

Dear Colleagues,

   I agree that MD5 checksums are very useful in ensuring the reliable
transmission of files.  That is why we use them for imgCIF binary sections.

   Unfortunately, most people simply ignore MD5 checksums on read unless
some intervening software check the checksums for them and do not write
them unless some intervening software automatically writes them.

   All of this -- the UTF-8 issue, the DDLm parsing issues, this hash
code imply a need for a great deal of robust supporting software if
CIF2 and DDlm are to become a reality.  We need to settle our specs and
get down to coding.


At 6:04 PM -0700 7/15/10, SIMON WESTRIP wrote:
>  >"...Scheme B provides for reliable exchange and archiving; it is 
>not intended to be an integral part of the CIF format..."
>However worthy such schemes may be, if they are not to be "part of 
>the CIF format" (whatever it may turn out to be) perhaps
>discussion of such here overly complicates/confuses the matter? I 
>note the desire to pursue this elsewhere and am in favour of 
>exploring methods of exchange and archiving (indeed I'm resisting 
>the temptation/distraction to follow-up on some of these ideas, e.g. 
>I like the idea of 'zipping a CIF' with all related data files, with 
>dictionary-defined links to the related data); however, inevitably a 
>CIF will have to be a CIF, whether it be 'text' with a declared 
>encoding or 'binary' with unambiguous byte order, or ...
>Forgive me if this sounds rather simplistic - I appreciate that 
>exchange and archiving is fundamental (especially
>once we move beyond ASCII) - but to the average CIF user (or me at 
>least:-), proposing software-dependent encryption schemes seems far 
>more of a restriction than any of the other options this group has 
>been discussing. Indeed, presented with scheme B or C,
>I would be very tempted just to say "I'll work with UTF-8 - far less 
>complicated" :-)
>From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
>To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>Sent: Thursday, 15 July, 2010 22:57:29
>Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. 
>.. .. .. .. .. .. .. .. .. .. .
>On Wednesday, July 14, 2010 12:46 AM, James Hester wrote:
>>First, to reassure those that have tired of the encoding 
>>discussion, I do not think that this particular thread need delay 
>>discussion of other facets of the DDLm project.
>>   In fact, it would probably be wise at this stage for the 
>>interested parties to branch off into a 'ddlm-encodings-group' and 
>>bash out a scheme along the lines that John B has proposed. 
>>Perhaps Brian could set that up for us?
>That would be fine with me, and initially I deferred this response 
>pending Brian's answer.  Inasmuch as Brian is a very busy person, 
>however, and seems often to need some time to respond to this group, 
>I'm going ahead.
>>  Pending that branching off:
>>John B proposes an enhanced scheme which uses an embedded hash 
>>value as a check of encoding integrity (see his email below for the 
>>details).  This hash value would need to be computed by tools 
>>separate from a user's normal text editing workflow.  I might 
>>cheekily comment that, if the user needs to run another program to 
>>convert their 'non-CIF2' text file to a 'CIF2' text file, why not 
>>just run an encoding converter instead of a hash value inserting 
>>program?  But perhaps there are legitimate reasons, so below I 
>>analyse John's scheme B, which I think is the best proposal yet for 
>>handling multiple encodings.
>Thanks, both for the (qualified) praise of the scheme and for the analysis.
>>The weak points that I see in John B's scheme are:
>>(1) Information will not be correctly transferred if the hasher 
>>uses the wrong encoding for calculating the hash, and the recipient 
>>uses the same wrong encoding.  The recipient is likely to use the 
>>encoding suggested by the creator, so the probability of this type 
>>of failure occurring is essentially the probability of the CIF 
>>writer instructing the hash calculator to use the wrong encoding. 
>>Other mistakes by the CIF writer (forgetting to add a hash, leaving 
>>an old hash in the file) are likely to simply result in rejection, 
>>which I don't see as a failure.
>This is a valid criticism, but in practice I think it can be 
>significantly mitigated by good design of the hashing program (such 
>as reliance by default on the environmental default encoding, and 
>detecting probable encoding mismatches).  In the case of a 
>CIF-specific editor, there is no need for any separate step and no 
>chance of encoding mismatch.  James has an additional suggestion in 
>his (ii) below.
>>(2) In order to read the hash value, the encoding of the file needs 
>>to be known (!)
>Yes and no.  In many cases, either the encoding can be determined 
>from the content (even without a correct encoding tag) or it can be 
>determined well enough to parse the file to find the hash (most 
>ASCII supersets).  Nevertheless, something along the lines of 
>James's (ii) below can do better.
>>(3) The recipient doesn't know if a hash value is present until 
>>they have parsed the entire file
>This is correct.  The recipient also cannot *use* the hash without 
>parsing the entire file, however, so it doesn't make a lot of 
>difference.  Nevertheless, it would be possible to provide a hint at 
>the beginning of the file, so that parsers that wanted to avoid the 
>overhead of the hash computation could do so.
>>(4) Assumption that all recipients will be able to handle all encodings
>There is no such assumption.  Rather, there is an acknowledgement 
>that some systems may be unable to handle some CIFs.  That is 
>already the case with CIF1, and it is not completely resolved by 
>standardizing on UTF-8 (i.e. scheme A).
>>  (5) Potential for intermediate files to be lying around the users' 
>>system which are neither CIF2/UTF-8 nor CIF2/hashed but are in some 
>>sense CIF2 files.
>This is intentional.  Scheme B provides for reliable exchange and 
>archiving; it is not intended to be an integral part of the CIF 
>format.  It would serve more as a gateway protocol, used when people 
>transmit CIF text or deposit it in a local archive.  For all other 
>purposes, there is no need to make users decorate their CIFs with 
>hashes, nor to prevent them from treating CIFs as ordinary text 
>files, complying with local conventions.  Or at any rate, that's the 
>direction from which the scheme is proposed.
>>A strong point:
>>(6): user must run a CIF-aware program to produce the hash value, 
>>so there is an opportunity to hide complexity inside the program 
>>(or just convert to UTF-8...)
>Just converting to UTF-8 for archiving and exchange would be scheme 
>D.  Or perhaps scheme 0, as it has come up before in a couple of 
>different forms.  It is distinct from scheme A in that it applies 
>only to archiving and exchange,  not generally to the CIF format. 
>Note that scheme B does not require UTF-8 (or UTF-16 or UTF-32) CIFs 
>to carry a hash, so converting to UTF-8 for storage and exchange in 
>fact is a special case of scheme B.
>>We can reduce the likelihood of (1) by producing interactive 
>>CIF-hash calculators that present the file text to the user in the 
>>nominated encoding scheme for checking before the hash is 
>>calculated, with intelligent choice of file contents to find 
>>non-ASCII code points.
>Indeed so.
>>We can reduce the impact of the remaining issues with the following 
>>adjusted Scheme B (Scheme C).  I would find something like Scheme C 
>>acceptable.  Relevant changes:
>>(i) mandate putting the hash comment (if necessary) on the very 
>>first line of the file, using ASCII encoding for each character. 
>>Most text editors would find such mixed encoding a challenge, but 
>>as hashing must be done programmatically I don't see an issue. 
>>Likewise, before any further text processing is attempted, the file 
>>should be put through a hash checker, which would output a file 
>>ready for the local environment (without a hash check at the top). 
>>Note that the hash comment effectively replaces the CIF2.0 magic 
>>number, reducing potential for confusion.  Note that a non-UTF-8 
>>file without hash comment should not have the CIF2.0 magic number. 
>>This change addresses points (2) and (3) above
>I can't accept that, for two main reasons:
>a) An important aspect of the scheme is that a file that complies 
>with it can be handled as an ordinary text file, at least in an 
>environment that correctly autodetects the encoding or that happens 
>to assume the correct encoding (e.g. because it is the environmental 
>default).  Most particularly, it can still be handled as a text file 
>on the system where it was generated.
>b) Another important aspect of the scheme is that text I carries are 
>compliant with the CIF2 text specifications, which require the magic 
>In addition,
>c) For encoding autodetection, it is of great advantage to have a 
>known character sequence at the beginning of the file.  Although 
>having instead two alternatives would not break autodetection, it 
>would make autodetection more complicated.  Also,
>d) as a practical matter, it is most convenient for a program adding 
>the hash to write it at the end.  A program checking the hash can't 
>be significantly bothered by such placement because it isn't be 
>ready to use the hash until it reaches the end.
>As discussed above, I don't think (2) is a compelling problem in practice.
>If (3) is an issue of significant concern, then I could agree to 
>yield on (d) by putting the content hash on the same line as the 
>magic code.  I don't see that being much advantage, however, given 
>the need to parse the entire file anyway before the hash is useful. 
>See also below.
>>(ii) state the encoding scheme as part of the hash line, inserted 
>>as part of the hash calculation.  In this way, at least the 
>>hasher's choice of encoding scheme is known, rather than allowing 
>>the further possible errors arising from hasher and user having 
>>different ideas of the encoding.  Addresses point (1)
>Scheme B already provides for an encoding tag as a hint about what 
>encoding was used (B.3), and I think it reasonable to expect a 
>hasher to create and/or rewrite that tag as necessary.  I would be 
>willing to make it a requirement that if present, the tag must 
>correctly indicate the encoding.  That would also address (3) to 
>some extent by serving as a hint that a content hash follows 
>(somewhere in the file).
>>(iii) restrict possible encodings to internationally recognised 
>>ones with well-specified Unicode mappings.  This addresses point (4)
>I don't see the need for this, and to some extent I think it could 
>be harmful.  For example, if Herb sees a use for a scheme of this 
>sort in conjunction with imgCIF (unknown at this point whether he 
>does), then he might want to be able to specify an encoding specific 
>to imgCIF, such as one that provides for multiple text segments, 
>each with its own character encoding.  To the extent that imgCIF is 
>an international standard, perhaps that could still satisfy the 
>restriction, but I don't think that was the intended meaning of 
>"internationally recognised".
>As for "well-specified Unicode mappings", I think maybe I'm missing 
>something.  CIF text is already limited to Unicode characters, and 
>any encoding that can serve for a particular piece of CIF text must 
>map at least the characters actually present in the text.  What 
>encodings or scenarios would be excluded, then, by that aspect of 
>this suggestion?
>I offer a few additional comments about scheme B:
>A) Because it does not require a hash for UTF-8 (including pure 
>US-ASCII; see B.5), scheme B is a superset of scheme A.
>B) Scheme B does not use quite the same language as scheme A with 
>respect to detectable encodings.  As a result, it supports (without 
>tagging or hashing) not just UTF-8, but also all UTF-16 and UTF-32 
>variants.  This is intentional.
>C) Scheme B is not aimed at ensuring that every conceivable receiver 
>be able to interpret every scheme-B-compliant CIF.  Instead, it 
>provides receivers the ability to *judge* whether they can interpret 
>particular CIFs, and afterwards to *verify* that they have done so 
>correctly.  Ensuring that receivers can interpret CIFs is thus a 
>responsibility of the sender / archive maintainer, possibly in 
>cooperation with the receiver / retriever.
>D) I reiterate that the scheme's self-description as being "for 
>archiving and exchange of CIF text" is intentional and meaningful. 
>It is not intended to require that CIFs carry hashes or encoding 
>tags when used for other purposes.  The scheme is positioned for use 
>on the front end of an archive ingestion system or as part of a 
>sending-side agent for CIF exchange, though it need not be 
>restricted to such scenarios.
>E) Furthermore, scheme B does not interfere with the ability of any 
>party to transcode CIF text if a different encoding is more suitable 
>for their purposes.  I would expect many archivers to perform such 
>transcoding as a matter of course, though none are obligated to do 
>In light of James' comments and suggestions, then, I offer scheme 
>B'.  It differs from scheme B at point 3, by requiring, in those 
>cases where the encoding cannot reliably be autodetected, that the 
>correct encoding name be written in an encoding tag at the beginning 
>of the file.  To my knowledge, all cases of interest either succumb 
>to autodetection or are sufficiently congruent with US-ASCII (or 
>otherwise sufficiently decodable, given known initial characters) to 
>allow the encoding tag to be read before the exact encoding is 
>known.  I emphasize that although the encoding tag being incorrect 
>makes a CIF non-compliant with this scheme, that does not prevent 
>the correct encoding being discovered via the content hash, by 
>iteration over all available schemes (provided that the correct 
>scheme is available).
>Scheme B':
>1. This scheme provides for reliable archiving and exchange of CIF 
>text.  Although it depends in some cases on metadata embedded in the 
>CIF text, presence of such metadata is not a well-formedness 
>constraint on the text itself.
>2. For the purposes of storage and transfer, CIF files must be 
>treated by all file handling protocols as streams of bytes.
>3. Any text encoding may be used.  If the encoding does not comply 
>with either (5a) or (5b) below, then its name must be given via an 
>encoding tag following the magic code, on the same line.  Otherwise, 
>an encoding tag is optional, but if present then it must correctly 
>name the encoding.
>4. Archiving or exchange of CIF text complies with this scheme if 
>the CIF text contains a correct content hash:
>   a) The hash value is computed by applying the MD5 algorithm to the 
>Unicode code point values of the CIF text, in the order they appear, 
>excluding all code points of CIF comments and all other CIF 
>whitespace appearing outside data values or separating List or Table 
>   b) The code point stream is converted to a byte stream for input 
>to the hash function by interpreting each code point as a 24-bit 
>integer, appearing on the byte stream in order from most-significant 
>to least-significant byte.
>   c) The hash value is expressed in the CIF itself as a structured 
>comment of the form:
>       where the Xs represent the hexadecimal digits of the computed 
>hash value.
>   d) The hash comment may appear anywhere in the CIF that a comment 
>may appear, but conventionally it is at the end of the CIF.
>5. Archiving or exchange of CIF text that does not contain a content 
>hash complies with this scheme if
>   a) the text encoding is specified in an international standard and 
>is distinguishable from all other encodings at the binary level, or
>   b) the text encoding is coincident with US-ASCII for all code 
>points appearing in the CIF.
>For the purposes of (5a), distinguishing encodings may rely on the 
>characteristics of CIF, such as the allowed character set and the 
>required CIF version comment, and also on the actual CIF text (such 
>as for recognition of UTF-8 by its encoding of non-ASCII characters).
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>Email Disclaimer: 
>ddlm-group mailing list
>ddlm-group mailing list

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.