Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Cif2-encoding] Fundamental source of disagreement

  • To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
  • Subject: [Cif2-encoding] Fundamental source of disagreement
  • From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
  • Date: Tue, 10 Aug 2010 06:42:26 -0400 (EDT)
  • In-Reply-To: <AANLkTinZ4KNsnREOOU6sVFdGYR_aQHcjdWr_ko648NGm@mail.gmail.com>
  • References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><AANLkTimLmnpS-HHP9en-zwUDeVKtbHSUJa36tUCOlQtL@mail.gmail.com><826180.50656.qm@web87010.mail.ird.yahoo.com><563298.52532.qm@web87005.mail.ird.yahoo.com><520427.68014.qm@web87001.mail.ird.yahoo.com><a06240800c84ac1b696bf@192.168.2.104><614241.93385.qm@web87016.mail.ird.yahoo.com><alpine.BSF.2.00.1006251827270.70846@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local><33483.93964.qm@web87012.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local><AANLkTilqKa_vZJEmfjEtd_MzKhH1CijEIglJzWpFQrrC@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikTee4PicHKjnnbAdipegyELQ6UWLXz9Zm08aVL@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local><AANLkTinZ4KNsnREOOU6sVFdGYR_aQHcjdWr_ko648NGm@mail.gmail.com>
With all due respect to James and others who adhere to the view that:

"There is no such thing as 'optional' for an information interchange 
standard."

I believe this the fundamental source of our disagreement on the
the direction for CIF2.

Optional features are common in almost all current successful standards
for information interchange, including HTML4, XMF and CIF1.  As a
practical matter, one tries to have strict writers and liberal readers
for interchange standards to encourage migration to as common a
convention as possible.  Even so, if we are too strict in our rules
for what is and is not a proper CIF, we will probably encourage
the growth of multiple unofficial, unmanaged and non-interchangeable
CIF2 dialects.

As for John's hashing scheme, I suspect some variation of it will find 
signficant use in major archives, just as associating MD5 checksums
with tarballs does for many software distributors, but that we also
will need some easier-to-generate-and-transfer _optional_ encoding
hint schemes, such as the accented "o's".  One simple way to handle
it would be:

   1.  Put some variant of the accented "o's" into the _optional_
magic number; and
   2.  Adopt the tarball approach to MD5 checksums by having it not
in the header but in a separate file, simply generating it from
a canonical UTF8 representation of the CIF2 file.

The accented o's are easy to carry along as an encoding hint, and
if you get the encoding hint right, then you will easily be able
to generate a canonical UTF8 file to validate the MD5 checksum against
if you wish for a critical file transfer, e.g. to an archive or a journal.

Regards,
    Herbert




=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Tue, 10 Aug 2010, James Hester wrote:

> I had not fully appreciated that Scheme B is intended to be applied only at the
> moment of transfer or archiving, and envisions users normally saving files in their
> preferred encoding with no hash codes or encoding hints required (I will call the
> inclusion of such hints and hashes as 'decoration').  A direct result of allowing
> undecorated files to reside on disk is that CIF software producers will need to
> write software that will function with arbitrary encodings with no decoration to
> help them, as that is the form that users' files will be most often be in. 
> Furthermore, given the ease with which files can be transferred between users
> (email attachment, saved in shared, network-mounted directory, drag and drop onto
> USB stick etc.) it is unlikely that Scheme B or anything involving extra effort
> would be applied unless the recipient demanded it.  And given how many times that
> file might have changed hands across borders and operating systems within a single
> group collaboration, there would only be a qualified guarantee that the character
> to binary mapping has not been mangled en route, making any scheme applied
> subsequently rather pointless.
> 
> We would thus go from a situation where we had a single, reliable and sometimes
> slightly inconvenient encoding (UTF8), to one where a CIF processor should be
> prepared for any given CIF file to be one of a wide range of encodings which need
> to be guessed. This CIF processor is thus forced to autodetect (possibly
> unreliably) or interact with the user.  This looks a lot like a step backward to
> me.
> 
> I would much prefer a scheme which did not compromise reliability in such a
> significant way.  My previous (somewhat clunky) attempts to adjust Scheme B were
> directed at trying to force any file with the CIF2.0 magic number to be either
> decorated or UTF-8, meaning that software has a reasonably high confidence in file
> integrity.
> 
> An alternative way of thinking about this is that CIF files also act as the
> mechanism of information transfer between software programs.  This may be less
> pronounced in mmCIF environments, but in small molecule work a large number of
> programs work with CIFs and pass structural information around using CIF files. 
> Therefore, the act of writing a CIF file is essentially already an act of
> information transfer: when a separate program is asked to input that CIF, the
> information has been transferred, even if that software is running on the same
> computer. 
> 
> Now, moving on to the detailed contours of Scheme B and addressing the particular
> points that John and I have been discussing.  My original criticisms are the ones
> preceded by numerals.
>
>       >(1) Information will not be correctly transferred if the hasher uses
>       the wrong encoding for calculating the hash, and the recipient uses the
>       same wrong encoding.  The recipient is likely to use the encoding
>       suggested by the creator, so the probability of this type of failure
>       occurring is essentially the probability of the CIF writer instructing
>       the hash calculator to use the wrong encoding.  Other mistakes by the
>       CIF writer (forgetting to add a hash, leaving an old hash in the file)
>       are likely to simply result in rejection, which I don't see as a
>       failure.
>
>       This is a valid criticism, but in practice I think it can be
>       significantly mitigated by good design of the hashing program (such as
>       reliance by default on the environmental default encoding, and
>       detecting probable encoding mismatches).  In the case of a CIF-specific
>       editor, there is no need for any separate step and no chance of
>       encoding mismatch.  James has an additional suggestion in his (ii)
>       below.
>
>       >(2) In order to read the hash value, the encoding of the file needs to
>       be known (!)
>
>       Yes and no.  In many cases, either the encoding can be determined from
>       the content (even without a correct encoding tag) or it can be
>       determined well enough to parse the file to find the hash (most ASCII
>       supersets).  Nevertheless, something along the lines of James's (ii)
>       below can do better.
> 
> 
> If we restrict the allowed encodings to those for which the ASCII codepoints can be
> autodetected assuming CIF2 layout (in particular the first line) I think that would
> be sufficiently robust.
>
>       >(3) The recipient doesn't know if a hash value is present until they
>       have parsed the entire file
>
>       This is correct.  The recipient also cannot *use* the hash without
>       parsing the entire file, however, so it doesn't make a lot of
>       difference.  Nevertheless, it would be possible to provide a hint at
>       the beginning of the file, so that parsers that wanted to avoid the
>       overhead of the hash computation could do so.
> 
> 
> The point of having the hash at the front is so that a parsing program can
> immediately reject an undecorated, non UTF-8 file, or alternatively branch based on
> how reliable the encoding hint is thought to be.  For example, if a hash is
> present, there is a somewhat stronger guarantee that the encoding hint has been
> checked or detected by a program rather than manually inserted. 
>
>       >(4) Assumption that all recipients will be able to handle all
>       encodings
>
>       There is no such assumption.  Rather, there is an acknowledgement that
>       some systems may be unable to handle some CIFs.  That is already the
>       case with CIF1, and it is not completely resolved by standardizing on
>       UTF-8 (i.e. scheme A).
> 
> 
> There is no such thing as 'optional' for an information interchange standard.  A
> file that conforms to the standard must be readable by parsers written according to
> the standard. If reading a standard-conformant file might fail or (worse) the file
> might be misinterpreted, information cannot always reliably be exchanged using this
> standard, so that optional behaviour needs to be either discarded, or made
> mandatory. There is thus no point in including optional behaviour in the standard.
> So: if the standard allows files to be written in encoding XYZ, then all readers
> should be able to read files written in encoding XYZ.  I view the CIF1 stance of
> allowing any encoding as a mistake, but a benign one, as in the case of CIF1 ASCII
> was so entrenched that it was the defacto standard for the characters appearing in
> CIF1 files.  In short, we have to specify a limited set of acceptable encodings.
> 
>
>       > (5) Potential for intermediate files to be lying around the users'
>       system which are neither CIF2/UTF-8 nor CIF2/hashed but are in some
>       sense CIF2 files.
>
>       This is intentional.  Scheme B provides for reliable exchange and
>       archiving; it is not intended to be an integral part of the CIF format.
>        It would serve more as a gateway protocol, used when people transmit
>       CIF text or deposit it in a local archive.  For all other purposes,
>       there is no need to make users decorate their CIFs with hashes, nor to
>       prevent them from treating CIFs as ordinary text files, complying with
>       local conventions.  Or at any rate, that's the direction from which the
>       scheme is proposed.
> 
> I have addressed this above.
>  
>       >A strong point:
>       >(6): user must run a CIF-aware program to produce the hash value, so
>       there is an opportunity to hide complexity inside the program (or just
>       convert to UTF-8...)
>
>       Just converting to UTF-8 for archiving and exchange would be scheme D.
>        Or perhaps scheme 0, as it has come up before in a couple of different
>       forms.  It is distinct from scheme A in that it applies only to
>       archiving and exchange,  not generally to the CIF format.  Note that
>       scheme B does not require UTF-8 (or UTF-16 or UTF-32) CIFs to carry a
>       hash, so converting to UTF-8 for storage and exchange in fact is a
>       special case of scheme B.
> 
> 
> Indeed.
>
>       >We can reduce the likelihood of (1) by producing interactive CIF-hash
>       calculators that present the file text to the user in the nominated
>       encoding scheme for checking before the hash is calculated, with
>       intelligent choice of file contents to find non-ASCII code points.
>
>       Indeed so.
>
>       >We can reduce the impact of the remaining issues with the following
>       adjusted Scheme B (Scheme C).  I would find something like Scheme C
>       acceptable.  Relevant changes:
>       >
>       >(i) mandate putting the hash comment (if necessary) on the very first
>       line of the file, using ASCII encoding for each character.  Most text
>       editors would find such mixed encoding a challenge, but as hashing must
>       be done programmatically I don't see an issue.  Likewise, before any
>       further text processing is attempted, the file should be put through a
>       hash checker, which would output a file ready for the local environment
>       (without a hash check at the top).  Note that the hash comment
>       effectively replaces the CIF2.0 magic number, reducing potential for
>       confusion.  Note that a non-UTF-8 file without hash comment should not
>       have the CIF2.0 magic number. This change addresses points (2) and (3)
>       above
>        
>
>       I can't accept that, for two main reasons:
>
>       a) An important aspect of the scheme is that a file that complies with
>       it can be handled as an ordinary text file, at least in an environment
>       that correctly autodetects the encoding or that happens to assume the
>       correct encoding (e.g. because it is the environmental default).  Most
>       particularly, it can still be handled as a text file on the system
>       where it was generated.
>
>       b) Another important aspect of the scheme is that text I carries are
>       compliant with the CIF2 text specifications, which require the magic
>       code.
>
>       In addition,
>
>       c) For encoding autodetection, it is of great advantage to have a known
>       character sequence at the beginning of the file.  Although having
>       instead two alternatives would not break autodetection, it would make
>       autodetection more complicated.  Also,
>
>       d) as a practical matter, it is most convenient for a program adding
>       the hash to write it at the end.  A program checking the hash can't be
>       significantly bothered by such placement because it isn't be ready to
>       use the hash until it reaches the end.
> 
>
>       As discussed above, I don't think (2) is a compelling problem in
>       practice.
>
>       If (3) is an issue of significant concern, then I could agree to yield
>       on (d) by putting the content hash on the same line as the magic code.
>        I don't see that being much advantage, however, given the need to
>       parse the entire file anyway before the hash is useful.  See also
>       below.
> 
> 
> I would be pleased if you would agree to mandate the hash at the beginning.
>
>       >(ii) state the encoding scheme as part of the hash line, inserted as
>       part of the hash calculation.  In this way, at least the hasher's
>       choice of encoding scheme is known, rather than allowing the further
>       possible errors arising from hasher and user having different ideas of
>       the encoding.  Addresses point (1)
>
>       Scheme B already provides for an encoding tag as a hint about what
>       encoding was used (B.3), and I think it reasonable to expect a hasher
>       to create and/or rewrite that tag as necessary.  I would be willing to
>       make it a requirement that if present, the tag must correctly indicate
>       the encoding.  That would also address (3) to some extent by serving as
>       a hint that a content hash follows (somewhere in the file).
> 
> 
> Good.  I don't think the presence of the encoding hint by itself is enough to serve
> as a guarantee that a hash will be found, as there will be users who simply insert
> that by themselves, being unaware that it is supposed to be machine generated.
>
>       >(iii) restrict possible encodings to internationally recognised ones
>       with well-specified Unicode mappings.  This addresses point (4)
>
>       I don't see the need for this, and to some extent I think it could be
>       harmful.  For example, if Herb sees a use for a scheme of this sort in
>       conjunction with imgCIF (unknown at this point whether he does), then
>       he might want to be able to specify an encoding specific to imgCIF,
>       such as one that provides for multiple text segments, each with its own
>       character encoding.  To the extent that imgCIF is an international
>       standard, perhaps that could still satisfy the restriction, but I don't
>       think that was the intended meaning of "internationally recognised".
>
>       As for "well-specified Unicode mappings", I think maybe I'm missing
>       something.  CIF text is already limited to Unicode characters, and any
>       encoding that can serve for a particular piece of CIF text must map at
>       least the characters actually present in the text.  What encodings or
>       scenarios would be excluded, then, by that aspect of this suggestion?
> 
> 
> My intention was to make sure that not only the particular user who created the
> file knew this mapping, but that the mapping was publically available.  Certainly
> only Unicode encodable code points will appear, but the recipient needs to be able
> to recover the mapping from the file bytes to Unicode without relying on e.g. files
> that will be supplied on request by someone whose email address no longer works.
> 
>
>       I offer a few additional comments about scheme B:
>
>       A) Because it does not require a hash for UTF-8 (including pure
>       US-ASCII; see B.5), scheme B is a superset of scheme A.
>
>       B) Scheme B does not use quite the same language as scheme A with
>       respect to detectable encodings.  As a result, it supports (without
>       tagging or hashing) not just UTF-8, but also all UTF-16 and UTF-32
>       variants.  This is intentional.
> 
> 
> I am concerned that the vast majority of users based in English speaking countries
> (and many non English speaking countries) will be quite annoyed if they have to
> deal with UTF-16/32 CIF2 files that are no longer accessible to the simple
> ASCII-based tools and software that they are used to.  Because of this, allowing
> undecorated UTF16/32 would be far more disruptive than forcing people to use UTF8
> only. Thus my stipulation on maintaining compatibility with ASCII for undecorated
> files.
>  
>
>       C) Scheme B is not aimed at ensuring that every conceivable receiver be
>       able to interpret every scheme-B-compliant CIF.  Instead, it provides
>       receivers the ability to *judge* whether they can interpret particular
>       CIFs, and afterwards to *verify* that they have done so correctly.
>        Ensuring that receivers can interpret CIFs is thus a responsibility of
>       the sender / archive maintainer, possibly in cooperation with the
>       receiver / retriever.
> 
> 
> As I've said before, I don't see the paradigm of live negotiation between senders
> and receivers as very useful, as it fails to account for CIFs being passed between
> different software (via reading/writing to a file system), or CIFs where the
> creator is no longer around, or technically unsophisticated senders where, for
> example, the software has produced an undecorated CIF in some native encoding and
> the sender has absolutely no idea why the receiver (if they even have contact with
> the receiver!) can't read the file properly.   I prefer to see the standard that we
> set as a substitute for live negotiation, so leaving things up to the users is in
> that sense an abrogation of our responsibility.
>
>       D) I reiterate that the scheme's self-description as being "for
>       archiving and exchange of CIF text" is intentional and meaningful.  It
>       is not intended to require that CIFs carry hashes or encoding tags when
>       used for other purposes.  The scheme is positioned for use on the front
>       end of an archive ingestion system or as part of a sending-side agent
>       for CIF exchange, though it need not be restricted to such scenarios.
>
>       E) Furthermore, scheme B does not interfere with the ability of any
>       party to transcode CIF text if a different encoding is more suitable
>       for their purposes.  I would expect many archivers to perform such
>       transcoding as a matter of course, though none are obligated to do so.
> 
>
>       In light of James' comments and suggestions, then, I offer scheme B'.
>        It differs from scheme B at point 3, by requiring, in those cases
>       where the encoding cannot reliably be autodetected, that the correct
>       encoding name be written in an encoding tag at the beginning of the
>       file.  To my knowledge, all cases of interest either succumb to
>       autodetection or are sufficiently congruent with US-ASCII (or otherwise
>       sufficiently decodable, given known initial characters) to allow the
>       encoding tag to be read before the exact encoding is known.  I
>       emphasize that although the encoding tag being incorrect makes a CIF
>       non-compliant with this scheme, that does not prevent the correct
>       encoding being discovered via the content hash, by iteration over all
>       available schemes (provided that the correct scheme is available).
>
>       Scheme B':
>       1. This scheme provides for reliable archiving and exchange of CIF
>       text.  Although it depends in some cases on metadata embedded in the
>       CIF text, presence of such metadata is not a well-formedness constraint
>       on the text itself.
>       2. For the purposes of storage and transfer, CIF files must be treated
>       by all file handling protocols as streams of bytes.
>       3. Any text encoding may be used.  If the encoding does not comply with
>       either (5a) or (5b) below, then its name must be given via an encoding
>       tag following the magic code, on the same line.  Otherwise, an encoding
>       tag is optional, but if present then it must correctly name the
>       encoding.
>       4. Archiving or exchange of CIF text complies with this scheme if the
>       CIF text contains a correct content hash:
>         a) The hash value is computed by applying the MD5 algorithm to the
>       Unicode code point values of the CIF text, in the order they appear,
>       excluding all code points of CIF comments and all other CIF whitespace
>       appearing outside data values or separating List or Table elements.
>         b) The code point stream is converted to a byte stream for input to
>       the hash function by interpreting each code point as a 24-bit integer,
>       appearing on the byte stream in order from most-significant to
>       least-significant byte.
>         c) The hash value is expressed in the CIF itself as a structured
>       comment of the form:
>             #\#content_hash_md5:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>            where the Xs represent the hexadecimal digits of the computed hash
>       value.
>         d) The hash comment may appear anywhere in the CIF that a comment may
>       appear, but conventionally it is at the end of the CIF.
>       5. Archiving or exchange of CIF text that does not contain a content
>       hash complies with this scheme if
>         a) the text encoding is specified in an international standard and is
>       distinguishable from all other encodings at the binary level, or
>         b) the text encoding is coincident with US-ASCII for all code points
>       appearing in the CIF.
>       For the purposes of (5a), distinguishing encodings may rely on the
>       characteristics of CIF, such as the allowed character set and the
>       required CIF version comment, and also on the actual CIF text (such as
>       for recognition of UTF-8 by its encoding of non-ASCII characters).
> 
>
>       Regards,
>
>       John
>       --
>       John C. Bollinger, Ph.D.
>       Department of Structural Biology
>       St. Jude Children's Research Hospital
> 
> 
> 
>
>       Email Disclaimer:  www.stjude.org/emaildisclaimer
>
>       _______________________________________________
>       ddlm-group mailing list
>       ddlm-group@iucr.org
>       http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.