Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .

  • To: "cif2-encoding@xxxxxxxx" <cif2-encoding@xxxxxxxx>
  • Subject: Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .
  • From: "Bollinger, John C" <John.Bollinger@xxxxxxxxxx>
  • Date: Wed, 11 Aug 2010 11:30:28 -0500
  • Accept-Language: en-US
  • acceptlanguage: en-US
  • In-Reply-To: <AANLkTinZ4KNsnREOOU6sVFdGYR_aQHcjdWr_ko648NGm@mail.gmail.com>
  • References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><AANLkTikih0j6-vyLDPMOqcTkoiK545yE28y4fU9JTUa2@mail.gmail.com><20100623103310.GD15883@emerald.iucr.org><alpine.BSF.2.00.1006231033360.56372@epsilon.pair.com><alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com><a06240802c848414681ef@><381469.52475.qm@web87004.mail.ird.yahoo.com><a06240801c84949b70cb7@><AANLkTilZj2UEffRwmvCrgnVbxrGwmsoqb9S7tw31MWSo@mail.gmail.com><984921.99613.qm@web87011.mail.ird.yahoo.com><AANLkTimLmnpS-HHP9en-zwUDeVKtbHSUJa36tUCOlQtL@mail.gmail.com><826180.50656.qm@web87010.mail.ird.yahoo.com><563298.52532.qm@web87005.mail.ird.yahoo.com><520427.68014.qm@web87001.mail.ird.yahoo.com><a06240800c84ac1b696bf@><614241.93385.qm@web87016.mail.ird.yahoo.com><alpine.BSF.2.00.1006251827270.70846@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local><33483.93964.qm@web87012.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local><AANLkTilqKa_vZJEmfjEtd_MzKhH1CijEIglJzWpFQrrC@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikTee4PicHKjnnbAdipegyELQ6UWLXz9Zm08aVL@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local><AANLkTinZ4KNsnREOOU6sVFdGYR_aQHcjdWr_ko648NGm@mail.gmail.com>

On Monday, August 09, 2010 10:20 PM, James Hester wrote:

>I had not fully appreciated that Scheme B is intended to be applied only at the moment of transfer or archiving, and envisions users normally saving files in their preferred encoding with no hash codes or encoding hints required (I will call the inclusion of such hints and hashes as 'decoration').

"Envisions users normally [...]" is a bit stronger than my position or the intended orientation of Scheme B.  "Accommodates" would be my choice of wording.

>  A direct result of allowing undecorated files to reside on disk is that CIF software producers will need to write software that will function with arbitrary encodings with no decoration to help them, as that is the form that users' files will be most often be in.

The standard can do no more to prevent users from storing undecorated CIFs than it can to prevent users from storing CIF text encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding.  More generally, all the standard can do is define the characteristics of a conformant CIF -- it can never prevent CIF-like but non-conformant files from being created, used, exchanged, or archived as if they were conformant CIFs.  Regardless of the standard's ultimate position on this issue, software authors will have to be guided by practical considerations and by the real-world requirements placed on their programs.  In particular, they will have to decide whether to accept "CIF" input that in fact violates the standard in various ways, and / or they will have to decide which optional CIF behaviors they will support.  As such, I don't see a significant distinction between the alternatives before us as regards the difficulty, complexity, or requirements of CIF2 software.

Furthermore, no formulation of CIF is inherently reliable or unreliable, because reliability (in this sense) is a characteristic of data transfer, not of data themselves.  Scheme B targets the activities that require reliability assurance, and disregards those that don't.  In a practical sense, this isn't any different from scheme A, because it is only when the encoding is potentially uncertain -- to wit, in the context of data transfer -- that either scheme need be applied (see also below).  I suppose I would be willing to make scheme B a general requirement of the CIF format, but I don't see any advantage there over the current formulation.  The actual behavior of people and the practical requirements on CIF software would not appreciably change.

>  Furthermore, given the ease with which files can be transferred between users (email attachment, saved in shared, network-mounted directory, drag and drop onto USB stick etc.) it is unlikely that Scheme B or anything involving extra effort would be applied unless the recipient demanded it.

For hand-created or hand-edited CIFs, I agree.  CIFs manipulated via a CIF2-compliant editor could be relied upon to conform to scheme B, however, provided that is standardized.  But the same applies to scheme A, given that few operating environments default to UTF-8 for text.

>  And given how many times that file might have changed hands across borders and operating systems within a single group collaboration, there would only be a qualified guarantee that the character to binary mapping has not been mangled en route, making any scheme applied subsequently rather pointless.

That also does not distinguish among the alternatives before us.  I appreciate the desire for an absolute guarantee of reliability, but none is available.  Qualified guarantees are the best we can achieve (and that's a technical assessment, not an aphorism).

>We would thus go from a situation where we had a single, reliable and sometimes slightly inconvenient encoding (UTF8), to one where a CIF processor should be prepared for any given CIF file to be one of a wide range of encodings which need to be guessed.

Under scheme A or the present draft text, we have "a single, reliable [...] encoding" only in the sense that the standard *specifies* that that encoding be used.  So far, however, I see little will to produce or use processors that are restricted to UTF-8, and I have every expectation that authors will continue to produce CIFs in various encodings regardless of the standard's ultimate stance.  Yes, it might be nice if everyone and every system converged on UTF-8 for text encoding, but CIF2 cannot force that to happen, not even among crystallographers.

In practice, then, we really have a situation where the practical / useful CIF2 processor must be prepared to handle a variety of encodings (details dependent on system requirements), which may need to be guessed, with no standard mechanism for helping the processor make that determination or for allowing it to check its guess.  Scheme B improves that situation by standardizing a general reliability assurance mechanism, which otherwise would be missing.  In view of the practical situation, I see no down side at all.  A CIF processor working with scheme B is *more* able, not less.


>I would much prefer a scheme which did not compromise reliability in such a significant way.

There is no such compromise, because in practice, we're not starting from a reliable position.

>My previous (somewhat clunky) attempts to adjust Scheme B were directed at trying to force any file with the CIF2.0 magic number to be either decorated or UTF-8, meaning that software has a reasonably high confidence in file integrity.
>An alternative way of thinking about this is that CIF files also act as the mechanism of information transfer between software programs.  [... W]hen a separate program is asked to input that CIF, the information has been transferred, even if that software is running on the same computer.

So in that sense, one could argue that Scheme B already applies to all CIFs, its assertion to the contrary notwithstanding.  Honestly, though, I don't think debating semantic details of terms such as "data transfer" is useful because in practice, and independent of scheme A, B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to choose what form of reliability assurance to accept or demand, if any.

>Now, moving on to the detailed contours of Scheme B and addressing the particular points that John and I have been discussing.  My original criticisms are the ones preceded by numerals.

(Quote-indentation levels have been adjusted in what follows.)


>>>(2) In order to read the hash value, the encoding of the file needs to be known (!)
>>Yes and no.  In many cases, either the encoding can be determined from the content (even without a correct encoding tag) or it can be determined well enough to parse the file to find the hash (most ASCII supersets).  Nevertheless, something along the lines of James's (ii) below can do better.
>If we restrict the allowed encodings to those for which the ASCII codepoints can be autodetected assuming CIF2 layout (in particular the first line) I think that would be sufficiently robust.

I'm glad we can agree on something.  :-)

>>>(3) The recipient doesn't know if a hash value is present until they have parsed the entire file
>>This is correct.  The recipient also cannot *use* the hash without parsing the entire file, however, so it doesn't make a lot of difference.  Nevertheless, it would be possible to provide a hint at the beginning of the file, so that parsers that wanted to avoid the overhead of the hash computation could do so.
>The point of having the hash at the front is so that a parsing program can immediately reject an undecorated, non UTF-8 file, or alternatively branch based on how reliable the encoding hint is thought to be.  For example, if a hash is present, there is a somewhat stronger guarantee that the encoding hint has been checked or detected by a program rather than manually inserted.

I can see some advantage to that, offsetting the added complication for CIF writers.  I agree in principle to put the computed hash at the front of the file.

>>>(4) Assumption that all recipients will be able to handle all encodings
>>There is no such assumption.  Rather, there is an acknowledgement that some systems may be unable to handle some CIFs.  That is already the case with CIF1, and it is not completely resolved by standardizing on UTF-8 (i.e. scheme A).
>There is no such thing as 'optional' for an information interchange standard.  A file that conforms to the standard must be readable by parsers written according to the standard. If reading a standard-conformant file might fail or (worse) the file might be misinterpreted, information cannot always reliably be exchanged using this standard, so that optional behaviour needs to be either discarded, or made mandatory. There is thus no point in including optional behaviour in the standard. So: if the standard allows files to be written in encoding XYZ, then all readers should be able to read files written in encoding XYZ.  I view the CIF1 stance of allowing any encoding as a mistake, but a benign one, as in the case of CIF1 ASCII was so entrenched that it was the defacto standard for the characters appearing in CIF1 files.  In short, we have to specify a limited set of acceptable encodings.

As Herb astutely observed, those assertions reflect a fundamental source of our disagreement.  I think we can all agree that a standard that permits conforming software to misinterpret conforming data is undesirable.
Surely we can also agree that an information interchange standard does not serve its purpose if it does not support information being successfully interchanged.  It does not follow, however, that the artifacts by which any two parties realize an information interchange must be interpretable by all other conceivable parties, nor does it follow that that would be a supremely advantageous characteristic if it were achievable.  It also does not follow that recognizable failure of any particular attempt at interchange must at all costs be avoided, or that a data interchange standard must take no account of its usage context.

Optional and alternative behaviors are not fundamentally incompatible with a data interchange standard, as XML and HTML demonstrate.  Or consider the extreme variability of CIF text content: whether a particular CIF is suitable for a particular purpose depends intimately on exactly which data are present in it, and even to some extent on which data names are used to present them, even though ALL are optional as far as the format is concerned.  If I say 'This CIF is unsuitable for my present purpose because it does not contain _symmetry_space_group_name_H-M', that does not mean the CIF standard is broken.  Yet, it is not qualitatively different for me to say 'This CIF is unsuitable because it is encoded in CCSID 500' despite CIF2 (hypothetically) permitting arbitrary encodings.


>>>(iii) restrict possible encodings to internationally recognised ones with well-specified Unicode mappings.  This addresses point (4)
>>I don't see the need for this, and to some extent I think it could be harmful.  For example, if Herb sees a use for a scheme of this sort in conjunction with imgCIF (unknown at this point whether he does), then he might want to be able to specify an encoding specific to imgCIF, such as one that provides for multiple text segments, each with its own character encoding.  To the extent that imgCIF is an international standard, perhaps that could still satisfy the restriction, but I don't think that was the intended meaning of "internationally recognised".
>>As for "well-specified Unicode mappings", I think maybe I'm missing something.  CIF text is already limited to Unicode characters, and any encoding that can serve for a particular piece of CIF text must map at least the characters actually present in the text.  What encodings or scenarios would be excluded, then, by that aspect of this suggestion?
>My intention was to make sure that not only the particular user who created the file knew this mapping, but that the mapping was publically available.  Certainly only Unicode encodable code points will appear, but the recipient needs to be able to recover the mapping from the file bytes to Unicode without relying on e.g. files that will be supplied on request by someone whose email address no longer works.

This issue is relevant only to the parties among whom a particular CIF is exchanged.  The standard would not particularly assist those parties by restricting the permitted encodings, because they can safely ignore such restrictions if they mutually agree to do so (whether statically or dynamically), and they (specifically, the CIF originator) must anyway comply with them if no such agreement is implicit or can be reached.

>>I offer a few additional comments about scheme B:


>>B) Scheme B does not use quite the same language as scheme A with respect to detectable encodings.  As a result, it supports (without tagging or hashing) not just UTF-8, but also all UTF-16 and UTF-32 variants.  This is intentional.
>I am concerned that the vast majority of users based in English speaking countries (and many non English speaking countries) will be quite annoyed if they have to deal with UTF-16/32 CIF2 files that are no longer accessible to the simple ASCII-based tools and software that they are used to.  Because of this, allowing undecorated UTF16/32 would be far more disruptive than forcing people to use UTF8 only. Thus my stipulation on maintaining compatibility with ASCII for undecorated files.

Supporting UTF-16/32 without tagging or hashing is not a key provision of scheme B, and I could live without it, but I don't think that would significantly change the likelihood of a user unexpectedly encountering undecorated UTF-16/32 CIFs.  It would change only whether such files were technically CIF-conformant, which doesn't much matter to the user on the spot.  In any case, it is not the lack of decoration that is the basic problem here.

>>C) Scheme B is not aimed at ensuring that every conceivable receiver be able to interpret every scheme-B-compliant CIF.  Instead, it provides receivers the ability to *judge* whether they can interpret particular CIFs, and afterwards to *verify* that they have done so correctly.  Ensuring that receivers can interpret CIFs is thus a responsibility of the sender / archive maintainer, possibly in cooperation with the receiver / retriever.
>As I've said before, I don't see the paradigm of live negotiation between senders and receivers as very useful, as it fails to account for CIFs being passed between different software (via reading/writing to a file system), or CIFs where the creator is no longer around, or technically unsophisticated senders where, for example, the software has produced an undecorated CIF in some native encoding and the sender has absolutely no idea why the receiver (if they even have contact with the receiver!) can't read the file properly.   I prefer to see the standard that we set as a substitute for live negotiation, so leaving things up to the users is in that sense an abrogation of our responsibility.

That scenario will undoubtedly occur occasionally regardless of the outcome of this discussion.  If it is our responsibility to avoid it at all costs then we are doomed to fail in that regard.  Software *will* under some circumstances produce undecorated, non-UTF-8 "CIFs" because that is sometimes convenient, efficient, and appropriate for the program's purpose.

I think, though, those comments reflect a bit of a misconception.  The overall purpose of CIF supporting multiple encodings would be to allow specific CIFs to be better adapted for specific purposes.  Such purposes include, but are not limited to

() exchanging data with general-purpose program(s) on the same system
() exchanging data with crystallography program(s) on the same system
() supporting performance or storage objectives of specific programs or systems
() efficiently supporting problem or data domains in which Latin text is a minority of the content (e.g. imgCIF)
() storing data in a personal archive
() exchanging data with known third parties
() publishing data to a general audience

*Few, if any, of those uses would be likely to involve live negotiation.*  That's why I assigned primary responsibility for selecting encodings to the entity providing the CIF.  I probably should not even have mentioned cooperation of the receiver; I did so more because it is conceivable than because it is likely.

Under any scheme I can imagine, some CIFs will not be well suited to some purposes.  I want to avoid the situation that *no* conformant CIF can be well suited to some reasonable purposes.  I am willing to forgo the result that *every* conformant CIF is suited to certain other, also reasonable purposes.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

cif2-encoding mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.