Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Cif2-encoding] The discussion so far

Thanks, Brian, for creating this list.


Since no one else has had the combination of time, energy, and inclination to do so, I’ll open with a summary of the state of the CIF 2.0 character encoding discussion so far, as it currently stands at its original location on the DDLm group.  Specifically, the previous summary and the discussion that proceeded from it can be found at http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00744.html and http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00690.html, though some of the most recent messages seem not to be available at present via the web interface.


The controversy derives from CIF 2.0's expansion of character set to all of Unicode.  It is magnified by CIF 1.0's explicit self description as an encoding-independent text format, and by the accumulated body of CIF software and author practices that rely on that text orientation.  There has been considerable debate about what it would mean for CIF 2.0 to be a text format vs. a binary format, and the relative advantages and disadvantages of each.  Among the points covered were:


1) A 'text' format implies that CIF content may comply with local, locale-specific conventions for electronic text representation, including details such as line termination conventions and, especially, character encoding.  Such files are suitable input for general-purpose text tools such as text editors, text extraction utilities, and text indexers.  Alternatively, a conformant text CIF might be expressed according to some other convention suitable for a particular application or foreign environment.  Because conventions differ, correctly archiving text or moving it between environments requires accounting for the text conventions in use, and may involve conversions such as line terminator changes and transcoding.  This is the CIF 1.0 position, though CIF1's restricted character set significantly reduces the impact of character encoding considerations relative to CIF2.


2) A 'binary' format is anything else, but in this context, the key characteristic of binary-CIF2 proposals is that they add to the text specification a specification for text serialization to byte-oriented media, such as disks and networks.  In particular, one strongly advocated position in the CIF 2.0 standardization discussion is that CIF 2.0 should require serialization of the underlying CIF text according to the UTF-8 character encoding scheme.  This would be a text-like binary format, in that some text tools can handle UTF-8 encoded text (sometimes requiring a little persuasion), and therefore could be used to read, modify, and write binary-CIF2 files.


3) The many specific issues and arguments that have been raised mostly fall into one or both of two general areas:


3a) reliability, by which we mean that a CIF consumer should have justifiable high confidence that he is interpreting CIF data in the way the CIF producer intended, and


3b) usability, by which we mean that human authors, and to a lesser extent, software, should be able to manipulate CIF2 files as they are accustomed to manipulating CIF1 files, e.g. using the default configurations of their systems’ text editors.  In addition to general usability, arguments in this category include some appealing to respect of scientists of many nationalities, and similar ones appealing to freedom / liberty.


Furthermore, a few of the points have appealed to


3c) practicality, by which we mean that CIF stakeholders should be able to use the specification effectively.  This aspect is subordinate to reliability and usability, but it cuts across both.  For the most part, it relates to the ability to develop software and practices that address the likely real-world usage (and misusage) of the standard.


4) The group as a whole appears to have agreed that UTF-8 is a highly suitable encoding for CIF2.  It can encode the entire Unicode code space, it is a superset of ASCII, it can be recognized heuristically with low probability of error, and it is widely implemented.  These characteristics yield high reliability at the cost of some usability.  The debate is not about whether UTF-8 should be used, but rather about whether the standard should forbid use of other encodings.  Essentially, this is a recasting of the text vs. binary debate.


5) Inasmuch as consensus on the issues described above has not yet been reached and does not appear likely, the group has issued a call for comments from a wider group of stakeholders.  No results of that call have yet been reported back to the group.


6) In the interim, the discussion has moved toward finding middle ground.  In particular, James Hester asked:


>If we consider CIF as text as the overriding priority:
>1. How do we then make exchanging and storing files according to text conventions sufficiently reliable for the purposes of CIF?  How far are we prepared to compromise?
>If we consider reliable exchange of information as the top priority:
>2. How do we then make CIFs sufficiently accessible to text-based tools?  How far are we prepared to compromise?

A short series of proposed schemes for CIF exchange and storage proceeded from that call:


7) In response to question (2), James offered a scheme (A) that primarily would relax the explicit specification of UTF-8 into a set of characteristics that a CIF encoding would need to satisfy.  His characteristics would need to be further pared down or relaxed to in practice permit encodings other than UTF-8.  This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive.


8) In response to question (1), John Bollinger offered a scheme (B) that retains text character for CIF2, and relies on labeling when wanted or needed to convey text conventions, and on hashing to provide verification and reliability.  This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive.


A limited amount of additional discussion proceeded from these proposed exchange and archiving schemes, and that’s where we currently stand.








John C. Bollinger, Ph.D.

Department of Structural Biology

St. Jude Children's Research Hospital

Email Disclaimer: www.stjude.org/emaildisclaimer
cif2-encoding mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.