Re: [Cif2-encoding] The discussion so far
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] The discussion so far
- From: James Hester <jamesrhester@xxxxxxxxx>
- Date: Thu, 5 Aug 2010 11:44:13 +1000
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local>
- References: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local>
Thanks, Brian, for creating this list.
�
Since no one else has had the combination of time, energy, and inclination to do so, I�ll open with a summary of the state of the CIF 2.0 character encoding discussion so far, as it currently stands at its original location on the DDLm group.� Specifically, the previous summary and the discussion that proceeded from it can be found at http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00744.html and http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00690.html, though some of the most recent messages seem not to be available at present via the web interface.
�
The controversy derives from CIF 2.0's expansion of character set to all of Unicode. �It is magnified by CIF 1.0's explicit self description as an encoding-independent text format, and by the accumulated body of CIF software and author practices that rely on that text orientation.� There has been considerable debate about what it would mean for CIF 2.0 to be a text format vs. a binary format, and the relative advantages and disadvantages of each.� Among the points covered were:
�
1) A 'text' format implies that CIF content may comply with local, locale-specific conventions for electronic text representation, including details such as line termination conventions and, especially, character encoding.� Such files are suitable input for general-purpose text tools such as text editors, text extraction utilities, and text indexers.� Alternatively, a conformant text CIF might be expressed according to some other convention suitable for a particular application or foreign environment.� Because conventions differ, correctly archiving text or moving it between environments requires accounting for the text conventions in use, and may involve conversions such as line terminator changes and transcoding.� This is the CIF 1.0 position, though CIF1's restricted character set significantly reduces the impact of character encoding considerations relative to CIF2.
�
2) A 'binary' format is anything else, but in this context, the key characteristic of binary-CIF2 proposals is that they add to the text specification a specification for text serialization to byte-oriented media, such as disks and networks.� In particular, one strongly advocated position in the CIF 2.0 standardization discussion is that CIF 2.0 should require serialization of the underlying CIF text according to the UTF-8 character encoding scheme.� This would be a text-like binary format, in that some text tools can handle UTF-8 encoded text (sometimes requiring a little persuasion), and therefore could be used to read, modify, and write binary-CIF2 files.
�
3) The many specific issues and arguments that have been raised mostly fall into one or both of two general areas:
�
3a) reliability, by which we mean that a CIF consumer should have justifiable high confidence that he is interpreting CIF data in the way the CIF producer intended, and
�
3b) usability, by which we mean that human authors, and to a lesser extent, software, should be able to manipulate CIF2 files as they are accustomed to manipulating CIF1 files, e.g. using the default configurations of their systems� text editors.� In addition to general usability, arguments in this category include some appealing to respect of scientists of many nationalities, and similar ones appealing to freedom / liberty.
�
Furthermore, a few of the points have appealed to
�
3c) practicality, by which we mean that CIF stakeholders should be able to use the specification effectively.� This aspect is subordinate to reliability and usability, but it cuts across both.� For the most part, it relates to the ability to develop software and practices that address the likely real-world usage (and misusage) of the standard.
�
4) The group as a whole appears to have agreed that UTF-8 is a highly suitable encoding for CIF2.� It can encode the entire Unicode code space, it is a superset of ASCII, it can be recognized heuristically with low probability of error, and it is widely implemented.� These characteristics yield high reliability at the cost of some usability.� The debate is not about whether UTF-8 should be used, but rather about whether the standard should forbid use of other encodings.� Essentially, this is a recasting of the text vs. binary debate.
�
5) Inasmuch as consensus on the issues described above has not yet been reached and does not appear likely, the group has issued a call for comments from a wider group of stakeholders.� No results of that call have yet been reported back to the group.
�
6) In the interim, the discussion has moved toward finding middle ground.� In particular, James Hester asked:
�
>If we consider CIF as text as the overriding priority:
>
>1. How do we then make exchanging and storing files according to text conventions sufficiently reliable for the purposes of CIF?� How far are we prepared to compromise?
>
>If we consider reliable exchange of information as the top priority:
>
>2. How do we then make CIFs sufficiently accessible to text-based tools?� How far are we prepared to compromise?
A short series of proposed schemes for CIF exchange and storage proceeded from that call:
�
7) In response to question (2), James offered a scheme (A) that primarily would relax the explicit specification of UTF-8 into a set of characteristics that a CIF encoding would need to satisfy.� His characteristics would need to be further pared down or relaxed to in practice permit encodings other than UTF-8.� This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive.
�
8) In response to question (1), John Bollinger offered a scheme (B) that retains text character for CIF2, and relies on labeling when wanted or needed to convey text conventions, and on hashing to provide verification and reliability.� This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive.
�
A limited amount of additional discussion proceeded from these proposed exchange and archiving schemes, and that�s where we currently stand.
�
�
Regards,
�
John
�
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital
Email Disclaimer: www.stjude.org/emaildisclaimer
_______________________________________________
cif2-encoding mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________ cif2-encoding mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- References:
- [Cif2-encoding] The discussion so far (Bollinger, John C)
- Prev by Date: Re: [Cif2-encoding] The discussion so far. .
- Next by Date: Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .
- Prev by thread: Re: [Cif2-encoding] The discussion so far. .
- Index(es):