[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] The discussion so far

To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
Subject: Re: [Cif2-encoding] The discussion so far
From: James Hester <jamesrhester@xxxxxxxxx>
Date: Thu, 5 Aug 2010 11:44:13 +1000
In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local>
References: <8F77913624F7524AACD2A92EAF3BFA5416659DED7B@SJMEMXMBS11.stjude.sjcrh.local>

Thanks John for this summary.� I think it is a fair description of the current state of our deliberations.� I am currently drafting a response to your previous email which I hope to be able to send fairly soon.

On Wed, Aug 4, 2010 at 12:58 AM, Bollinger, John C <[email protected]> wrote:

Thanks, Brian, for creating this list.

�

Since no one else has had the combination of time, energy, and inclination to do so, I�ll open with a summary of the state of the CIF 2.0 character encoding discussion so far, as it currently stands at its original location on the DDLm group.� Specifically, the previous summary and the discussion that proceeded from it can be found at http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00744.html and http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00690.html, though some of the most recent messages seem not to be available at present via the web interface.

�

The controversy derives from CIF 2.0's expansion of character set to all of Unicode. �It is magnified by CIF 1.0's explicit self description as an encoding-independent text format, and by the accumulated body of CIF software and author practices that rely on that text orientation.� There has been considerable debate about what it would mean for CIF 2.0 to be a text format vs. a binary format, and the relative advantages and disadvantages of each.� Among the points covered were:

�

1) A 'text' format implies that CIF content may comply with local, locale-specific conventions for electronic text representation, including details such as line termination conventions and, especially, character encoding.� Such files are suitable input for general-purpose text tools such as text editors, text extraction utilities, and text indexers.� Alternatively, a conformant text CIF might be expressed according to some other convention suitable for a particular application or foreign environment.� Because conventions differ, correctly archiving text or moving it between environments requires accounting for the text conventions in use, and may involve conversions such as line terminator changes and transcoding.� This is the CIF 1.0 position, though CIF1's restricted character set significantly reduces the impact of character encoding considerations relative to CIF2.

�

2) A 'binary' format is anything else, but in this context, the key characteristic of binary-CIF2 proposals is that they add to the text specification a specification for text serialization to byte-oriented media, such as disks and networks.� In particular, one strongly advocated position in the CIF 2.0 standardization discussion is that CIF 2.0 should require serialization of the underlying CIF text according to the UTF-8 character encoding scheme.� This would be a text-like binary format, in that some text tools can handle UTF-8 encoded text (sometimes requiring a little persuasion), and therefore could be used to read, modify, and write binary-CIF2 files.

�

3) The many specific issues and arguments that have been raised mostly fall into one or both of two general areas:

�

3a) reliability, by which we mean that a CIF consumer should have justifiable high confidence that he is interpreting CIF data in the way the CIF producer intended, and

�

3b) usability, by which we mean that human authors, and to a lesser extent, software, should be able to manipulate CIF2 files as they are accustomed to manipulating CIF1 files, e.g. using the default configurations of their systems� text editors.� In addition to general usability, arguments in this category include some appealing to respect of scientists of many nationalities, and similar ones appealing to freedom / liberty.

�

Furthermore, a few of the points have appealed to

�

3c) practicality, by which we mean that CIF stakeholders should be able to use the specification effectively.� This aspect is subordinate to reliability and usability, but it cuts across both.� For the most part, it relates to the ability to develop software and practices that address the likely real-world usage (and misusage) of the standard.

�

4) The group as a whole appears to have agreed that UTF-8 is a highly suitable encoding for CIF2.� It can encode the entire Unicode code space, it is a superset of ASCII, it can be recognized heuristically with low probability of error, and it is widely implemented.� These characteristics yield high reliability at the cost of some usability.� The debate is not about whether UTF-8 should be used, but rather about whether the standard should forbid use of other encodings.� Essentially, this is a recasting of the text vs. binary debate.

�

5) Inasmuch as consensus on the issues described above has not yet been reached and does not appear likely, the group has issued a call for comments from a wider group of stakeholders.� No results of that call have yet been reported back to the group.

�

6) In the interim, the discussion has moved toward finding middle ground.� In particular, James Hester asked:

�

>If we consider CIF as text as the overriding priority:
>
>1. How do we then make exchanging and storing files according to text conventions sufficiently reliable for the purposes of CIF?� How far are we prepared to compromise?
>
>If we consider reliable exchange of information as the top priority:
>
>2. How do we then make CIFs sufficiently accessible to text-based tools?� How far are we prepared to compromise?

A short series of proposed schemes for CIF exchange and storage proceeded from that call:

�

7) In response to question (2), James offered a scheme (A) that primarily would relax the explicit specification of UTF-8 into a set of characteristics that a CIF encoding would need to satisfy.� His characteristics would need to be further pared down or relaxed to in practice permit encodings other than UTF-8.� This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive.

�

8) In response to question (1), John Bollinger offered a scheme (B) that retains text character for CIF2, and relies on labeling when wanted or needed to convey text conventions, and on hashing to provide verification and reliability.� This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive.

�

A limited amount of additional discussion proceeded from these proposed exchange and archiving schemes, and that�s where we currently stand.

�

�

Regards,

�

John

�

--

John C. Bollinger, Ph.D.

Department of Structural Biology

St. Jude Children's Research Hospital

Email Disclaimer: www.stjude.org/emaildisclaimer

_______________________________________________
cif2-encoding mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
cif2-encoding mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]

References:

[Cif2-encoding] The discussion so far (Bollinger, John C)

Prev by Date: Re: [Cif2-encoding] The discussion so far. .

Next by Date: Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .

Prev by thread: Re: [Cif2-encoding] The discussion so far. .

Index(es):

Date

Thread

Discussion List Archives

Re: [Cif2-encoding] The discussion so far