Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .

On Wed, Jun 30, 2010 at 1:06 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:


As part of extending the character set beyond ASCII, we abandon the premise that CIF is a text format, though under some circumstances it may still be possible to manipulate CIFs with tools designed for text.

Alternatively, I have been advocating essentially this: by extending the character set beyond ASCII, we magnify the importance of exchanging and storing CIFs according to text conventions, including correctly communicating encodings as necessary and transcoding as appropriate.

I would agree that this distills the essence of our discussion.  The two polar positions are then:

(a) Reliable transfer and archiving of information are the top priority
(b) Being able to process CIF using text conventions is the top priority

If we consider CIF as text as the overriding priority:

1. How do we then make exchanging and storing files according to text conventions sufficiently reliable for the purposes of CIF?  How far are we prepared to compromise?

If we consider reliable exchange of information as the top priority:

2. How do we then make CIFs sufficiently accessible to text-based tools?  How far are we prepared to compromise?

It would be useful if we came up with suitable answers to (1) and (2) to clarify what are alternatives are.  We can then seek a compromise after finding a balance between the loss of textual convenience from emphasizing reliability with the loss of reliability gained by textual convenience (Simple!).  I will be proposing a poll of the international CIF using community in a separate email to aid us in our deliberations.

Being clearly in the camp that values reliability before text, I propose the following scheme as a scheme for ensuring reliability while attempting to keep the file as accessible as possible to text tools, and invite others to propose a scheme that satisfies those that value text as the highest priority:

Scheme A:
  1. For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes. 
  2. Any encoding specifying the mapping from bytes to Unicode code points may be used, provided that:
    1. The encoding is specified in an international standard
    2. The encoding is distinguishable from all other standard encodings at the binary level.  This requirement may be satisfied by an initial 'signature', provided that this signature is specified in the relevant international standard as being mandatory
    3. The encoding is supported across the range of platforms likely to use CIF2.  "Support" includes:
      1. Availability of text input and output functions using this encoding across a range of programming languages
      2. Availability of applications on the platform to manipulate text in this encoding, most importantly text editors but also tools such as search
    4. The encoding is coincident with US-ASCII for codepoints <= 127.  This requirement may be dropped in the future if CIF2 becomes the dominant form of CIF file.
Requirement (2.4) is there for backwards compatibility with the de-facto practice of CIF1.  I submit that currently only UTF-8 meets these requirements for encoding ((2.2) in only a probabilistic sense).

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.