Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. .



On Monday, June 28, 2010 6:00 PM, SIMON WESTRIP wrote:

>John suggests "the goal of CIF being compatible with general-purpose text tools"
>
>This is possibly the crux of the matter.

It is right at the heart of the matter, I agree, and it comes with an historical impetus.  As I composed these comments, I distilled what I think are the essences of the two main positions into two short statements that capture, for me, the alternatives before us.  Please forgive the somewhat didactic discussion leading up to these, and skip straight to the *** if you wish to ignore my long-windedness altogether.

>Unless a general-purpose text tool is capable of the determining text encoding system, it ain't going to be much use
>for a CIF that was encoded on a different system and uses non-ASCII chars?

Forgive me if I am reading too much into the question, but I think it highlights a central difference of understanding: some parties to this discussion seem to hold that text vs. binary is an inherent characteristic of a file, but I maintain that a stream of bytes divorced from any explicit or implicit metadata about its encoding is binary, not text.  This complication of electronic text handling is not new, but it has assumed much more prominence as internationalization issues have gained importance.

Implicit encoding metadata commonly takes the form of the text in question being encoded according to the default scheme for the system or tool.  It could, in one sense, also take the form of a requirement in the format specification, but that is meaningful only for tools specific to the format, which rather moots the text vs. binary question.  It could also take the form of local policy, such as "all CIFs in this archive are encoded in CESU-8," which would be useful to tools configured for the relevant environment (e.g. a web server).

Explicit metadata can be carried by the file itself or conveyed out-of-band.  XML's encoding attribute is an example of the former, and HTTP's content-type header is an example of the latter.  These are useful only to certain tools, specific to a particular format, environment, or exchange mechanism.

One of the upshots of all this is that transcoding must in general be a routine aspect of text file exchange, as that can make explicit encoding metadata implicit.  As Simon has shown, transcoding not automatic in many contexts, so it may require extra work on the receiving end.  To the extent that there is a current assumption and practice of CIFs being stored and forwarded byte-for-byte as received (i.e. without transcoding or explicit metadata), CIF is already being treated as a binary format.  In a sense, perhaps, it is being treated simultaneously as several distinct binary formats.


***

>By extending the character set beyond ASCII, we have to accept that not all general-purpose text tools are going to
>be applicable as CIF editors/viewers.

That's a valid perspective, but I would sharpen it: as part of extending the character set beyond ASCII, we abandon the premise that CIF is a text format, though under some circumstances it may still be possible to manipulate CIFs with tools designed for text.

Alternatively, I have been advocating essentially this: by extending the character set beyond ASCII, we magnify the importance of exchanging and storing CIFs according to text conventions, including correctly communicating encodings as necessary and transcoding as appropriate.

I hope the latter position adequately encompasses Herb's view as well.  Each position carries additional baggage, which I have omitted to focus on the essential ideas.  If wider comment is sought, then I submit that these alternatives provide a suitable basis for soliciting such.


Whichever position prevails, I should like to see something substantially similar to the corresponding position statement above be inserted into the spec.


Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.