Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

On Tuesday, June 22, 2010 11:06 AM, SIMON WESTRIP wrote:
>CIF may currently be handled with multiple encodings, but as its restricted to ASCII, the
>encoding issue hasn't really been relevent - most code pages include the ASCII code points?

It is a common feature of many encodings to be congruent with 7-bit ASCII over its range, but that is not universal.  UTF-16 and UTF-32, for example, are not congruent with ASCII anywhere.  Neither is EBCDIC.  Shift-JIS is mostly congruent with ASCII, but varies at two code points.

>If CIF2 is also to allow multiple encodings, it is quite possible that a basic text editor will not render the content
>appropriately for anything outside the ASCII range if it is unable to determine the encoding (it may not even attempt to
>determine the encoding - my linux text editors aren't very good at autodetection - I don't know about windows notepad,
>but last time I looked it couldn't even interpret linux line endings appropriately).

Indeed.  I believe some basic text editors will assume that any file presented to them uses the host's default encoding.  In many cases that is not UTF-8, so selecting UTF-8 as the only CIF encoding does not promote CIF interoperability with those particular programs.

>In the absence of a BOM, the only solution is to use an heuristic approach to determine the encoding?

Not necessarily.  If the data are delivered via web form or other HTTP-based method, for example, then the HTTP protocol provides support for specifying the encoding.  Similarly, if the file is delivered as part of a MIME multipart message, then the content type specified by its MIME headers can express the encoding.

>Such heuristics would also have to be applied in order to process the CIF (which I'd already decided I will have to do
>because of the likelihood of receiving non-UTF8 CIF2's)

Were I in your shoes, I would plan to transcode non-UTF-8 CIFs to UTF-8 upon receipt, as part of the verification process.  I would store only the UTF-8 version; thereafter, no worries.  One of the advantages of defining CIF2 as an encoding-independent text format would be that doing as I describe would preserve the original *CIF* data (i.e. the text) with 100% fidelity, even though it might not preserve the exact byte stream.

>So I still beleive that as a *standard* we should specify UTF8.
>However, that does not mean that we cannot be tolerant of other encodings?
>If a system exists that processes all its CIFs in a  different encoding, I see no reason for it to change -
>only when the CIF is to be made publically available should it be converted to UTF-8.
>Likewise, if such a system is capable of handling current CIFs, surely it will manage UTF-8 CIFs with
>little overhead? Afterall, CIF2 is going to be different from CIF1.

This nicely captures my point about the CIF data format vs. CIF storage and interchange.  UTF-8 can very easily be a standard for CIF interchange -- perhaps the only standard -- without conflating that with the CIF data format.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.