Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .


On Wednesday, June 23, 2010 9:47 AM, Herbert J. Bernstein wrote:

>If we impose a non-text canonical UTF-8 encoding that does not contain an
>internal encoding signature, and that file is transmitted as text and
>not binary from a machine for which, say, ASCII with code pages for, say,
>western europe, is the native encoding, and the transmission converts
>the UTF-8 charcaters as if they were accented characters in Latin-1,
>then what is received may appear plausible at the receiving end, just
>wrong.

Surely that is a general issue with exchanging encoded text.  It is not caused by designating a canonical encoding, and it would not be solved either by declining to designate a canonical encoding or by mandating UTF-8 as the only allowed encoding.

>Therefore, I would suggest that we be very careful to make such a
>canonical UTF-8 cif self identifying, by including not only a BOM,
>but by adding some text in the range of #x128-#x254 to the magic
>number to help in detecting such unintended transmission conversions.

It would definitely ease encoding detection / correction if the magic number contained non-ASCII characters.  Doing so, however, either will require CIF2 to be a hybrid binary/text format, or will effectively restrict CIF to be used only with encodings that support the chosen characters.  (Or am I missing something?)  I disfavor the former, and I think the latter is a serious restriction indeed.

>In addition, I would suggest that, just as the first line of an XML
>document specifies its encoding in plain text, that we add the same
>information to our magic number.

I have been giving some consideration to exactly that possibility.  It works for all encodings that are supersets of ASCII.  Other encodings would need to be detected some other way (e.g. byte-order mark, analysis of the encoded magic number), but they are not at such risk of encoding confusion.

The signature of a CIF2 might then be something like these:

#\#CIF_2.0
#\#CIF_2.0:UTF-8
#\#CIF_2.0:KOI8-R
#\#CIF_2.0:ISO-8859-1

where the first two mean the same thing.  If we do choose to not require UTF-8 then I favor this approach.


John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.