Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .

All that is required to avoid the trap of unintended text transformations 
from UTF-8 as if it were, say, Latin 1, is to add any string from the 
Latin 1 supplement of the Unicode BMP.  I would suggest
    :#x00F2#x00F3#x00F4#x00F5#x00F6:
which as utf8 would be

:#x00c3#x00b2#x00c3#x00b3#x00c3#x00b4#x00c3#x00b5#x00c3#0x00b6

which would come out as 5 accented lower case o's running through the
full set of accents if transmitted correctly, but as 
capital A-tides alternating with SUPERSCRIPT TWO, SUPERSCRIPT THREE,
ACUTE ACCENT, MICRO SIGN, PILCROW SIGN in the most likely mis-transmission
of a UTF8 file as a Latin-1 file.

Let us call that the code-point sequence #x00F2#x00F3#x00F4#x00F5#x00F6 
the transmission check <tc>.  Then the proposed magic number would be

#\#CIF_2.0:<encoding>:<tc>:

Both the encoding and the tc would be optional, but highly recommended.
This might not allow fully automated decoding, but it would at least
provide a decent error check for many of the most common cases that
cause trouble, and would actually give us an edge over the XML
convention (which only give th encoding) in terms of reliability.


=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 23 Jun 2010, Bollinger, John C wrote:

>
> On Wednesday, June 23, 2010 9:47 AM, Herbert J. Bernstein wrote:
>
>> If we impose a non-text canonical UTF-8 encoding that does not contain an
>> internal encoding signature, and that file is transmitted as text and
>> not binary from a machine for which, say, ASCII with code pages for, say,
>> western europe, is the native encoding, and the transmission converts
>> the UTF-8 charcaters as if they were accented characters in Latin-1,
>> then what is received may appear plausible at the receiving end, just
>> wrong.
>
> Surely that is a general issue with exchanging encoded text.  It is not caused by designating a canonical encoding, and it would not be solved either by declining to designate a canonical encoding or by mandating UTF-8 as the only allowed encoding.
>
>> Therefore, I would suggest that we be very careful to make such a
>> canonical UTF-8 cif self identifying, by including not only a BOM,
>> but by adding some text in the range of #x128-#x254 to the magic
>> number to help in detecting such unintended transmission conversions.
>
> It would definitely ease encoding detection / correction if the magic number contained non-ASCII characters.  Doing so, however, either will require CIF2 to be a hybrid binary/text format, or will effectively restrict CIF to be used only with encodings that support the chosen characters.  (Or am I missing something?)  I disfavor the former, and I think the latter is a serious restriction indeed.
>
>> In addition, I would suggest that, just as the first line of an XML
>> document specifies its encoding in plain text, that we add the same
>> information to our magic number.
>
> I have been giving some consideration to exactly that possibility.  It works for all encodings that are supersets of ASCII.  Other encodings would need to be detected some other way (e.g. byte-order mark, analysis of the encoded magic number), but they are not at such risk of encoding confusion.
>
> The signature of a CIF2 might then be something like these:
>
> #\#CIF_2.0
> #\#CIF_2.0:UTF-8
> #\#CIF_2.0:KOI8-R
> #\#CIF_2.0:ISO-8859-1
>
> where the first two mean the same thing.  If we do choose to not require UTF-8 then I favor this approach.
>
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.