[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .

All that is required to avoid the trap of unintended text transformations 
from UTF-8 as if it were, say, Latin 1, is to add any string from the 
Latin 1 supplement of the Unicode BMP.  I would suggest
which as utf8 would be


which would come out as 5 accented lower case o's running through the
full set of accents if transmitted correctly, but as 
capital A-tides alternating with SUPERSCRIPT TWO, SUPERSCRIPT THREE,
ACUTE ACCENT, MICRO SIGN, PILCROW SIGN in the most likely mis-transmission
of a UTF8 file as a Latin-1 file.

Let us call that the code-point sequence #x00F2#x00F3#x00F4#x00F5#x00F6 
the transmission check <tc>.  Then the proposed magic number would be


Both the encoding and the tc would be optional, but highly recommended.
This might not allow fully automated decoding, but it would at least
provide a decent error check for many of the most common cases that
cause trouble, and would actually give us an edge over the XML
convention (which only give th encoding) in terms of reliability.

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Wed, 23 Jun 2010, Bollinger, John C wrote:

> On Wednesday, June 23, 2010 9:47 AM, Herbert J. Bernstein wrote:
>> If we impose a non-text canonical UTF-8 encoding that does not contain an
>> internal encoding signature, and that file is transmitted as text and
>> not binary from a machine for which, say, ASCII with code pages for, say,
>> western europe, is the native encoding, and the transmission converts
>> the UTF-8 charcaters as if they were accented characters in Latin-1,
>> then what is received may appear plausible at the receiving end, just
>> wrong.
> Surely that is a general issue with exchanging encoded text.  It is not caused by designating a canonical encoding, and it would not be solved either by declining to designate a canonical encoding or by mandating UTF-8 as the only allowed encoding.
>> Therefore, I would suggest that we be very careful to make such a
>> canonical UTF-8 cif self identifying, by including not only a BOM,
>> but by adding some text in the range of #x128-#x254 to the magic
>> number to help in detecting such unintended transmission conversions.
> It would definitely ease encoding detection / correction if the magic number contained non-ASCII characters.  Doing so, however, either will require CIF2 to be a hybrid binary/text format, or will effectively restrict CIF to be used only with encodings that support the chosen characters.  (Or am I missing something?)  I disfavor the former, and I think the latter is a serious restriction indeed.
>> In addition, I would suggest that, just as the first line of an XML
>> document specifies its encoding in plain text, that we add the same
>> information to our magic number.
> I have been giving some consideration to exactly that possibility.  It works for all encodings that are supersets of ASCII.  Other encodings would need to be detected some other way (e.g. byte-order mark, analysis of the encoded magic number), but they are not at such risk of encoding confusion.
> The signature of a CIF2 might then be something like these:
> #\#CIF_2.0
> #\#CIF_2.0:UTF-8
> #\#CIF_2.0:KOI8-R
> #\#CIF_2.0:ISO-8859-1
> where the first two mean the same thing.  If we do choose to not require UTF-8 then I favor this approach.
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
> Email Disclaimer:  www.stjude.org/emaildisclaimer
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]