[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. ...

Dear Colleagues,

   John is mistaken about the value of the transmission check for most
code pages.  Most of the Cyrillic code pages have quite distinctive 
printable charcaters for these characters values, and none of them would 
return the transmission check as the intended characters.  At worst you 
would get :: in place of the :<tc>: (e.g. with KOI0 or KOI7)  Try it with 
a few encodings and you will see it works pretty well.  I chose the 
accented o's to get in a region of the code tables that are well 
populated, so for most encodings you would really see the problem,
but even the case of :: is pretty clear.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Wed, 23 Jun 2010, Bollinger, John C wrote:

> On Wednesday, June 23, 2010 1:36 PM, Herbert J. Bernstein wrote:
>> All that is required to avoid the trap of unintended text transformations
>> from UTF-8 as if it were, say, Latin 1, is to add any string from the
>> Latin 1 supplement of the Unicode BMP.  I would suggest
>>    :#x00F2#x00F3#x00F4#x00F5#x00F6:
>> which as utf8 would be
>> :#x00c3#x00b2#x00c3#x00b3#x00c3#x00b4#x00c3#x00b5#x00c3#0x00b6
>> which would come out as 5 accented lower case o's running through the
>> full set of accents if transmitted correctly, but as
>> capital A-tides alternating with SUPERSCRIPT TWO, SUPERSCRIPT THREE,
>> ACUTE ACCENT, MICRO SIGN, PILCROW SIGN in the most likely mis-transmission
>> of a UTF8 file as a Latin-1 file.
> And similarly, it would come out as a different sequence of characters if the stream were misinterpreted according to a different wrong encoding.  So far so good.  That's fine when the true encoding is UTF-8, UTF-16, or any other in which characters U+00F2 - U+00F6 are representable.  It is inapplicable, however, when the true encoding is any of the many in which those characters are not representable, such as KOI8-R, many of the ISO-8859-x series, and as I understand it, most or all of the encodings specific to east Asian text (which generally do, whether formally or informally, incorporate ASCII as a subset, and are thus potentially suitable for CIF).
>> Let us call that the code-point sequence #x00F2#x00F3#x00F4#x00F5#x00F6
>> the transmission check <tc>.  Then the proposed magic number would be
>> #\#CIF_2.0:<encoding>:<tc>:
>> Both the encoding and the tc would be optional, but highly recommended.
>> This might not allow fully automated decoding, but it would at least
>> provide a decent error check for many of the most common cases that
>> cause trouble, and would actually give us an edge over the XML
>> convention (which only give th encoding) in terms of reliability.
> I am not opposed to the transmission check idea, but if something more generally applicable could be found then I would prefer it.  Nevertheless, in conjunction with UTF-8 as a canonical CIF2 representation, the transmission check would have wide applicability, especially in those areas where encoding mismatches are most likely to occur.
> Regards,
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
> Email Disclaimer:  www.stjude.org/emaildisclaimer
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]