Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. ...

Dear Colleagues,

   John is mistaken about the value of the transmission check for most
code pages.  Most of the Cyrillic code pages have quite distinctive 
printable charcaters for these characters values, and none of them would 
return the transmission check as the intended characters.  At worst you 
would get :: in place of the :<tc>: (e.g. with KOI0 or KOI7)  Try it with 
a few encodings and you will see it works pretty well.  I chose the 
accented o's to get in a region of the code tables that are well 
populated, so for most encodings you would really see the problem,
but even the case of :: is pretty clear.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 23 Jun 2010, Bollinger, John C wrote:

>
> On Wednesday, June 23, 2010 1:36 PM, Herbert J. Bernstein wrote:
>
>> All that is required to avoid the trap of unintended text transformations
>> from UTF-8 as if it were, say, Latin 1, is to add any string from the
>> Latin 1 supplement of the Unicode BMP.  I would suggest
>>    :#x00F2#x00F3#x00F4#x00F5#x00F6:
>> which as utf8 would be
>>
>> :#x00c3#x00b2#x00c3#x00b3#x00c3#x00b4#x00c3#x00b5#x00c3#0x00b6
>>
>> which would come out as 5 accented lower case o's running through the
>> full set of accents if transmitted correctly, but as
>> capital A-tides alternating with SUPERSCRIPT TWO, SUPERSCRIPT THREE,
>> ACUTE ACCENT, MICRO SIGN, PILCROW SIGN in the most likely mis-transmission
>> of a UTF8 file as a Latin-1 file.
>
> And similarly, it would come out as a different sequence of characters if the stream were misinterpreted according to a different wrong encoding.  So far so good.  That's fine when the true encoding is UTF-8, UTF-16, or any other in which characters U+00F2 - U+00F6 are representable.  It is inapplicable, however, when the true encoding is any of the many in which those characters are not representable, such as KOI8-R, many of the ISO-8859-x series, and as I understand it, most or all of the encodings specific to east Asian text (which generally do, whether formally or informally, incorporate ASCII as a subset, and are thus potentially suitable for CIF).
>
>> Let us call that the code-point sequence #x00F2#x00F3#x00F4#x00F5#x00F6
>> the transmission check <tc>.  Then the proposed magic number would be
>>
>> #\#CIF_2.0:<encoding>:<tc>:
>>
>> Both the encoding and the tc would be optional, but highly recommended.
>> This might not allow fully automated decoding, but it would at least
>> provide a decent error check for many of the most common cases that
>> cause trouble, and would actually give us an edge over the XML
>> convention (which only give th encoding) in terms of reliability.
>
> I am not opposed to the transmission check idea, but if something more generally applicable could be found then I would prefer it.  Nevertheless, in conjunction with UTF-8 as a canonical CIF2 representation, the transmission check would have wide applicability, especially in those areas where encoding mismatches are most likely to occur.
>
>
> Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.