Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .

The solution I propose is tuned to files that were intentionally created
as UTF-8 CIF2 files but, because they were created on a system with
a different default encoding, were transferred as if they were in
that different encoding with automatic conversion to the default
encoding on the receiving end.  There is no problem if the CIF had been
created in the default encoding for the originating system.  It would
just be received in the default encoding for the receiving system
which could reliably convert the file to UTF-8 or whatever other
encoding it wishes to use.  The case I raised is particularly tricky
because the sender has no way to see their own error.  The file
looks like and is a valid UTF-8 file at their end.  If the receiver
sends it back by the reverse path, the copy that comes back will
compare perfectly.  Without the transmission check for the receiver
to use, the result of enforcing a UTF-8 canonical encoding on
machines that are not fully UTF-8 aware is to produce an undetectable
and therefore uncorrectable error.

Users of KOI8-R just use other code pages for text that goes beyond
that code page, just as I use CP1251 when I am working on a code-page
based system and need to send cyrillic.  It is a major nuisance having
to switch code pages -- that is why Unicode is a better idea, and why
it is highly likely that a user on a KOI8-R or CP1251-based system
is virtually certain to use special editors to work with unicode,
and get trapped precisely by the case I suggested when they try
to send that UTF-8 file as a text file.  They really need the tc
field.

If you really want to stick to code-pages with chucks of text in
multiple encodings, you are likely to end up using multi-part
MIME with different parts in different encodings.  Then we
might want to look at multiple transmission check strings.




At 5:11 PM -0500 6/23/10, Bollinger, John C wrote:
>On Wednesday, June 23, 2010 3:00 PM, Herbert J. Bernstein wrote:
>
>>    John is mistaken about the value of the transmission check for most
>>code pages.  Most of the Cyrillic code pages have quite distinctive
>>printable charcaters for these characters values, and none of them would
>>return the transmission check as the intended characters.  At worst you
>>would get :: in place of the :<tc>: (e.g. with KOI0 or KOI7)  Try it with
>>a few encodings and you will see it works pretty well.  I chose the
>>accented o's to get in a region of the code tables that are well
>>populated, so for most encodings you would really see the problem,
>>but even the case of :: is pretty clear.
>
>I think you misunderstand my point, but perhaps I'm just not getting 
>it.  Here's what I think I understand about the transmission check 
>proposal: the idea is to
>
>1) include a fixed, well-known code at a recognizable position in 
>the magic number.
>
>2) The code will be constructed using a sufficient number of chosen 
>non-ASCII characters such that
>
>3) decoding the byte stream according to the wrong encoding will 
>result in a transmission check string that differs from the 
>expected, well-known one.
>
>Have I got it?
>
>Supposing that I understand the proposal, I quite agree that if a 
>CIF encoded in UTF-8 and bearing the proposed transmission check 
>signature were misinterpreted as being encoded in pretty much any 
>other encoding, then the TC would reveal the mistake.  And that 
>would be valuable.
>
>The proposed scheme and specific signature would also work for a CIF 
>encoded in ISO-8859-1, ISO-8859-15, and perhaps a couple others of 
>the ISO-8859 family, except that it would not distinguish these one 
>from another.  As I understand it, however, that the same procedure 
>could not be used for CIFs using most other encodings.  At best, 
>different TC codes would be required for most encodings.  For 
>example, KOI8-R does not have a way to express *any* of the 
>characters having Unicode code points U+00F2 - U+00F6, so CIF text 
>encoded in KOI8-R simply could not include the TC signature at all. 
>There are no codes for the requisite characters.  That does not stop 
>detection of UTF-8 misinterpreted as KOI8-R; rather, it makes it 
>impossible to even attempt the reverse.
>
>I am unaware of a single non-ASCII character that is representable 
>in substantially all encodings we might reasonably expect to be used 
>for CIF text.  A different TC signature could be chosen to use for 
>KOI8-R, ISO-8859-various, Shift-JIS, etc.  Each would provide a 
>decent check that a CIF encoded in the corresponding form was not 
>misinterpreted as being in some other form, provided that the 
>intended encoding were also known (for looking up the appropriate TC 
>code).  That's doable, but it's suddenly a lot more complicated, 
>requiring a lot of bookkeeping.
>
>As I said, I do not oppose the transmission check idea.  If UTF-8 
>were indeed designated the canonical encoding for CIF among many 
>alternatives, then I would expect TC to be applicable to a large 
>number of the important cases.  I doubt there is a simple 
>alternative that is significantly more general, but that's what I 
>would prefer if such could be found.
>
>
>Regards,
>
>John
>--
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>
>
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>
>_______________________________________________
>ddlm-group mailing list
>ddlm-group@iucr.org
>http://scripts.iucr.org/mailman/listinfo/ddlm-group


-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.