[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Wed, 23 Jun 2010 19:13:03 -0400
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA541661229527@SJMEMXMBS11.stjude.sjcrh.local>
- References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><AANLkTilolZk4SzLF8mzqOz4EagFJcEHDKOAblGMnoqpW@mail.gmail.com><alpine.BSF.2.00.1006212120510.91069@epsilon.pair.com><AANLkTiklvzlKquqlRQIrpPGZjJfuRzLqiv2E6Stcq6wd@mail.gmail.com><alpine.BSF.2.00.1006212241210.4105@epsilon.pair.com><AANLkTilACXxnPRtJXEjGD39eleDl9dxlAcwar8j9MBPr@mail.gmail.com><alpine.BSF.2.00.1006220753471.87930@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54166122951E@SJMEMXMBS11.stjude.sjcrh.local> <AANLkTikih0j6-vyLDPMOqcTkoiK545yE28y4fU9JTUa2@mail.gmail.com><20100623103310.GD15883@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA541661229521@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1006231033360.56372@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229523@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229526@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1006231550410.30894@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229527@SJMEMXMBS11.stjude.sjcrh.local>
The solution I propose is tuned to files that were intentionally created as UTF-8 CIF2 files but, because they were created on a system with a different default encoding, were transferred as if they were in that different encoding with automatic conversion to the default encoding on the receiving end. There is no problem if the CIF had been created in the default encoding for the originating system. It would just be received in the default encoding for the receiving system which could reliably convert the file to UTF-8 or whatever other encoding it wishes to use. The case I raised is particularly tricky because the sender has no way to see their own error. The file looks like and is a valid UTF-8 file at their end. If the receiver sends it back by the reverse path, the copy that comes back will compare perfectly. Without the transmission check for the receiver to use, the result of enforcing a UTF-8 canonical encoding on machines that are not fully UTF-8 aware is to produce an undetectable and therefore uncorrectable error. Users of KOI8-R just use other code pages for text that goes beyond that code page, just as I use CP1251 when I am working on a code-page based system and need to send cyrillic. It is a major nuisance having to switch code pages -- that is why Unicode is a better idea, and why it is highly likely that a user on a KOI8-R or CP1251-based system is virtually certain to use special editors to work with unicode, and get trapped precisely by the case I suggested when they try to send that UTF-8 file as a text file. They really need the tc field. If you really want to stick to code-pages with chucks of text in multiple encodings, you are likely to end up using multi-part MIME with different parts in different encodings. Then we might want to look at multiple transmission check strings. At 5:11 PM -0500 6/23/10, Bollinger, John C wrote: >On Wednesday, June 23, 2010 3:00 PM, Herbert J. Bernstein wrote: > >> John is mistaken about the value of the transmission check for most >>code pages. Most of the Cyrillic code pages have quite distinctive >>printable charcaters for these characters values, and none of them would >>return the transmission check as the intended characters. At worst you >>would get :: in place of the :<tc>: (e.g. with KOI0 or KOI7) Try it with >>a few encodings and you will see it works pretty well. I chose the >>accented o's to get in a region of the code tables that are well >>populated, so for most encodings you would really see the problem, >>but even the case of :: is pretty clear. > >I think you misunderstand my point, but perhaps I'm just not getting >it. Here's what I think I understand about the transmission check >proposal: the idea is to > >1) include a fixed, well-known code at a recognizable position in >the magic number. > >2) The code will be constructed using a sufficient number of chosen >non-ASCII characters such that > >3) decoding the byte stream according to the wrong encoding will >result in a transmission check string that differs from the >expected, well-known one. > >Have I got it? > >Supposing that I understand the proposal, I quite agree that if a >CIF encoded in UTF-8 and bearing the proposed transmission check >signature were misinterpreted as being encoded in pretty much any >other encoding, then the TC would reveal the mistake. And that >would be valuable. > >The proposed scheme and specific signature would also work for a CIF >encoded in ISO-8859-1, ISO-8859-15, and perhaps a couple others of >the ISO-8859 family, except that it would not distinguish these one >from another. As I understand it, however, that the same procedure >could not be used for CIFs using most other encodings. At best, >different TC codes would be required for most encodings. For >example, KOI8-R does not have a way to express *any* of the >characters having Unicode code points U+00F2 - U+00F6, so CIF text >encoded in KOI8-R simply could not include the TC signature at all. >There are no codes for the requisite characters. That does not stop >detection of UTF-8 misinterpreted as KOI8-R; rather, it makes it >impossible to even attempt the reverse. > >I am unaware of a single non-ASCII character that is representable >in substantially all encodings we might reasonably expect to be used >for CIF text. A different TC signature could be chosen to use for >KOI8-R, ISO-8859-various, Shift-JIS, etc. Each would provide a >decent check that a CIF encoded in the corresponding form was not >misinterpreted as being in some other form, provided that the >intended encoding were also known (for looking up the appropriate TC >code). That's doable, but it's suddenly a lot more complicated, >requiring a lot of bookkeeping. > >As I said, I do not oppose the transmission check idea. If UTF-8 >were indeed designated the canonical encoding for CIF among many >alternatives, then I would expect TC to be applicable to a large >number of the important cases. I doubt there is a simple >alternative that is significantly more general, but that's what I >would prefer if such could be found. > > >Regards, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > >Email Disclaimer: www.stjude.org/emaildisclaimer > >_______________________________________________ >ddlm-group mailing list >ddlm-group@iucr.org >http://scripts.iucr.org/mailman/listinfo/ddlm-group -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Brian McMahon)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. ... (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. . (Bollinger, John C)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .
- Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .
- Index(es):