Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .

Dear James,

   Brian just sent a message explaining how the IUCr saw the conversions
happen in the past.  There is nothing miraculous or crazy about it.
It goes back to the earliest days of ARPANET when the telnet and ftp
protocols were first defined, so that the heterogeneous collection
of character encodings for text could be used in an interoperable way,
It was very useful, at times being used internally within a single
machine (doing an ftp from that machine to itself) to provide coversion
of line terminators in the days before we had nice unix utilities
for the purpose.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 28 Jun 2010, James Hester wrote:

> Hi Herbert: you again suggest that files are 'converted' during
> transmission.  I really need a concrete demonstration of this
> quasi-miraculous (or quasi-crazy) behaviour, or some confirmation from
> another quarter.
>
> I would note that your example KOI-8 user could continue to use their
> favourite editor and codepage, then run an encoding conversion tool
> over the resulting file to convert KOI-8 to UTF-8.  While such a step
> is relatively easy, it would be somewhat more complicated if the
> internal encoding declaration had to be hand-edited as well after this
> conversion, this time really in a UTF-8 compliant editor.  Why not
> just say 'UTF-8 only' and point people in the direction of conversion
> tools, so that they don't need to stray from their favourite editors?
> Your Mac tool (Cyclone?) sounded very capable for these purposes.
>
> On Thu, Jun 24, 2010 at 9:13 AM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>> The solution I propose is tuned to files that were intentionally created
>> as UTF-8 CIF2 files but, because they were created on a system with
>> a different default encoding, were transferred as if they were in
>> that different encoding with automatic conversion to the default
>> encoding on the receiving end.  There is no problem if the CIF had been
>> created in the default encoding for the originating system.  It would
>> just be received in the default encoding for the receiving system
>> which could reliably convert the file to UTF-8 or whatever other
>> encoding it wishes to use.  The case I raised is particularly tricky
>> because the sender has no way to see their own error.  The file
>> looks like and is a valid UTF-8 file at their end.  If the receiver
>> sends it back by the reverse path, the copy that comes back will
>> compare perfectly.  Without the transmission check for the receiver
>> to use, the result of enforcing a UTF-8 canonical encoding on
>> machines that are not fully UTF-8 aware is to produce an undetectable
>> and therefore uncorrectable error.
>>
>> Users of KOI8-R just use other code pages for text that goes beyond
>> that code page, just as I use CP1251 when I am working on a code-page
>> based system and need to send cyrillic.  It is a major nuisance having
>> to switch code pages -- that is why Unicode is a better idea, and why
>> it is highly likely that a user on a KOI8-R or CP1251-based system
>> is virtually certain to use special editors to work with unicode,
>> and get trapped precisely by the case I suggested when they try
>> to send that UTF-8 file as a text file.  They really need the tc
>> field.
>>
>> If you really want to stick to code-pages with chucks of text in
>> multiple encodings, you are likely to end up using multi-part
>> MIME with different parts in different encodings.  Then we
>> might want to look at multiple transmission check strings.
>>
>>
>>
>>
>> At 5:11 PM -0500 6/23/10, Bollinger, John C wrote:
>>> On Wednesday, June 23, 2010 3:00 PM, Herbert J. Bernstein wrote:
>>>
>>>>    John is mistaken about the value of the transmission check for most
>>>> code pages.  Most of the Cyrillic code pages have quite distinctive
>>>> printable charcaters for these characters values, and none of them would
>>>> return the transmission check as the intended characters.  At worst you
>>>> would get :: in place of the :<tc>: (e.g. with KOI0 or KOI7)  Try it with
>>>> a few encodings and you will see it works pretty well.  I chose the
>>>> accented o's to get in a region of the code tables that are well
>>>> populated, so for most encodings you would really see the problem,
>>>> but even the case of :: is pretty clear.
>>>
>>> I think you misunderstand my point, but perhaps I'm just not getting
>>> it.  Here's what I think I understand about the transmission check
>>> proposal: the idea is to
>>>
>>> 1) include a fixed, well-known code at a recognizable position in
>>> the magic number.
>>>
>>> 2) The code will be constructed using a sufficient number of chosen
>>> non-ASCII characters such that
>>>
>>> 3) decoding the byte stream according to the wrong encoding will
>>> result in a transmission check string that differs from the
>>> expected, well-known one.
>>>
>>> Have I got it?
>>>
>>> Supposing that I understand the proposal, I quite agree that if a
>>> CIF encoded in UTF-8 and bearing the proposed transmission check
>>> signature were misinterpreted as being encoded in pretty much any
>>> other encoding, then the TC would reveal the mistake.  And that
>>> would be valuable.
>>>
>>> The proposed scheme and specific signature would also work for a CIF
>>> encoded in ISO-8859-1, ISO-8859-15, and perhaps a couple others of
>>> the ISO-8859 family, except that it would not distinguish these one
>>> from another.  As I understand it, however, that the same procedure
>>> could not be used for CIFs using most other encodings.  At best,
>>> different TC codes would be required for most encodings.  For
>>> example, KOI8-R does not have a way to express *any* of the
>>> characters having Unicode code points U+00F2 - U+00F6, so CIF text
>>> encoded in KOI8-R simply could not include the TC signature at all.
>>> There are no codes for the requisite characters.  That does not stop
>>> detection of UTF-8 misinterpreted as KOI8-R; rather, it makes it
>>> impossible to even attempt the reverse.
>>>
>>> I am unaware of a single non-ASCII character that is representable
>>> in substantially all encodings we might reasonably expect to be used
>>> for CIF text.  A different TC signature could be chosen to use for
>>> KOI8-R, ISO-8859-various, Shift-JIS, etc.  Each would provide a
>>> decent check that a CIF encoded in the corresponding form was not
>>> misinterpreted as being in some other form, provided that the
>>> intended encoding were also known (for looking up the appropriate TC
>>> code).  That's doable, but it's suddenly a lot more complicated,
>>> requiring a lot of bookkeeping.
>>>
>>> As I said, I do not oppose the transmission check idea.  If UTF-8
>>> were indeed designated the canonical encoding for CIF among many
>>> alternatives, then I would expect TC to be applicable to a large
>>> number of the important cases.  I doubt there is a simple
>>> alternative that is significantly more general, but that's what I
>>> would prefer if such could be found.
>>>
>>>
>>> Regards,
>>>
>>> John
>>> --
>>> John C. Bollinger, Ph.D.
>>> Department of Structural Biology
>>> St. Jude Children's Research Hospital
>>>
>>>
>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>> --
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                  +1-631-244-3035
>>                  yaya@dowling.edu
>> =====================================================
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.