On Wednesday, June 23, 2010 3:00 PM, Herbert J. Bernstein wrote:

>   John is mistaken about the value of the transmission check for most
>code pages.  Most of the Cyrillic code pages have quite distinctive
>printable charcaters for these characters values, and none of them would
>return the transmission check as the intended characters.  At worst you
>would get :: in place of the :<tc>: (e.g. with KOI0 or KOI7)  Try it with
>a few encodings and you will see it works pretty well.  I chose the
>accented o's to get in a region of the code tables that are well
>populated, so for most encodings you would really see the problem,
>but even the case of :: is pretty clear.

I think you misunderstand my point, but perhaps I'm just not getting it.  Here's what I think I understand about the transmission check proposal: the idea is to

1) include a fixed, well-known code at a recognizable position in the magic number.

2) The code will be constructed using a sufficient number of chosen non-ASCII characters such that

3) decoding the byte stream according to the wrong encoding will result in a transmission check string that differs from the expected, well-known one.

Have I got it?

Supposing that I understand the proposal, I quite agree that if a CIF encoded in UTF-8 and bearing the proposed transmission check signature were misinterpreted as being encoded in pretty much any other encoding, then the TC would reveal the mistake.  And that would be valuable.

The proposed scheme and specific signature would also work for a CIF encoded in ISO-8859-1, ISO-8859-15, and perhaps a couple others of the ISO-8859 family, except that it would not distinguish these one from another.  As I understand it, however, that the same procedure could not be used for CIFs using most other encodings.  At best, different TC codes would be required for most encodings.  For example, KOI8-R does not have a way to express *any* of the characters having Unicode code points U+00F2 - U+00F6, so CIF text encoded in KOI8-R simply could not include the TC signature at all.  There are no codes for the requisite characters.  That does not stop detection of UTF-8 misinterpreted as KOI8-R; rather, it makes it impossible to even attempt the reverse.

I am unaware of a single non-ASCII character that is representable in substantially all encodings we might reasonably expect to be used for CIF text.  A different TC signature could be chosen to use for KOI8-R, ISO-8859-various, Shift-JIS, etc.  Each would provide a decent check that a CIF encoded in the corresponding form was not misinterpreted as being in some other form, provided that the intended encoding were also known (for looking up the appropriate TC code).  That's doable, but it's suddenly a lot more complicated, requiring a lot of bookkeeping.

As I said, I do not oppose the transmission check idea.  If UTF-8 were indeed designated the canonical encoding for CIF among many alternatives, then I would expect TC to be applicable to a large number of the important cases.  I doubt there is a simple alternative that is significantly more general, but that's what I would prefer if such could be found.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

