[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .
From: "Bollinger, John C" <[email protected]>
Date: Mon, 28 Jun 2010 11:08:23 -0500
Accept-Language: en-US
acceptlanguage: en-US
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]> <a06240803c84a8e4d89fc@[192.168.2.104]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54166122952C@SJMEMXMBS11.stjude.sjcrh.local ><[email protected]><a06240800c84ac1b696bf@[192.168.2.104]><[email protected]><[email protected]> <[email protected]><[email protected]><[email protected]>

On Sunday, June 27, 2010 8:23 AM, Herbert J. Bernstein wrote:
> The trick is to put some reasonable selection of accented characters into the check string. Most of the non-accented roman characters are common to a very wide range of encodings. It happens that accented lower case o's work fairly well for detecting a lot of the most common encodings. If you also explicitly state the intended encoding, the chances of a misidentification are probably as low as you are going to get. Nothing is perfect, but the combination of an encoding field and a transmission check is, I think, well worth considering.

I believe I have found the way I was looking for to enable the transmission check signature to distinguish among all conceivable extensions of ASCII. The key is that the signature must include not only the encoded characters, but also an ASCII representation of the Unicode code points of those characters (important: not their code points in some other character set / code page). The processor then decodes the characters according to whatever it thinks is the proper encoding, obtaining, perhaps indirectly, their Unicode code points (which may be numerically greater than 255). It compares them, numerically, to the expected code point sequence, and the check is successful if they all match. In principle, it could try multiple encodings to attempt to find the right one.

This has the added advantage that it is unnecessary to determine a TC character sequence in advance for any particular encoding. Finding suitable choices would be a process amenable to computation. All encodings whose character repertoires are subsets of Unicode's are compatible with this approach in principle, for in the worst case the signature could contain the entire character repertoire representable by the encoding. I don't anticipate that more than a small number of characters would be needed in any particular signature, however.

For example, the magic number with encoding tag and augmented TC signature might look something like this (using Herb's suggested signature):

#\#CIF_2.0:UTF-8:F2,F3,F4,F5,F6;<c3><b2><c3><b3><c3><b4><c3><b5><c3><b6>

where <hexadecimal number> is meant to represent a byte having the given numeric value, and all other characters are literal. There is no need to restrict the signature to characters in the Latin-1 range.

This approach, with a suitable choice of signature (which need not be known in advance by the receiver), can in principle distinguish any chosen encoding from *all* others that are not strict supersets. Superset encodings do not present a risk of erroneous decoding, so that's not an important limitation for our purpose.

Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer: www.stjude.org/emaildisclaimer
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .

Next by Date: Re: [ddlm-group] Summary of encoding discussion so far. .

Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .

Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .