Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .


On Sunday, June 27, 2010 8:23 AM, Herbert J. Bernstein wrote:
>   The trick is to put some reasonable selection of accented characters into the check string.  Most of the non-accented roman characters are common to a very wide range of encodings.  It happens that accented lower case o's work fairly well for detecting a lot of the most common encodings.  If you also explicitly state the intended encoding, the chances of a misidentification are probably as low as you are going to get.  Nothing is perfect, but the combination of an encoding field and a transmission check is, I think, well worth considering.

I believe I have found the way I was looking for to enable the transmission check signature to distinguish among all conceivable extensions of ASCII.  The key is that the signature must include not only the encoded characters, but also an ASCII representation of the Unicode code points of those characters (important: not their code points in some other character set / code page).  The processor then decodes the characters according to whatever it thinks is the proper encoding, obtaining, perhaps indirectly, their Unicode code points (which may be numerically greater than 255).  It compares them, numerically, to the expected code point sequence, and the check is successful if they all match.  In principle, it could try multiple encodings to attempt to find the right one.

This has the added advantage that it is unnecessary to determine a TC character sequence in advance for any particular encoding.  Finding suitable choices would be a process amenable to computation.  All encodings whose character repertoires are subsets of Unicode's are compatible with this approach in principle, for in the worst case the signature could contain the entire character repertoire representable by the encoding.  I don't anticipate that more than a small number of characters would be needed in any particular signature, however.

For example, the magic number with encoding tag and augmented TC signature might look something like this (using Herb's suggested signature):

#\#CIF_2.0:UTF-8:F2,F3,F4,F5,F6;<c3><b2><c3><b3><c3><b4><c3><b5><c3><b6>

where <hexadecimal number> is meant to represent a byte having the given numeric value, and all other characters are literal.  There is no need to restrict the signature to characters in the Latin-1 range.

This approach, with a suitable choice of signature (which need not be known in advance by the receiver), can in principle distinguish any chosen encoding from *all* others that are not strict supersets.  Superset encodings do not present a risk of erroneous decoding, so that's not an important limitation for our purpose.


Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.