Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] A new(?) compromise position. .


On Wednesday, September 29, 2010 12:00 PM, Herbert J. Bernstein wrote:
>   I know from long and painful experience that files with just a few accented characters are very, very difficult to clearly identify, and can look like valid UTF8 files.  UTF8 is _not_ self-identifying without the BOM.

UTF-8 is not deterministically self-identifying either with or without a BOM.  It is always conceivable that the supposed BOM bytes are intended as ordinary encoded characters in some other encoding.  On the other hand, I don't know of any other character encoding in which a well-formed CIF could begin with the bytes of a UTF-8 BOM.  Perhaps that makes a BOM sufficient for our purposes.

>   The case that really convinced me that there was a problem was a French document with a lower case e with an accent acute on the E.  I nearly missed a  misencoding of a mac native file that because it was being misread as a capital E in a UTF8 file showed the accent as grave.

I wonder whether you and James are thinking along different lines.  The error you describe indeed sounds tricky to catch by eye, but isolated non-ASCII characters should never pose a problem for computer detection of UTF-8 vs. single-byte encodings.  For ASCII-compatible, non-UTF-8 encoded text to be simultaneously be valid UTF-8 requires that non-ASCII character codes occur only in particular 2-, 3-, and 4-byte patterns.  That can never be the case for isolated non-ASCII characters, thus even one such isolated character is enough to let a decoder determine that its input is not valid UTF-8.

That does not, however, prevent a UTF-8 decoder from attempting to recover in some way from a decoding error.  Most that I have dealt with do so by default, often with no message to the caller.  An application developer relying on his decoder to catch invalid UTF-8 must therefore ensure that it is able to signal decoding errors to its caller and is configured to do so.

Alternatively, are you confident that in this example you have laid the blame in the right place?  I can only speculate, but I note that a double decoding (UTF-8 decoding followed by interpreting the result as Mac OS Roman) would have exactly the result you describe.  Likewise, a confusion between Mac OS Roman (assumed) and ISO-8859-1 (actual) would do the same.  I'm having trouble, however, figuring out how an error recovery mechanism might have this result, and if the character in question was between ASCII characters (highly likely for French) then the result cannot have arisen from an error-free, yet wrong UTF-8 decoding.

>   There are simply too many cases like that in which a file written in a
>non-UTF8 encoding looks like something reasonable, but wrong, to say that UTF without the BOM is self->identifying.

I have no strong objection at the moment to requiring a BOM to identify UTF-8 CIFs.  Nevertheless, I'm not yet persuaded that the risk of mis-identifying a byte stream as UTF-8 is so significant.  As far as I can tell, the worst possible case is when the true encoding is ASCII-compatible (but not UTF-8) and among the input are exactly two encoded non-ASCII characters, adjacent to each other.  Assuming equal probability of all non-ASCII characters, the likelihood of the byte stream being valid UTF-8 is around 12%.  If there are two such pairs or one triple, then the likelihood drops to under 3%.  It goes down rapidly from there with additional non-ASCII characters (to zero if any occur isolated).

The likelihood of such an input being presented in the first place must also be factored in, including whatever influence may be exerted by the fact that the file would not be valid CIF (on account of using non-ASCII characters but not being encoded in UTF-8 or UTF-16).  It's hard to gauge the actual risk, but I'm with James in estimating it to be very low.

UTF-16 is different.  Because the first character of a well-formed CIF (ignoring any BOM) must be from the ASCII subset, and because CIF does not allow character U+0000, it is always possible to distinguish a well-formed CIF encoded in UTF-16 from a well-formed CIF encoded in any ASCII-compatible encoding, EBCDIC, or most, if not all, other encodings.  Even BE vs. LE can be readily distinguished for CIF, but note that UTF-16 without a byte-order mark is BE by definition.  (Refer to Unicode 5.2, section 3.10, definition 98.)

With that said, I also have no strong objection at the moment to requiring a BOM to identify UTF-16 CIFs, though I don't see much advantage to it.

John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

cif2-encoding mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.