Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Herbert Bernstein wrote:
>Let me see if I understand this correctly -- a user takes 2 perfectly good
>CIF2 files, edits each to clean up, say, some comments to keep straight where
>one begins and one ends, using a well-designed modern text editor that
>happens to put a BOM at the start of each file, concatenates the two files
>with cat to ship them into the IUCr, and suddenly they have a syntax error
>caused by a character that they cannot see!!!
>To me this seems pointless when it is trivial for software to recognize the
>character and handle it sensibly.

And that is my principal rationale for preferring that embedded U+FEFF
be recognized as CIF whitespace.  With that approach, the
concatenation of two well-formed CIF2 files is always a well-formed
CIF2 file, regardless of the presence or absence of BOMs in the
original files.  Note, too, that such concatenation cannot produce a
mixed-encoding file because files encoded in UTF-16[BE|LE],
UTF-32[BE|LE], or any other encoding that can be distinguished from
UTF-8 are not well-formed CIF2 files to start.  The file concatenation
scenario thus does not provide a use case for the CIF2 *specification*
to recognize embedded U+FEFF as an encoding marker.

On the other hand, I again feel compelled to distinguish program
behaviors from the CIF2 format specification.  None of the above would
prevent a CIF processor from recognizing and handling CIF-like
character streams encoded via schemes other than UTF-8, nor from
recognizing embedded U+FEFF code sequences in various encodings as
encoding switches, thereby handling mixed-encoding files.  Indeed,
such a program or library would be invaluable for correcting
encoding-related errors.  That does not, however, mean that such files
must be considered well-formed CIF2, no matter how likely they may (or
may not) be to arise.

James Hester wrote:
> I would be happy to call an embedded BOM a syntax error.

In light of the possibility of U+FEFF appearing in a data value (for
example, from cutting text from a Unicode manuscript and pasting it
into a CIF), I need to refine my earlier blanket alternative of
treating embedded U+FEFF as a syntax error.  I now think it would be
ok to treat U+FEFF as a syntax error *provided* that it appears
outside a delimited string.  That's still not my preference, though,
and I feel confident that Herb will still disagree.


John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
(901) 595-3166 [office]

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.