[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Herbert Bernstein wrote:
>Let me see if I understand this correctly -- a user takes 2 perfectly good
>CIF2 files, edits each to clean up, say, some comments to keep straight where
>one begins and one ends, using a well-designed modern text editor that
>happens to put a BOM at the start of each file, concatenates the two files
>with cat to ship them into the IUCr, and suddenly they have a syntax error
>caused by a character that they cannot see!!!
>To me this seems pointless when it is trivial for software to recognize the
>character and handle it sensibly.

And that is my principal rationale for preferring that embedded U+FEFF
be recognized as CIF whitespace.  With that approach, the
concatenation of two well-formed CIF2 files is always a well-formed
CIF2 file, regardless of the presence or absence of BOMs in the
original files.  Note, too, that such concatenation cannot produce a
mixed-encoding file because files encoded in UTF-16[BE|LE],
UTF-32[BE|LE], or any other encoding that can be distinguished from
UTF-8 are not well-formed CIF2 files to start.  The file concatenation
scenario thus does not provide a use case for the CIF2 *specification*
to recognize embedded U+FEFF as an encoding marker.

On the other hand, I again feel compelled to distinguish program
behaviors from the CIF2 format specification.  None of the above would
prevent a CIF processor from recognizing and handling CIF-like
character streams encoded via schemes other than UTF-8, nor from
recognizing embedded U+FEFF code sequences in various encodings as
encoding switches, thereby handling mixed-encoding files.  Indeed,
such a program or library would be invaluable for correcting
encoding-related errors.  That does not, however, mean that such files
must be considered well-formed CIF2, no matter how likely they may (or
may not) be to arise.

James Hester wrote:
> I would be happy to call an embedded BOM a syntax error.

In light of the possibility of U+FEFF appearing in a data value (for
example, from cutting text from a Unicode manuscript and pasting it
into a CIF), I need to refine my earlier blanket alternative of
treating embedded U+FEFF as a syntax error.  I now think it would be
ok to treat U+FEFF as a syntax error *provided* that it appears
outside a delimited string.  That's still not my preference, though,
and I feel confident that Herb will still disagree.


John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
(901) 595-3166 [office]

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

Reply to: [list | sender only]