[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] UTF-8 BOM
From: "Bollinger, John C" <[email protected]>
Date: Tue, 18 May 2010 11:57:35 -0500
Accept-Language: en-US
acceptlanguage: en-US
In-Reply-To: <[email protected]>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337D9@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]>

Herbert Bernstein wrote:
>Let me see if I understand this correctly -- a user takes 2 perfectly good
>CIF2 files, edits each to clean up, say, some comments to keep straight where
>one begins and one ends, using a well-designed modern text editor that
>happens to put a BOM at the start of each file, concatenates the two files
>with cat to ship them into the IUCr, and suddenly they have a syntax error
>caused by a character that they cannot see!!!
>
>To me this seems pointless when it is trivial for software to recognize the
>character and handle it sensibly.

And that is my principal rationale for preferring that embedded U+FEFF
be recognized as CIF whitespace.  With that approach, the
concatenation of two well-formed CIF2 files is always a well-formed
CIF2 file, regardless of the presence or absence of BOMs in the
original files.  Note, too, that such concatenation cannot produce a
mixed-encoding file because files encoded in UTF-16[BE|LE],
UTF-32[BE|LE], or any other encoding that can be distinguished from
UTF-8 are not well-formed CIF2 files to start.  The file concatenation
scenario thus does not provide a use case for the CIF2 *specification*
to recognize embedded U+FEFF as an encoding marker.

On the other hand, I again feel compelled to distinguish program
behaviors from the CIF2 format specification.  None of the above would
prevent a CIF processor from recognizing and handling CIF-like
character streams encoded via schemes other than UTF-8, nor from
recognizing embedded U+FEFF code sequences in various encodings as
encoding switches, thereby handling mixed-encoding files.  Indeed,
such a program or library would be invaluable for correcting
encoding-related errors.  That does not, however, mean that such files
must be considered well-formed CIF2, no matter how likely they may (or
may not) be to arise.

James Hester wrote:
> I would be happy to call an embedded BOM a syntax error.

In light of the possibility of U+FEFF appearing in a data value (for
example, from cutting text from a Unicode manuscript and pasting it
into a CIF), I need to refine my earlier blanket alternative of
treating embedded U+FEFF as a syntax error.  I now think it would be
ok to treat U+FEFF as a syntax error *provided* that it appears
outside a delimited string.  That's still not my preference, though,
and I feel confident that Herb will still disagree.

Regards,

John
--
John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
[email protected]
(901) 595-3166 [office]
www.stjude.org

Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

References:

[ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Joe Krahn)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] UTF-8 BOM

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: Re: [ddlm-group] UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM