[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] UTF-8 BOM
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Tue, 18 May 2010 13:46:49 -0400 (EDT)
In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005101301340.99142@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337D9@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local>

Allow me to clarify my position, so there is no misunderstanding:

I believe that we will be dealing with a world with at least UTF-8
and UCS-2/UTF-16 encodings for many years to come.  I have no
objection to CIF2 being specified solely in terms of UTF-8 for
simplicity and consistency, but if we are to write software that
people can use, we must have a reasonable position with respect
to the encodings people use, and that means that, at the very
least, we need to accept and process UTF-8 BOMs as harmless
additional text.  Some of us will also be supporting UCS-2/UTF-16
directly in our applications.  I don't mind if other applications
are only going to support UTF-8, but inasmuch as, as long as
we have java and web browsers, we are going to encounter UCS-2/UTF-16,
we should do something sensible when a UCS-2/UTF-16 BOM pops up,
either doing the internal translation if we so choose, or, if that
is not handled by a particular application, issuing a polite warning
suggesting the used of an external translator if the application does
not wish to handle UCS-2/UTF-16.

BOMS will almost always appear in modern UCS-2/UTF-16 files, and when
they are converted to UTF-8 that will give us yet another source of
UTF-8 BOMs.  I believe the sensible thing to so it to recognize BOMs.

Regards,
     Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Tue, 18 May 2010, Bollinger, John C wrote:

> Herbert Bernstein wrote:
>> Let me see if I understand this correctly -- a user takes 2 perfectly good
>> CIF2 files, edits each to clean up, say, some comments to keep straight where
>> one begins and one ends, using a well-designed modern text editor that
>> happens to put a BOM at the start of each file, concatenates the two files
>> with cat to ship them into the IUCr, and suddenly they have a syntax error
>> caused by a character that they cannot see!!!
>>
>> To me this seems pointless when it is trivial for software to recognize the
>> character and handle it sensibly.
>
> And that is my principal rationale for preferring that embedded U+FEFF be recognized as CIF whitespace.  With that approach, the concatenation of two well-formed CIF2 files is always a well-formed CIF2 file, regardless of the presence or absence of BOMs in the original files.  Note, too, that such concatenation cannot produce a mixed-encoding file because files encoded in UTF-16[BE|LE], UTF-32[BE|LE], or any other encoding that can be distinguished from UTF-8 are not well-formed CIF2 files to start.  The file concatenation scenario thus does not provide a use case for the CIF2 *specification* to recognize embedded U+FEFF as an encoding marker.
>
> On the other hand, I again feel compelled to distinguish program behaviors from the CIF2 format specification.  None of the above would prevent a CIF processor from recognizing and handling CIF-like character streams encoded via schemes other than UTF-8, nor from recognizing embedded U+FEFF code sequences in various encodings as encoding switches, thereby handling mixed-encoding files.  Indeed, such a program or library would be invaluable for correcting encoding-related errors.  That does not, however, mean that such files must be considered well-formed CIF2, no matter how likely they may (or may not) be to arise.
>
>
> James Hester wrote:
>> I would be happy to call an embedded BOM a syntax error.
>
> In light of the possibility of U+FEFF appearing in a data value (for example, from cutting text from a Unicode manuscript and pasting it into a CIF), I need to refine my earlier blanket alternative of treating embedded U+FEFF as a syntax error.  I now think it would be ok to treat U+FEFF as a syntax error *provided* that it appears outside a delimited string.  That's still not my preference, though, and I feel confident that Herb will still disagree.
>
>
> Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Computing and X-Ray Scientist
> Department of Structural Biology
> St. Jude Children's Research Hospital
> John.Bollinger@StJude.org
> (901) 595-3166 [office]
> www.stjude.org
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] UTF-8 BOM (James Hester)

References:

[ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Joe Krahn)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: [ddlm-group] Questions about Methods

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: Re: [ddlm-group] UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM