[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] UTF-8 BOM
From: "Herbert J. Bernstein" <[email protected]>
Date: Mon, 10 May 2010 13:21:05 -0400 (EDT)
In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local>

Dear Colleagues,

   Inasmuch as we have adopted unicode we really should conform to
the unicode conventions.  It is fine for UTF-8 to be our default when
there is no BOM, but if there is a BOM we should process it.  The
minimum to do with any BOM is:

   1.  Accept it at any point in a character stream.
   2.  Check it against the BOMs for the codes that we are able
to process on that system (minimum would be utf-8 bom).
   3.  If the BOM conforms to an encoding that that particular
system can accept, continue processing in the encoding
selected.
   4.  If the BOM does not conform to an encoding that that particular
system can accept, declare an error and stop or issue a warning and
try to continue in utf-8.  Decalring an error is safest.  Trying to muddle
through may be necessary.

Rejecting a valid UTF-8 CIF simply because it went through a modern
editor and gained a UTF-8 BOM does not seem reasonable.

On writing, the approach should be that a CIF write can do one of the
following:

   1.  Write a stream with no BOMs, in which case the intended
encoding is UTF-8; or
   2.  Write a stream starting with a UTF-8 BOM, in which case the
intended encoding is UTF-8; or
   3.  Write a stream starting with the BOM for some other encoding,
in which case the intended encoding is something other than UTF-8
and the file should not be identified as a standard UTF-8 CIF,
but as something else.

I, for one, intend to read and write both UTF-8 and UTF-16, which
covers most modern unicode uses, but I have no objection to
UTF-8 being the CIF standard for normal file interchange.  It is
simply a practical reality that big- and little-endian UCS-2 and
UTF-16 are widely used, and need at least some CIF support.
In order to conform to the current spec, I'll make the writing
of BOMs a non-default option for a UTF-8 file, but I agree
with John Bolinger that we should do womething sensible with
files that come with a BOM.

Regards,
   Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Mon, 10 May 2010, Bollinger, John C wrote:

> I realize that earlier there was an extended discussion on this group 
> about identification and / or declaration of character encodings, 
> including the topic of using a byte-order mark to identify some 
> encodings.  Rest assured that I do not wish to reopen that discussion. 
> I do, however, want to raise a related question: whether it is 
> acceptable for a CIF2 processor to accept and ignore a UTF-8 BOM 
> sequence (bytes 0xEF 0xBB 0xBF, the UTF-8 encoding of character U+FEFF) 
> at the beginning of a CIF.
>
> Some text editors that support UTF-8 are known to ensure that 
> UTF-8-encoded files they write start with this sequence.  Inasmuch as it 
> seems a goal of this group to continue to support users editing CIFs 
> with general-purpose text editors, it therefore seems wise to me that an 
> initial BOM sequence be considered ignorable metadata in CIF2.  The 
> alternative is for it to be an error, with the confusing result that 
> editing some CIF2-compliant CIFs with some programs will corrupt the 
> resulting file, whereas *either* using a different text editor or 
> editing a different CIF (for example, one that contains no non-ASCII 
> characters) works fine.
>
> This suggested behavior would not require a CIF2 lexical scanner to 
> decode the BOM byte sequence to the corresponding character.  A scanner 
> operating directly on the raw byte stream can recognize and handle the 
> literal byte sequence almost as easily as one operating on the 
> corresponding decoded character stream could recognize and handle the 
> decoded character.
>
> Best Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

References:

[ddlm-group] UTF-8 BOM (Bollinger, John C)

Prev by Date: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] UTF-8 BOM

Prev by thread: [ddlm-group] UTF-8 BOM

Next by thread: Re: [ddlm-group] UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM