[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] UTF-8 BOM
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Tue, 11 May 2010 22:07:04 -0400 (EDT)
In-Reply-To: <AANLkTinygegHONBx7eGMssx3KsenrdNx-4OTR73USneS@mail.gmail.com>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005101301340.99142@epsilon.pair.com><AANLkTinygegHONBx7eGMssx3KsenrdNx-4OTR73USneS@mail.gmail.com>

Dear Colleagues,

   It is good that we are moving forward enough to accept a UTF-8 BOM
at the front of a UTF-8 file.  To avoid confusion about that to which
we are agreeing.  I would suggest carefully reading:

   http://unicode.org/faq/utf_bom.html

Compliance is not just a matter of dealing with Notepad, but also BBEdit
in terms of writing, and an increasing number of editors in terms
of reading that assume Latin1 if there is no BOM, making is easy
to mess up text fields with things that look right for the user,
but which will be read as very different characters in a journal
editorial office.

   The prudent thing would be to always write the BOM and to accept
files with or without it.

   Regards,
     Herbert


=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 12 May 2010, James Hester wrote:

> I agree with John B that we can allow 0xEF 0xBB 0xBF '#' '\' '#' 'C' 'I' 'F' '2' '.' '0� in addition to
> '#\#CIF2.0' as alternative acceptable 'magic numbers' at the beginning of CIF2.0 files.� If I understand the
> situation correctly, we are forced to do this only because Windows Notepad will prepend the BOM characters to
> any file with UTF8 encoding.
> 
> Other possible uses of the BOM were discussed at length previously and I remain unconvinced of the need to
> include those uses in the syntax standard, for the reasons given in that previous discussion.
> 
> On Tue, May 11, 2010 at 3:21 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
>       Dear Colleagues,
>
>       � Inasmuch as we have adopted unicode we really should conform to
>       the unicode conventions. �It is fine for UTF-8 to be our default when
>       there is no BOM, but if there is a BOM we should process it. �The
>       minimum to do with any BOM is:
>
>       � 1. �Accept it at any point in a character stream.
>       � 2. �Check it against the BOMs for the codes that we are able
>       to process on that system (minimum would be utf-8 bom).
>       � 3. �If the BOM conforms to an encoding that that particular
>       system can accept, continue processing in the encoding
>       selected.
>       � 4. �If the BOM does not conform to an encoding that that particular
>       system can accept, declare an error and stop or issue a warning and
>       try to continue in utf-8. �Decalring an error is safest. �Trying to muddle
>       through may be necessary.
>
>       Rejecting a valid UTF-8 CIF simply because it went through a modern
>       editor and gained a UTF-8 BOM does not seem reasonable.
>
>       On writing, the approach should be that a CIF write can do one of the
>       following:
>
>       � 1. �Write a stream with no BOMs, in which case the intended
>       encoding is UTF-8; or
>       � 2. �Write a stream starting with a UTF-8 BOM, in which case the
>       intended encoding is UTF-8; or
>       � 3. �Write a stream starting with the BOM for some other encoding,
>       in which case the intended encoding is something other than UTF-8
>       and the file should not be identified as a standard UTF-8 CIF,
>       but as something else.
>
>       I, for one, intend to read and write both UTF-8 and UTF-16, which
>       covers most modern unicode uses, but I have no objection to
>       UTF-8 being the CIF standard for normal file interchange. �It is
>       simply a practical reality that big- and little-endian UCS-2 and
>       UTF-16 are widely used, and need at least some CIF support.
>       In order to conform to the current spec, I'll make the writing
>       of BOMs a non-default option for a UTF-8 file, but I agree
>       with John Bolinger that we should do womething sensible with
>       files that come with a BOM.
>
>       Regards,
>       � Herbert
>       =====================================================
>       �Herbert J. Bernstein, Professor of Computer Science
>       � �Dowling College, Kramer Science Center, KSC 121
>       � � � � Idle Hour Blvd, Oakdale, NY, 11769
>
>       � � � � � � � � �+1-631-244-3035
>       � � � � � � � � �yaya@dowling.edu
>       =====================================================
> 
> On Mon, 10 May 2010, Bollinger, John C wrote:
> 
> > I realize that earlier there was an extended discussion on this group
> > about identification and / or declaration of character encodings,
> > including the topic of using a byte-order mark to identify some
> > encodings. �Rest assured that I do not wish to reopen that discussion.
> > I do, however, want to raise a related question: whether it is
> > acceptable for a CIF2 processor to accept and ignore a UTF-8 BOM
> > sequence (bytes 0xEF 0xBB 0xBF, the UTF-8 encoding of character U+FEFF)
> > at the beginning of a CIF.
> >
> > Some text editors that support UTF-8 are known to ensure that
> > UTF-8-encoded files they write start with this sequence. �Inasmuch as it
> > seems a goal of this group to continue to support users editing CIFs
> > with general-purpose text editors, it therefore seems wise to me that an
> > initial BOM sequence be considered ignorable metadata in CIF2. �The
> > alternative is for it to be an error, with the confusing result that
> > editing some CIF2-compliant CIFs with some programs will corrupt the
> > resulting file, whereas *either* using a different text editor or
> > editing a different CIF (for example, one that contains no non-ASCII
> > characters) works fine.
> >
> > This suggested behavior would not require a CIF2 lexical scanner to
> > decode the BOM byte sequence to the corresponding character. �A scanner
> > operating directly on the raw byte stream can recognize and handle the
> > literal byte sequence almost as easily as one operating on the
> > corresponding decoded character stream could recognize and handle the
> > decoded character.
> >
> > Best Regards,
> >
> > John
> > --
> > John C. Bollinger, Ph.D.
> > Department of Structural Biology
> > St. Jude Children's Research Hospital
> >
> >
> >
> > Email Disclaimer: �www.stjude.org/emaildisclaimer
> >
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org
> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

[ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] UTF-8 BOM

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: [ddlm-group] [John.Bollinger@STJUDE.ORG: Re: Feedback on draft CIF2specification from JohnBollinger]

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM