Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

I agree with John B that we can allow 0xEF 0xBB 0xBF '#' '\' '#' 'C' 'I' 'F' '2' '.' '0  in addition to '#\#CIF2.0' as alternative acceptable 'magic numbers' at the beginning of CIF2.0 files.  If I understand the situation correctly, we are forced to do this only because Windows Notepad will prepend the BOM characters to any file with UTF8 encoding.

Other possible uses of the BOM were discussed at length previously and I remain unconvinced of the need to include those uses in the syntax standard, for the reasons given in that previous discussion.

On Tue, May 11, 2010 at 3:21 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
Dear Colleagues,

  Inasmuch as we have adopted unicode we really should conform to
the unicode conventions.  It is fine for UTF-8 to be our default when
there is no BOM, but if there is a BOM we should process it.  The
minimum to do with any BOM is:

  1.  Accept it at any point in a character stream.
  2.  Check it against the BOMs for the codes that we are able
to process on that system (minimum would be utf-8 bom).
  3.  If the BOM conforms to an encoding that that particular
system can accept, continue processing in the encoding
selected.
  4.  If the BOM does not conform to an encoding that that particular
system can accept, declare an error and stop or issue a warning and
try to continue in utf-8.  Decalring an error is safest.  Trying to muddle
through may be necessary.

Rejecting a valid UTF-8 CIF simply because it went through a modern
editor and gained a UTF-8 BOM does not seem reasonable.

On writing, the approach should be that a CIF write can do one of the
following:

  1.  Write a stream with no BOMs, in which case the intended
encoding is UTF-8; or
  2.  Write a stream starting with a UTF-8 BOM, in which case the
intended encoding is UTF-8; or
  3.  Write a stream starting with the BOM for some other encoding,
in which case the intended encoding is something other than UTF-8
and the file should not be identified as a standard UTF-8 CIF,
but as something else.

I, for one, intend to read and write both UTF-8 and UTF-16, which
covers most modern unicode uses, but I have no objection to
UTF-8 being the CIF standard for normal file interchange.  It is
simply a practical reality that big- and little-endian UCS-2 and
UTF-16 are widely used, and need at least some CIF support.
In order to conform to the current spec, I'll make the writing
of BOMs a non-default option for a UTF-8 file, but I agree
with John Bolinger that we should do womething sensible with
files that come with a BOM.

Regards,
  Herbert
=====================================================
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                 +1-631-244-3035
                 yaya@dowling.edu
=====================================================

On Mon, 10 May 2010, Bollinger, John C wrote:

> I realize that earlier there was an extended discussion on this group
> about identification and / or declaration of character encodings,
> including the topic of using a byte-order mark to identify some
> encodings.  Rest assured that I do not wish to reopen that discussion.
> I do, however, want to raise a related question: whether it is
> acceptable for a CIF2 processor to accept and ignore a UTF-8 BOM
> sequence (bytes 0xEF 0xBB 0xBF, the UTF-8 encoding of character U+FEFF)
> at the beginning of a CIF.
>
> Some text editors that support UTF-8 are known to ensure that
> UTF-8-encoded files they write start with this sequence.  Inasmuch as it
> seems a goal of this group to continue to support users editing CIFs
> with general-purpose text editors, it therefore seems wise to me that an
> initial BOM sequence be considered ignorable metadata in CIF2.  The
> alternative is for it to be an error, with the confusing result that
> editing some CIF2-compliant CIFs with some programs will corrupt the
> resulting file, whereas *either* using a different text editor or
> editing a different CIF (for example, one that contains no non-ASCII
> characters) works fine.
>
> This suggested behavior would not require a CIF2 lexical scanner to
> decode the BOM byte sequence to the corresponding character.  A scanner
> operating directly on the raw byte stream can recognize and handle the
> literal byte sequence almost as easily as one operating on the
> corresponding decoded character stream could recognize and handle the
> decoded character.
>
> Best Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.