[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] UTF-8 BOM
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Mon, 14 Jun 2010 11:26:22 -0400 (EDT)
In-Reply-To: <20100614142541.GA356@emerald.iucr.org>
References: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com><20100614142541.GA356@emerald.iucr.org>

Dear Colleagues,

   With all due respect to those who are uncomfortable with the BOM,
the same FAQ, 2 lines earlier says, "Yes, UTF-8 can contain a BOM. 
However, it makes no difference as to the endianness of the byte stream. 
UTF-8 always has the same byte order. An initial BOM is only used as a 
signature  an indication that an otherwise unmarked text file is in UTF-8. 
Note that some recipients of UTF-8 encoded data do not expect a BOM."
Then if follows with the quote in Brian's message.

   The point is that UTF-8 allows for BOMs, but that there are applications
that are not UTF-8 aware, and those applications get tripped up by
BOMs because they are not UTF-8 aware.  As the same FAQ explains in
the prior question, "a BOM can be used as a signature no matter how the 
Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc."

   We cannot hold back the tide.  It is perfectly reasonable to discourage
the writing of UTF-8 BOMs from pure CIF applications that only write
UTF-8 files, but we do nobody a service by requiring it to be an error
to encounter a BOM on reading.

   I would suggest the following protocol as a pragmatic compromize

   1.  A fully compliant CIF writer should not prefix a UTF-8 CIF with
a BOM, nor embed a BOM within a UTF-8 CIF, and should write only
valid UTF-8 characters within a UTF-8 CIF.  The .cif file extension
should not be used for files that do not conform to this restriction.
   2.  A fully compliant CIF reader need not recognize an initial BOM and 
may leave such chores to external editors and filters, and a fully 
compliant CIF reader does not need to recognized character encodings
other then UTF-8.  However, it is not a violation of the CIF standard for 
CIF readers to accept and process BOMs, not is it a violation of the
CIF standard for a CIF reader to accept characater encodings other than
UTF-8, nor is it not required that any error or warning messages be
issued by CIF readers that process encodings other than UTF-8 provided
that the information in the CIF is a valid representation of Unicode
characters and the file does not use the ",cif" file extension.  The .cbf 
extension file extension or file extension other than .cif should be used 
for files that use BOMs or encodings other than UTF-8
   3.  A fully compliant CIF reader need not process non-".cif" files.

That way those who wish to can be fully compliant without dealing
with BOMs, but those of us who have to deal with a wider range of 
encodings will not have to issue pointless warnings and error messages,
but a ".cif" file will conform to Brian's protocol.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 14 Jun 2010, Brian McMahon wrote:

> I'm coming to this late, I fear, but I would prefer that the spec
> be kept as simple as possible. I note the following comments in
> the Unicode FAQ document referenced by John B
> (http://www.unicode.org/faq/utf_bom.html):
>
>    "Where UTF-8 is used transparently in 8-bit environments, the use
>    of a BOM will interfere with any protocol or file format that expects
>    specific ASCII characters at the beginning, such as the use of "#!"
>    of at the beginning of Unix shell scripts."
>
>    "In the absence of a protocol supporting its use as a BOM and when
>    not at the beginning of a text stream, U+FEFF should normally not
>    occur."
>
> I suggest the CIF specification deprecate the use of U+FEFF so that
> *any* occurrence of it be treated formally as an error. However, a
> note should acknowledge that U+FEFF is permitted according to the
> Unicode standard at the start of a data stream, and that therefore a
> CIF reading application may at its discretion accept U+FEFF followed
> by #\#CIF2.0 as a valid magic number at the start of a file.
>
> The idea is that any fully-conformant CIF writer will never write an
> initial UTF-8 BOM, and so any software designed to handle only fully
> conformant CIFs will not be troubled by it. Of course the world does
> contain CIFs created other than by fully-conformant CIF writers. To
> an extent the community should decide for itself how best to attempt
> to handle deviations from full conformance. It would help, perhaps, if
> those of us writing CIF readers would document specific practices that
> the software takes to accommodate such deviations. Ideally, such
> software should have a verbose logging mode that can be activated
> whenever surprising behaviour in reading CIFs is encountered by
> the user.
>
> Notice that naive concatenation of CIFs will remain a bad idea for
> all sorts of reasons - beyond the purely syntactic issues, one will
> get multiple "data_TOZ" declarations for example. Undoubtedly this
> will continue to happen, but perhaps increasing the number of
> occasions when blindly concatenating files triggers software errors
> will help to raise awareness and/or the use of better software tools.
>
> Regards
> Brian
>
> On Mon, May 24, 2010 at 04:26:40PM +1000, James Hester wrote:
>> To run through the alternatives and some of the arguments so far:
>>
>> (i) treating an embedded BOM as an ordinary character runs against the
>> Unicode recommendations.  If we wish our standard to be respected, I think
>> we should at least respect other standards and the thinking that has gone
>> into them
>>
>> (ii) treating an embedded BOM as whitespace is OK with the Unicode standard,
>> but means that a non-ASCII character now has syntactic meaning in the CIF.
>> I think this would be completely inconsistent on our part, as an invisible
>> character (when displayed) can actually be used to delimit strings.  This is
>> my least preferred solution, as it goes against the human-readability
>> expected of CIFs
>>
>> (iii) ignoring embedded BOMs is bad because they can be a 'tip off to a
>> serious problem'.
>>
>> (iv) treating embedded BOMs as syntax errors will cause issues when CIF2
>> files are naively concatenated
>>
>> I think the only viable alternatives are to choose (iii) or (iv).
>>
>> So: why exactly is ignoring a BOM a problem?  If the embedded BOM is the
>> leading BOM from a UTF16 file that has been naively concatenated, it will
>> have bytes 0xFE 0xFF.  This byte sequence (and the reverse) is not
>> acceptable UTF8, leading to a decoding error from the UTF8 decoding step.
>> The subsequent bytes will be UTF16, which should cause a decoding failure in
>> any case.   So I deduce that we are simply discussing how to treat a UTF8
>> BOM, which can only find its way into a CIF file by naive concatenation of
>> UTF8-encoded files written by certain programs.
>>
>> If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I don't
>> see that it is indicative of any problems beyond misguided choice of text
>> editor.
>>
>> So I would advocate ignoring (and removing) UTF8-BOMs in the input stream,
>> and treating all other BOMs as syntax errors.  Individual applications may
>> wish to give users the option of interpreting U+FEFF as the deprecated ZWNBP
>> (and translating to the correct character) on the understanding that if this
>> occurs outside a delimited string it will cause a syntax error.
>>
>> James
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Joe Krahn)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Brian McMahon)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] UTF-8 BOM

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: Re: [ddlm-group] UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM