Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

I'm coming to this late, I fear, but I would prefer that the spec
be kept as simple as possible. I note the following comments in
the Unicode FAQ document referenced by John B
(http://www.unicode.org/faq/utf_bom.html):

    "Where UTF-8 is used transparently in 8-bit environments, the use
    of a BOM will interfere with any protocol or file format that expects
    specific ASCII characters at the beginning, such as the use of "#!"
    of at the beginning of Unix shell scripts." 

    "In the absence of a protocol supporting its use as a BOM and when
    not at the beginning of a text stream, U+FEFF should normally not
    occur."

I suggest the CIF specification deprecate the use of U+FEFF so that
*any* occurrence of it be treated formally as an error. However, a
note should acknowledge that U+FEFF is permitted according to the
Unicode standard at the start of a data stream, and that therefore a
CIF reading application may at its discretion accept U+FEFF followed
by #\#CIF2.0 as a valid magic number at the start of a file.

The idea is that any fully-conformant CIF writer will never write an
initial UTF-8 BOM, and so any software designed to handle only fully
conformant CIFs will not be troubled by it. Of course the world does
contain CIFs created other than by fully-conformant CIF writers. To
an extent the community should decide for itself how best to attempt
to handle deviations from full conformance. It would help, perhaps, if
those of us writing CIF readers would document specific practices that
the software takes to accommodate such deviations. Ideally, such
software should have a verbose logging mode that can be activated
whenever surprising behaviour in reading CIFs is encountered by
the user.

Notice that naive concatenation of CIFs will remain a bad idea for
all sorts of reasons - beyond the purely syntactic issues, one will
get multiple "data_TOZ" declarations for example. Undoubtedly this
will continue to happen, but perhaps increasing the number of
occasions when blindly concatenating files triggers software errors
will help to raise awareness and/or the use of better software tools.

Regards
Brian

On Mon, May 24, 2010 at 04:26:40PM +1000, James Hester wrote:
> To run through the alternatives and some of the arguments so far:
> 
> (i) treating an embedded BOM as an ordinary character runs against the
> Unicode recommendations.  If we wish our standard to be respected, I think
> we should at least respect other standards and the thinking that has gone
> into them
> 
> (ii) treating an embedded BOM as whitespace is OK with the Unicode standard,
> but means that a non-ASCII character now has syntactic meaning in the CIF.
> I think this would be completely inconsistent on our part, as an invisible
> character (when displayed) can actually be used to delimit strings.  This is
> my least preferred solution, as it goes against the human-readability
> expected of CIFs
> 
> (iii) ignoring embedded BOMs is bad because they can be a 'tip off to a
> serious problem'.
> 
> (iv) treating embedded BOMs as syntax errors will cause issues when CIF2
> files are naively concatenated
> 
> I think the only viable alternatives are to choose (iii) or (iv).
> 
> So: why exactly is ignoring a BOM a problem?  If the embedded BOM is the
> leading BOM from a UTF16 file that has been naively concatenated, it will
> have bytes 0xFE 0xFF.  This byte sequence (and the reverse) is not
> acceptable UTF8, leading to a decoding error from the UTF8 decoding step.
> The subsequent bytes will be UTF16, which should cause a decoding failure in
> any case.   So I deduce that we are simply discussing how to treat a UTF8
> BOM, which can only find its way into a CIF file by naive concatenation of
> UTF8-encoded files written by certain programs.
> 
> If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I don't
> see that it is indicative of any problems beyond misguided choice of text
> editor.
> 
> So I would advocate ignoring (and removing) UTF8-BOMs in the input stream,
> and treating all other BOMs as syntax errors.  Individual applications may
> wish to give users the option of interpreting U+FEFF as the deprecated ZWNBP
> (and translating to the correct character) on the understanding that if this
> occurs outside a delimited string it will cause a syntax error.
> 
> James
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.