[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: Brian McMahon <bm@iucr.org>
- Date: Mon, 14 Jun 2010 15:25:41 +0100
- In-Reply-To: <AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com>
- References: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com>
I'm coming to this late, I fear, but I would prefer that the spec be kept as simple as possible. I note the following comments in the Unicode FAQ document referenced by John B (http://www.unicode.org/faq/utf_bom.html): "Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts." "In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur." I suggest the CIF specification deprecate the use of U+FEFF so that *any* occurrence of it be treated formally as an error. However, a note should acknowledge that U+FEFF is permitted according to the Unicode standard at the start of a data stream, and that therefore a CIF reading application may at its discretion accept U+FEFF followed by #\#CIF2.0 as a valid magic number at the start of a file. The idea is that any fully-conformant CIF writer will never write an initial UTF-8 BOM, and so any software designed to handle only fully conformant CIFs will not be troubled by it. Of course the world does contain CIFs created other than by fully-conformant CIF writers. To an extent the community should decide for itself how best to attempt to handle deviations from full conformance. It would help, perhaps, if those of us writing CIF readers would document specific practices that the software takes to accommodate such deviations. Ideally, such software should have a verbose logging mode that can be activated whenever surprising behaviour in reading CIFs is encountered by the user. Notice that naive concatenation of CIFs will remain a bad idea for all sorts of reasons - beyond the purely syntactic issues, one will get multiple "data_TOZ" declarations for example. Undoubtedly this will continue to happen, but perhaps increasing the number of occasions when blindly concatenating files triggers software errors will help to raise awareness and/or the use of better software tools. Regards Brian On Mon, May 24, 2010 at 04:26:40PM +1000, James Hester wrote: > To run through the alternatives and some of the arguments so far: > > (i) treating an embedded BOM as an ordinary character runs against the > Unicode recommendations. If we wish our standard to be respected, I think > we should at least respect other standards and the thinking that has gone > into them > > (ii) treating an embedded BOM as whitespace is OK with the Unicode standard, > but means that a non-ASCII character now has syntactic meaning in the CIF. > I think this would be completely inconsistent on our part, as an invisible > character (when displayed) can actually be used to delimit strings. This is > my least preferred solution, as it goes against the human-readability > expected of CIFs > > (iii) ignoring embedded BOMs is bad because they can be a 'tip off to a > serious problem'. > > (iv) treating embedded BOMs as syntax errors will cause issues when CIF2 > files are naively concatenated > > I think the only viable alternatives are to choose (iii) or (iv). > > So: why exactly is ignoring a BOM a problem? If the embedded BOM is the > leading BOM from a UTF16 file that has been naively concatenated, it will > have bytes 0xFE 0xFF. This byte sequence (and the reverse) is not > acceptable UTF8, leading to a decoding error from the UTF8 decoding step. > The subsequent bytes will be UTF16, which should cause a decoding failure in > any case. So I deduce that we are simply discussing how to treat a UTF8 > BOM, which can only find its way into a CIF file by naive concatenation of > UTF8-encoded files written by certain programs. > > If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I don't > see that it is indicative of any problems beyond misguided choice of text > editor. > > So I would advocate ignoring (and removing) UTF8-BOMs in the input stream, > and treating all other BOMs as syntax errors. Individual applications may > wish to give users the option of interpreting U+FEFF as the deprecated ZWNBP > (and translating to the correct character) on the understanding that if this > occurs outside a delimited string it will cause a syntax error. > > James _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Joe Krahn)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):