[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Mon, 24 May 2010 15:13:35 -0500
- Accept-Language: en-US
- acceptlanguage: en-US
- In-Reply-To: <AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com>
On Monday, May 24, 2010 1:27 AM, James Hester wrote: >To run through the alternatives and some of the arguments so far: > >(i) treating an embedded BOM as an ordinary character runs against >the Unicode recommendations. If we wish our standard to be >respected, I think we should at least respect other standards and >the thinking that has gone into them > >(ii) treating an embedded BOM as whitespace is OK with the Unicode >standard, but means that a non-ASCII character now has syntactic >meaning in the CIF. I think this would be completely inconsistent >on our part, as an invisible character (when displayed) can actually >be used to delimit strings. This is my least preferred solution, as >it goes against the human-readability expected of CIFs > >(iii) ignoring embedded BOMs is bad because they can be a 'tip off >to a serious problem'. > >(iv) treating embedded BOMs as syntax errors will cause issues when >CIF2 files are naively concatenated > >I think the only viable alternatives are to choose (iii) or (iv). > >So: why exactly is ignoring a BOM a problem? If the embedded BOM is >the leading BOM from a UTF16 file that has been naively concatenated, >it will have bytes 0xFE 0xFF. This byte sequence (and the reverse) is >not acceptable UTF8, leading to a decoding error from the UTF8 >decoding step. The subsequent bytes will be UTF16, which should cause >a decoding failure in any case. So I deduce that we are simply >discussing how to treat a UTF8 BOM, which can only find its way into a >CIF file by naive concatenation of UTF8-encoded files written by >certain programs. I generally agree with that summary and analysis, though I observe that a U+FEFF character may intentionally be embedded in a data value to serve its (deprecated) role as a ZWNBSP. It might arise from transferring text from an existing manuscript into a CIF, such as an author may do while preparing a new, CIF-formatted manuscript. I can come up with other scenarios leading to an embedded U+FEFF that don't involve directly concatenating files, though so far they all seem far-fetched. >If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I >don't see that it is indicative of any problems beyond misguided choice >of text editor. There are cases where ignoring an embedded BOM would change the syntactic interpretation of the cif, generally when it is neither preceded nor followed by whitespace. That might occur, for example, when naively appending a CIF2 CIF, with BOM and required version comment, to the end of a CIF with no trailing newline. If the BOM is ignored then the last token of the first CIF and the first token of the second are pasted together, which Might not result in a syntax error. Of course, this is a potential problem with concatenating CIFs without BOMs, too. There are nastier possible results from silently stripping embedded U+FEFF, some owing to its legality in data names, and a few other tricks I have brewing in the back of my head. None of them are likely to occur accidentally, though. >So I would advocate ignoring (and removing) UTF8-BOMs in the input >stream, and treating all other BOMs as syntax errors. Individual >applications may wish to give users the option of interpreting U+FEFF >as the deprecated ZWNBP (and translating to the correct character) on >the understanding that if this occurs outside a delimited string it >will cause a syntax error. I am not at all comfortable with allowing parsers to strip or substitute U+FEFF embedded in data values, much less requiring that they do so: a data protocol should faithfully deliver the data entrusted to it, or else complain. I don't much like the idea of stripping or substituting U+FEFF elsewhere, for that matter, but I could live with that. Requiring U+FEFF to be altered in some contexts but not in others would present some practical challenges, to be sure, but not insurmountable ones. If that is unpalatable, though, and if treating embedded U+FEFF as whitespace is unacceptable, then we're left with treating it as an ordinary character, which no one seems to like much, and treating it as an error, which has had mixed reviews. > >James > >PS am I the only one who thinks it unlikely that Wordpad users would >choose to use 'cat' to join file fragments together? No, you're not. But Wordpad is not the only editor of note, its users are not the only people who might end up concatenating CIFs edited with it. Personally, though, I tend to ascribe sufficient technical acumen to 'cat' users to understand why there's a potential problem and to have some idea how to tackle it. John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Joe Krahn)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Prev by Date: Re: [ddlm-group] Case sensitivity
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: [ddlm-group] imgCIF versus CIF2
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):