[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Tue, 18 May 2010 11:57:35 -0500
- Accept-Language: en-US
- acceptlanguage: en-US
- In-Reply-To: <alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005101301340.99142@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337D9@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com>
Herbert Bernstein wrote: >Let me see if I understand this correctly -- a user takes 2 perfectly good >CIF2 files, edits each to clean up, say, some comments to keep straight where >one begins and one ends, using a well-designed modern text editor that >happens to put a BOM at the start of each file, concatenates the two files >with cat to ship them into the IUCr, and suddenly they have a syntax error >caused by a character that they cannot see!!! > >To me this seems pointless when it is trivial for software to recognize the >character and handle it sensibly. And that is my principal rationale for preferring that embedded U+FEFF be recognized as CIF whitespace. With that approach, the concatenation of two well-formed CIF2 files is always a well-formed CIF2 file, regardless of the presence or absence of BOMs in the original files. Note, too, that such concatenation cannot produce a mixed-encoding file because files encoded in UTF-16[BE|LE], UTF-32[BE|LE], or any other encoding that can be distinguished from UTF-8 are not well-formed CIF2 files to start. The file concatenation scenario thus does not provide a use case for the CIF2 *specification* to recognize embedded U+FEFF as an encoding marker. On the other hand, I again feel compelled to distinguish program behaviors from the CIF2 format specification. None of the above would prevent a CIF processor from recognizing and handling CIF-like character streams encoded via schemes other than UTF-8, nor from recognizing embedded U+FEFF code sequences in various encodings as encoding switches, thereby handling mixed-encoding files. Indeed, such a program or library would be invaluable for correcting encoding-related errors. That does not, however, mean that such files must be considered well-formed CIF2, no matter how likely they may (or may not) be to arise. James Hester wrote: > I would be happy to call an embedded BOM a syntax error. In light of the possibility of U+FEFF appearing in a data value (for example, from cutting text from a Unicode manuscript and pasting it into a CIF), I need to refine my earlier blanket alternative of treating embedded U+FEFF as a syntax error. I now think it would be ok to treat U+FEFF as a syntax error *provided* that it appears outside a delimited string. That's still not my preference, though, and I feel confident that Herb will still disagree. Regards, John -- John C. Bollinger, Ph.D. Computing and X-Ray Scientist Department of Structural Biology St. Jude Children's Research Hospital John.Bollinger@StJude.org (901) 595-3166 [office] www.stjude.org Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Joe Krahn)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):