[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Tue, 11 May 2010 11:28:30 -0500
- Accept-Language: en-US
- acceptlanguage: en-US
- In-Reply-To: <alpine.BSF.2.00.1005101301340.99142@epsilon.pair.com>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005101301340.99142@epsilon.pair.com>
Dear Colleagues, I think CIF processor behavior such as Herb describes would be outstanding, and I commend Herb for his dedication to providing such capable and robust software. I do disagree about one of his specific points, however: > The > minimum to do with any BOM is: [...] > 1. Accept it at any point in a character stream. It would be both unconventional and programmatically inconvenient to give special significance to U+FEFF anywhere other than at the very beginning of a file. The Unicode consortium in fact addresses this exact question in its FAQ: http://www.unicode.org/faq/utf_bom.html#bom6. Although the Unicode's comments do allow for protocol-specific support for accepting U+FEFF as a BOM other than at the beginning of the stream, I see little advantage to adding such a complication to the CIF2 specifications. This all expands the scope of the topic far beyond what I had intended, however. I think it is perhaps useful to recognize at this point, therefore, that the CIF2 language specification and the behavior of CIF2 processors are separate questions. This group has already decided that files compliant with CIF 2.0 are encoded in UTF-8, period. I do not want to reopen that debate. On the other hand, that in no way prevents CIF processors from -- as an extension -- recognizing and handling putative CIFs that violate the spec by employing character encodings different from UTF-8. That sort of thing is generally heralded as beneficial for ease of use, and it is consistent with the good design principle of being relaxed about inputs but strict about outputs. (And in that vein I would hope that any CIF 2.0 writer's normal behavior would be to encode in UTF-8.) My suggestion is slightly different, as I hope this restatement will show: *in light of the fact that spec-compliant CIF2 files are encoded in UTF-8*, I suggest that the spec allow a file beginning with a UTF-8 BOM to be spec-compliant (subject to the compliance of the rest of the contents). Like Herb, I intend that my parsers will accept such CIFs whether they strictly comply with the spec or not, but the question is whether accepting such files should be a compliance requirement or an extension. Either way, I think it will be valuable to document this decision in the spec, if only to draw attention to the issue. Best Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):