[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: Joe Krahn <krahn@niehs.nih.gov>
- Date: Wed, 12 May 2010 18:34:14 -0400
- In-Reply-To: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1005101301340.99142@epsilon.pair.com> <8F77913624F7524AACD2A92EAF3BFA54165DF337D9@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com>
In general, CIF does not directly deal with encoding. It should be possible to allow a low-level I/O library to deal with all encoding issues. Therefore, supporting non-standard stream interpretation should be avoided. It is a good practical idea to allow mid-stream BOMs by interpreting them as character 0xFEFF, and allow it as whitespace, with a warning. It should not be a required feature, because it is non-standard, and only exists in UTF-8 for backwards-compatibility. Eventually, conforming I/O libraries will interpret them as invalid. Ideally, text concatenation software will become BOM-aware. Interpret mid-stream BOMs and allowing mixed encodings is a major hack, and impractical for systems that deal with encoding at the I/O level. It is reasonable to allow it as an non-standard extension, but should always give a warning so that people realize that such files are likely to be broken elsewhere. In summary: Standard CIF2 needs to support standard UTF-8 BOMs at the beginning of a file. Anything else should be considered a non-standard extension. For practical reasons, CIF2 parsers should be encouraged but not required to allow mid-stream UTF-8 0xFEFF as whitespace. Joe Herbert J. Bernstein wrote: > Dear Colleagues, > > While it is certainly prudent to tell people to either write a pure > UTF-8 file with no BOM or to prefix it with a BOM, and that is > home a compliant CIF writer should work, it is not practical to > insist the CIF readers should reject embedded BOMs. Indeed, the > URL cited by John does not tell you they are illegal, but that > you should treat them as a zero width non-breaking space. > > The reason we cannot insist on readers demanding that BOMs occur > at the beginning is that users may concatenate whole CIF or > build one CIF out of fragments of text, and this will very likely > result in embedded BOMs and possibly switches in encodings. If > we fail to handle the BOMs were are much more likely to garble > such files. I strongly recommend the approach in my prior > message -- recognize BOMs are all times. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Tue, 11 May 2010, Bollinger, John C wrote: > >> Dear Colleagues, >> >> I think CIF processor behavior such as Herb describes would be >> outstanding, and I commend Herb for his dedication to providing such >> capable and robust software. I do disagree about one of his specific >> points, however: >> >>> The >>> minimum to do with any BOM is: >> [...] >> >>> 1. Accept it at any point in a character stream. >> It would be both unconventional and programmatically inconvenient to >> give special significance to U+FEFF anywhere other than at the very >> beginning of a file. The Unicode consortium in fact addresses this exact >> question in its FAQ: http://www.unicode.org/faq/utf_bom.html#bom6. >> Although the Unicode's comments do allow for protocol-specific support >> for accepting U+FEFF as a BOM other than at the beginning of the stream, >> I see little advantage to adding such a complication to the CIF2 >> specifications. >> >> This all expands the scope of the topic far beyond what I had intended, >> however. I think it is perhaps useful to recognize at this point, >> therefore, that the CIF2 language specification and the behavior of CIF2 >> processors are separate questions. This group has already decided that >> files compliant with CIF 2.0 are encoded in UTF-8, period. I do not want >> to reopen that debate. On the other hand, that in no way prevents CIF >> processors from -- as an extension -- recognizing and handling putative >> CIFs that violate the spec by employing character encodings different >> from UTF-8. That sort of thing is generally heralded as beneficial for >> ease of use, and it is consistent with the good design principle of being >> relaxed about inputs but strict about outputs. (And in that vein I would >> hope that any CIF 2.0 writer's normal behavior would be to encode in >> UTF-8.) >> >> My suggestion is slightly different, as I hope this restatement will >> show: *in light of the fact that spec-compliant CIF2 files are encoded in >> UTF-8*, I suggest that the spec allow a file beginning with a UTF-8 BOM >> to be spec-compliant (subject to the compliance of the rest of the >> contents). Like Herb, I intend that my parsers will accept such CIFs >> whether they strictly comply with the spec or not, but the question is >> whether accepting such files should be a compliance requirement or an >> extension. Either way, I think it will be valuable to document this >> decision in the spec, if only to draw attention to the issue. >> >> >> Best Regards, >> >> John >> -- >> John C. Bollinger, Ph.D. >> Department of Structural Biology >> St. Jude Children's Research Hospital >> >> >> Email Disclaimer: www.stjude.org/emaildisclaimer >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (Joe Krahn)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):