[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: Joe Krahn <krahn@niehs.nih.gov>
- Date: Thu, 13 May 2010 12:34:45 -0400
- In-Reply-To: <4BEB2CE6.3060900@niehs.nih.gov>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1005101301340.99142@epsilon.pair.com> <8F77913624F7524AACD2A92EAF3BFA54165DF337D9@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov>
Dear colleagues, I did some experiments with iconv, a robust text encoding conversion tool. It will silently skip over an embedded BOM for any UTF encoding. So, this seems to be a reasonable behavior even though it is non-standard. >From internet searches, there does not seem to be a consensus on whether UTF-8 should include a BOM. I suspect that the dislike of UTF-8 BOM comes from people who only work on systems where UTF-8 is the default. CIF1 does not allow "extended" ASCII encoding for characters 128-255, so there is no potential for misinterpreting valid CIF files. That is the main purpose of a UTF-8 BOM. So arguments for or against BOMs in CIF files both seem reasonable. CIF parsers should tolerate leading and embedded UTF-8 BOMs, as well as files with no BOM. For writing CIF2, it is probably best to make BOMs optional, because the parser needs to tolerate both variations on input. Krahn, Joe (NIH/NIEHS) [C] wrote: > In general, CIF does not directly deal with encoding. It should be > possible to allow a low-level I/O library to deal with all encoding > issues. Therefore, supporting non-standard stream interpretation should > be avoided. > > It is a good practical idea to allow mid-stream BOMs by interpreting > them as character 0xFEFF, and allow it as whitespace, with a warning. It > should not be a required feature, because it is non-standard, and only > exists in UTF-8 for backwards-compatibility. Eventually, conforming I/O > libraries will interpret them as invalid. Ideally, text concatenation > software will become BOM-aware. > > Interpret mid-stream BOMs and allowing mixed encodings is a major hack, > and impractical for systems that deal with encoding at the I/O level. It > is reasonable to allow it as an non-standard extension, but should > always give a warning so that people realize that such files are likely > to be broken elsewhere. > > In summary: Standard CIF2 needs to support standard UTF-8 BOMs at the > beginning of a file. Anything else should be considered a non-standard > extension. For practical reasons, CIF2 parsers should be encouraged but > not required to allow mid-stream UTF-8 0xFEFF as whitespace. > > Joe > > Herbert J. Bernstein wrote: >> Dear Colleagues, >> >> While it is certainly prudent to tell people to either write a pure >> UTF-8 file with no BOM or to prefix it with a BOM, and that is >> home a compliant CIF writer should work, it is not practical to >> insist the CIF readers should reject embedded BOMs. Indeed, the >> URL cited by John does not tell you they are illegal, but that >> you should treat them as a zero width non-breaking space. >> >> The reason we cannot insist on readers demanding that BOMs occur >> at the beginning is that users may concatenate whole CIF or >> build one CIF out of fragments of text, and this will very likely >> result in embedded BOMs and possibly switches in encodings. If >> we fail to handle the BOMs were are much more likely to garble >> such files. I strongly recommend the approach in my prior >> message -- recognize BOMs are all times. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Tue, 11 May 2010, Bollinger, John C wrote: >> >>> Dear Colleagues, >>> >>> I think CIF processor behavior such as Herb describes would be >>> outstanding, and I commend Herb for his dedication to providing such >>> capable and robust software. I do disagree about one of his specific >>> points, however: >>> >>>> The >>>> minimum to do with any BOM is: >>> [...] >>> >>>> 1. Accept it at any point in a character stream. >>> It would be both unconventional and programmatically inconvenient to >>> give special significance to U+FEFF anywhere other than at the very >>> beginning of a file. The Unicode consortium in fact addresses this exact >>> question in its FAQ: http://www.unicode.org/faq/utf_bom.html#bom6. >>> Although the Unicode's comments do allow for protocol-specific support >>> for accepting U+FEFF as a BOM other than at the beginning of the stream, >>> I see little advantage to adding such a complication to the CIF2 >>> specifications. >>> >>> This all expands the scope of the topic far beyond what I had intended, >>> however. I think it is perhaps useful to recognize at this point, >>> therefore, that the CIF2 language specification and the behavior of CIF2 >>> processors are separate questions. This group has already decided that >>> files compliant with CIF 2.0 are encoded in UTF-8, period. I do not want >>> to reopen that debate. On the other hand, that in no way prevents CIF >>> processors from -- as an extension -- recognizing and handling putative >>> CIFs that violate the spec by employing character encodings different >>> from UTF-8. That sort of thing is generally heralded as beneficial for >>> ease of use, and it is consistent with the good design principle of being >>> relaxed about inputs but strict about outputs. (And in that vein I would >>> hope that any CIF 2.0 writer's normal behavior would be to encode in >>> UTF-8.) >>> >>> My suggestion is slightly different, as I hope this restatement will >>> show: *in light of the fact that spec-compliant CIF2 files are encoded in >>> UTF-8*, I suggest that the spec allow a file beginning with a UTF-8 BOM >>> to be spec-compliant (subject to the compliance of the rest of the >>> contents). Like Herb, I intend that my parsers will accept such CIFs >>> whether they strictly comply with the spec or not, but the question is >>> whether accepting such files should be a compliance requirement or an >>> extension. Either way, I think it will be valuable to document this >>> decision in the spec, if only to draw attention to the issue. >>> >>> >>> Best Regards, >>> >>> John >>> -- >>> John C. Bollinger, Ph.D. >>> Department of Structural Biology _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Joe Krahn)
- Prev by Date: Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM
- Next by Date: Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM
- Prev by thread: Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM
- Next by thread: Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM
- Index(es):