[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Mon, 10 May 2010 13:21:05 -0400 (EDT)
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local>
Dear Colleagues, Inasmuch as we have adopted unicode we really should conform to the unicode conventions. It is fine for UTF-8 to be our default when there is no BOM, but if there is a BOM we should process it. The minimum to do with any BOM is: 1. Accept it at any point in a character stream. 2. Check it against the BOMs for the codes that we are able to process on that system (minimum would be utf-8 bom). 3. If the BOM conforms to an encoding that that particular system can accept, continue processing in the encoding selected. 4. If the BOM does not conform to an encoding that that particular system can accept, declare an error and stop or issue a warning and try to continue in utf-8. Decalring an error is safest. Trying to muddle through may be necessary. Rejecting a valid UTF-8 CIF simply because it went through a modern editor and gained a UTF-8 BOM does not seem reasonable. On writing, the approach should be that a CIF write can do one of the following: 1. Write a stream with no BOMs, in which case the intended encoding is UTF-8; or 2. Write a stream starting with a UTF-8 BOM, in which case the intended encoding is UTF-8; or 3. Write a stream starting with the BOM for some other encoding, in which case the intended encoding is something other than UTF-8 and the file should not be identified as a standard UTF-8 CIF, but as something else. I, for one, intend to read and write both UTF-8 and UTF-16, which covers most modern unicode uses, but I have no objection to UTF-8 being the CIF standard for normal file interchange. It is simply a practical reality that big- and little-endian UCS-2 and UTF-16 are widely used, and need at least some CIF support. In order to conform to the current spec, I'll make the writing of BOMs a non-default option for a UTF-8 file, but I agree with John Bolinger that we should do womething sensible with files that come with a BOM. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Mon, 10 May 2010, Bollinger, John C wrote: > I realize that earlier there was an extended discussion on this group > about identification and / or declaration of character encodings, > including the topic of using a byte-order mark to identify some > encodings. Rest assured that I do not wish to reopen that discussion. > I do, however, want to raise a related question: whether it is > acceptable for a CIF2 processor to accept and ignore a UTF-8 BOM > sequence (bytes 0xEF 0xBB 0xBF, the UTF-8 encoding of character U+FEFF) > at the beginning of a CIF. > > Some text editors that support UTF-8 are known to ensure that > UTF-8-encoded files they write start with this sequence. Inasmuch as it > seems a goal of this group to continue to support users editing CIFs > with general-purpose text editors, it therefore seems wise to me that an > initial BOM sequence be considered ignorable metadata in CIF2. The > alternative is for it to be an error, with the confusing result that > editing some CIF2-compliant CIFs with some programs will corrupt the > resulting file, whereas *either* using a different text editor or > editing a different CIF (for example, one that contains no non-ASCII > characters) works fine. > > This suggested behavior would not require a CIF2 lexical scanner to > decode the BOM byte sequence to the corresponding character. A scanner > operating directly on the raw byte stream can recognize and handle the > literal byte sequence almost as easily as one operating on the > corresponding decoded character stream could recognize and handle the > decoded character. > > Best Regards, > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Prev by Date: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):