[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Tue, 1 Jun 2010 05:26:12 -0400 (EDT)
- In-Reply-To: <[email protected]>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>
Dear Collegues, A UCS-2 message embedded in an email messages normally carries a BOM, but that begs the question -- it is normal practice to switch encodings mid-stream, and, theory and abstractions aside, we are definitely going to encounter embedded BOM and, for that matter, MIME-based, switches in encodings in the course of processing one stream of information. If one prefers to call such a multi-mode stream a CBF rather than calling it a CIFs, so be it, but they still have to be processed. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 [email protected] ===================================================== On Tue, 1 Jun 2010, James Hester wrote: > Hi Herbert and others, > > As far as I can tell, BOMs have no semantic or parsing significance in the context of an email message, which was my point.� Encoding is switched > using mime headers, as you mention, not using BOMs.� So, I don't see that either email or web standards offer support for the idea of using a BOM > to switch encoding.� While I appreciate that being restricted to UTF-8 places some restrictions on imgCIF, it is considerably better than the > situation that a lot of email still finds itself in, of being restricted to US-ASCII, and imgCBF is still available as an alternative. > > So I would repeat my suggestion of > > (1) ignoring UTF8 BOM where it is likely to be the result of concatenation (approximately, this means amongst whitespace) > (2) raising a syntax error if the byte sequence could be either BOM or NBWSP (approximately, this means inside any dataname/value/datablock > name/save frame name) > (3) any other type of BOM remains a syntax error as it is not UTF8 > > I will be calling for a vote in a week or so, after giving everyone a bit more of a chance to make their voice heard. > > On Wed, May 26, 2010 at 8:35 PM, Herbert J. Bernstein <[email protected]> wrote: > The extension we use is cbf, so the extension is not an issue. > A cbf might be a true ascii cif, or an imgCIF file with true > binary sectons with or without compression or a UCS-2 file with > or without binutf sections with or without compression. > > Clearly the cleanest case for binutf is when the entire file > starts out as UCS-2 and just continues that way, but becuase > the logic of imgCIF permits any mixture of the various types > of binary sections with any type of headers, there is no reason > to declare an error because of changes from, say, straight ASCII > to UCS-2 and back. > > The most common place in which you will find a similar distain for > requiring BOMs as the first glyph is in email messages because > a modern, multi-part email message is actually a concatenation > of multiple files of arbtrary types and encodings. �Now you could > make the argument that the email message is just a container for > those files and that each file carries its BOM at the front of > that (sub)file, and you would be right, but that is exactly how imgCIF > ends up in the same situation -- it is a container for multiple > headers and binary images and each binary image may be in a > different encoding (with different compression as well). �This > flexibility is not an accident -- it was a major intentional change > in imgCIF in 1997 from Andy Hammersley original model of one ACSII header > and one binary image to a more CIF-like, order independent, approach > of allow an arbitrary mixture of multiple headers and multiple > binary images. > > �From a programming point of view, once you live in a world of > multiple encodings, recognizing a BOM at the start of a file is > no different from recognizing it anywhere in a file. > > �In addition to email, another place in which changes of encoding, > albeit with a meta tag or Content-Type header, rather than with a BOM, is in web pages, in which in a page being displayed from frames, > a brower application has to be prepared to switch encodings on every frame. > > �I understand how uncomfortable people can be with such flexibility > -- changing encodings mid-stream -- so just as we use the cbf > exention for all imgCIF files that are not pure ASCII right now, > I will use .cbf for CIF2 files that switch BOMs midstream, but I > will allow for switches in BOMs midstream. > > �Have you considered using .cf2 as the extension for CIF2 files. > In light of the decision to make CIF2 a maximally disruptive > change from CIF, confuison between CIF and CIF2 files would seem to me a much more serious cause for concern than dealing with a > embedded BOM > which, can after all, be much more easily dealt with automatically than the CIF2 changes. > > �Regards, > � �Herbert > ===================================================== > �Herbert J. Bernstein, Professor of Computer Science > � Dowling College, Kramer Science Center, KSC 121 > � � � �Idle Hour Blvd, Oakdale, NY, 11769 > > � � � � � � � � +1-631-244-3035 > � � � � � � � � [email protected] > ===================================================== > > On Wed, 26 May 2010, James Hester wrote: > > Dear Herbert, > > I don't believe the technique of using a BOM to switch encodings mid-stream > is widely supported either within this group, by Unicode decoding/encoding > libraries, or by standards documents.� For example, do any browsers support > switching the encoding of a webpage halfway through?� I think not. I'd be > happy to hear of a counterexample to this assertion, but assuming that such > switching is not likely to be supported, I'd like to hear what you think of > the following comments: > > Encoding a CIF2 file in UCS2 or UCS4 seems to me to be notionally the same > as compressing or otherwise transforming the original file.� Therefore, the > notion of a 'UCS2-encoded CIF2 file' is no more contrary to the current CIF2 > standard than the notion of a 'gzipped CIF2 file'.� Both files require some > operation to transform them to a CIF2 file.� Both files will lack the > required magic number at the front, and will cause CIF2 parsers to fail > dismally.� I would propose that, if you need UCS2 for efficiency or storage > reasons, you save files with a non 'CIF' extension (e.g. image001.cif.ucs2) > and make it clear external to the file contents that they will need to be > transformed from ucs2 to utf-8 before being fed to standards-compliant CIF2 > tools.� My main concern with this approach is that we avoid confusion > between a CIF2 file and an (re)encoded CIF2 file, because as soon as a CIF > reader or writer is unsure about what they are reading or writing, the > effectiveness of the standard is degraded. > > I appreciate that this is not ideal from your point of view, and that you'd > like to be able to specify the encoding within the file itself.� For the > same reasons as discussed last year, I don't like that approach. > > I do not understand your argument about an internal UCS BOM being not that > much of a big deal because the program logic is not complicated.� Ease of > programming is not really the issue here.� If a file is a > standards-compliant CIF2 file, it must not cause a syntax error when read by > a standards-compliant CIF2 reader (especially for a data transfer > protocol!!).� If a UCS2 BOM is allowed in a CIF2 file, then *all* readers > must be able to accept and understand it identically. > > James. > > On Mon, May 24, 2010 at 11:11 PM, Herbert J. Bernstein > <[email protected]> wrote: > � � �Dear Colleagues, > > � � �James has said: > > � � � � � �So: why exactly is ignoring a BOM a problem?� If the > � � � � � �embedded BOM is the > � � � � � �leading BOM from a UTF16 file that has been naively > � � � � � �concatenated, it will > � � � � � �have bytes 0xFE 0xFF.� This byte sequence (and the > � � � � � �reverse) is not > � � � � � �acceptable UTF8, leading to a decoding error from > � � � � � �the UTF8 decoding step.� > � � � � � �The subsequent bytes will be UTF16, which should > � � � � � �cause a decoding failure in > � � � � � �any case.�� So I deduce that we are simply > � � � � � �discussing how to treat a UTF8 > � � � � � �BOM, which can only find its way into a CIF file by > � � � � � �naive concatenation of > � � � � � �UTF8-encoded files written by certain programs. > > � � � � � �If the embedded BOM is a UTF-8 BOM, then ignoring it > � � � � � �would be OK, as I don't > � � � � � �see that it is indicative of any problems beyond > � � � � � �misguided choice of text > � � � � � �editor. > > � � � � � �So I would advocate ignoring (and removing) > � � � � � �UTF8-BOMs in the input stream, > � � � � � �and treating all other BOMs as syntax errors.� > � � � � � �Individual applications may > � � � � � �wish to give users the option of interpreting U+FEFF > � � � � � �as the deprecated ZWNBP > � � � � � �(and translating to the correct character) on the > � � � � � �understanding that if this > � � � � � �occurs outside a delimited string it will cause a > � � � � � �syntax error. > > > I propose something slightly different, which will amount to what > James > is proposing for applications that wish to handle only UTF8, but which > will be essential for applications that have to work with a wider > range > of encodings (e.g. imgCIF applications). > > There are three highly likely BOMs that may be encountered at any > point > in a byte stream in a Unicode world: > > The UTF-8 BOM: �EF BB BF > The UTF-16 big-endian BOM: �FE FF > The UTF-16 little-endian BOM FF FE > > For a UTF-8 application, the sequence is EF B8 BF is, as James > suggests, > simply something to accept and ignore, with processing continuing > normally without comment. �Again, as James suggests, for a UTF-8 only > applications the other 2 BOMs are invalid characters to treat as an > error. > > However, for an application able to work with a wider range of > encodings, > the other two BOMs are just what it needs to decide how to handle the > remainder of the stream. > > Now that we have settled the case-sensitivity issue in a normalized > unicode context, the recognition of BOMs in this manner imposes no > particular additional burden on applications. �All applications will > have to have utilities to assemble UTF-8 character sequences into > Unicode code points either as 16 bit, or, better, 32 bit integers, > so this is just a perfectly normal and in most cases already coded > branch point in that logic. �It the application wishes to only be > UTF-8 aware, it can chop off the branch that would decode UCS-2/UTF-16 > streams. �For what I have to do in my applications, I will simply > accept the output of that branch -- in terms of code points for text > I won't be able to tell the difference among the three possible > streams of encoded characters, and for the UCS-2/UTF-16 bin-utf binary > data I have to handle for imgCIF, things will work. �Certainly, for > interchange with applications that only handle UTF-8, I will write > the 50% expanded UTF-8 encodings of the same binaries, but for > performance limited data collections, I will write out UCS-2/UTF-16 > files. > > �Nobody is hurt by what I am proposing and CIF2 will see wider > application this way. �Alternatively, if the needs of imgCIF are > unacceptable to be labelled CIF, we can always go back to > calling it imgNCIF (N for "not") as we had to in 1997 until we > called a truce and decided to accept the realities of modern > macromolecular data acquisition. > > �Regards, > � �Herbert > > ===================================================== > �Herbert J. Bernstein, Professor of Computer Science > � Dowling College, Kramer Science Center, KSC 121 > � � � �Idle Hour Blvd, Oakdale, NY, 11769 > > � � � � � � � � +1-631-244-3035 > � � � � � � � � [email protected] > ===================================================== > > On Mon, 24 May 2010, James Hester wrote: > > � � �To run through the alternatives and some of the arguments > � � �so far: > > � � �(i) treating an embedded BOM as an ordinary character runs > � � �against the > � � �Unicode recommendations.� If we wish our standard to be > � � �respected, I think > � � �we should at least respect other standards and the > � � �thinking that has gone > � � �into them > > � � �(ii) treating an embedded BOM as whitespace is OK with the > � � �Unicode standard, > � � �but means that a non-ASCII character now has syntactic > � � �meaning in the CIF.� > � � �I think this would be completely inconsistent on our part, > � � �as an invisible > � � �character (when displayed) can actually be used to delimit > � � �strings.� This is > � � �my least preferred solution, as it goes against the > � � �human-readability > � � �expected of CIFs > > � � �(iii) ignoring embedded BOMs is bad because they can be a > � � �'tip off to a > � � �serious problem'. > > � � �(iv) treating embedded BOMs as syntax errors will cause > � � �issues when CIF2 > � � �files are naively concatenated > > � � �I think the only viable alternatives are to choose (iii) > � � �or (iv). > > � � �So: why exactly is ignoring a BOM a problem?� If the > � � �embedded BOM is the > � � �leading BOM from a UTF16 file that has been naively > � � �concatenated, it will > � � �have bytes 0xFE 0xFF.� This byte sequence (and the > � � �reverse) is not > � � �acceptable UTF8, leading to a decoding error from the UTF8 > � � �decoding step.� > � � �The subsequent bytes will be UTF16, which should cause a > � � �decoding failure in > � � �any case.�� So I deduce that we are simply discussing how > � � �to treat a UTF8 > � � �BOM, which can only find its way into a CIF file by naive > � � �concatenation of > � � �UTF8-encoded files written by certain programs. > > � � �If the embedded BOM is a UTF-8 BOM, then ignoring it would > � � �be OK, as I don't > � � �see that it is indicative of any problems beyond misguided > � � �choice of text > � � �editor. > > � � �So I would advocate ignoring (and removing) UTF8-BOMs in > � � �the input stream, > � � �and treating all other BOMs as syntax errors.� Individual > � � �applications may > � � �wish to give users the option of interpreting U+FEFF as > � � �the deprecated ZWNBP > � � �(and translating to the correct character) on the > � � �understanding that if this > � � �occurs outside a delimited string it will cause a syntax > � � �error. > > � � �James > > � � �PS am I the only one who thinks it unlikely that Wordpad > � � �users would choose > � � �to use 'cat' to join file fragments together? > > � � �On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein > � � �<[email protected]> wrote: > � � �� � �Allow me to clarify my position, so there is no > � � �� � �misunderstanding: > > � � �� � �I believe that we will be dealing with a world with > � � �at least > � � �� � �UTF-8 > � � �� � �and UCS-2/UTF-16 encodings for many years to come. �I > � � �have no > � � �� � �objection to CIF2 being specified solely in terms of > � � �UTF-8 for > � � �� � �simplicity and consistency, but if we are to write > � � �software that > � � �� � �people can use, we must have a reasonable position > � � �with respect > � � �� � �to the encodings people use, and that means that, at > � � �the very > � � �� � �least, we need to accept and process UTF-8 BOMs as > � � �harmless > � � �� � �additional text. �Some of us will also be supporting > � � �� � �UCS-2/UTF-16 > � � �� � �directly in our applications. �I don't mind if other > � � �� � �applications > � � �� � �are only going to support UTF-8, but inasmuch as, as > � � �long as > � � �� � �we have java and web browsers, we are going to > � � �encounter > � � �� � �UCS-2/UTF-16, > � � �� � �we should do something sensible when a UCS-2/UTF-16 > � � �BOM pops up, > � � �� � �either doing the internal translation if we so > � � �choose, or, if > � � �� � �that > � � �� � �is not handled by a particular application, issuing a > � � �polite > � � �� � �warning > � � �� � �suggesting the used of an external translator if the > � � �application > � � �� � �does > � � �� � �not wish to handle UCS-2/UTF-16. > > � � �� � �BOMS will almost always appear in modern UCS-2/UTF-16 > � � �files, and > � � �� � �when > � � �� � �they are converted to UTF-8 that will give us yet > � � �another source > � � �� � �of > � � �� � �UTF-8 BOMs. �I believe the sensible thing to so it to > � � �recognize > � � �� � �BOMs. > > � � �� � �Regards, > � � �� � �� � Herbert > � � �� � �===================================================== > � � �� � ��Herbert J. Bernstein, Professor of Computer Science > � � �� � �� �Dowling College, Kramer Science Center, KSC 121 > � � �� � �� � � � Idle Hour Blvd, Oakdale, NY, 11769 > > � � �� � �� � � � � � � � �+1-631-244-3035 > � � �� � �� � � � � � � � �[email protected] > � � �� � �===================================================== > > � � �On Tue, 18 May 2010, Bollinger, John C wrote: > > � � �> Herbert Bernstein wrote: > � � �>> Let me see if I understand this correctly -- a user > � � �takes 2 > � � �perfectly good > � � �>> CIF2 files, edits each to clean up, say, some comments > � � �to keep > � � �straight where > � � �>> one begins and one ends, using a well-designed modern > � � �text editor > � � �that > � � �>> happens to put a BOM at the start of each file, > � � �concatenates the > � � �two files > � � �>> with cat to ship them into the IUCr, and suddenly they > � � �have a > � � �syntax error > � � �>> caused by a character that they cannot see!!! > � � �>> > � � �>> To me this seems pointless when it is trivial for > � � �software to > � � �recognize the > � � �>> character and handle it sensibly. > � � �> > � � �> And that is my principal rationale for preferring that > � � �embedded > � � �U+FEFF be recognized as CIF whitespace. �With that > � � �approach, the > � � �concatenation of two well-formed CIF2 files is always a > � � �well-formed > � � �CIF2 file, regardless of the presence or absence of BOMs > � � �in the > � � �original files. �Note, too, that such concatenation cannot > � � �produce a > � � �mixed-encoding file because files encoded in > � � �UTF-16[BE|LE], > � � �UTF-32[BE|LE], or any other encoding that can be > � � �distinguished from > � � �UTF-8 are not well-formed CIF2 files to start. �The file > � � �concatenation > � � �scenario thus does not provide a use case for the CIF2 > � � �*specification* > � � �to recognize embedded U+FEFF as an encoding marker. > � � �> > � � �> On the other hand, I again feel compelled to distinguish > � � �program > � � �behaviors from the CIF2 format specification. �None of the > � � �above would > � � �prevent a CIF processor from recognizing and handling > � � �CIF-like > � � �character streams encoded via schemes other than UTF-8, > � � �nor from > � � �recognizing embedded U+FEFF code sequences in various > � � �encodings as > � � �encoding switches, thereby handling mixed-encoding files. > � � ��Indeed, > � � �such a program or library would be invaluable for > � � �correcting > � � �encoding-related errors. �That does not, however, mean > � � �that such files > � � �must be considered well-formed CIF2, no matter how likely > � � �they may (or > � � �may not) be to arise. > � � �> > � � �> > � � �> James Hester wrote: > � � �>> I would be happy to call an embedded BOM a syntax > � � �error. > � � �> > � � �> In light of the possibility of U+FEFF appearing in a > � � �data value (for > � � �example, from cutting text from a Unicode manuscript and > � � �pasting it > � � �into a CIF), I need to refine my earlier blanket > � � �alternative of > � � �treating embedded U+FEFF as a syntax error. �I now think > � � �it would be > � � �ok to treat U+FEFF as a syntax error *provided* that it > � � �appears > � � �outside a delimited string. �That's still not my > � � �preference, though, > � � �and I feel confident that Herb will still disagree. > � � �> > � � �> > � � �> Regards, > � � �> > � � �> John > � � �> -- > � � �> John C. Bollinger, Ph.D. > � � �> Computing and X-Ray Scientist > � � �> Department of Structural Biology > � � �> St. Jude Children's Research Hospital > � � �> [email protected] > � � �> (901) 595-3166 [office] > � � �> www.stjude.org > � � �> > � � �> > � � �> > � � �> Email Disclaimer: �www.stjude.org/emaildisclaimer > � � �> > � � �> _______________________________________________ > � � �> ddlm-group mailing list > � � �> [email protected] > � � �> http://scripts.iucr.org/mailman/listinfo/ddlm-group > � � �> > � � �_______________________________________________ > � � �ddlm-group mailing list > � � �[email protected] > � � �http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > � � �-- > � � �T +61 (02) 9717 9907 > � � �F +61 (02) 9717 3145 > � � �M +61 (04) 0249 4148 > > > _______________________________________________ > ddlm-group mailing list > [email protected] > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > ddlm-group mailing list > [email protected] > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > >
_______________________________________________ ddlm-group mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: [ddlm-group] imgCIF versus CIF2
- Index(es):