[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Tue, 1 Jun 2010 05:26:12 -0400 (EDT)
- In-Reply-To: <AANLkTik5apeKoo9ZboWP7D6ynhIHB45S9Wp5Pg856Bl0@mail.gmail.com>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com><alpine.BSF.2.00.1005240815140.84443@epsilon.pair.com><AANLkTilPuEqipc7oRAFGrS8nI1ae_M154SuGdQcp3pYF@mail.gmail.com><alpine.BSF.2.00.1005260604070.73005@epsilon.pair.com><AANLkTik5apeKoo9ZboWP7D6ynhIHB45S9Wp5Pg856Bl0@mail.gmail.com>
Dear Collegues, A UCS-2 message embedded in an email messages normally carries a BOM, but that begs the question -- it is normal practice to switch encodings mid-stream, and, theory and abstractions aside, we are definitely going to encounter embedded BOM and, for that matter, MIME-based, switches in encodings in the course of processing one stream of information. If one prefers to call such a multi-mode stream a CBF rather than calling it a CIFs, so be it, but they still have to be processed. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Tue, 1 Jun 2010, James Hester wrote: > Hi Herbert and others, > > As far as I can tell, BOMs have no semantic or parsing significance in the context of an email message, which was my point. Encoding is switched > using mime headers, as you mention, not using BOMs. So, I don't see that either email or web standards offer support for the idea of using a BOM > to switch encoding. While I appreciate that being restricted to UTF-8 places some restrictions on imgCIF, it is considerably better than the > situation that a lot of email still finds itself in, of being restricted to US-ASCII, and imgCBF is still available as an alternative. > > So I would repeat my suggestion of > > (1) ignoring UTF8 BOM where it is likely to be the result of concatenation (approximately, this means amongst whitespace) > (2) raising a syntax error if the byte sequence could be either BOM or NBWSP (approximately, this means inside any dataname/value/datablock > name/save frame name) > (3) any other type of BOM remains a syntax error as it is not UTF8 > > I will be calling for a vote in a week or so, after giving everyone a bit more of a chance to make their voice heard. > > On Wed, May 26, 2010 at 8:35 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: > The extension we use is cbf, so the extension is not an issue. > A cbf might be a true ascii cif, or an imgCIF file with true > binary sectons with or without compression or a UCS-2 file with > or without binutf sections with or without compression. > > Clearly the cleanest case for binutf is when the entire file > starts out as UCS-2 and just continues that way, but becuase > the logic of imgCIF permits any mixture of the various types > of binary sections with any type of headers, there is no reason > to declare an error because of changes from, say, straight ASCII > to UCS-2 and back. > > The most common place in which you will find a similar distain for > requiring BOMs as the first glyph is in email messages because > a modern, multi-part email message is actually a concatenation > of multiple files of arbtrary types and encodings. Now you could > make the argument that the email message is just a container for > those files and that each file carries its BOM at the front of > that (sub)file, and you would be right, but that is exactly how imgCIF > ends up in the same situation -- it is a container for multiple > headers and binary images and each binary image may be in a > different encoding (with different compression as well). This > flexibility is not an accident -- it was a major intentional change > in imgCIF in 1997 from Andy Hammersley original model of one ACSII header > and one binary image to a more CIF-like, order independent, approach > of allow an arbitrary mixture of multiple headers and multiple > binary images. > > From a programming point of view, once you live in a world of > multiple encodings, recognizing a BOM at the start of a file is > no different from recognizing it anywhere in a file. > > In addition to email, another place in which changes of encoding, > albeit with a meta tag or Content-Type header, rather than with a BOM, is in web pages, in which in a page being displayed from frames, > a brower application has to be prepared to switch encodings on every frame. > > I understand how uncomfortable people can be with such flexibility > -- changing encodings mid-stream -- so just as we use the cbf > exention for all imgCIF files that are not pure ASCII right now, > I will use .cbf for CIF2 files that switch BOMs midstream, but I > will allow for switches in BOMs midstream. > > Have you considered using .cf2 as the extension for CIF2 files. > In light of the decision to make CIF2 a maximally disruptive > change from CIF, confuison between CIF and CIF2 files would seem to me a much more serious cause for concern than dealing with a > embedded BOM > which, can after all, be much more easily dealt with automatically than the CIF2 changes. > > Regards, > Herbert > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Wed, 26 May 2010, James Hester wrote: > > Dear Herbert, > > I don't believe the technique of using a BOM to switch encodings mid-stream > is widely supported either within this group, by Unicode decoding/encoding > libraries, or by standards documents. For example, do any browsers support > switching the encoding of a webpage halfway through? I think not. I'd be > happy to hear of a counterexample to this assertion, but assuming that such > switching is not likely to be supported, I'd like to hear what you think of > the following comments: > > Encoding a CIF2 file in UCS2 or UCS4 seems to me to be notionally the same > as compressing or otherwise transforming the original file. Therefore, the > notion of a 'UCS2-encoded CIF2 file' is no more contrary to the current CIF2 > standard than the notion of a 'gzipped CIF2 file'. Both files require some > operation to transform them to a CIF2 file. Both files will lack the > required magic number at the front, and will cause CIF2 parsers to fail > dismally. I would propose that, if you need UCS2 for efficiency or storage > reasons, you save files with a non 'CIF' extension (e.g. image001.cif.ucs2) > and make it clear external to the file contents that they will need to be > transformed from ucs2 to utf-8 before being fed to standards-compliant CIF2 > tools. My main concern with this approach is that we avoid confusion > between a CIF2 file and an (re)encoded CIF2 file, because as soon as a CIF > reader or writer is unsure about what they are reading or writing, the > effectiveness of the standard is degraded. > > I appreciate that this is not ideal from your point of view, and that you'd > like to be able to specify the encoding within the file itself. For the > same reasons as discussed last year, I don't like that approach. > > I do not understand your argument about an internal UCS BOM being not that > much of a big deal because the program logic is not complicated. Ease of > programming is not really the issue here. If a file is a > standards-compliant CIF2 file, it must not cause a syntax error when read by > a standards-compliant CIF2 reader (especially for a data transfer > protocol!!). If a UCS2 BOM is allowed in a CIF2 file, then *all* readers > must be able to accept and understand it identically. > > James. > > On Mon, May 24, 2010 at 11:11 PM, Herbert J. Bernstein > <yaya@bernstein-plus-sons.com> wrote: > Dear Colleagues, > > James has said: > > So: why exactly is ignoring a BOM a problem? If the > embedded BOM is the > leading BOM from a UTF16 file that has been naively > concatenated, it will > have bytes 0xFE 0xFF. This byte sequence (and the > reverse) is not > acceptable UTF8, leading to a decoding error from > the UTF8 decoding step. > The subsequent bytes will be UTF16, which should > cause a decoding failure in > any case. So I deduce that we are simply > discussing how to treat a UTF8 > BOM, which can only find its way into a CIF file by > naive concatenation of > UTF8-encoded files written by certain programs. > > If the embedded BOM is a UTF-8 BOM, then ignoring it > would be OK, as I don't > see that it is indicative of any problems beyond > misguided choice of text > editor. > > So I would advocate ignoring (and removing) > UTF8-BOMs in the input stream, > and treating all other BOMs as syntax errors. > Individual applications may > wish to give users the option of interpreting U+FEFF > as the deprecated ZWNBP > (and translating to the correct character) on the > understanding that if this > occurs outside a delimited string it will cause a > syntax error. > > > I propose something slightly different, which will amount to what > James > is proposing for applications that wish to handle only UTF8, but which > will be essential for applications that have to work with a wider > range > of encodings (e.g. imgCIF applications). > > There are three highly likely BOMs that may be encountered at any > point > in a byte stream in a Unicode world: > > The UTF-8 BOM: EF BB BF > The UTF-16 big-endian BOM: FE FF > The UTF-16 little-endian BOM FF FE > > For a UTF-8 application, the sequence is EF B8 BF is, as James > suggests, > simply something to accept and ignore, with processing continuing > normally without comment. Again, as James suggests, for a UTF-8 only > applications the other 2 BOMs are invalid characters to treat as an > error. > > However, for an application able to work with a wider range of > encodings, > the other two BOMs are just what it needs to decide how to handle the > remainder of the stream. > > Now that we have settled the case-sensitivity issue in a normalized > unicode context, the recognition of BOMs in this manner imposes no > particular additional burden on applications. All applications will > have to have utilities to assemble UTF-8 character sequences into > Unicode code points either as 16 bit, or, better, 32 bit integers, > so this is just a perfectly normal and in most cases already coded > branch point in that logic. It the application wishes to only be > UTF-8 aware, it can chop off the branch that would decode UCS-2/UTF-16 > streams. For what I have to do in my applications, I will simply > accept the output of that branch -- in terms of code points for text > I won't be able to tell the difference among the three possible > streams of encoded characters, and for the UCS-2/UTF-16 bin-utf binary > data I have to handle for imgCIF, things will work. Certainly, for > interchange with applications that only handle UTF-8, I will write > the 50% expanded UTF-8 encodings of the same binaries, but for > performance limited data collections, I will write out UCS-2/UTF-16 > files. > > Nobody is hurt by what I am proposing and CIF2 will see wider > application this way. Alternatively, if the needs of imgCIF are > unacceptable to be labelled CIF, we can always go back to > calling it imgNCIF (N for "not") as we had to in 1997 until we > called a truce and decided to accept the realities of modern > macromolecular data acquisition. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Mon, 24 May 2010, James Hester wrote: > > To run through the alternatives and some of the arguments > so far: > > (i) treating an embedded BOM as an ordinary character runs > against the > Unicode recommendations. If we wish our standard to be > respected, I think > we should at least respect other standards and the > thinking that has gone > into them > > (ii) treating an embedded BOM as whitespace is OK with the > Unicode standard, > but means that a non-ASCII character now has syntactic > meaning in the CIF. > I think this would be completely inconsistent on our part, > as an invisible > character (when displayed) can actually be used to delimit > strings. This is > my least preferred solution, as it goes against the > human-readability > expected of CIFs > > (iii) ignoring embedded BOMs is bad because they can be a > 'tip off to a > serious problem'. > > (iv) treating embedded BOMs as syntax errors will cause > issues when CIF2 > files are naively concatenated > > I think the only viable alternatives are to choose (iii) > or (iv). > > So: why exactly is ignoring a BOM a problem? If the > embedded BOM is the > leading BOM from a UTF16 file that has been naively > concatenated, it will > have bytes 0xFE 0xFF. This byte sequence (and the > reverse) is not > acceptable UTF8, leading to a decoding error from the UTF8 > decoding step. > The subsequent bytes will be UTF16, which should cause a > decoding failure in > any case. So I deduce that we are simply discussing how > to treat a UTF8 > BOM, which can only find its way into a CIF file by naive > concatenation of > UTF8-encoded files written by certain programs. > > If the embedded BOM is a UTF-8 BOM, then ignoring it would > be OK, as I don't > see that it is indicative of any problems beyond misguided > choice of text > editor. > > So I would advocate ignoring (and removing) UTF8-BOMs in > the input stream, > and treating all other BOMs as syntax errors. Individual > applications may > wish to give users the option of interpreting U+FEFF as > the deprecated ZWNBP > (and translating to the correct character) on the > understanding that if this > occurs outside a delimited string it will cause a syntax > error. > > James > > PS am I the only one who thinks it unlikely that Wordpad > users would choose > to use 'cat' to join file fragments together? > > On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein > <yaya@bernstein-plus-sons.com> wrote: > Allow me to clarify my position, so there is no > misunderstanding: > > I believe that we will be dealing with a world with > at least > UTF-8 > and UCS-2/UTF-16 encodings for many years to come. I > have no > objection to CIF2 being specified solely in terms of > UTF-8 for > simplicity and consistency, but if we are to write > software that > people can use, we must have a reasonable position > with respect > to the encodings people use, and that means that, at > the very > least, we need to accept and process UTF-8 BOMs as > harmless > additional text. Some of us will also be supporting > UCS-2/UTF-16 > directly in our applications. I don't mind if other > applications > are only going to support UTF-8, but inasmuch as, as > long as > we have java and web browsers, we are going to > encounter > UCS-2/UTF-16, > we should do something sensible when a UCS-2/UTF-16 > BOM pops up, > either doing the internal translation if we so > choose, or, if > that > is not handled by a particular application, issuing a > polite > warning > suggesting the used of an external translator if the > application > does > not wish to handle UCS-2/UTF-16. > > BOMS will almost always appear in modern UCS-2/UTF-16 > files, and > when > they are converted to UTF-8 that will give us yet > another source > of > UTF-8 BOMs. I believe the sensible thing to so it to > recognize > BOMs. > > Regards, > Herbert > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Tue, 18 May 2010, Bollinger, John C wrote: > > > Herbert Bernstein wrote: > >> Let me see if I understand this correctly -- a user > takes 2 > perfectly good > >> CIF2 files, edits each to clean up, say, some comments > to keep > straight where > >> one begins and one ends, using a well-designed modern > text editor > that > >> happens to put a BOM at the start of each file, > concatenates the > two files > >> with cat to ship them into the IUCr, and suddenly they > have a > syntax error > >> caused by a character that they cannot see!!! > >> > >> To me this seems pointless when it is trivial for > software to > recognize the > >> character and handle it sensibly. > > > > And that is my principal rationale for preferring that > embedded > U+FEFF be recognized as CIF whitespace. With that > approach, the > concatenation of two well-formed CIF2 files is always a > well-formed > CIF2 file, regardless of the presence or absence of BOMs > in the > original files. Note, too, that such concatenation cannot > produce a > mixed-encoding file because files encoded in > UTF-16[BE|LE], > UTF-32[BE|LE], or any other encoding that can be > distinguished from > UTF-8 are not well-formed CIF2 files to start. The file > concatenation > scenario thus does not provide a use case for the CIF2 > *specification* > to recognize embedded U+FEFF as an encoding marker. > > > > On the other hand, I again feel compelled to distinguish > program > behaviors from the CIF2 format specification. None of the > above would > prevent a CIF processor from recognizing and handling > CIF-like > character streams encoded via schemes other than UTF-8, > nor from > recognizing embedded U+FEFF code sequences in various > encodings as > encoding switches, thereby handling mixed-encoding files. > Indeed, > such a program or library would be invaluable for > correcting > encoding-related errors. That does not, however, mean > that such files > must be considered well-formed CIF2, no matter how likely > they may (or > may not) be to arise. > > > > > > James Hester wrote: > >> I would be happy to call an embedded BOM a syntax > error. > > > > In light of the possibility of U+FEFF appearing in a > data value (for > example, from cutting text from a Unicode manuscript and > pasting it > into a CIF), I need to refine my earlier blanket > alternative of > treating embedded U+FEFF as a syntax error. I now think > it would be > ok to treat U+FEFF as a syntax error *provided* that it > appears > outside a delimited string. That's still not my > preference, though, > and I feel confident that Herb will still disagree. > > > > > > Regards, > > > > John > > -- > > John C. Bollinger, Ph.D. > > Computing and X-Ray Scientist > > Department of Structural Biology > > St. Jude Children's Research Hospital > > John.Bollinger@StJude.org > > (901) 595-3166 [office] > > www.stjude.org > > > > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > > _______________________________________________ > > ddlm-group mailing list > > ddlm-group@iucr.org > > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: [ddlm-group] imgCIF versus CIF2
- Index(es):