[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] UTF-8 BOM
From: "Herbert J. Bernstein" <[email protected]>
Date: Tue, 1 Jun 2010 05:26:12 -0400 (EDT)
In-Reply-To: <[email protected]>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>

Dear Collegues,

   A UCS-2 message embedded in an email messages normally carries
a BOM, but that begs the question -- it is normal practice to
switch encodings mid-stream, and, theory and abstractions aside,
we are definitely going to encounter embedded BOM and, for that
matter, MIME-based, switches in encodings in the course of
processing one stream of information.  If one prefers to call
such a multi-mode stream a CBF rather than calling
it a CIFs, so be it, but they still have to be processed.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Tue, 1 Jun 2010, James Hester wrote:

> Hi Herbert and others,
> 
> As far as I can tell, BOMs have no semantic or parsing significance in the context of an email message, which was my point.� Encoding is switched
> using mime headers, as you mention, not using BOMs.� So, I don't see that either email or web standards offer support for the idea of using a BOM
> to switch encoding.� While I appreciate that being restricted to UTF-8 places some restrictions on imgCIF, it is considerably better than the
> situation that a lot of email still finds itself in, of being restricted to US-ASCII, and imgCBF is still available as an alternative.
> 
> So I would repeat my suggestion of
> 
> (1) ignoring UTF8 BOM where it is likely to be the result of concatenation (approximately, this means amongst whitespace)
> (2) raising a syntax error if the byte sequence could be either BOM or NBWSP (approximately, this means inside any dataname/value/datablock
> name/save frame name)
> (3) any other type of BOM remains a syntax error as it is not UTF8
> 
> I will be calling for a vote in a week or so, after giving everyone a bit more of a chance to make their voice heard.
> 
> On Wed, May 26, 2010 at 8:35 PM, Herbert J. Bernstein <[email protected]> wrote:
>       The extension we use is cbf, so the extension is not an issue.
>       A cbf might be a true ascii cif, or an imgCIF file with true
>       binary sectons with or without compression or a UCS-2 file with
>       or without binutf sections with or without compression.
>
>       Clearly the cleanest case for binutf is when the entire file
>       starts out as UCS-2 and just continues that way, but becuase
>       the logic of imgCIF permits any mixture of the various types
>       of binary sections with any type of headers, there is no reason
>       to declare an error because of changes from, say, straight ASCII
>       to UCS-2 and back.
>
>       The most common place in which you will find a similar distain for
>       requiring BOMs as the first glyph is in email messages because
>       a modern, multi-part email message is actually a concatenation
>       of multiple files of arbtrary types and encodings. �Now you could
>       make the argument that the email message is just a container for
>       those files and that each file carries its BOM at the front of
>       that (sub)file, and you would be right, but that is exactly how imgCIF
>       ends up in the same situation -- it is a container for multiple
>       headers and binary images and each binary image may be in a
>       different encoding (with different compression as well). �This
>       flexibility is not an accident -- it was a major intentional change
>       in imgCIF in 1997 from Andy Hammersley original model of one ACSII header
>       and one binary image to a more CIF-like, order independent, approach
>       of allow an arbitrary mixture of multiple headers and multiple
>       binary images.
>
>       �From a programming point of view, once you live in a world of
>       multiple encodings, recognizing a BOM at the start of a file is
>       no different from recognizing it anywhere in a file.
>
>       �In addition to email, another place in which changes of encoding,
>       albeit with a meta tag or Content-Type header, rather than with a BOM, is in web pages, in which in a page being displayed from frames,
>       a brower application has to be prepared to switch encodings on every frame.
>
>       �I understand how uncomfortable people can be with such flexibility
>       -- changing encodings mid-stream -- so just as we use the cbf
>       exention for all imgCIF files that are not pure ASCII right now,
>       I will use .cbf for CIF2 files that switch BOMs midstream, but I
>       will allow for switches in BOMs midstream.
>
>       �Have you considered using .cf2 as the extension for CIF2 files.
>       In light of the decision to make CIF2 a maximally disruptive
>       change from CIF, confuison between CIF and CIF2 files would seem to me a much more serious cause for concern than dealing with a
>       embedded BOM
>       which, can after all, be much more easily dealt with automatically than the CIF2 changes.
>
>       �Regards,
>       � �Herbert
>       =====================================================
>       �Herbert J. Bernstein, Professor of Computer Science
>       � Dowling College, Kramer Science Center, KSC 121
>       � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
>       � � � � � � � � +1-631-244-3035
>       � � � � � � � � [email protected]
>       =====================================================
> 
> On Wed, 26 May 2010, James Hester wrote:
>
>       Dear Herbert,
>
>       I don't believe the technique of using a BOM to switch encodings mid-stream
>       is widely supported either within this group, by Unicode decoding/encoding
>       libraries, or by standards documents.� For example, do any browsers support
>       switching the encoding of a webpage halfway through?� I think not. I'd be
>       happy to hear of a counterexample to this assertion, but assuming that such
>       switching is not likely to be supported, I'd like to hear what you think of
>       the following comments:
>
>       Encoding a CIF2 file in UCS2 or UCS4 seems to me to be notionally the same
>       as compressing or otherwise transforming the original file.� Therefore, the
>       notion of a 'UCS2-encoded CIF2 file' is no more contrary to the current CIF2
>       standard than the notion of a 'gzipped CIF2 file'.� Both files require some
>       operation to transform them to a CIF2 file.� Both files will lack the
>       required magic number at the front, and will cause CIF2 parsers to fail
>       dismally.� I would propose that, if you need UCS2 for efficiency or storage
>       reasons, you save files with a non 'CIF' extension (e.g. image001.cif.ucs2)
>       and make it clear external to the file contents that they will need to be
>       transformed from ucs2 to utf-8 before being fed to standards-compliant CIF2
>       tools.� My main concern with this approach is that we avoid confusion
>       between a CIF2 file and an (re)encoded CIF2 file, because as soon as a CIF
>       reader or writer is unsure about what they are reading or writing, the
>       effectiveness of the standard is degraded.
>
>       I appreciate that this is not ideal from your point of view, and that you'd
>       like to be able to specify the encoding within the file itself.� For the
>       same reasons as discussed last year, I don't like that approach.
>
>       I do not understand your argument about an internal UCS BOM being not that
>       much of a big deal because the program logic is not complicated.� Ease of
>       programming is not really the issue here.� If a file is a
>       standards-compliant CIF2 file, it must not cause a syntax error when read by
>       a standards-compliant CIF2 reader (especially for a data transfer
>       protocol!!).� If a UCS2 BOM is allowed in a CIF2 file, then *all* readers
>       must be able to accept and understand it identically.
>
>       James.
>
>       On Mon, May 24, 2010 at 11:11 PM, Herbert J. Bernstein
>       <[email protected]> wrote:
>       � � �Dear Colleagues,
>
>       � � �James has said:
>
>       � � � � � �So: why exactly is ignoring a BOM a problem?� If the
>       � � � � � �embedded BOM is the
>       � � � � � �leading BOM from a UTF16 file that has been naively
>       � � � � � �concatenated, it will
>       � � � � � �have bytes 0xFE 0xFF.� This byte sequence (and the
>       � � � � � �reverse) is not
>       � � � � � �acceptable UTF8, leading to a decoding error from
>       � � � � � �the UTF8 decoding step.�
>       � � � � � �The subsequent bytes will be UTF16, which should
>       � � � � � �cause a decoding failure in
>       � � � � � �any case.�� So I deduce that we are simply
>       � � � � � �discussing how to treat a UTF8
>       � � � � � �BOM, which can only find its way into a CIF file by
>       � � � � � �naive concatenation of
>       � � � � � �UTF8-encoded files written by certain programs.
>
>       � � � � � �If the embedded BOM is a UTF-8 BOM, then ignoring it
>       � � � � � �would be OK, as I don't
>       � � � � � �see that it is indicative of any problems beyond
>       � � � � � �misguided choice of text
>       � � � � � �editor.
>
>       � � � � � �So I would advocate ignoring (and removing)
>       � � � � � �UTF8-BOMs in the input stream,
>       � � � � � �and treating all other BOMs as syntax errors.�
>       � � � � � �Individual applications may
>       � � � � � �wish to give users the option of interpreting U+FEFF
>       � � � � � �as the deprecated ZWNBP
>       � � � � � �(and translating to the correct character) on the
>       � � � � � �understanding that if this
>       � � � � � �occurs outside a delimited string it will cause a
>       � � � � � �syntax error.
> 
>
>       I propose something slightly different, which will amount to what
>       James
>       is proposing for applications that wish to handle only UTF8, but which
>       will be essential for applications that have to work with a wider
>       range
>       of encodings (e.g. imgCIF applications).
>
>       There are three highly likely BOMs that may be encountered at any
>       point
>       in a byte stream in a Unicode world:
>
>       The UTF-8 BOM: �EF BB BF
>       The UTF-16 big-endian BOM: �FE FF
>       The UTF-16 little-endian BOM FF FE
>
>       For a UTF-8 application, the sequence is EF B8 BF is, as James
>       suggests,
>       simply something to accept and ignore, with processing continuing
>       normally without comment. �Again, as James suggests, for a UTF-8 only
>       applications the other 2 BOMs are invalid characters to treat as an
>       error.
>
>       However, for an application able to work with a wider range of
>       encodings,
>       the other two BOMs are just what it needs to decide how to handle the
>       remainder of the stream.
>
>       Now that we have settled the case-sensitivity issue in a normalized
>       unicode context, the recognition of BOMs in this manner imposes no
>       particular additional burden on applications. �All applications will
>       have to have utilities to assemble UTF-8 character sequences into
>       Unicode code points either as 16 bit, or, better, 32 bit integers,
>       so this is just a perfectly normal and in most cases already coded
>       branch point in that logic. �It the application wishes to only be
>       UTF-8 aware, it can chop off the branch that would decode UCS-2/UTF-16
>       streams. �For what I have to do in my applications, I will simply
>       accept the output of that branch -- in terms of code points for text
>       I won't be able to tell the difference among the three possible
>       streams of encoded characters, and for the UCS-2/UTF-16 bin-utf binary
>       data I have to handle for imgCIF, things will work. �Certainly, for
>       interchange with applications that only handle UTF-8, I will write
>       the 50% expanded UTF-8 encodings of the same binaries, but for
>       performance limited data collections, I will write out UCS-2/UTF-16
>       files.
>
>       �Nobody is hurt by what I am proposing and CIF2 will see wider
>       application this way. �Alternatively, if the needs of imgCIF are
>       unacceptable to be labelled CIF, we can always go back to
>       calling it imgNCIF (N for "not") as we had to in 1997 until we
>       called a truce and decided to accept the realities of modern
>       macromolecular data acquisition.
>
>       �Regards,
>       � �Herbert
>
>       =====================================================
>       �Herbert J. Bernstein, Professor of Computer Science
>       � Dowling College, Kramer Science Center, KSC 121
>       � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
>       � � � � � � � � +1-631-244-3035
>       � � � � � � � � [email protected]
>       =====================================================
>
>       On Mon, 24 May 2010, James Hester wrote:
>
>       � � �To run through the alternatives and some of the arguments
>       � � �so far:
>
>       � � �(i) treating an embedded BOM as an ordinary character runs
>       � � �against the
>       � � �Unicode recommendations.� If we wish our standard to be
>       � � �respected, I think
>       � � �we should at least respect other standards and the
>       � � �thinking that has gone
>       � � �into them
>
>       � � �(ii) treating an embedded BOM as whitespace is OK with the
>       � � �Unicode standard,
>       � � �but means that a non-ASCII character now has syntactic
>       � � �meaning in the CIF.�
>       � � �I think this would be completely inconsistent on our part,
>       � � �as an invisible
>       � � �character (when displayed) can actually be used to delimit
>       � � �strings.� This is
>       � � �my least preferred solution, as it goes against the
>       � � �human-readability
>       � � �expected of CIFs
>
>       � � �(iii) ignoring embedded BOMs is bad because they can be a
>       � � �'tip off to a
>       � � �serious problem'.
>
>       � � �(iv) treating embedded BOMs as syntax errors will cause
>       � � �issues when CIF2
>       � � �files are naively concatenated
>
>       � � �I think the only viable alternatives are to choose (iii)
>       � � �or (iv).
>
>       � � �So: why exactly is ignoring a BOM a problem?� If the
>       � � �embedded BOM is the
>       � � �leading BOM from a UTF16 file that has been naively
>       � � �concatenated, it will
>       � � �have bytes 0xFE 0xFF.� This byte sequence (and the
>       � � �reverse) is not
>       � � �acceptable UTF8, leading to a decoding error from the UTF8
>       � � �decoding step.�
>       � � �The subsequent bytes will be UTF16, which should cause a
>       � � �decoding failure in
>       � � �any case.�� So I deduce that we are simply discussing how
>       � � �to treat a UTF8
>       � � �BOM, which can only find its way into a CIF file by naive
>       � � �concatenation of
>       � � �UTF8-encoded files written by certain programs.
>
>       � � �If the embedded BOM is a UTF-8 BOM, then ignoring it would
>       � � �be OK, as I don't
>       � � �see that it is indicative of any problems beyond misguided
>       � � �choice of text
>       � � �editor.
>
>       � � �So I would advocate ignoring (and removing) UTF8-BOMs in
>       � � �the input stream,
>       � � �and treating all other BOMs as syntax errors.� Individual
>       � � �applications may
>       � � �wish to give users the option of interpreting U+FEFF as
>       � � �the deprecated ZWNBP
>       � � �(and translating to the correct character) on the
>       � � �understanding that if this
>       � � �occurs outside a delimited string it will cause a syntax
>       � � �error.
>
>       � � �James
>
>       � � �PS am I the only one who thinks it unlikely that Wordpad
>       � � �users would choose
>       � � �to use 'cat' to join file fragments together?
>
>       � � �On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein
>       � � �<[email protected]> wrote:
>       � � �� � �Allow me to clarify my position, so there is no
>       � � �� � �misunderstanding:
>
>       � � �� � �I believe that we will be dealing with a world with
>       � � �at least
>       � � �� � �UTF-8
>       � � �� � �and UCS-2/UTF-16 encodings for many years to come. �I
>       � � �have no
>       � � �� � �objection to CIF2 being specified solely in terms of
>       � � �UTF-8 for
>       � � �� � �simplicity and consistency, but if we are to write
>       � � �software that
>       � � �� � �people can use, we must have a reasonable position
>       � � �with respect
>       � � �� � �to the encodings people use, and that means that, at
>       � � �the very
>       � � �� � �least, we need to accept and process UTF-8 BOMs as
>       � � �harmless
>       � � �� � �additional text. �Some of us will also be supporting
>       � � �� � �UCS-2/UTF-16
>       � � �� � �directly in our applications. �I don't mind if other
>       � � �� � �applications
>       � � �� � �are only going to support UTF-8, but inasmuch as, as
>       � � �long as
>       � � �� � �we have java and web browsers, we are going to
>       � � �encounter
>       � � �� � �UCS-2/UTF-16,
>       � � �� � �we should do something sensible when a UCS-2/UTF-16
>       � � �BOM pops up,
>       � � �� � �either doing the internal translation if we so
>       � � �choose, or, if
>       � � �� � �that
>       � � �� � �is not handled by a particular application, issuing a
>       � � �polite
>       � � �� � �warning
>       � � �� � �suggesting the used of an external translator if the
>       � � �application
>       � � �� � �does
>       � � �� � �not wish to handle UCS-2/UTF-16.
>
>       � � �� � �BOMS will almost always appear in modern UCS-2/UTF-16
>       � � �files, and
>       � � �� � �when
>       � � �� � �they are converted to UTF-8 that will give us yet
>       � � �another source
>       � � �� � �of
>       � � �� � �UTF-8 BOMs. �I believe the sensible thing to so it to
>       � � �recognize
>       � � �� � �BOMs.
>
>       � � �� � �Regards,
>       � � �� � �� � Herbert
>       � � �� � �=====================================================
>       � � �� � ��Herbert J. Bernstein, Professor of Computer Science
>       � � �� � �� �Dowling College, Kramer Science Center, KSC 121
>       � � �� � �� � � � Idle Hour Blvd, Oakdale, NY, 11769
>
>       � � �� � �� � � � � � � � �+1-631-244-3035
>       � � �� � �� � � � � � � � �[email protected]
>       � � �� � �=====================================================
>
>       � � �On Tue, 18 May 2010, Bollinger, John C wrote:
>
>       � � �> Herbert Bernstein wrote:
>       � � �>> Let me see if I understand this correctly -- a user
>       � � �takes 2
>       � � �perfectly good
>       � � �>> CIF2 files, edits each to clean up, say, some comments
>       � � �to keep
>       � � �straight where
>       � � �>> one begins and one ends, using a well-designed modern
>       � � �text editor
>       � � �that
>       � � �>> happens to put a BOM at the start of each file,
>       � � �concatenates the
>       � � �two files
>       � � �>> with cat to ship them into the IUCr, and suddenly they
>       � � �have a
>       � � �syntax error
>       � � �>> caused by a character that they cannot see!!!
>       � � �>>
>       � � �>> To me this seems pointless when it is trivial for
>       � � �software to
>       � � �recognize the
>       � � �>> character and handle it sensibly.
>       � � �>
>       � � �> And that is my principal rationale for preferring that
>       � � �embedded
>       � � �U+FEFF be recognized as CIF whitespace. �With that
>       � � �approach, the
>       � � �concatenation of two well-formed CIF2 files is always a
>       � � �well-formed
>       � � �CIF2 file, regardless of the presence or absence of BOMs
>       � � �in the
>       � � �original files. �Note, too, that such concatenation cannot
>       � � �produce a
>       � � �mixed-encoding file because files encoded in
>       � � �UTF-16[BE|LE],
>       � � �UTF-32[BE|LE], or any other encoding that can be
>       � � �distinguished from
>       � � �UTF-8 are not well-formed CIF2 files to start. �The file
>       � � �concatenation
>       � � �scenario thus does not provide a use case for the CIF2
>       � � �*specification*
>       � � �to recognize embedded U+FEFF as an encoding marker.
>       � � �>
>       � � �> On the other hand, I again feel compelled to distinguish
>       � � �program
>       � � �behaviors from the CIF2 format specification. �None of the
>       � � �above would
>       � � �prevent a CIF processor from recognizing and handling
>       � � �CIF-like
>       � � �character streams encoded via schemes other than UTF-8,
>       � � �nor from
>       � � �recognizing embedded U+FEFF code sequences in various
>       � � �encodings as
>       � � �encoding switches, thereby handling mixed-encoding files.
>       � � ��Indeed,
>       � � �such a program or library would be invaluable for
>       � � �correcting
>       � � �encoding-related errors. �That does not, however, mean
>       � � �that such files
>       � � �must be considered well-formed CIF2, no matter how likely
>       � � �they may (or
>       � � �may not) be to arise.
>       � � �>
>       � � �>
>       � � �> James Hester wrote:
>       � � �>> I would be happy to call an embedded BOM a syntax
>       � � �error.
>       � � �>
>       � � �> In light of the possibility of U+FEFF appearing in a
>       � � �data value (for
>       � � �example, from cutting text from a Unicode manuscript and
>       � � �pasting it
>       � � �into a CIF), I need to refine my earlier blanket
>       � � �alternative of
>       � � �treating embedded U+FEFF as a syntax error. �I now think
>       � � �it would be
>       � � �ok to treat U+FEFF as a syntax error *provided* that it
>       � � �appears
>       � � �outside a delimited string. �That's still not my
>       � � �preference, though,
>       � � �and I feel confident that Herb will still disagree.
>       � � �>
>       � � �>
>       � � �> Regards,
>       � � �>
>       � � �> John
>       � � �> --
>       � � �> John C. Bollinger, Ph.D.
>       � � �> Computing and X-Ray Scientist
>       � � �> Department of Structural Biology
>       � � �> St. Jude Children's Research Hospital
>       � � �> [email protected]
>       � � �> (901) 595-3166 [office]
>       � � �> www.stjude.org
>       � � �>
>       � � �>
>       � � �>
>       � � �> Email Disclaimer: �www.stjude.org/emaildisclaimer
>       � � �>
>       � � �> _______________________________________________
>       � � �> ddlm-group mailing list
>       � � �> [email protected]
>       � � �> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>       � � �>
>       � � �_______________________________________________
>       � � �ddlm-group mailing list
>       � � �[email protected]
>       � � �http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
>
>       � � �--
>       � � �T +61 (02) 9717 9907
>       � � �F +61 (02) 9717 3145
>       � � �M +61 (04) 0249 4148
> 
>
>       _______________________________________________
>       ddlm-group mailing list
>       [email protected]
>       http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
>
>       --
>       T +61 (02) 9717 9907
>       F +61 (02) 9717 3145
>       M +61 (04) 0249 4148
> 
> 
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

[ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] UTF-8 BOM

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: [ddlm-group] imgCIF versus CIF2

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM