[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Dear Collegues,

   A UCS-2 message embedded in an email messages normally carries
a BOM, but that begs the question -- it is normal practice to
switch encodings mid-stream, and, theory and abstractions aside,
we are definitely going to encounter embedded BOM and, for that
matter, MIME-based, switches in encodings in the course of
processing one stream of information.  If one prefers to call
such a multi-mode stream a CBF rather than calling
it a CIFs, so be it, but they still have to be processed.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Tue, 1 Jun 2010, James Hester wrote:

> Hi Herbert and others,
> 
> As far as I can tell, BOMs have no semantic or parsing significance in the context of an email message, which was my point.  Encoding is switched
> using mime headers, as you mention, not using BOMs.  So, I don't see that either email or web standards offer support for the idea of using a BOM
> to switch encoding.  While I appreciate that being restricted to UTF-8 places some restrictions on imgCIF, it is considerably better than the
> situation that a lot of email still finds itself in, of being restricted to US-ASCII, and imgCBF is still available as an alternative.
> 
> So I would repeat my suggestion of
> 
> (1) ignoring UTF8 BOM where it is likely to be the result of concatenation (approximately, this means amongst whitespace)
> (2) raising a syntax error if the byte sequence could be either BOM or NBWSP (approximately, this means inside any dataname/value/datablock
> name/save frame name)
> (3) any other type of BOM remains a syntax error as it is not UTF8
> 
> I will be calling for a vote in a week or so, after giving everyone a bit more of a chance to make their voice heard.
> 
> On Wed, May 26, 2010 at 8:35 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
>       The extension we use is cbf, so the extension is not an issue.
>       A cbf might be a true ascii cif, or an imgCIF file with true
>       binary sectons with or without compression or a UCS-2 file with
>       or without binutf sections with or without compression.
>
>       Clearly the cleanest case for binutf is when the entire file
>       starts out as UCS-2 and just continues that way, but becuase
>       the logic of imgCIF permits any mixture of the various types
>       of binary sections with any type of headers, there is no reason
>       to declare an error because of changes from, say, straight ASCII
>       to UCS-2 and back.
>
>       The most common place in which you will find a similar distain for
>       requiring BOMs as the first glyph is in email messages because
>       a modern, multi-part email message is actually a concatenation
>       of multiple files of arbtrary types and encodings.  Now you could
>       make the argument that the email message is just a container for
>       those files and that each file carries its BOM at the front of
>       that (sub)file, and you would be right, but that is exactly how imgCIF
>       ends up in the same situation -- it is a container for multiple
>       headers and binary images and each binary image may be in a
>       different encoding (with different compression as well).  This
>       flexibility is not an accident -- it was a major intentional change
>       in imgCIF in 1997 from Andy Hammersley original model of one ACSII header
>       and one binary image to a more CIF-like, order independent, approach
>       of allow an arbitrary mixture of multiple headers and multiple
>       binary images.
>
>        From a programming point of view, once you live in a world of
>       multiple encodings, recognizing a BOM at the start of a file is
>       no different from recognizing it anywhere in a file.
>
>        In addition to email, another place in which changes of encoding,
>       albeit with a meta tag or Content-Type header, rather than with a BOM, is in web pages, in which in a page being displayed from frames,
>       a brower application has to be prepared to switch encodings on every frame.
>
>        I understand how uncomfortable people can be with such flexibility
>       -- changing encodings mid-stream -- so just as we use the cbf
>       exention for all imgCIF files that are not pure ASCII right now,
>       I will use .cbf for CIF2 files that switch BOMs midstream, but I
>       will allow for switches in BOMs midstream.
>
>        Have you considered using .cf2 as the extension for CIF2 files.
>       In light of the decision to make CIF2 a maximally disruptive
>       change from CIF, confuison between CIF and CIF2 files would seem to me a much more serious cause for concern than dealing with a
>       embedded BOM
>       which, can after all, be much more easily dealt with automatically than the CIF2 changes.
>
>        Regards,
>          Herbert
>       =====================================================
>        Herbert J. Bernstein, Professor of Computer Science
>         Dowling College, Kramer Science Center, KSC 121
>              Idle Hour Blvd, Oakdale, NY, 11769
>
>                       +1-631-244-3035
>                       yaya@dowling.edu
>       =====================================================
> 
> On Wed, 26 May 2010, James Hester wrote:
>
>       Dear Herbert,
>
>       I don't believe the technique of using a BOM to switch encodings mid-stream
>       is widely supported either within this group, by Unicode decoding/encoding
>       libraries, or by standards documents.  For example, do any browsers support
>       switching the encoding of a webpage halfway through?  I think not. I'd be
>       happy to hear of a counterexample to this assertion, but assuming that such
>       switching is not likely to be supported, I'd like to hear what you think of
>       the following comments:
>
>       Encoding a CIF2 file in UCS2 or UCS4 seems to me to be notionally the same
>       as compressing or otherwise transforming the original file.  Therefore, the
>       notion of a 'UCS2-encoded CIF2 file' is no more contrary to the current CIF2
>       standard than the notion of a 'gzipped CIF2 file'.  Both files require some
>       operation to transform them to a CIF2 file.  Both files will lack the
>       required magic number at the front, and will cause CIF2 parsers to fail
>       dismally.  I would propose that, if you need UCS2 for efficiency or storage
>       reasons, you save files with a non 'CIF' extension (e.g. image001.cif.ucs2)
>       and make it clear external to the file contents that they will need to be
>       transformed from ucs2 to utf-8 before being fed to standards-compliant CIF2
>       tools.  My main concern with this approach is that we avoid confusion
>       between a CIF2 file and an (re)encoded CIF2 file, because as soon as a CIF
>       reader or writer is unsure about what they are reading or writing, the
>       effectiveness of the standard is degraded.
>
>       I appreciate that this is not ideal from your point of view, and that you'd
>       like to be able to specify the encoding within the file itself.  For the
>       same reasons as discussed last year, I don't like that approach.
>
>       I do not understand your argument about an internal UCS BOM being not that
>       much of a big deal because the program logic is not complicated.  Ease of
>       programming is not really the issue here.  If a file is a
>       standards-compliant CIF2 file, it must not cause a syntax error when read by
>       a standards-compliant CIF2 reader (especially for a data transfer
>       protocol!!).  If a UCS2 BOM is allowed in a CIF2 file, then *all* readers
>       must be able to accept and understand it identically.
>
>       James.
>
>       On Mon, May 24, 2010 at 11:11 PM, Herbert J. Bernstein
>       <yaya@bernstein-plus-sons.com> wrote:
>            Dear Colleagues,
>
>            James has said:
>
>                  So: why exactly is ignoring a BOM a problem?  If the
>                  embedded BOM is the
>                  leading BOM from a UTF16 file that has been naively
>                  concatenated, it will
>                  have bytes 0xFE 0xFF.  This byte sequence (and the
>                  reverse) is not
>                  acceptable UTF8, leading to a decoding error from
>                  the UTF8 decoding step. 
>                  The subsequent bytes will be UTF16, which should
>                  cause a decoding failure in
>                  any case.   So I deduce that we are simply
>                  discussing how to treat a UTF8
>                  BOM, which can only find its way into a CIF file by
>                  naive concatenation of
>                  UTF8-encoded files written by certain programs.
>
>                  If the embedded BOM is a UTF-8 BOM, then ignoring it
>                  would be OK, as I don't
>                  see that it is indicative of any problems beyond
>                  misguided choice of text
>                  editor.
>
>                  So I would advocate ignoring (and removing)
>                  UTF8-BOMs in the input stream,
>                  and treating all other BOMs as syntax errors. 
>                  Individual applications may
>                  wish to give users the option of interpreting U+FEFF
>                  as the deprecated ZWNBP
>                  (and translating to the correct character) on the
>                  understanding that if this
>                  occurs outside a delimited string it will cause a
>                  syntax error.
> 
>
>       I propose something slightly different, which will amount to what
>       James
>       is proposing for applications that wish to handle only UTF8, but which
>       will be essential for applications that have to work with a wider
>       range
>       of encodings (e.g. imgCIF applications).
>
>       There are three highly likely BOMs that may be encountered at any
>       point
>       in a byte stream in a Unicode world:
>
>       The UTF-8 BOM:  EF BB BF
>       The UTF-16 big-endian BOM:  FE FF
>       The UTF-16 little-endian BOM FF FE
>
>       For a UTF-8 application, the sequence is EF B8 BF is, as James
>       suggests,
>       simply something to accept and ignore, with processing continuing
>       normally without comment.  Again, as James suggests, for a UTF-8 only
>       applications the other 2 BOMs are invalid characters to treat as an
>       error.
>
>       However, for an application able to work with a wider range of
>       encodings,
>       the other two BOMs are just what it needs to decide how to handle the
>       remainder of the stream.
>
>       Now that we have settled the case-sensitivity issue in a normalized
>       unicode context, the recognition of BOMs in this manner imposes no
>       particular additional burden on applications.  All applications will
>       have to have utilities to assemble UTF-8 character sequences into
>       Unicode code points either as 16 bit, or, better, 32 bit integers,
>       so this is just a perfectly normal and in most cases already coded
>       branch point in that logic.  It the application wishes to only be
>       UTF-8 aware, it can chop off the branch that would decode UCS-2/UTF-16
>       streams.  For what I have to do in my applications, I will simply
>       accept the output of that branch -- in terms of code points for text
>       I won't be able to tell the difference among the three possible
>       streams of encoded characters, and for the UCS-2/UTF-16 bin-utf binary
>       data I have to handle for imgCIF, things will work.  Certainly, for
>       interchange with applications that only handle UTF-8, I will write
>       the 50% expanded UTF-8 encodings of the same binaries, but for
>       performance limited data collections, I will write out UCS-2/UTF-16
>       files.
>
>        Nobody is hurt by what I am proposing and CIF2 will see wider
>       application this way.  Alternatively, if the needs of imgCIF are
>       unacceptable to be labelled CIF, we can always go back to
>       calling it imgNCIF (N for "not") as we had to in 1997 until we
>       called a truce and decided to accept the realities of modern
>       macromolecular data acquisition.
>
>        Regards,
>          Herbert
>
>       =====================================================
>        Herbert J. Bernstein, Professor of Computer Science
>         Dowling College, Kramer Science Center, KSC 121
>              Idle Hour Blvd, Oakdale, NY, 11769
>
>                       +1-631-244-3035
>                       yaya@dowling.edu
>       =====================================================
>
>       On Mon, 24 May 2010, James Hester wrote:
>
>            To run through the alternatives and some of the arguments
>            so far:
>
>            (i) treating an embedded BOM as an ordinary character runs
>            against the
>            Unicode recommendations.  If we wish our standard to be
>            respected, I think
>            we should at least respect other standards and the
>            thinking that has gone
>            into them
>
>            (ii) treating an embedded BOM as whitespace is OK with the
>            Unicode standard,
>            but means that a non-ASCII character now has syntactic
>            meaning in the CIF. 
>            I think this would be completely inconsistent on our part,
>            as an invisible
>            character (when displayed) can actually be used to delimit
>            strings.  This is
>            my least preferred solution, as it goes against the
>            human-readability
>            expected of CIFs
>
>            (iii) ignoring embedded BOMs is bad because they can be a
>            'tip off to a
>            serious problem'.
>
>            (iv) treating embedded BOMs as syntax errors will cause
>            issues when CIF2
>            files are naively concatenated
>
>            I think the only viable alternatives are to choose (iii)
>            or (iv).
>
>            So: why exactly is ignoring a BOM a problem?  If the
>            embedded BOM is the
>            leading BOM from a UTF16 file that has been naively
>            concatenated, it will
>            have bytes 0xFE 0xFF.  This byte sequence (and the
>            reverse) is not
>            acceptable UTF8, leading to a decoding error from the UTF8
>            decoding step. 
>            The subsequent bytes will be UTF16, which should cause a
>            decoding failure in
>            any case.   So I deduce that we are simply discussing how
>            to treat a UTF8
>            BOM, which can only find its way into a CIF file by naive
>            concatenation of
>            UTF8-encoded files written by certain programs.
>
>            If the embedded BOM is a UTF-8 BOM, then ignoring it would
>            be OK, as I don't
>            see that it is indicative of any problems beyond misguided
>            choice of text
>            editor.
>
>            So I would advocate ignoring (and removing) UTF8-BOMs in
>            the input stream,
>            and treating all other BOMs as syntax errors.  Individual
>            applications may
>            wish to give users the option of interpreting U+FEFF as
>            the deprecated ZWNBP
>            (and translating to the correct character) on the
>            understanding that if this
>            occurs outside a delimited string it will cause a syntax
>            error.
>
>            James
>
>            PS am I the only one who thinks it unlikely that Wordpad
>            users would choose
>            to use 'cat' to join file fragments together?
>
>            On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein
>            <yaya@bernstein-plus-sons.com> wrote:
>                 Allow me to clarify my position, so there is no
>                 misunderstanding:
>
>                 I believe that we will be dealing with a world with
>            at least
>                 UTF-8
>                 and UCS-2/UTF-16 encodings for many years to come.  I
>            have no
>                 objection to CIF2 being specified solely in terms of
>            UTF-8 for
>                 simplicity and consistency, but if we are to write
>            software that
>                 people can use, we must have a reasonable position
>            with respect
>                 to the encodings people use, and that means that, at
>            the very
>                 least, we need to accept and process UTF-8 BOMs as
>            harmless
>                 additional text.  Some of us will also be supporting
>                 UCS-2/UTF-16
>                 directly in our applications.  I don't mind if other
>                 applications
>                 are only going to support UTF-8, but inasmuch as, as
>            long as
>                 we have java and web browsers, we are going to
>            encounter
>                 UCS-2/UTF-16,
>                 we should do something sensible when a UCS-2/UTF-16
>            BOM pops up,
>                 either doing the internal translation if we so
>            choose, or, if
>                 that
>                 is not handled by a particular application, issuing a
>            polite
>                 warning
>                 suggesting the used of an external translator if the
>            application
>                 does
>                 not wish to handle UCS-2/UTF-16.
>
>                 BOMS will almost always appear in modern UCS-2/UTF-16
>            files, and
>                 when
>                 they are converted to UTF-8 that will give us yet
>            another source
>                 of
>                 UTF-8 BOMs.  I believe the sensible thing to so it to
>            recognize
>                 BOMs.
>
>                 Regards,
>                     Herbert
>                 =====================================================
>                  Herbert J. Bernstein, Professor of Computer Science
>                    Dowling College, Kramer Science Center, KSC 121
>                         Idle Hour Blvd, Oakdale, NY, 11769
>
>                                  +1-631-244-3035
>                                  yaya@dowling.edu
>                 =====================================================
>
>            On Tue, 18 May 2010, Bollinger, John C wrote:
>
>            > Herbert Bernstein wrote:
>            >> Let me see if I understand this correctly -- a user
>            takes 2
>            perfectly good
>            >> CIF2 files, edits each to clean up, say, some comments
>            to keep
>            straight where
>            >> one begins and one ends, using a well-designed modern
>            text editor
>            that
>            >> happens to put a BOM at the start of each file,
>            concatenates the
>            two files
>            >> with cat to ship them into the IUCr, and suddenly they
>            have a
>            syntax error
>            >> caused by a character that they cannot see!!!
>            >>
>            >> To me this seems pointless when it is trivial for
>            software to
>            recognize the
>            >> character and handle it sensibly.
>            >
>            > And that is my principal rationale for preferring that
>            embedded
>            U+FEFF be recognized as CIF whitespace.  With that
>            approach, the
>            concatenation of two well-formed CIF2 files is always a
>            well-formed
>            CIF2 file, regardless of the presence or absence of BOMs
>            in the
>            original files.  Note, too, that such concatenation cannot
>            produce a
>            mixed-encoding file because files encoded in
>            UTF-16[BE|LE],
>            UTF-32[BE|LE], or any other encoding that can be
>            distinguished from
>            UTF-8 are not well-formed CIF2 files to start.  The file
>            concatenation
>            scenario thus does not provide a use case for the CIF2
>            *specification*
>            to recognize embedded U+FEFF as an encoding marker.
>            >
>            > On the other hand, I again feel compelled to distinguish
>            program
>            behaviors from the CIF2 format specification.  None of the
>            above would
>            prevent a CIF processor from recognizing and handling
>            CIF-like
>            character streams encoded via schemes other than UTF-8,
>            nor from
>            recognizing embedded U+FEFF code sequences in various
>            encodings as
>            encoding switches, thereby handling mixed-encoding files.
>             Indeed,
>            such a program or library would be invaluable for
>            correcting
>            encoding-related errors.  That does not, however, mean
>            that such files
>            must be considered well-formed CIF2, no matter how likely
>            they may (or
>            may not) be to arise.
>            >
>            >
>            > James Hester wrote:
>            >> I would be happy to call an embedded BOM a syntax
>            error.
>            >
>            > In light of the possibility of U+FEFF appearing in a
>            data value (for
>            example, from cutting text from a Unicode manuscript and
>            pasting it
>            into a CIF), I need to refine my earlier blanket
>            alternative of
>            treating embedded U+FEFF as a syntax error.  I now think
>            it would be
>            ok to treat U+FEFF as a syntax error *provided* that it
>            appears
>            outside a delimited string.  That's still not my
>            preference, though,
>            and I feel confident that Herb will still disagree.
>            >
>            >
>            > Regards,
>            >
>            > John
>            > --
>            > John C. Bollinger, Ph.D.
>            > Computing and X-Ray Scientist
>            > Department of Structural Biology
>            > St. Jude Children's Research Hospital
>            > John.Bollinger@StJude.org
>            > (901) 595-3166 [office]
>            > www.stjude.org
>            >
>            >
>            >
>            > Email Disclaimer:  www.stjude.org/emaildisclaimer
>            >
>            > _______________________________________________
>            > ddlm-group mailing list
>            > ddlm-group@iucr.org
>            > http://scripts.iucr.org/mailman/listinfo/ddlm-group
>            >
>            _______________________________________________
>            ddlm-group mailing list
>            ddlm-group@iucr.org
>            http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
>
>            --
>            T +61 (02) 9717 9907
>            F +61 (02) 9717 3145
>            M +61 (04) 0249 4148
> 
>
>       _______________________________________________
>       ddlm-group mailing list
>       ddlm-group@iucr.org
>       http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
>
>       --
>       T +61 (02) 9717 9907
>       F +61 (02) 9717 3145
>       M +61 (04) 0249 4148
> 
> 
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]