Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

P.S.  There seems to be some confusion in the discussion on this point
between the input byte stream and the sequence of unicode code points.
If we have sequence of unicode code points presented as a UTF-8 encoded
input byte stream, and we encounter the byte sequence FE FF or the byte 
sequence FF FE, neither one will be seen as a valid unicode code point.
The first hex digit pairs in the input byte stream for valid unicode code
points are:

   0x, 2x, 3x, 4x, 5x, 6x, 7x
   C0, C1, C2, C3, C4, C5, C6, C7
   C8, C9, CA, CB, CC, CD, CE, CF
   D0, D1, D2, D3, D4, D5, D6, D7
   D8, D9, DA, DB, DC, DD, DE, DF
   E0, E1, E2, E3, E4, E5, E6, E7
   E8, E9, EA, EB, EC, ED, EE, EF
   F0, F1, F2, F3, F4, F5, F6, F7

So, if your UTF-8-only aware application sees a UCS-2/UTF-16
BOM, there will be no confusion, you will report an error,
and if  UTS-8/UCS-2/UTF-16 aware application see a UCS-2/UTF-16
BOM, it will know to use the correct encoding.

Now consider the other case, of a UCS-2/UTF-16 aware application
that encounters a UTF-8 BOM.  It will see it either as

U+EFBB U+BFxx or
U+BBEF U+xxBF

depended on the endianness it is working in.  The first case is
no problem -- there are is no glyph for EFBB, so the UTF-8 BOM
can be unambigously recognized in a big-endian UCS-2/UTF-16
input byte stream, but for a little-endian UCS-2/UTF-16 input
byte stream, BBEF is a valid glyph, so the UTF-8 BOM would not
be recognized as such.

Therefore, for my applications, once you are in UTF-16-LE, if
you wish to switch back to UTF-8, you first need to switch
to UTF-16-BE and then to UTF-8.

Summary:

UTF-8 BOM encountered in a UTF-8 encoded input byte stream,
no problem -- just ignore it

UTF-16-BE/LE BOM encounteed in UTF-8 encoded input byte stream,
no problem -- just declare an error if you are in a pure UTF-8
application, or switch encodings if you are in an application
with that capability.

UTF-8 BOM encountered in a UTF-16-BE encoded input byte stream
no problem -- just declare an error if you are in a pute UTF-16-BE
application, or switch encodings if you are in an application
with that caoability.

UTF-8 BOM encountered in a UTF-16-LE encoded input byte stream,
possible problem -- you may wish to issue a warning, so, if
the user intended a switch, they could switch to UTF-16-BE
before switching to UTF-8/

Note that, if you are going to stick to pure UTF-8 applications,
you have no problems with any of this.
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 26 May 2010, Herbert J. Bernstein wrote:

> The extension we use is cbf, so the extension is not an issue.
> A cbf might be a true ascii cif, or an imgCIF file with true
> binary sectons with or without compression or a UCS-2 file with
> or without binutf sections with or without compression.
>
> Clearly the cleanest case for binutf is when the entire file
> starts out as UCS-2 and just continues that way, but becuase
> the logic of imgCIF permits any mixture of the various types
> of binary sections with any type of headers, there is no reason
> to declare an error because of changes from, say, straight ASCII
> to UCS-2 and back.
>
> The most common place in which you will find a similar distain for
> requiring BOMs as the first glyph is in email messages because
> a modern, multi-part email message is actually a concatenation
> of multiple files of arbtrary types and encodings.  Now you could
> make the argument that the email message is just a container for
> those files and that each file carries its BOM at the front of
> that (sub)file, and you would be right, but that is exactly how imgCIF
> ends up in the same situation -- it is a container for multiple
> headers and binary images and each binary image may be in a
> different encoding (with different compression as well).  This
> flexibility is not an accident -- it was a major intentional change
> in imgCIF in 1997 from Andy Hammersley original model of one ACSII header
> and one binary image to a more CIF-like, order independent, approach
> of allow an arbitrary mixture of multiple headers and multiple
> binary images.
>
>  From a programming point of view, once you live in a world of
> multiple encodings, recognizing a BOM at the start of a file is
> no different from recognizing it anywhere in a file.
>
>  In addition to email, another place in which changes of encoding,
> albeit with a meta tag or Content-Type header, rather than with a BOM, is in 
> web pages, in which in a page being displayed from frames, a brower 
> application has to be prepared to switch encodings on every frame.
>
>  I understand how uncomfortable people can be with such flexibility
> -- changing encodings mid-stream -- so just as we use the cbf
> exention for all imgCIF files that are not pure ASCII right now,
> I will use .cbf for CIF2 files that switch BOMs midstream, but I
> will allow for switches in BOMs midstream.
>
>  Have you considered using .cf2 as the extension for CIF2 files.
> In light of the decision to make CIF2 a maximally disruptive
> change from CIF, confuison between CIF and CIF2 files would seem to me a much 
> more serious cause for concern than dealing with a embedded BOM
> which, can after all, be much more easily dealt with automatically than the 
> CIF2 changes.
>
>  Regards,
>    Herbert
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Wed, 26 May 2010, James Hester wrote:
>
>> Dear Herbert,
>> 
>> I don't believe the technique of using a BOM to switch encodings mid-stream
>> is widely supported either within this group, by Unicode decoding/encoding
>> libraries, or by standards documents.  For example, do any browsers support
>> switching the encoding of a webpage halfway through?  I think not. I'd be
>> happy to hear of a counterexample to this assertion, but assuming that such
>> switching is not likely to be supported, I'd like to hear what you think of
>> the following comments:
>> 
>> Encoding a CIF2 file in UCS2 or UCS4 seems to me to be notionally the same
>> as compressing or otherwise transforming the original file.  Therefore, the
>> notion of a 'UCS2-encoded CIF2 file' is no more contrary to the current 
>> CIF2
>> standard than the notion of a 'gzipped CIF2 file'.  Both files require some
>> operation to transform them to a CIF2 file.  Both files will lack the
>> required magic number at the front, and will cause CIF2 parsers to fail
>> dismally.  I would propose that, if you need UCS2 for efficiency or storage
>> reasons, you save files with a non 'CIF' extension (e.g. image001.cif.ucs2)
>> and make it clear external to the file contents that they will need to be
>> transformed from ucs2 to utf-8 before being fed to standards-compliant CIF2
>> tools.  My main concern with this approach is that we avoid confusion
>> between a CIF2 file and an (re)encoded CIF2 file, because as soon as a CIF
>> reader or writer is unsure about what they are reading or writing, the
>> effectiveness of the standard is degraded.
>> 
>> I appreciate that this is not ideal from your point of view, and that you'd
>> like to be able to specify the encoding within the file itself.  For the
>> same reasons as discussed last year, I don't like that approach.
>> 
>> I do not understand your argument about an internal UCS BOM being not that
>> much of a big deal because the program logic is not complicated.  Ease of
>> programming is not really the issue here.  If a file is a
>> standards-compliant CIF2 file, it must not cause a syntax error when read 
>> by
>> a standards-compliant CIF2 reader (especially for a data transfer
>> protocol!!).  If a UCS2 BOM is allowed in a CIF2 file, then *all* readers
>> must be able to accept and understand it identically.
>> 
>> James.
>> 
>> On Mon, May 24, 2010 at 11:11 PM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>       Dear Colleagues,
>>
>>       James has said:
>>
>>             So: why exactly is ignoring a BOM a problem?  If the
>>             embedded BOM is the
>>             leading BOM from a UTF16 file that has been naively
>>             concatenated, it will
>>             have bytes 0xFE 0xFF.  This byte sequence (and the
>>             reverse) is not
>>             acceptable UTF8, leading to a decoding error from
>>             the UTF8 decoding step. 
>>             The subsequent bytes will be UTF16, which should
>>             cause a decoding failure in
>>             any case.   So I deduce that we are simply
>>             discussing how to treat a UTF8
>>             BOM, which can only find its way into a CIF file by
>>             naive concatenation of
>>             UTF8-encoded files written by certain programs.
>>
>>             If the embedded BOM is a UTF-8 BOM, then ignoring it
>>             would be OK, as I don't
>>             see that it is indicative of any problems beyond
>>             misguided choice of text
>>             editor.
>>
>>             So I would advocate ignoring (and removing)
>>             UTF8-BOMs in the input stream,
>>             and treating all other BOMs as syntax errors. 
>>             Individual applications may
>>             wish to give users the option of interpreting U+FEFF
>>             as the deprecated ZWNBP
>>             (and translating to the correct character) on the
>>             understanding that if this
>>             occurs outside a delimited string it will cause a
>>             syntax error.
>> 
>> 
>> I propose something slightly different, which will amount to what
>> James
>> is proposing for applications that wish to handle only UTF8, but which
>> will be essential for applications that have to work with a wider
>> range
>> of encodings (e.g. imgCIF applications).
>> 
>> There are three highly likely BOMs that may be encountered at any
>> point
>> in a byte stream in a Unicode world:
>> 
>> The UTF-8 BOM:  EF BB BF
>> The UTF-16 big-endian BOM:  FE FF
>> The UTF-16 little-endian BOM FF FE
>> 
>> For a UTF-8 application, the sequence is EF B8 BF is, as James
>> suggests,
>> simply something to accept and ignore, with processing continuing
>> normally without comment.  Again, as James suggests, for a UTF-8 only
>> applications the other 2 BOMs are invalid characters to treat as an
>> error.
>> 
>> However, for an application able to work with a wider range of
>> encodings,
>> the other two BOMs are just what it needs to decide how to handle the
>> remainder of the stream.
>> 
>> Now that we have settled the case-sensitivity issue in a normalized
>> unicode context, the recognition of BOMs in this manner imposes no
>> particular additional burden on applications.  All applications will
>> have to have utilities to assemble UTF-8 character sequences into
>> Unicode code points either as 16 bit, or, better, 32 bit integers,
>> so this is just a perfectly normal and in most cases already coded
>> branch point in that logic.  It the application wishes to only be
>> UTF-8 aware, it can chop off the branch that would decode UCS-2/UTF-16
>> streams.  For what I have to do in my applications, I will simply
>> accept the output of that branch -- in terms of code points for text
>> I won't be able to tell the difference among the three possible
>> streams of encoded characters, and for the UCS-2/UTF-16 bin-utf binary
>> data I have to handle for imgCIF, things will work.  Certainly, for
>> interchange with applications that only handle UTF-8, I will write
>> the 50% expanded UTF-8 encodings of the same binaries, but for
>> performance limited data collections, I will write out UCS-2/UTF-16
>> files.
>> 
>>  Nobody is hurt by what I am proposing and CIF2 will see wider
>> application this way.  Alternatively, if the needs of imgCIF are
>> unacceptable to be labelled CIF, we can always go back to
>> calling it imgNCIF (N for "not") as we had to in 1997 until we
>> called a truce and decided to accept the realities of modern
>> macromolecular data acquisition.
>> 
>>  Regards,
>>    Herbert
>> 
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>> 
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>> 
>> On Mon, 24 May 2010, James Hester wrote:
>>
>>       To run through the alternatives and some of the arguments
>>       so far:
>>
>>       (i) treating an embedded BOM as an ordinary character runs
>>       against the
>>       Unicode recommendations.  If we wish our standard to be
>>       respected, I think
>>       we should at least respect other standards and the
>>       thinking that has gone
>>       into them
>>
>>       (ii) treating an embedded BOM as whitespace is OK with the
>>       Unicode standard,
>>       but means that a non-ASCII character now has syntactic
>>       meaning in the CIF. 
>>       I think this would be completely inconsistent on our part,
>>       as an invisible
>>       character (when displayed) can actually be used to delimit
>>       strings.  This is
>>       my least preferred solution, as it goes against the
>>       human-readability
>>       expected of CIFs
>>
>>       (iii) ignoring embedded BOMs is bad because they can be a
>>       'tip off to a
>>       serious problem'.
>>
>>       (iv) treating embedded BOMs as syntax errors will cause
>>       issues when CIF2
>>       files are naively concatenated
>>
>>       I think the only viable alternatives are to choose (iii)
>>       or (iv).
>>
>>       So: why exactly is ignoring a BOM a problem?  If the
>>       embedded BOM is the
>>       leading BOM from a UTF16 file that has been naively
>>       concatenated, it will
>>       have bytes 0xFE 0xFF.  This byte sequence (and the
>>       reverse) is not
>>       acceptable UTF8, leading to a decoding error from the UTF8
>>       decoding step. 
>>       The subsequent bytes will be UTF16, which should cause a
>>       decoding failure in
>>       any case.   So I deduce that we are simply discussing how
>>       to treat a UTF8
>>       BOM, which can only find its way into a CIF file by naive
>>       concatenation of
>>       UTF8-encoded files written by certain programs.
>>
>>       If the embedded BOM is a UTF-8 BOM, then ignoring it would
>>       be OK, as I don't
>>       see that it is indicative of any problems beyond misguided
>>       choice of text
>>       editor.
>>
>>       So I would advocate ignoring (and removing) UTF8-BOMs in
>>       the input stream,
>>       and treating all other BOMs as syntax errors.  Individual
>>       applications may
>>       wish to give users the option of interpreting U+FEFF as
>>       the deprecated ZWNBP
>>       (and translating to the correct character) on the
>>       understanding that if this
>>       occurs outside a delimited string it will cause a syntax
>>       error.
>>
>>       James
>>
>>       PS am I the only one who thinks it unlikely that Wordpad
>>       users would choose
>>       to use 'cat' to join file fragments together?
>>
>>       On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein
>>       <yaya@bernstein-plus-sons.com> wrote:
>>            Allow me to clarify my position, so there is no
>>            misunderstanding:
>>
>>            I believe that we will be dealing with a world with
>>       at least
>>            UTF-8
>>            and UCS-2/UTF-16 encodings for many years to come.  I
>>       have no
>>            objection to CIF2 being specified solely in terms of
>>       UTF-8 for
>>            simplicity and consistency, but if we are to write
>>       software that
>>            people can use, we must have a reasonable position
>>       with respect
>>            to the encodings people use, and that means that, at
>>       the very
>>            least, we need to accept and process UTF-8 BOMs as
>>       harmless
>>            additional text.  Some of us will also be supporting
>>            UCS-2/UTF-16
>>            directly in our applications.  I don't mind if other
>>            applications
>>            are only going to support UTF-8, but inasmuch as, as
>>       long as
>>            we have java and web browsers, we are going to
>>       encounter
>>            UCS-2/UTF-16,
>>            we should do something sensible when a UCS-2/UTF-16
>>       BOM pops up,
>>            either doing the internal translation if we so
>>       choose, or, if
>>            that
>>            is not handled by a particular application, issuing a
>>       polite
>>            warning
>>            suggesting the used of an external translator if the
>>       application
>>            does
>>            not wish to handle UCS-2/UTF-16.
>>
>>            BOMS will almost always appear in modern UCS-2/UTF-16
>>       files, and
>>            when
>>            they are converted to UTF-8 that will give us yet
>>       another source
>>            of
>>            UTF-8 BOMs.  I believe the sensible thing to so it to
>>       recognize
>>            BOMs.
>>
>>            Regards,
>>                Herbert
>>            =====================================================
>>             Herbert J. Bernstein, Professor of Computer Science
>>               Dowling College, Kramer Science Center, KSC 121
>>                    Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                             +1-631-244-3035
>>                             yaya@dowling.edu
>>            =====================================================
>>
>>       On Tue, 18 May 2010, Bollinger, John C wrote:
>>
>>       > Herbert Bernstein wrote:
>>       >> Let me see if I understand this correctly -- a user
>>       takes 2
>>       perfectly good
>>       >> CIF2 files, edits each to clean up, say, some comments
>>       to keep
>>       straight where
>>       >> one begins and one ends, using a well-designed modern
>>       text editor
>>       that
>>       >> happens to put a BOM at the start of each file,
>>       concatenates the
>>       two files
>>       >> with cat to ship them into the IUCr, and suddenly they
>>       have a
>>       syntax error
>>       >> caused by a character that they cannot see!!!
>>       >>
>>       >> To me this seems pointless when it is trivial for
>>       software to
>>       recognize the
>>       >> character and handle it sensibly.
>>       >
>>       > And that is my principal rationale for preferring that
>>       embedded
>>       U+FEFF be recognized as CIF whitespace.  With that
>>       approach, the
>>       concatenation of two well-formed CIF2 files is always a
>>       well-formed
>>       CIF2 file, regardless of the presence or absence of BOMs
>>       in the
>>       original files.  Note, too, that such concatenation cannot
>>       produce a
>>       mixed-encoding file because files encoded in
>>       UTF-16[BE|LE],
>>       UTF-32[BE|LE], or any other encoding that can be
>>       distinguished from
>>       UTF-8 are not well-formed CIF2 files to start.  The file
>>       concatenation
>>       scenario thus does not provide a use case for the CIF2
>>       *specification*
>>       to recognize embedded U+FEFF as an encoding marker.
>>       >
>>       > On the other hand, I again feel compelled to distinguish
>>       program
>>       behaviors from the CIF2 format specification.  None of the
>>       above would
>>       prevent a CIF processor from recognizing and handling
>>       CIF-like
>>       character streams encoded via schemes other than UTF-8,
>>       nor from
>>       recognizing embedded U+FEFF code sequences in various
>>       encodings as
>>       encoding switches, thereby handling mixed-encoding files.
>>        Indeed,
>>       such a program or library would be invaluable for
>>       correcting
>>       encoding-related errors.  That does not, however, mean
>>       that such files
>>       must be considered well-formed CIF2, no matter how likely
>>       they may (or
>>       may not) be to arise.
>>       >
>>       >
>>       > James Hester wrote:
>>       >> I would be happy to call an embedded BOM a syntax
>>       error.
>>       >
>>       > In light of the possibility of U+FEFF appearing in a
>>       data value (for
>>       example, from cutting text from a Unicode manuscript and
>>       pasting it
>>       into a CIF), I need to refine my earlier blanket
>>       alternative of
>>       treating embedded U+FEFF as a syntax error.  I now think
>>       it would be
>>       ok to treat U+FEFF as a syntax error *provided* that it
>>       appears
>>       outside a delimited string.  That's still not my
>>       preference, though,
>>       and I feel confident that Herb will still disagree.
>>       >
>>       >
>>       > Regards,
>>       >
>>       > John
>>       > --
>>       > John C. Bollinger, Ph.D.
>>       > Computing and X-Ray Scientist
>>       > Department of Structural Biology
>>       > St. Jude Children's Research Hospital
>>       > John.Bollinger@StJude.org
>>       > (901) 595-3166 [office]
>>       > www.stjude.org
>>       >
>>       >
>>       >
>>       > Email Disclaimer:  www.stjude.org/emaildisclaimer
>>       >
>>       > _______________________________________________
>>       > ddlm-group mailing list
>>       > ddlm-group@iucr.org
>>       > http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>       >
>>       _______________________________________________
>>       ddlm-group mailing list
>>       ddlm-group@iucr.org
>>       http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
>> 
>> 
>>
>>       --
>>       T +61 (02) 9717 9907
>>       F +61 (02) 9717 3145
>>       M +61 (04) 0249 4148
>> 
>> 
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
>> 
>> 
>> 
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> 
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.