Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Hi Herbert and others,

As far as I can tell, BOMs have no semantic or parsing significance in the context of an email message, which was my point.  Encoding is switched using mime headers, as you mention, not using BOMs.  So, I don't see that either email or web standards offer support for the idea of using a BOM to switch encoding.  While I appreciate that being restricted to UTF-8 places some restrictions on imgCIF, it is considerably better than the situation that a lot of email still finds itself in, of being restricted to US-ASCII, and imgCBF is still available as an alternative.

So I would repeat my suggestion of

(1) ignoring UTF8 BOM where it is likely to be the result of concatenation (approximately, this means amongst whitespace)
(2) raising a syntax error if the byte sequence could be either BOM or NBWSP (approximately, this means inside any dataname/value/datablock name/save frame name)
(3) any other type of BOM remains a syntax error as it is not UTF8

I will be calling for a vote in a week or so, after giving everyone a bit more of a chance to make their voice heard.

On Wed, May 26, 2010 at 8:35 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
The extension we use is cbf, so the extension is not an issue.
A cbf might be a true ascii cif, or an imgCIF file with true
binary sectons with or without compression or a UCS-2 file with
or without binutf sections with or without compression.

Clearly the cleanest case for binutf is when the entire file
starts out as UCS-2 and just continues that way, but becuase
the logic of imgCIF permits any mixture of the various types
of binary sections with any type of headers, there is no reason
to declare an error because of changes from, say, straight ASCII
to UCS-2 and back.

The most common place in which you will find a similar distain for
requiring BOMs as the first glyph is in email messages because
a modern, multi-part email message is actually a concatenation
of multiple files of arbtrary types and encodings.  Now you could
make the argument that the email message is just a container for
those files and that each file carries its BOM at the front of
that (sub)file, and you would be right, but that is exactly how imgCIF
ends up in the same situation -- it is a container for multiple
headers and binary images and each binary image may be in a
different encoding (with different compression as well).  This
flexibility is not an accident -- it was a major intentional change
in imgCIF in 1997 from Andy Hammersley original model of one ACSII header
and one binary image to a more CIF-like, order independent, approach
of allow an arbitrary mixture of multiple headers and multiple
binary images.

 From a programming point of view, once you live in a world of
multiple encodings, recognizing a BOM at the start of a file is
no different from recognizing it anywhere in a file.

 In addition to email, another place in which changes of encoding,
albeit with a meta tag or Content-Type header, rather than with a BOM, is in web pages, in which in a page being displayed from frames, a brower application has to be prepared to switch encodings on every frame.

 I understand how uncomfortable people can be with such flexibility
-- changing encodings mid-stream -- so just as we use the cbf
exention for all imgCIF files that are not pure ASCII right now,
I will use .cbf for CIF2 files that switch BOMs midstream, but I
will allow for switches in BOMs midstream.

 Have you considered using .cf2 as the extension for CIF2 files.
In light of the decision to make CIF2 a maximally disruptive
change from CIF, confuison between CIF and CIF2 files would seem to me a much more serious cause for concern than dealing with a embedded BOM
which, can after all, be much more easily dealt with automatically than the CIF2 changes.


 Regards,
   Herbert
=====================================================
 Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
       Idle Hour Blvd, Oakdale, NY, 11769

                +1-631-244-3035
                yaya@dowling.edu
=====================================================

On Wed, 26 May 2010, James Hester wrote:

Dear Herbert,

I don't believe the technique of using a BOM to switch encodings mid-stream
is widely supported either within this group, by Unicode decoding/encoding
libraries, or by standards documents.  For example, do any browsers support
switching the encoding of a webpage halfway through?  I think not. I'd be
happy to hear of a counterexample to this assertion, but assuming that such
switching is not likely to be supported, I'd like to hear what you think of
the following comments:

Encoding a CIF2 file in UCS2 or UCS4 seems to me to be notionally the same
as compressing or otherwise transforming the original file.  Therefore, the
notion of a 'UCS2-encoded CIF2 file' is no more contrary to the current CIF2
standard than the notion of a 'gzipped CIF2 file'.  Both files require some
operation to transform them to a CIF2 file.  Both files will lack the
required magic number at the front, and will cause CIF2 parsers to fail
dismally.  I would propose that, if you need UCS2 for efficiency or storage
reasons, you save files with a non 'CIF' extension (e.g. image001.cif.ucs2)
and make it clear external to the file contents that they will need to be
transformed from ucs2 to utf-8 before being fed to standards-compliant CIF2
tools.  My main concern with this approach is that we avoid confusion
between a CIF2 file and an (re)encoded CIF2 file, because as soon as a CIF
reader or writer is unsure about what they are reading or writing, the
effectiveness of the standard is degraded.

I appreciate that this is not ideal from your point of view, and that you'd
like to be able to specify the encoding within the file itself.  For the
same reasons as discussed last year, I don't like that approach.

I do not understand your argument about an internal UCS BOM being not that
much of a big deal because the program logic is not complicated.  Ease of
programming is not really the issue here.  If a file is a
standards-compliant CIF2 file, it must not cause a syntax error when read by
a standards-compliant CIF2 reader (especially for a data transfer
protocol!!).  If a UCS2 BOM is allowed in a CIF2 file, then *all* readers
must be able to accept and understand it identically.

James.

On Mon, May 24, 2010 at 11:11 PM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
     Dear Colleagues,

     James has said:

           So: why exactly is ignoring a BOM a problem?  If the
           embedded BOM is the
           leading BOM from a UTF16 file that has been naively
           concatenated, it will
           have bytes 0xFE 0xFF.  This byte sequence (and the
           reverse) is not
           acceptable UTF8, leading to a decoding error from
           the UTF8 decoding step. 
           The subsequent bytes will be UTF16, which should
           cause a decoding failure in
           any case.   So I deduce that we are simply
           discussing how to treat a UTF8
           BOM, which can only find its way into a CIF file by
           naive concatenation of
           UTF8-encoded files written by certain programs.

           If the embedded BOM is a UTF-8 BOM, then ignoring it
           would be OK, as I don't
           see that it is indicative of any problems beyond
           misguided choice of text
           editor.

           So I would advocate ignoring (and removing)
           UTF8-BOMs in the input stream,
           and treating all other BOMs as syntax errors. 
           Individual applications may
           wish to give users the option of interpreting U+FEFF
           as the deprecated ZWNBP
           (and translating to the correct character) on the
           understanding that if this
           occurs outside a delimited string it will cause a
           syntax error.


I propose something slightly different, which will amount to what
James
is proposing for applications that wish to handle only UTF8, but which
will be essential for applications that have to work with a wider
range
of encodings (e.g. imgCIF applications).

There are three highly likely BOMs that may be encountered at any
point
in a byte stream in a Unicode world:

The UTF-8 BOM:  EF BB BF
The UTF-16 big-endian BOM:  FE FF
The UTF-16 little-endian BOM FF FE

For a UTF-8 application, the sequence is EF B8 BF is, as James
suggests,
simply something to accept and ignore, with processing continuing
normally without comment.  Again, as James suggests, for a UTF-8 only
applications the other 2 BOMs are invalid characters to treat as an
error.

However, for an application able to work with a wider range of
encodings,
the other two BOMs are just what it needs to decide how to handle the
remainder of the stream.

Now that we have settled the case-sensitivity issue in a normalized
unicode context, the recognition of BOMs in this manner imposes no
particular additional burden on applications.  All applications will
have to have utilities to assemble UTF-8 character sequences into
Unicode code points either as 16 bit, or, better, 32 bit integers,
so this is just a perfectly normal and in most cases already coded
branch point in that logic.  It the application wishes to only be
UTF-8 aware, it can chop off the branch that would decode UCS-2/UTF-16
streams.  For what I have to do in my applications, I will simply
accept the output of that branch -- in terms of code points for text
I won't be able to tell the difference among the three possible
streams of encoded characters, and for the UCS-2/UTF-16 bin-utf binary
data I have to handle for imgCIF, things will work.  Certainly, for
interchange with applications that only handle UTF-8, I will write
the 50% expanded UTF-8 encodings of the same binaries, but for
performance limited data collections, I will write out UCS-2/UTF-16
files.

 Nobody is hurt by what I am proposing and CIF2 will see wider
application this way.  Alternatively, if the needs of imgCIF are
unacceptable to be labelled CIF, we can always go back to
calling it imgNCIF (N for "not") as we had to in 1997 until we
called a truce and decided to accept the realities of modern
macromolecular data acquisition.

 Regards,
   Herbert

=====================================================
 Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
       Idle Hour Blvd, Oakdale, NY, 11769

                +1-631-244-3035
                yaya@dowling.edu
=====================================================

On Mon, 24 May 2010, James Hester wrote:

     To run through the alternatives and some of the arguments
     so far:

     (i) treating an embedded BOM as an ordinary character runs
     against the
     Unicode recommendations.  If we wish our standard to be
     respected, I think
     we should at least respect other standards and the
     thinking that has gone
     into them

     (ii) treating an embedded BOM as whitespace is OK with the
     Unicode standard,
     but means that a non-ASCII character now has syntactic
     meaning in the CIF. 
     I think this would be completely inconsistent on our part,
     as an invisible
     character (when displayed) can actually be used to delimit
     strings.  This is
     my least preferred solution, as it goes against the
     human-readability
     expected of CIFs

     (iii) ignoring embedded BOMs is bad because they can be a
     'tip off to a
     serious problem'.

     (iv) treating embedded BOMs as syntax errors will cause
     issues when CIF2
     files are naively concatenated

     I think the only viable alternatives are to choose (iii)
     or (iv).

     So: why exactly is ignoring a BOM a problem?  If the
     embedded BOM is the
     leading BOM from a UTF16 file that has been naively
     concatenated, it will
     have bytes 0xFE 0xFF.  This byte sequence (and the
     reverse) is not
     acceptable UTF8, leading to a decoding error from the UTF8
     decoding step. 
     The subsequent bytes will be UTF16, which should cause a
     decoding failure in
     any case.   So I deduce that we are simply discussing how
     to treat a UTF8
     BOM, which can only find its way into a CIF file by naive
     concatenation of
     UTF8-encoded files written by certain programs.

     If the embedded BOM is a UTF-8 BOM, then ignoring it would
     be OK, as I don't
     see that it is indicative of any problems beyond misguided
     choice of text
     editor.

     So I would advocate ignoring (and removing) UTF8-BOMs in
     the input stream,
     and treating all other BOMs as syntax errors.  Individual
     applications may
     wish to give users the option of interpreting U+FEFF as
     the deprecated ZWNBP
     (and translating to the correct character) on the
     understanding that if this
     occurs outside a delimited string it will cause a syntax
     error.

     James

     PS am I the only one who thinks it unlikely that Wordpad
     users would choose
     to use 'cat' to join file fragments together?

     On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein
     <yaya@bernstein-plus-sons.com> wrote:
          Allow me to clarify my position, so there is no
          misunderstanding:

          I believe that we will be dealing with a world with
     at least
          UTF-8
          and UCS-2/UTF-16 encodings for many years to come.  I
     have no
          objection to CIF2 being specified solely in terms of
     UTF-8 for
          simplicity and consistency, but if we are to write
     software that
          people can use, we must have a reasonable position
     with respect
          to the encodings people use, and that means that, at
     the very
          least, we need to accept and process UTF-8 BOMs as
     harmless
          additional text.  Some of us will also be supporting
          UCS-2/UTF-16
          directly in our applications.  I don't mind if other
          applications
          are only going to support UTF-8, but inasmuch as, as
     long as
          we have java and web browsers, we are going to
     encounter
          UCS-2/UTF-16,
          we should do something sensible when a UCS-2/UTF-16
     BOM pops up,
          either doing the internal translation if we so
     choose, or, if
          that
          is not handled by a particular application, issuing a
     polite
          warning
          suggesting the used of an external translator if the
     application
          does
          not wish to handle UCS-2/UTF-16.

          BOMS will almost always appear in modern UCS-2/UTF-16
     files, and
          when
          they are converted to UTF-8 that will give us yet
     another source
          of
          UTF-8 BOMs.  I believe the sensible thing to so it to
     recognize
          BOMs.

          Regards,
              Herbert
          =====================================================
           Herbert J. Bernstein, Professor of Computer Science
             Dowling College, Kramer Science Center, KSC 121
                  Idle Hour Blvd, Oakdale, NY, 11769

                           +1-631-244-3035
                           yaya@dowling.edu
          =====================================================

     On Tue, 18 May 2010, Bollinger, John C wrote:

     > Herbert Bernstein wrote:
     >> Let me see if I understand this correctly -- a user
     takes 2
     perfectly good
     >> CIF2 files, edits each to clean up, say, some comments
     to keep
     straight where
     >> one begins and one ends, using a well-designed modern
     text editor
     that
     >> happens to put a BOM at the start of each file,
     concatenates the
     two files
     >> with cat to ship them into the IUCr, and suddenly they
     have a
     syntax error
     >> caused by a character that they cannot see!!!
     >>
     >> To me this seems pointless when it is trivial for
     software to
     recognize the
     >> character and handle it sensibly.
     >
     > And that is my principal rationale for preferring that
     embedded
     U+FEFF be recognized as CIF whitespace.  With that
     approach, the
     concatenation of two well-formed CIF2 files is always a
     well-formed
     CIF2 file, regardless of the presence or absence of BOMs
     in the
     original files.  Note, too, that such concatenation cannot
     produce a
     mixed-encoding file because files encoded in
     UTF-16[BE|LE],
     UTF-32[BE|LE], or any other encoding that can be
     distinguished from
     UTF-8 are not well-formed CIF2 files to start.  The file
     concatenation
     scenario thus does not provide a use case for the CIF2
     *specification*
     to recognize embedded U+FEFF as an encoding marker.
     >
     > On the other hand, I again feel compelled to distinguish
     program
     behaviors from the CIF2 format specification.  None of the
     above would
     prevent a CIF processor from recognizing and handling
     CIF-like
     character streams encoded via schemes other than UTF-8,
     nor from
     recognizing embedded U+FEFF code sequences in various
     encodings as
     encoding switches, thereby handling mixed-encoding files.
      Indeed,
     such a program or library would be invaluable for
     correcting
     encoding-related errors.  That does not, however, mean
     that such files
     must be considered well-formed CIF2, no matter how likely
     they may (or
     may not) be to arise.
     >
     >
     > James Hester wrote:
     >> I would be happy to call an embedded BOM a syntax
     error.
     >
     > In light of the possibility of U+FEFF appearing in a
     data value (for
     example, from cutting text from a Unicode manuscript and
     pasting it
     into a CIF), I need to refine my earlier blanket
     alternative of
     treating embedded U+FEFF as a syntax error.  I now think
     it would be
     ok to treat U+FEFF as a syntax error *provided* that it
     appears
     outside a delimited string.  That's still not my
     preference, though,
     and I feel confident that Herb will still disagree.
     >
     >
     > Regards,
     >
     > John
     > --
     > John C. Bollinger, Ph.D.
     > Computing and X-Ray Scientist
     > Department of Structural Biology
     > St. Jude Children's Research Hospital
     > John.Bollinger@StJude.org
     > (901) 595-3166 [office]
     > www.stjude.org
     >
     >
     >
     > Email Disclaimer:  www.stjude.org/emaildisclaimer
     >
     > _______________________________________________
     > ddlm-group mailing list
     > ddlm-group@iucr.org
     > http://scripts.iucr.org/mailman/listinfo/ddlm-group
     >
     _______________________________________________
     ddlm-group mailing list
     ddlm-group@iucr.org
     http://scripts.iucr.org/mailman/listinfo/ddlm-group




     --
     T +61 (02) 9717 9907
     F +61 (02) 9717 3145
     M +61 (04) 0249 4148


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.