[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Tue, 1 Jun 2010 05:26:12 -0400 (EDT)
- In-Reply-To: <[email protected]>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>
Dear Collegues,
A UCS-2 message embedded in an email messages normally carries
a BOM, but that begs the question -- it is normal practice to
switch encodings mid-stream, and, theory and abstractions aside,
we are definitely going to encounter embedded BOM and, for that
matter, MIME-based, switches in encodings in the course of
processing one stream of information. If one prefers to call
such a multi-mode stream a CBF rather than calling
it a CIFs, so be it, but they still have to be processed.
Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
On Tue, 1 Jun 2010, James Hester wrote:
> Hi Herbert and others,
>
> As far as I can tell, BOMs have no semantic or parsing significance in the context of an email message, which was my point.� Encoding is switched
> using mime headers, as you mention, not using BOMs.� So, I don't see that either email or web standards offer support for the idea of using a BOM
> to switch encoding.� While I appreciate that being restricted to UTF-8 places some restrictions on imgCIF, it is considerably better than the
> situation that a lot of email still finds itself in, of being restricted to US-ASCII, and imgCBF is still available as an alternative.
>
> So I would repeat my suggestion of
>
> (1) ignoring UTF8 BOM where it is likely to be the result of concatenation (approximately, this means amongst whitespace)
> (2) raising a syntax error if the byte sequence could be either BOM or NBWSP (approximately, this means inside any dataname/value/datablock
> name/save frame name)
> (3) any other type of BOM remains a syntax error as it is not UTF8
>
> I will be calling for a vote in a week or so, after giving everyone a bit more of a chance to make their voice heard.
>
> On Wed, May 26, 2010 at 8:35 PM, Herbert J. Bernstein <[email protected]> wrote:
> The extension we use is cbf, so the extension is not an issue.
> A cbf might be a true ascii cif, or an imgCIF file with true
> binary sectons with or without compression or a UCS-2 file with
> or without binutf sections with or without compression.
>
> Clearly the cleanest case for binutf is when the entire file
> starts out as UCS-2 and just continues that way, but becuase
> the logic of imgCIF permits any mixture of the various types
> of binary sections with any type of headers, there is no reason
> to declare an error because of changes from, say, straight ASCII
> to UCS-2 and back.
>
> The most common place in which you will find a similar distain for
> requiring BOMs as the first glyph is in email messages because
> a modern, multi-part email message is actually a concatenation
> of multiple files of arbtrary types and encodings. �Now you could
> make the argument that the email message is just a container for
> those files and that each file carries its BOM at the front of
> that (sub)file, and you would be right, but that is exactly how imgCIF
> ends up in the same situation -- it is a container for multiple
> headers and binary images and each binary image may be in a
> different encoding (with different compression as well). �This
> flexibility is not an accident -- it was a major intentional change
> in imgCIF in 1997 from Andy Hammersley original model of one ACSII header
> and one binary image to a more CIF-like, order independent, approach
> of allow an arbitrary mixture of multiple headers and multiple
> binary images.
>
> �From a programming point of view, once you live in a world of
> multiple encodings, recognizing a BOM at the start of a file is
> no different from recognizing it anywhere in a file.
>
> �In addition to email, another place in which changes of encoding,
> albeit with a meta tag or Content-Type header, rather than with a BOM, is in web pages, in which in a page being displayed from frames,
> a brower application has to be prepared to switch encodings on every frame.
>
> �I understand how uncomfortable people can be with such flexibility
> -- changing encodings mid-stream -- so just as we use the cbf
> exention for all imgCIF files that are not pure ASCII right now,
> I will use .cbf for CIF2 files that switch BOMs midstream, but I
> will allow for switches in BOMs midstream.
>
> �Have you considered using .cf2 as the extension for CIF2 files.
> In light of the decision to make CIF2 a maximally disruptive
> change from CIF, confuison between CIF and CIF2 files would seem to me a much more serious cause for concern than dealing with a
> embedded BOM
> which, can after all, be much more easily dealt with automatically than the CIF2 changes.
>
> �Regards,
> � �Herbert
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � Dowling College, Kramer Science Center, KSC 121
> � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � +1-631-244-3035
> � � � � � � � � [email protected]
> =====================================================
>
> On Wed, 26 May 2010, James Hester wrote:
>
> Dear Herbert,
>
> I don't believe the technique of using a BOM to switch encodings mid-stream
> is widely supported either within this group, by Unicode decoding/encoding
> libraries, or by standards documents.� For example, do any browsers support
> switching the encoding of a webpage halfway through?� I think not. I'd be
> happy to hear of a counterexample to this assertion, but assuming that such
> switching is not likely to be supported, I'd like to hear what you think of
> the following comments:
>
> Encoding a CIF2 file in UCS2 or UCS4 seems to me to be notionally the same
> as compressing or otherwise transforming the original file.� Therefore, the
> notion of a 'UCS2-encoded CIF2 file' is no more contrary to the current CIF2
> standard than the notion of a 'gzipped CIF2 file'.� Both files require some
> operation to transform them to a CIF2 file.� Both files will lack the
> required magic number at the front, and will cause CIF2 parsers to fail
> dismally.� I would propose that, if you need UCS2 for efficiency or storage
> reasons, you save files with a non 'CIF' extension (e.g. image001.cif.ucs2)
> and make it clear external to the file contents that they will need to be
> transformed from ucs2 to utf-8 before being fed to standards-compliant CIF2
> tools.� My main concern with this approach is that we avoid confusion
> between a CIF2 file and an (re)encoded CIF2 file, because as soon as a CIF
> reader or writer is unsure about what they are reading or writing, the
> effectiveness of the standard is degraded.
>
> I appreciate that this is not ideal from your point of view, and that you'd
> like to be able to specify the encoding within the file itself.� For the
> same reasons as discussed last year, I don't like that approach.
>
> I do not understand your argument about an internal UCS BOM being not that
> much of a big deal because the program logic is not complicated.� Ease of
> programming is not really the issue here.� If a file is a
> standards-compliant CIF2 file, it must not cause a syntax error when read by
> a standards-compliant CIF2 reader (especially for a data transfer
> protocol!!).� If a UCS2 BOM is allowed in a CIF2 file, then *all* readers
> must be able to accept and understand it identically.
>
> James.
>
> On Mon, May 24, 2010 at 11:11 PM, Herbert J. Bernstein
> <[email protected]> wrote:
> � � �Dear Colleagues,
>
> � � �James has said:
>
> � � � � � �So: why exactly is ignoring a BOM a problem?� If the
> � � � � � �embedded BOM is the
> � � � � � �leading BOM from a UTF16 file that has been naively
> � � � � � �concatenated, it will
> � � � � � �have bytes 0xFE 0xFF.� This byte sequence (and the
> � � � � � �reverse) is not
> � � � � � �acceptable UTF8, leading to a decoding error from
> � � � � � �the UTF8 decoding step.�
> � � � � � �The subsequent bytes will be UTF16, which should
> � � � � � �cause a decoding failure in
> � � � � � �any case.�� So I deduce that we are simply
> � � � � � �discussing how to treat a UTF8
> � � � � � �BOM, which can only find its way into a CIF file by
> � � � � � �naive concatenation of
> � � � � � �UTF8-encoded files written by certain programs.
>
> � � � � � �If the embedded BOM is a UTF-8 BOM, then ignoring it
> � � � � � �would be OK, as I don't
> � � � � � �see that it is indicative of any problems beyond
> � � � � � �misguided choice of text
> � � � � � �editor.
>
> � � � � � �So I would advocate ignoring (and removing)
> � � � � � �UTF8-BOMs in the input stream,
> � � � � � �and treating all other BOMs as syntax errors.�
> � � � � � �Individual applications may
> � � � � � �wish to give users the option of interpreting U+FEFF
> � � � � � �as the deprecated ZWNBP
> � � � � � �(and translating to the correct character) on the
> � � � � � �understanding that if this
> � � � � � �occurs outside a delimited string it will cause a
> � � � � � �syntax error.
>
>
> I propose something slightly different, which will amount to what
> James
> is proposing for applications that wish to handle only UTF8, but which
> will be essential for applications that have to work with a wider
> range
> of encodings (e.g. imgCIF applications).
>
> There are three highly likely BOMs that may be encountered at any
> point
> in a byte stream in a Unicode world:
>
> The UTF-8 BOM: �EF BB BF
> The UTF-16 big-endian BOM: �FE FF
> The UTF-16 little-endian BOM FF FE
>
> For a UTF-8 application, the sequence is EF B8 BF is, as James
> suggests,
> simply something to accept and ignore, with processing continuing
> normally without comment. �Again, as James suggests, for a UTF-8 only
> applications the other 2 BOMs are invalid characters to treat as an
> error.
>
> However, for an application able to work with a wider range of
> encodings,
> the other two BOMs are just what it needs to decide how to handle the
> remainder of the stream.
>
> Now that we have settled the case-sensitivity issue in a normalized
> unicode context, the recognition of BOMs in this manner imposes no
> particular additional burden on applications. �All applications will
> have to have utilities to assemble UTF-8 character sequences into
> Unicode code points either as 16 bit, or, better, 32 bit integers,
> so this is just a perfectly normal and in most cases already coded
> branch point in that logic. �It the application wishes to only be
> UTF-8 aware, it can chop off the branch that would decode UCS-2/UTF-16
> streams. �For what I have to do in my applications, I will simply
> accept the output of that branch -- in terms of code points for text
> I won't be able to tell the difference among the three possible
> streams of encoded characters, and for the UCS-2/UTF-16 bin-utf binary
> data I have to handle for imgCIF, things will work. �Certainly, for
> interchange with applications that only handle UTF-8, I will write
> the 50% expanded UTF-8 encodings of the same binaries, but for
> performance limited data collections, I will write out UCS-2/UTF-16
> files.
>
> �Nobody is hurt by what I am proposing and CIF2 will see wider
> application this way. �Alternatively, if the needs of imgCIF are
> unacceptable to be labelled CIF, we can always go back to
> calling it imgNCIF (N for "not") as we had to in 1997 until we
> called a truce and decided to accept the realities of modern
> macromolecular data acquisition.
>
> �Regards,
> � �Herbert
>
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � Dowling College, Kramer Science Center, KSC 121
> � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � +1-631-244-3035
> � � � � � � � � [email protected]
> =====================================================
>
> On Mon, 24 May 2010, James Hester wrote:
>
> � � �To run through the alternatives and some of the arguments
> � � �so far:
>
> � � �(i) treating an embedded BOM as an ordinary character runs
> � � �against the
> � � �Unicode recommendations.� If we wish our standard to be
> � � �respected, I think
> � � �we should at least respect other standards and the
> � � �thinking that has gone
> � � �into them
>
> � � �(ii) treating an embedded BOM as whitespace is OK with the
> � � �Unicode standard,
> � � �but means that a non-ASCII character now has syntactic
> � � �meaning in the CIF.�
> � � �I think this would be completely inconsistent on our part,
> � � �as an invisible
> � � �character (when displayed) can actually be used to delimit
> � � �strings.� This is
> � � �my least preferred solution, as it goes against the
> � � �human-readability
> � � �expected of CIFs
>
> � � �(iii) ignoring embedded BOMs is bad because they can be a
> � � �'tip off to a
> � � �serious problem'.
>
> � � �(iv) treating embedded BOMs as syntax errors will cause
> � � �issues when CIF2
> � � �files are naively concatenated
>
> � � �I think the only viable alternatives are to choose (iii)
> � � �or (iv).
>
> � � �So: why exactly is ignoring a BOM a problem?� If the
> � � �embedded BOM is the
> � � �leading BOM from a UTF16 file that has been naively
> � � �concatenated, it will
> � � �have bytes 0xFE 0xFF.� This byte sequence (and the
> � � �reverse) is not
> � � �acceptable UTF8, leading to a decoding error from the UTF8
> � � �decoding step.�
> � � �The subsequent bytes will be UTF16, which should cause a
> � � �decoding failure in
> � � �any case.�� So I deduce that we are simply discussing how
> � � �to treat a UTF8
> � � �BOM, which can only find its way into a CIF file by naive
> � � �concatenation of
> � � �UTF8-encoded files written by certain programs.
>
> � � �If the embedded BOM is a UTF-8 BOM, then ignoring it would
> � � �be OK, as I don't
> � � �see that it is indicative of any problems beyond misguided
> � � �choice of text
> � � �editor.
>
> � � �So I would advocate ignoring (and removing) UTF8-BOMs in
> � � �the input stream,
> � � �and treating all other BOMs as syntax errors.� Individual
> � � �applications may
> � � �wish to give users the option of interpreting U+FEFF as
> � � �the deprecated ZWNBP
> � � �(and translating to the correct character) on the
> � � �understanding that if this
> � � �occurs outside a delimited string it will cause a syntax
> � � �error.
>
> � � �James
>
> � � �PS am I the only one who thinks it unlikely that Wordpad
> � � �users would choose
> � � �to use 'cat' to join file fragments together?
>
> � � �On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein
> � � �<[email protected]> wrote:
> � � �� � �Allow me to clarify my position, so there is no
> � � �� � �misunderstanding:
>
> � � �� � �I believe that we will be dealing with a world with
> � � �at least
> � � �� � �UTF-8
> � � �� � �and UCS-2/UTF-16 encodings for many years to come. �I
> � � �have no
> � � �� � �objection to CIF2 being specified solely in terms of
> � � �UTF-8 for
> � � �� � �simplicity and consistency, but if we are to write
> � � �software that
> � � �� � �people can use, we must have a reasonable position
> � � �with respect
> � � �� � �to the encodings people use, and that means that, at
> � � �the very
> � � �� � �least, we need to accept and process UTF-8 BOMs as
> � � �harmless
> � � �� � �additional text. �Some of us will also be supporting
> � � �� � �UCS-2/UTF-16
> � � �� � �directly in our applications. �I don't mind if other
> � � �� � �applications
> � � �� � �are only going to support UTF-8, but inasmuch as, as
> � � �long as
> � � �� � �we have java and web browsers, we are going to
> � � �encounter
> � � �� � �UCS-2/UTF-16,
> � � �� � �we should do something sensible when a UCS-2/UTF-16
> � � �BOM pops up,
> � � �� � �either doing the internal translation if we so
> � � �choose, or, if
> � � �� � �that
> � � �� � �is not handled by a particular application, issuing a
> � � �polite
> � � �� � �warning
> � � �� � �suggesting the used of an external translator if the
> � � �application
> � � �� � �does
> � � �� � �not wish to handle UCS-2/UTF-16.
>
> � � �� � �BOMS will almost always appear in modern UCS-2/UTF-16
> � � �files, and
> � � �� � �when
> � � �� � �they are converted to UTF-8 that will give us yet
> � � �another source
> � � �� � �of
> � � �� � �UTF-8 BOMs. �I believe the sensible thing to so it to
> � � �recognize
> � � �� � �BOMs.
>
> � � �� � �Regards,
> � � �� � �� � Herbert
> � � �� � �=====================================================
> � � �� � ��Herbert J. Bernstein, Professor of Computer Science
> � � �� � �� �Dowling College, Kramer Science Center, KSC 121
> � � �� � �� � � � Idle Hour Blvd, Oakdale, NY, 11769
>
> � � �� � �� � � � � � � � �+1-631-244-3035
> � � �� � �� � � � � � � � �[email protected]
> � � �� � �=====================================================
>
> � � �On Tue, 18 May 2010, Bollinger, John C wrote:
>
> � � �> Herbert Bernstein wrote:
> � � �>> Let me see if I understand this correctly -- a user
> � � �takes 2
> � � �perfectly good
> � � �>> CIF2 files, edits each to clean up, say, some comments
> � � �to keep
> � � �straight where
> � � �>> one begins and one ends, using a well-designed modern
> � � �text editor
> � � �that
> � � �>> happens to put a BOM at the start of each file,
> � � �concatenates the
> � � �two files
> � � �>> with cat to ship them into the IUCr, and suddenly they
> � � �have a
> � � �syntax error
> � � �>> caused by a character that they cannot see!!!
> � � �>>
> � � �>> To me this seems pointless when it is trivial for
> � � �software to
> � � �recognize the
> � � �>> character and handle it sensibly.
> � � �>
> � � �> And that is my principal rationale for preferring that
> � � �embedded
> � � �U+FEFF be recognized as CIF whitespace. �With that
> � � �approach, the
> � � �concatenation of two well-formed CIF2 files is always a
> � � �well-formed
> � � �CIF2 file, regardless of the presence or absence of BOMs
> � � �in the
> � � �original files. �Note, too, that such concatenation cannot
> � � �produce a
> � � �mixed-encoding file because files encoded in
> � � �UTF-16[BE|LE],
> � � �UTF-32[BE|LE], or any other encoding that can be
> � � �distinguished from
> � � �UTF-8 are not well-formed CIF2 files to start. �The file
> � � �concatenation
> � � �scenario thus does not provide a use case for the CIF2
> � � �*specification*
> � � �to recognize embedded U+FEFF as an encoding marker.
> � � �>
> � � �> On the other hand, I again feel compelled to distinguish
> � � �program
> � � �behaviors from the CIF2 format specification. �None of the
> � � �above would
> � � �prevent a CIF processor from recognizing and handling
> � � �CIF-like
> � � �character streams encoded via schemes other than UTF-8,
> � � �nor from
> � � �recognizing embedded U+FEFF code sequences in various
> � � �encodings as
> � � �encoding switches, thereby handling mixed-encoding files.
> � � ��Indeed,
> � � �such a program or library would be invaluable for
> � � �correcting
> � � �encoding-related errors. �That does not, however, mean
> � � �that such files
> � � �must be considered well-formed CIF2, no matter how likely
> � � �they may (or
> � � �may not) be to arise.
> � � �>
> � � �>
> � � �> James Hester wrote:
> � � �>> I would be happy to call an embedded BOM a syntax
> � � �error.
> � � �>
> � � �> In light of the possibility of U+FEFF appearing in a
> � � �data value (for
> � � �example, from cutting text from a Unicode manuscript and
> � � �pasting it
> � � �into a CIF), I need to refine my earlier blanket
> � � �alternative of
> � � �treating embedded U+FEFF as a syntax error. �I now think
> � � �it would be
> � � �ok to treat U+FEFF as a syntax error *provided* that it
> � � �appears
> � � �outside a delimited string. �That's still not my
> � � �preference, though,
> � � �and I feel confident that Herb will still disagree.
> � � �>
> � � �>
> � � �> Regards,
> � � �>
> � � �> John
> � � �> --
> � � �> John C. Bollinger, Ph.D.
> � � �> Computing and X-Ray Scientist
> � � �> Department of Structural Biology
> � � �> St. Jude Children's Research Hospital
> � � �> [email protected]
> � � �> (901) 595-3166 [office]
> � � �> www.stjude.org
> � � �>
> � � �>
> � � �>
> � � �> Email Disclaimer: �www.stjude.org/emaildisclaimer
> � � �>
> � � �> _______________________________________________
> � � �> ddlm-group mailing list
> � � �> [email protected]
> � � �> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> � � �>
> � � �_______________________________________________
> � � �ddlm-group mailing list
> � � �[email protected]
> � � �http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>
>
>
> � � �--
> � � �T +61 (02) 9717 9907
> � � �F +61 (02) 9717 3145
> � � �M +61 (04) 0249 4148
>
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
>
_______________________________________________ ddlm-group mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: [ddlm-group] imgCIF versus CIF2
- Index(es):

