[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Dear Herbert:

I'm puzzled by your opinion that 0xFEFF is not a valid Unicode code
point, as you were an active participant in recent discussions where
we were talking about the appearance of 0xFEFF in UTF8 CIF2 files.
Anyway:

The third message in this thread, from John Bollinger, discusses the
treatment of 0xFEFF, with reference to the Unicode standard.  I
recommend that you read that message, and particularly note the phrase
from the Unicode FAQ: "For backwards compatibility it should be
treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of
the content of the file or string."  In a nutshell, it is up to us to
decide how to treat 0xFEFF in the decoded data stream, and you have
contributed to that particular discussion in the message below.  So:
voting on the treatment of 0xFEFF in the datastream is appropriate,
and from your contribution I interpret a desire to ignore it (Option
2(d))?

Note also that http://www.unicode.org/faq/utf_bom.html states:

"A byte order mark (BOM) consists of the character code U+FEFF at the
beginning of a data stream, where it can be used as a signature
defining the byte order and encoding form, primarily of unmarked
plaintext files. Under some higher level protocols, use of a BOM may
be mandatory (or prohibited) in the Unicode data stream defined in
that protocol."

That is, a BOM is only recognised by the Unicode standard at the start
of a data stream. Voting option 3(b) is there due to your advocacy of
using UCS2 BOM as an encoding switch inside a data stream (because the
Unicode standard *does not* mandate that, we should explicitly state
this if this is what we want).

On Fri, Jun 18, 2010 at 11:30 AM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> Sorry, you are mistaken.  What the code chart says is:
>
> Special
> FEFF    ZERO WIDTH NO-BREAK SPACE
>        = BYTE ORDER MARK (BOM), ZWNBSP
>         may be used to detect byto order by contrast
>         with the noncharcater code point FFFE
>          use as an indication of non-breaking is
>         deprecated; see 2060 instead
>         -> 200B zero width space
>         -> 2060 word joiner
>         -> FFFE <not a charcater>
>
> So, under the latest version of unicode, the use you are describing
> in deprecated. The unicode consortium has the character back to what it
> originally was -- the BOM, which is not a character, and I intend
> to process it that way, not in the very odd way that some people followed
> for a few recent Unicode versions that made no sense and has now been
> deprecated.
>
> In theory there could be old unicode UTF-8 files somehow with stray FEFF
> characters in them as code points, but inasmuch as CIF2 is new, we are
> all spared the puzzlement of dealing with this non-problem of dealing
> with a noncharacter which became a strange character and is now again
> a noncharacter.
>
> In addition, if you read the bizarre discussions on FEFF when people
> were trying to use if as a code point instead of just stopping it
> at the text processing level, you will see that the only thing they
> could do with it was throw it away (that is what a zero width no-break
> space means)
>
> The _only_ fully compliant use for FEFF in the current standard is
> as a BOM, not as a valid code point, so it is not really an issue
> for CIF2 any more than FFFF or FFFE are, none of which should
> be delivered as code points in text processing.
>
> The proposition you have proposed is a false trichotomy.
>
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Fri, 18 Jun 2010, James Hester wrote:
>
>> I suggest you look again (perhaps you found 0xFFFE instead?).  Unicode
>> Hexadecimal code point 0xFEFF is Zero Width Non-Breaking Space
>> (ZWNBSP).  Previous recent emails have discussed this at some length.
>>
>> On Fri, Jun 18, 2010 at 10:55 AM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>>
>>> Dear Colleagues,
>>>
>>>  As I said, I reject the false trichotomy presented, and vote to reject
>>> this binary approach to CIF2.  Asking what should be done if the
>>> Unicode code point 0xFEFF is encountered in the text stream.  FFFE is
>>> not a Unicode text character (I just checked the latest Unicode standard,
>>> and it is still not a character, explicitly call as "noncharacter") so
>>> a properly functioning text system simply will not deliver it as text
>>> to an application, just as in older ASCII-based systems, characters such
>>> as
>>> NUL and SYN are stripped before delivery of text to an application.
>>>
>>>  Regards,
>>>    Herbert
>>>
>>> =====================================================
>>>  Herbert J. Bernstein, Professor of Computer Science
>>>   Dowling College, Kramer Science Center, KSC 121
>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                 +1-631-244-3035
>>>                 yaya@dowling.edu
>>> =====================================================
>>>
>>> On Fri, 18 Jun 2010, James Hester wrote:
>>>
>>>> Herbert and Simon: regardless of your concerns about what encodings
>>>> should be acceptable for CIF2, I would invite you to vote on the
>>>> treatment of Unicode code point 0xFEFF when encountered in the decoded
>>>> text stream.  If you think a initial BOM should not be part of the
>>>> decoded text, then you are deciding how to treat code point 0xFEFF as
>>>> the first character in a CIF2 file, and the only consistent stance
>>>> would be that such a file is non-conformant, as the magic number
>>>> convention is violated.
>>>>
>>>> On Thu, Jun 17, 2010 at 9:21 PM, SIMON WESTRIP
>>>> <simonwestrip@btinternet.com> wrote:
>>>>>
>>>>> Dear all
>>>>>
>>>>> I've been watching this thread with the viewpoint that whatever is
>>>>> decided
>>>>> for the spec,
>>>>> I am going to have to be aware that CIFs may contain mixed encoding or
>>>>> encoding that
>>>>> isnt as specified. We meet this situation elsewhere, especially with
>>>>> text
>>>>> uploaded from
>>>>> web forms.
>>>>>
>>>>> So I quite like Herbert's latest description and would prefer to hold
>>>>> back
>>>>> from voting until I've considered this in more detail.
>>>>>
>>>>> Cheers
>>>>>
>>>>> Simon
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>>>> To: Group finalising DDLm and associated dictionaries
>>>>> <ddlm-group@iucr.org>
>>>>> Sent: Wednesday, 16 June, 2010 12:29:41
>>>>> Subject: Re: [ddlm-group] UTF-8 BOM
>>>>>
>>>>> Dear Colleagues,
>>>>>
>>>>>   As I said in my last message, I am proposing that we do what
>>>>> most of the world really does with unicode -- treat a CIF2 as
>>>>> a text file in which the information presented is a sequence
>>>>> os valid printable unicode code points no matter what the
>>>>> encoding.
>>>>>
>>>>>   For convenience in interchange, I am proposing that all
>>>>> CIF2 processing software working on systems that provide
>>>>> support for UTF-8 must provide support for that particular
>>>>> encoding, but if someone happens to be working in a system
>>>>> the only supports a UTF-7 or a UTF-16 or an old code-page-based
>>>>> encoding then I see no reason to declare what they produce
>>>>> erroneous in any way -- just a reason to require that they
>>>>> clearly identify the encoding used so that one of the
>>>>> many reliable encoding conversion programs that are available
>>>>> may be passed over their file when it needs to be handled
>>>>> in the preferred encoding.  I happen to use cyclone on my
>>>>> mac for that purpose.
>>>>>
>>>>>   The use of a BOM is just a quick, simply way to clearly
>>>>> specify an ecnoding if the file encoding a text file
>>>>> is a unicode file, but it really is not part of the text
>>>>> itself.
>>>>>
>>>>>   I, the strong proponent of supporting binary with CIF,
>>>>> am proposing that we return to the original approach
>>>>> to CIF -- that it really is a text file, not a binary file.
>>>>> I do so precisely to help me support the handling of
>>>>> binary with CIF.
>>>>>
>>>>>   Regards,
>>>>>     Herbert
>>>>> =====================================================
>>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>
>>>>>                   +1-631-244-3035
>>>>>                   yaya@dowling.edu
>>>>> =====================================================
>>>>>
>>>>> On Wed, 16 Jun 2010, James Hester wrote:
>>>>>
>>>>>> Dear Herbert,
>>>>>>
>>>>>> Would you mind enlarging a little on what you are responding to here,
>>>>>> as I don't follow your thinking.
>>>>>> Perhaps I was not clear: I am not in favour of allowing a variety of
>>>>>> encodings to be included within the CIF2 standard.  I am advocating
>>>>>> UTF8 only.  Is this what you are responding to, or are you discussing
>>>>>> the suggestion of allowing a variety of encodings?
>>>>>>
>>>>>> On Wed, Jun 16, 2010 at 12:33 PM, Herbert J. Bernstein
>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>
>>>>>>> Dear Colleagues,
>>>>>>>
>>>>>>>  This is quite a disruptive change.  Until now CIF has always had
>>>>>>> machine-dependent encoding changes assumed.  I am in favor of
>>>>>>> working the entire world towards a common representation of text,
>>>>>>> and the use of multiple Unicode representations supported on
>>>>>>> current systems is going to be a large positive step.  I think
>>>>>>> it is a little premature (by about 10 years) to assume a
>>>>>>> world of UTF-8 purity.  We ain't there yet.
>>>>>>>
>>>>>>>  You are essentially making CIF2 into a binary format instead
>>>>>>> of a text format.  That is a truly disruptive change.  I think
>>>>>>> it is a serious mistake that will discourage use of CIF as an
>>>>>>> interchange format, not encourage it.
>>>>>>>
>>>>>>>  Regards,
>>>>>>>    Herbert
>>>>>>>
>>>>>>> =====================================================
>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>
>>>>>>>                 +1-631-244-3035
>>>>>>>                 yaya@dowling.edu
>>>>>>> =====================================================
>>>>>>>
>>>>>>> On Wed, 16 Jun 2010, James Hester wrote:
>>>>>>>
>>>>>>>> My concern with opening up the suite of possible CIF encodings is
>>>>>>>> that
>>>>>>>> we
>>>>>>>> need to maintain a guarantee that any CIF2-conformant writer will
>>>>>>>> produce
>>>>>>>> files that any CIF2-conformant reader can read.  As we are a data
>>>>>>>> transfer
>>>>>>>> and archiving standard, this is a core guarantee that we make, so we
>>>>>>>> cannot
>>>>>>>> specify optional behaviour.  Note that we are not restricted to
>>>>>>>> someone
>>>>>>>> transferring files between computers at a single point in time, when
>>>>>>>> some
>>>>>>>> negotiation of encoding protocol could take place; we may be talking
>>>>>>>> about
>>>>>>>> a
>>>>>>>> third party retrieving a file archived some years ago by someone
>>>>>>>> else
>>>>>>>> in
>>>>>>>> the
>>>>>>>> local university repository.
>>>>>>>>
>>>>>>>> What people are and have always been free to do is to encapsulate
>>>>>>>> and
>>>>>>>> encode
>>>>>>>> CIFs in whatever way they wish, as long as the result is not touted
>>>>>>>> as
>>>>>>>> being
>>>>>>>> 'CIF2 conformant'.  The optional UTF8 BOM that we have more or less
>>>>>>>> agreed
>>>>>>>> to is purely in deference to poorly-written text editors, rather
>>>>>>>> than
>>>>>>>> an
>>>>>>>> encoding signature as such.
>>>>>>>>
>>>>>>>> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C
>>>>>>>> <John.Bollinger@stjude.org> wrote:
>>>>>>>>      On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:
>>>>>>>>
>>>>>>>>      >I'm coming to this late, I fear, but I would prefer that the
>>>>>>>>      spec
>>>>>>>>      >be kept as simple as possible. I note the following comments
>>>>>>>> in
>>>>>>>>      >the Unicode FAQ document referenced by John B
>>>>>>>>      >(http://www.unicode.org/faq/utf_bom.html):
>>>>>>>>      >
>>>>>>>>      >    "Where UTF-8 is used transparently in 8-bit environments,
>>>>>>>>      the use
>>>>>>>>      >    of a BOM will interfere with any protocol or file format
>>>>>>>>      that expects
>>>>>>>>      >    specific ASCII characters at the beginning, such as the
>>>>>>>> use
>>>>>>>>      of "#!"
>>>>>>>>      >    of at the beginning of Unix shell scripts."
>>>>>>>>
>>>>>>>> Well yes, but that applies to protocols defined in terms of 8-bit,
>>>>>>>> ASCII-derived character sets ("8-bit environments").  It does not
>>>>>>>> argue for BOMs to be forbidden in Unicode environments such as CIF2.
>>>>>>>>  Of course, neither does it require that BOMs be accepted or
>>>>>>>> recognized in Unicode environments.
>>>>>>>>
>>>>>>>>>    "In the absence of a protocol supporting its use as a BOM and
>>>>>>>>
>>>>>>>> when
>>>>>>>>>
>>>>>>>>>    not at the beginning of a text stream, U+FEFF should normally
>>>>>>>>> not
>>>>>>>>>    occur."
>>>>>>>>
>>>>>>>> I'm disappointed that you truncated the quote there.  It continues
>>>>>>>> with "For backwards compatibility it should be treated as ZERO WIDTH
>>>>>>>> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the
>>>>>>>> file or string."  It goes on to advocate using U+2060 instead, and
>>>>>>>> (in
>>>>>>>> the interest of full disclosure) it closes by commenting that a
>>>>>>>> language or protocol can specify that U+FEFF is unsupported in the
>>>>>>>> middle of a file.
>>>>>>>>
>>>>>>>>> I suggest the CIF specification deprecate the use of U+FEFF so that
>>>>>>>>> *any* occurrence of it be treated formally as an error. However, a
>>>>>>>>> note should acknowledge that U+FEFF is permitted according to the
>>>>>>>>> Unicode standard at the start of a data stream, and that therefore
>>>>>>>>> a
>>>>>>>>> CIF reading application may at its discretion accept U+FEFF
>>>>>>>>> followed
>>>>>>>>> by #\#CIF2.0 as a valid magic number at the start of a file.
>>>>>>>>
>>>>>>>> I don't see what is gained by forbidding U+FEFF from appearing
>>>>>>>> inside
>>>>>>>> data values, where one might arrive via any number of innocent
>>>>>>>> means.
>>>>>>>>  As it currently stands, the draft permits this.  It is somewhat
>>>>>>>> problematic to allow it at the beginning or end of a
>>>>>>>> whitespace-delimited value, but U+FEFF is by no means the only
>>>>>>>> character that is allowed but problematic at such a position.
>>>>>>>>
>>>>>>>> On the other hand, it is viable to specify that CIF itself does not
>>>>>>>> (directly) include a BOM.  That's where we started.  (Pedantic note:
>>>>>>>> "initial BOM" is redundant.  As the term is used in relation to
>>>>>>>> Unicode, a BOM necessarily appears at the beginning of a data
>>>>>>>> stream;
>>>>>>>> anywhere else, U+FEFF is just U+FEFF.)  If CIF does not formally
>>>>>>>> allow
>>>>>>>> a BOM then an otherwise well-formed CIF stream headed by a BOM would
>>>>>>>> then need to be interpreted either
>>>>>>>>
>>>>>>>> 1) as an unrecognized file, or
>>>>>>>>
>>>>>>>> 2) as an ill-formed CIF, or
>>>>>>>>
>>>>>>>> 3) as a well-formed CIF (any version) encapsulated in another
>>>>>>>> protocol.  Such "another protocol" does not need to be the concern
>>>>>>>> of
>>>>>>>> CIF.
>>>>>>>>
>>>>>>>>> The idea is that any fully-conformant CIF writer will never write
>>>>>>>>> an
>>>>>>>>> initial UTF-8 BOM, and so any software designed to handle only
>>>>>>>>> fully
>>>>>>>>> conformant CIFs will not be troubled by it.
>>>>>>>>
>>>>>>>> I could live with that.  I can't imagine writing a CIF processor
>>>>>>>> limited to that mode of operation, nor would I want to use one, but
>>>>>>>> I
>>>>>>>> can handle CIF's formal scope being limited in that way.
>>>>>>>>
>>>>>>>> In that case, however, let's carry it to the logical conclusion.
>>>>>>>>  Rather than put one particular encoding detail outside CIF's scope,
>>>>>>>> why not put character encoding out of scope altogether?  CIF can
>>>>>>>> easily be defined simply in terms of "Unicode characters".  Perhaps
>>>>>>>> instead of anointing UTF-8 as the One True Encoding for CIF, it
>>>>>>>> would
>>>>>>>> be better to make encoding an entirely separate concern.
>>>>>>>>
>>>>>>>> Practically speaking, you're going to have that anyway.  Even
>>>>>>>> disregarding imgCIF, does anyone really expect never to hear "it's a
>>>>>>>> CIF, except encoded in <FOO-13> instead of UTF-8"?  Does anyone
>>>>>>>> really
>>>>>>>> think they need the authority of the CIF specification to require
>>>>>>>> that
>>>>>>>> CIFs be delivered to them in a particular encoding?  How is that
>>>>>>>> qualitatively different from requiring particular CIF content, as
>>>>>>>> most
>>>>>>>> programs do?
>>>>>>>>
>>>>>>>>>                                             Of course the world
>>>>>>>>> does
>>>>>>>>> contain CIFs created other than by fully-conformant CIF writers. To
>>>>>>>>> an extent the community should decide for itself how best to
>>>>>>>>> attempt
>>>>>>>>> to handle deviations from full conformance. It would help, perhaps,
>>>>>>>>
>>>>>>>> if
>>>>>>>>>
>>>>>>>>> those of us writing CIF readers would document specific practices
>>>>>>>>
>>>>>>>> that
>>>>>>>>>
>>>>>>>>> the software takes to accommodate such deviations. Ideally, such
>>>>>>>>> software should have a verbose logging mode that can be activated
>>>>>>>>> whenever surprising behaviour in reading CIFs is encountered by
>>>>>>>>> the user.
>>>>>>>>
>>>>>>>> I think it's exceedingly optimistic to expect "the community" to
>>>>>>>> arrive at and abide by a single, consistent set of best practices.
>>>>>>>>  The best you can hope for is that a small number of organizations
>>>>>>>> and
>>>>>>>> / or programs will exert enough influence to establish their own de
>>>>>>>> facto standards.
>>>>>>>>
>>>>>>>> We can exert some influence there, however.  Either the CIF spec or
>>>>>>>> a
>>>>>>>> companion spec could establish conformance requirements for CIF
>>>>>>>> *processors*, including, for example, the ability to diagnose
>>>>>>>> particular malformations.  The XML spec does this, as do some
>>>>>>>> programming language specs.
>>>>>>>>
>>>>>>>> Such a document could also establish, perhaps, that CIF processors
>>>>>>>> must be able to accept the UTF-8 encoding, and maybe even that they
>>>>>>>> must assume UTF-8 by default.  That would establish the baseline and
>>>>>>>> a
>>>>>>>> guaranteed interoperability mode that we would otherwise lose by
>>>>>>>> pushing character encoding outside the format specification.
>>>>>>>>
>>>>>>>>> Notice that naive concatenation of CIFs will remain a bad idea for
>>>>>>>>> all sorts of reasons - beyond the purely syntactic issues, one will
>>>>>>>>> get multiple "data_TOZ" declarations for example. Undoubtedly this
>>>>>>>>> will continue to happen, but perhaps increasing the number of
>>>>>>>>> occasions when blindly concatenating files triggers software errors
>>>>>>>>> will help to raise awareness and/or the use of better software
>>>>>>>>> tools.
>>>>>>>>
>>>>>>>> You are preaching to the choir with that as far as I am concerned.
>>>>>>>>  It
>>>>>>>> has never been altogether safe or reliable to assemble CIFs by
>>>>>>>> concatenation of fragments or complete CIFs, and I don't see why
>>>>>>>> CIF2
>>>>>>>> needs to make special accommodation for behavior that was never
>>>>>>>> correct in the first place.  No matter what treatment is chosen for
>>>>>>>> U+FEFF, people who exercise due care will still be able to assemble
>>>>>>>> well-formed CIF2 files from fragments, even by using 'cat' if they
>>>>>>>> do
>>>>>>>> so shrewdly.
>>>>>>>>
>>>>>>>> John
>>>>>>>> --
>>>>>>>> John C. Bollinger, Ph.D.
>>>>>>>> Department of Structural Biology
>>>>>>>> St. Jude Children's Research Hospital
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ddlm-group mailing list
>>>>>>>> ddlm-group@iucr.org
>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> T +61 (02) 9717 9907
>>>>>>>> F +61 (02) 9717 3145
>>>>>>>> M +61 (04) 0249 4148
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ddlm-group mailing list
>>>>>>> ddlm-group@iucr.org
>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> T +61 (02) 9717 9907
>>>>>> F +61 (02) 9717 3145
>>>>>> M +61 (04) 0249 4148
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> ddlm-group@iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> ddlm-group@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> T +61 (02) 9717 9907
>>>> F +61 (02) 9717 3145
>>>> M +61 (04) 0249 4148
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>>
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]