[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

I suggest you look again (perhaps you found 0xFFFE instead?).  Unicode
Hexadecimal code point 0xFEFF is Zero Width Non-Breaking Space
(ZWNBSP).  Previous recent emails have discussed this at some length.

On Fri, Jun 18, 2010 at 10:55 AM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> Dear Colleagues,
>
>  As I said, I reject the false trichotomy presented, and vote to reject
> this binary approach to CIF2.  Asking what should be done if the
> Unicode code point 0xFEFF is encountered in the text stream.  FFFE is
> not a Unicode text character (I just checked the latest Unicode standard,
> and it is still not a character, explicitly call as "noncharacter") so
> a properly functioning text system simply will not deliver it as text
> to an application, just as in older ASCII-based systems, characters such as
> NUL and SYN are stripped before delivery of text to an application.
>
>  Regards,
>    Herbert
>
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Fri, 18 Jun 2010, James Hester wrote:
>
>> Herbert and Simon: regardless of your concerns about what encodings
>> should be acceptable for CIF2, I would invite you to vote on the
>> treatment of Unicode code point 0xFEFF when encountered in the decoded
>> text stream.  If you think a initial BOM should not be part of the
>> decoded text, then you are deciding how to treat code point 0xFEFF as
>> the first character in a CIF2 file, and the only consistent stance
>> would be that such a file is non-conformant, as the magic number
>> convention is violated.
>>
>> On Thu, Jun 17, 2010 at 9:21 PM, SIMON WESTRIP
>> <simonwestrip@btinternet.com> wrote:
>>>
>>> Dear all
>>>
>>> I've been watching this thread with the viewpoint that whatever is
>>> decided
>>> for the spec,
>>> I am going to have to be aware that CIFs may contain mixed encoding or
>>> encoding that
>>> isnt as specified. We meet this situation elsewhere, especially with text
>>> uploaded from
>>> web forms.
>>>
>>> So I quite like Herbert's latest description and would prefer to hold
>>> back
>>> from voting until I've considered this in more detail.
>>>
>>> Cheers
>>>
>>> Simon
>>>
>>>
>>> ________________________________
>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>> To: Group finalising DDLm and associated dictionaries
>>> <ddlm-group@iucr.org>
>>> Sent: Wednesday, 16 June, 2010 12:29:41
>>> Subject: Re: [ddlm-group] UTF-8 BOM
>>>
>>> Dear Colleagues,
>>>
>>>   As I said in my last message, I am proposing that we do what
>>> most of the world really does with unicode -- treat a CIF2 as
>>> a text file in which the information presented is a sequence
>>> os valid printable unicode code points no matter what the
>>> encoding.
>>>
>>>   For convenience in interchange, I am proposing that all
>>> CIF2 processing software working on systems that provide
>>> support for UTF-8 must provide support for that particular
>>> encoding, but if someone happens to be working in a system
>>> the only supports a UTF-7 or a UTF-16 or an old code-page-based
>>> encoding then I see no reason to declare what they produce
>>> erroneous in any way -- just a reason to require that they
>>> clearly identify the encoding used so that one of the
>>> many reliable encoding conversion programs that are available
>>> may be passed over their file when it needs to be handled
>>> in the preferred encoding.  I happen to use cyclone on my
>>> mac for that purpose.
>>>
>>>   The use of a BOM is just a quick, simply way to clearly
>>> specify an ecnoding if the file encoding a text file
>>> is a unicode file, but it really is not part of the text
>>> itself.
>>>
>>>   I, the strong proponent of supporting binary with CIF,
>>> am proposing that we return to the original approach
>>> to CIF -- that it really is a text file, not a binary file.
>>> I do so precisely to help me support the handling of
>>> binary with CIF.
>>>
>>>   Regards,
>>>     Herbert
>>> =====================================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                   +1-631-244-3035
>>>                   yaya@dowling.edu
>>> =====================================================
>>>
>>> On Wed, 16 Jun 2010, James Hester wrote:
>>>
>>>> Dear Herbert,
>>>>
>>>> Would you mind enlarging a little on what you are responding to here,
>>>> as I don't follow your thinking.
>>>> Perhaps I was not clear: I am not in favour of allowing a variety of
>>>> encodings to be included within the CIF2 standard.  I am advocating
>>>> UTF8 only.  Is this what you are responding to, or are you discussing
>>>> the suggestion of allowing a variety of encodings?
>>>>
>>>> On Wed, Jun 16, 2010 at 12:33 PM, Herbert J. Bernstein
>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>
>>>>> Dear Colleagues,
>>>>>
>>>>>  This is quite a disruptive change.  Until now CIF has always had
>>>>> machine-dependent encoding changes assumed.  I am in favor of
>>>>> working the entire world towards a common representation of text,
>>>>> and the use of multiple Unicode representations supported on
>>>>> current systems is going to be a large positive step.  I think
>>>>> it is a little premature (by about 10 years) to assume a
>>>>> world of UTF-8 purity.  We ain't there yet.
>>>>>
>>>>>  You are essentially making CIF2 into a binary format instead
>>>>> of a text format.  That is a truly disruptive change.  I think
>>>>> it is a serious mistake that will discourage use of CIF as an
>>>>> interchange format, not encourage it.
>>>>>
>>>>>  Regards,
>>>>>    Herbert
>>>>>
>>>>> =====================================================
>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>
>>>>>                 +1-631-244-3035
>>>>>                 yaya@dowling.edu
>>>>> =====================================================
>>>>>
>>>>> On Wed, 16 Jun 2010, James Hester wrote:
>>>>>
>>>>>> My concern with opening up the suite of possible CIF encodings is that
>>>>>> we
>>>>>> need to maintain a guarantee that any CIF2-conformant writer will
>>>>>> produce
>>>>>> files that any CIF2-conformant reader can read.  As we are a data
>>>>>> transfer
>>>>>> and archiving standard, this is a core guarantee that we make, so we
>>>>>> cannot
>>>>>> specify optional behaviour.  Note that we are not restricted to
>>>>>> someone
>>>>>> transferring files between computers at a single point in time, when
>>>>>> some
>>>>>> negotiation of encoding protocol could take place; we may be talking
>>>>>> about
>>>>>> a
>>>>>> third party retrieving a file archived some years ago by someone else
>>>>>> in
>>>>>> the
>>>>>> local university repository.
>>>>>>
>>>>>> What people are and have always been free to do is to encapsulate and
>>>>>> encode
>>>>>> CIFs in whatever way they wish, as long as the result is not touted as
>>>>>> being
>>>>>> 'CIF2 conformant'.  The optional UTF8 BOM that we have more or less
>>>>>> agreed
>>>>>> to is purely in deference to poorly-written text editors, rather than
>>>>>> an
>>>>>> encoding signature as such.
>>>>>>
>>>>>> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C
>>>>>> <John.Bollinger@stjude.org> wrote:
>>>>>>      On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:
>>>>>>
>>>>>>      >I'm coming to this late, I fear, but I would prefer that the
>>>>>>      spec
>>>>>>      >be kept as simple as possible. I note the following comments in
>>>>>>      >the Unicode FAQ document referenced by John B
>>>>>>      >(http://www.unicode.org/faq/utf_bom.html):
>>>>>>      >
>>>>>>      >    "Where UTF-8 is used transparently in 8-bit environments,
>>>>>>      the use
>>>>>>      >    of a BOM will interfere with any protocol or file format
>>>>>>      that expects
>>>>>>      >    specific ASCII characters at the beginning, such as the use
>>>>>>      of "#!"
>>>>>>      >    of at the beginning of Unix shell scripts."
>>>>>>
>>>>>> Well yes, but that applies to protocols defined in terms of 8-bit,
>>>>>> ASCII-derived character sets ("8-bit environments").  It does not
>>>>>> argue for BOMs to be forbidden in Unicode environments such as CIF2.
>>>>>>  Of course, neither does it require that BOMs be accepted or
>>>>>> recognized in Unicode environments.
>>>>>>
>>>>>>>    "In the absence of a protocol supporting its use as a BOM and
>>>>>>
>>>>>> when
>>>>>>>
>>>>>>>    not at the beginning of a text stream, U+FEFF should normally not
>>>>>>>    occur."
>>>>>>
>>>>>> I'm disappointed that you truncated the quote there.  It continues
>>>>>> with "For backwards compatibility it should be treated as ZERO WIDTH
>>>>>> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the
>>>>>> file or string."  It goes on to advocate using U+2060 instead, and (in
>>>>>> the interest of full disclosure) it closes by commenting that a
>>>>>> language or protocol can specify that U+FEFF is unsupported in the
>>>>>> middle of a file.
>>>>>>
>>>>>>> I suggest the CIF specification deprecate the use of U+FEFF so that
>>>>>>> *any* occurrence of it be treated formally as an error. However, a
>>>>>>> note should acknowledge that U+FEFF is permitted according to the
>>>>>>> Unicode standard at the start of a data stream, and that therefore a
>>>>>>> CIF reading application may at its discretion accept U+FEFF followed
>>>>>>> by #\#CIF2.0 as a valid magic number at the start of a file.
>>>>>>
>>>>>> I don't see what is gained by forbidding U+FEFF from appearing inside
>>>>>> data values, where one might arrive via any number of innocent means.
>>>>>>  As it currently stands, the draft permits this.  It is somewhat
>>>>>> problematic to allow it at the beginning or end of a
>>>>>> whitespace-delimited value, but U+FEFF is by no means the only
>>>>>> character that is allowed but problematic at such a position.
>>>>>>
>>>>>> On the other hand, it is viable to specify that CIF itself does not
>>>>>> (directly) include a BOM.  That's where we started.  (Pedantic note:
>>>>>> "initial BOM" is redundant.  As the term is used in relation to
>>>>>> Unicode, a BOM necessarily appears at the beginning of a data stream;
>>>>>> anywhere else, U+FEFF is just U+FEFF.)  If CIF does not formally allow
>>>>>> a BOM then an otherwise well-formed CIF stream headed by a BOM would
>>>>>> then need to be interpreted either
>>>>>>
>>>>>> 1) as an unrecognized file, or
>>>>>>
>>>>>> 2) as an ill-formed CIF, or
>>>>>>
>>>>>> 3) as a well-formed CIF (any version) encapsulated in another
>>>>>> protocol.  Such "another protocol" does not need to be the concern of
>>>>>> CIF.
>>>>>>
>>>>>>> The idea is that any fully-conformant CIF writer will never write an
>>>>>>> initial UTF-8 BOM, and so any software designed to handle only fully
>>>>>>> conformant CIFs will not be troubled by it.
>>>>>>
>>>>>> I could live with that.  I can't imagine writing a CIF processor
>>>>>> limited to that mode of operation, nor would I want to use one, but I
>>>>>> can handle CIF's formal scope being limited in that way.
>>>>>>
>>>>>> In that case, however, let's carry it to the logical conclusion.
>>>>>>  Rather than put one particular encoding detail outside CIF's scope,
>>>>>> why not put character encoding out of scope altogether?  CIF can
>>>>>> easily be defined simply in terms of "Unicode characters".  Perhaps
>>>>>> instead of anointing UTF-8 as the One True Encoding for CIF, it would
>>>>>> be better to make encoding an entirely separate concern.
>>>>>>
>>>>>> Practically speaking, you're going to have that anyway.  Even
>>>>>> disregarding imgCIF, does anyone really expect never to hear "it's a
>>>>>> CIF, except encoded in <FOO-13> instead of UTF-8"?  Does anyone really
>>>>>> think they need the authority of the CIF specification to require that
>>>>>> CIFs be delivered to them in a particular encoding?  How is that
>>>>>> qualitatively different from requiring particular CIF content, as most
>>>>>> programs do?
>>>>>>
>>>>>>>                                             Of course the world does
>>>>>>> contain CIFs created other than by fully-conformant CIF writers. To
>>>>>>> an extent the community should decide for itself how best to attempt
>>>>>>> to handle deviations from full conformance. It would help, perhaps,
>>>>>>
>>>>>> if
>>>>>>>
>>>>>>> those of us writing CIF readers would document specific practices
>>>>>>
>>>>>> that
>>>>>>>
>>>>>>> the software takes to accommodate such deviations. Ideally, such
>>>>>>> software should have a verbose logging mode that can be activated
>>>>>>> whenever surprising behaviour in reading CIFs is encountered by
>>>>>>> the user.
>>>>>>
>>>>>> I think it's exceedingly optimistic to expect "the community" to
>>>>>> arrive at and abide by a single, consistent set of best practices.
>>>>>>  The best you can hope for is that a small number of organizations and
>>>>>> / or programs will exert enough influence to establish their own de
>>>>>> facto standards.
>>>>>>
>>>>>> We can exert some influence there, however.  Either the CIF spec or a
>>>>>> companion spec could establish conformance requirements for CIF
>>>>>> *processors*, including, for example, the ability to diagnose
>>>>>> particular malformations.  The XML spec does this, as do some
>>>>>> programming language specs.
>>>>>>
>>>>>> Such a document could also establish, perhaps, that CIF processors
>>>>>> must be able to accept the UTF-8 encoding, and maybe even that they
>>>>>> must assume UTF-8 by default.  That would establish the baseline and a
>>>>>> guaranteed interoperability mode that we would otherwise lose by
>>>>>> pushing character encoding outside the format specification.
>>>>>>
>>>>>>> Notice that naive concatenation of CIFs will remain a bad idea for
>>>>>>> all sorts of reasons - beyond the purely syntactic issues, one will
>>>>>>> get multiple "data_TOZ" declarations for example. Undoubtedly this
>>>>>>> will continue to happen, but perhaps increasing the number of
>>>>>>> occasions when blindly concatenating files triggers software errors
>>>>>>> will help to raise awareness and/or the use of better software tools.
>>>>>>
>>>>>> You are preaching to the choir with that as far as I am concerned.  It
>>>>>> has never been altogether safe or reliable to assemble CIFs by
>>>>>> concatenation of fragments or complete CIFs, and I don't see why CIF2
>>>>>> needs to make special accommodation for behavior that was never
>>>>>> correct in the first place.  No matter what treatment is chosen for
>>>>>> U+FEFF, people who exercise due care will still be able to assemble
>>>>>> well-formed CIF2 files from fragments, even by using 'cat' if they do
>>>>>> so shrewdly.
>>>>>>
>>>>>> John
>>>>>> --
>>>>>> John C. Bollinger, Ph.D.
>>>>>> Department of Structural Biology
>>>>>> St. Jude Children's Research Hospital
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>>>>
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> ddlm-group@iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> T +61 (02) 9717 9907
>>>>>> F +61 (02) 9717 3145
>>>>>> M +61 (04) 0249 4148
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> ddlm-group@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> T +61 (02) 9717 9907
>>>> F +61 (02) 9717 3145
>>>> M +61 (04) 0249 4148
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>>
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]