Re: [ddlm-group] UTF-8 BOM

Herbert and Simon: regardless of your concerns about what encodings
should be acceptable for CIF2, I would invite you to vote on the
treatment of Unicode code point 0xFEFF when encountered in the decoded
text stream.  If you think a initial BOM should not be part of the
decoded text, then you are deciding how to treat code point 0xFEFF as
the first character in a CIF2 file, and the only consistent stance
would be that such a file is non-conformant, as the magic number
convention is violated.

On Thu, Jun 17, 2010 at 9:21 PM, SIMON WESTRIP
<simonwestrip@btinternet.com> wrote:
> Dear all
> I've been watching this thread with the viewpoint that whatever is decided
> for the spec,
> I am going to have to be aware that CIFs may contain mixed encoding or
> encoding that
> isnt as specified. We meet this situation elsewhere, especially with text
> uploaded from
> web forms.
> So I quite like Herbert's latest description and would prefer to hold back
> from voting until I've considered this in more detail.
> Cheers
> Simon
> ________________________________
> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Wednesday, 16 June, 2010 12:29:41
> Subject: Re: [ddlm-group] UTF-8 BOM
> Dear Colleagues,
>   As I said in my last message, I am proposing that we do what
> most of the world really does with unicode -- treat a CIF2 as
> a text file in which the information presented is a sequence
> os valid printable unicode code points no matter what the
> encoding.
>   For convenience in interchange, I am proposing that all
> CIF2 processing software working on systems that provide
> support for UTF-8 must provide support for that particular
> encoding, but if someone happens to be working in a system
> the only supports a UTF-7 or a UTF-16 or an old code-page-based
> encoding then I see no reason to declare what they produce
> erroneous in any way -- just a reason to require that they
> clearly identify the encoding used so that one of the
> many reliable encoding conversion programs that are available
> may be passed over their file when it needs to be handled
> in the preferred encoding.  I happen to use cyclone on my
> mac for that purpose.
>   The use of a BOM is just a quick, simply way to clearly
> specify an ecnoding if the file encoding a text file
> is a unicode file, but it really is not part of the text
> itself.
>   I, the strong proponent of supporting binary with CIF,
> am proposing that we return to the original approach
> to CIF -- that it really is a text file, not a binary file.
> I do so precisely to help me support the handling of
> binary with CIF.
>   Regards,
>     Herbert
> On Wed, 16 Jun 2010, James Hester wrote:
>> Dear Herbert,
>> Would you mind enlarging a little on what you are responding to here,
>> as I don't follow your thinking.
>> Perhaps I was not clear: I am not in favour of allowing a variety of
>> encodings to be included within the CIF2 standard.  I am advocating
>> UTF8 only.  Is this what you are responding to, or are you discussing
>> the suggestion of allowing a variety of encodings?
>> On Wed, Jun 16, 2010 at 12:33 PM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>> Dear Colleagues,
>>>  This is quite a disruptive change.  Until now CIF has always had
>>> machine-dependent encoding changes assumed.  I am in favor of
>>> working the entire world towards a common representation of text,
>>> and the use of multiple Unicode representations supported on
>>> current systems is going to be a large positive step.  I think
>>> it is a little premature (by about 10 years) to assume a
>>> world of UTF-8 purity.  We ain't there yet.
>>>  You are essentially making CIF2 into a binary format instead
>>> of a text format.  That is a truly disruptive change.  I think
>>> it is a serious mistake that will discourage use of CIF as an
>>> interchange format, not encourage it.
>>>  Regards,
>>>    Herbert
>>> On Wed, 16 Jun 2010, James Hester wrote:
>>>> My concern with opening up the suite of possible CIF encodings is that
>>>> we
>>>> need to maintain a guarantee that any CIF2-conformant writer will
>>>> produce
>>>> files that any CIF2-conformant reader can read.  As we are a data
>>>> transfer
>>>> and archiving standard, this is a core guarantee that we make, so we
>>>> cannot
>>>> specify optional behaviour.  Note that we are not restricted to someone
>>>> transferring files between computers at a single point in time, when
>>>> some
>>>> negotiation of encoding protocol could take place; we may be talking
>>>> about
>>>> a
>>>> third party retrieving a file archived some years ago by someone else in
>>>> the
>>>> local university repository.
>>>> What people are and have always been free to do is to encapsulate and
>>>> encode
>>>> CIFs in whatever way they wish, as long as the result is not touted as
>>>> being
>>>> 'CIF2 conformant'.  The optional UTF8 BOM that we have more or less
>>>> agreed
>>>> to is purely in deference to poorly-written text editors, rather than an
>>>> encoding signature as such.
>>>> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C
>>>> <John.Bollinger@stjude.org> wrote:
>>>>      On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:
>>>>      >I'm coming to this late, I fear, but I would prefer that the
>>>>      spec
>>>>      >be kept as simple as possible. I note the following comments in
>>>>      >the Unicode FAQ document referenced by John B
>>>>      >(http://www.unicode.org/faq/utf_bom.html):
>>>>      >
>>>>      >    "Where UTF-8 is used transparently in 8-bit environments,
>>>>      the use
>>>>      >    of a BOM will interfere with any protocol or file format
>>>>      that expects
>>>>      >    specific ASCII characters at the beginning, such as the use
>>>>      of "#!"
>>>>      >    of at the beginning of Unix shell scripts."
>>>> Well yes, but that applies to protocols defined in terms of 8-bit,
>>>> ASCII-derived character sets ("8-bit environments").  It does not
>>>> argue for BOMs to be forbidden in Unicode environments such as CIF2.
>>>>  Of course, neither does it require that BOMs be accepted or
>>>> recognized in Unicode environments.
>>>>>    "In the absence of a protocol supporting its use as a BOM and
>>>> when
>>>>>    not at the beginning of a text stream, U+FEFF should normally not
>>>>>    occur."
>>>> I'm disappointed that you truncated the quote there.  It continues
>>>> with "For backwards compatibility it should be treated as ZERO WIDTH
>>>> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the
>>>> file or string."  It goes on to advocate using U+2060 instead, and (in
>>>> the interest of full disclosure) it closes by commenting that a
>>>> language or protocol can specify that U+FEFF is unsupported in the
>>>> middle of a file.
>>>>> I suggest the CIF specification deprecate the use of U+FEFF so that
>>>>> *any* occurrence of it be treated formally as an error. However, a
>>>>> note should acknowledge that U+FEFF is permitted according to the
>>>>> Unicode standard at the start of a data stream, and that therefore a
>>>>> CIF reading application may at its discretion accept U+FEFF followed
>>>>> by #\#CIF2.0 as a valid magic number at the start of a file.
>>>> I don't see what is gained by forbidding U+FEFF from appearing inside
>>>> data values, where one might arrive via any number of innocent means.
>>>>  As it currently stands, the draft permits this.  It is somewhat
>>>> problematic to allow it at the beginning or end of a
>>>> whitespace-delimited value, but U+FEFF is by no means the only
>>>> character that is allowed but problematic at such a position.
>>>> On the other hand, it is viable to specify that CIF itself does not
>>>> (directly) include a BOM.  That's where we started.  (Pedantic note:
>>>> "initial BOM" is redundant.  As the term is used in relation to
>>>> Unicode, a BOM necessarily appears at the beginning of a data stream;
>>>> anywhere else, U+FEFF is just U+FEFF.)  If CIF does not formally allow
>>>> a BOM then an otherwise well-formed CIF stream headed by a BOM would
>>>> then need to be interpreted either
>>>> 1) as an unrecognized file, or
>>>> 2) as an ill-formed CIF, or
>>>> 3) as a well-formed CIF (any version) encapsulated in another
>>>> protocol.  Such "another protocol" does not need to be the concern of
>>>> CIF.
>>>>> The idea is that any fully-conformant CIF writer will never write an
>>>>> initial UTF-8 BOM, and so any software designed to handle only fully
>>>>> conformant CIFs will not be troubled by it.
>>>> I could live with that.  I can't imagine writing a CIF processor
>>>> limited to that mode of operation, nor would I want to use one, but I
>>>> can handle CIF's formal scope being limited in that way.
>>>> In that case, however, let's carry it to the logical conclusion.
>>>>  Rather than put one particular encoding detail outside CIF's scope,
>>>> why not put character encoding out of scope altogether?  CIF can
>>>> easily be defined simply in terms of "Unicode characters".  Perhaps
>>>> instead of anointing UTF-8 as the One True Encoding for CIF, it would
>>>> be better to make encoding an entirely separate concern.
>>>> Practically speaking, you're going to have that anyway.  Even
>>>> disregarding imgCIF, does anyone really expect never to hear "it's a
>>>> CIF, except encoded in <FOO-13> instead of UTF-8"?  Does anyone really
>>>> think they need the authority of the CIF specification to require that
>>>> CIFs be delivered to them in a particular encoding?  How is that
>>>> qualitatively different from requiring particular CIF content, as most
>>>> programs do?
>>>>>                                             Of course the world does
>>>>> contain CIFs created other than by fully-conformant CIF writers. To
>>>>> an extent the community should decide for itself how best to attempt
>>>>> to handle deviations from full conformance. It would help, perhaps,
>>>> if
>>>>> those of us writing CIF readers would document specific practices
>>>> that
>>>>> the software takes to accommodate such deviations. Ideally, such
>>>>> software should have a verbose logging mode that can be activated
>>>>> whenever surprising behaviour in reading CIFs is encountered by
>>>>> the user.
>>>> I think it's exceedingly optimistic to expect "the community" to
>>>> arrive at and abide by a single, consistent set of best practices.
>>>>  The best you can hope for is that a small number of organizations and
>>>> / or programs will exert enough influence to establish their own de
>>>> facto standards.
>>>> We can exert some influence there, however.  Either the CIF spec or a
>>>> companion spec could establish conformance requirements for CIF
>>>> *processors*, including, for example, the ability to diagnose
>>>> particular malformations.  The XML spec does this, as do some
>>>> programming language specs.
>>>> Such a document could also establish, perhaps, that CIF processors
>>>> must be able to accept the UTF-8 encoding, and maybe even that they
>>>> must assume UTF-8 by default.  That would establish the baseline and a
>>>> guaranteed interoperability mode that we would otherwise lose by
>>>> pushing character encoding outside the format specification.
>>>>> Notice that naive concatenation of CIFs will remain a bad idea for
>>>>> all sorts of reasons - beyond the purely syntactic issues, one will
>>>>> get multiple "data_TOZ" declarations for example. Undoubtedly this
>>>>> will continue to happen, but perhaps increasing the number of
>>>>> occasions when blindly concatenating files triggers software errors
>>>>> will help to raise awareness and/or the use of better software tools.
>>>> You are preaching to the choir with that as far as I am concerned.  It
>>>> has never been altogether safe or reliable to assemble CIFs by
>>>> concatenation of fragments or complete CIFs, and I don't see why CIF2
>>>> needs to make special accommodation for behavior that was never
>>>> correct in the first place.  No matter what treatment is chosen for
>>>> U+FEFF, people who exercise due care will still be able to assemble
>>>> well-formed CIF2 files from fragments, even by using 'cat' if they do
>>>> so shrewdly.
>>>> John
