[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

My concern with opening up the suite of possible CIF encodings is that we need to maintain a guarantee that any CIF2-conformant writer will produce files that any CIF2-conformant reader can read.  As we are a data transfer and archiving standard, this is a core guarantee that we make, so we cannot specify optional behaviour.  Note that we are not restricted to someone transferring files between computers at a single point in time, when some negotiation of encoding protocol could take place; we may be talking about a third party retrieving a file archived some years ago by someone else in the local university repository.

What people are and have always been free to do is to encapsulate and encode CIFs in whatever way they wish, as long as the result is not touted as being 'CIF2 conformant'.  The optional UTF8 BOM that we have more or less agreed to is purely in deference to poorly-written text editors, rather than an encoding signature as such.

On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:
On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:

>I'm coming to this late, I fear, but I would prefer that the spec
>be kept as simple as possible. I note the following comments in
>the Unicode FAQ document referenced by John B
>(http://www.unicode.org/faq/utf_bom.html):
>
>    "Where UTF-8 is used transparently in 8-bit environments, the use
>    of a BOM will interfere with any protocol or file format that expects
>    specific ASCII characters at the beginning, such as the use of "#!"
>    of at the beginning of Unix shell scripts."

Well yes, but that applies to protocols defined in terms of 8-bit, ASCII-derived character sets ("8-bit environments").  It does not argue for BOMs to be forbidden in Unicode environments such as CIF2.  Of course, neither does it require that BOMs be accepted or recognized in Unicode environments.

>    "In the absence of a protocol supporting its use as a BOM and when
>    not at the beginning of a text stream, U+FEFF should normally not
>    occur."

I'm disappointed that you truncated the quote there.  It continues with "For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string."  It goes on to advocate using U+2060 instead, and (in the interest of full disclosure) it closes by commenting that a language or protocol can specify that U+FEFF is unsupported in the middle of a file.

>I suggest the CIF specification deprecate the use of U+FEFF so that
>*any* occurrence of it be treated formally as an error. However, a
>note should acknowledge that U+FEFF is permitted according to the
>Unicode standard at the start of a data stream, and that therefore a
>CIF reading application may at its discretion accept U+FEFF followed
>by #\#CIF2.0 as a valid magic number at the start of a file.

I don't see what is gained by forbidding U+FEFF from appearing inside data values, where one might arrive via any number of innocent means.  As it currently stands, the draft permits this.  It is somewhat problematic to allow it at the beginning or end of a whitespace-delimited value, but U+FEFF is by no means the only character that is allowed but problematic at such a position.

On the other hand, it is viable to specify that CIF itself does not (directly) include a BOM.  That's where we started.  (Pedantic note: "initial BOM" is redundant.  As the term is used in relation to Unicode, a BOM necessarily appears at the beginning of a data stream; anywhere else, U+FEFF is just U+FEFF.)  If CIF does not formally allow a BOM then an otherwise well-formed CIF stream headed by a BOM would then need to be interpreted either

1) as an unrecognized file, or

2) as an ill-formed CIF, or

3) as a well-formed CIF (any version) encapsulated in another protocol.  Such "another protocol" does not need to be the concern of CIF.

>The idea is that any fully-conformant CIF writer will never write an
>initial UTF-8 BOM, and so any software designed to handle only fully
>conformant CIFs will not be troubled by it.

I could live with that.  I can't imagine writing a CIF processor limited to that mode of operation, nor would I want to use one, but I can handle CIF's formal scope being limited in that way.

In that case, however, let's carry it to the logical conclusion.  Rather than put one particular encoding detail outside CIF's scope, why not put character encoding out of scope altogether?  CIF can easily be defined simply in terms of "Unicode characters".  Perhaps instead of anointing UTF-8 as the One True Encoding for CIF, it would be better to make encoding an entirely separate concern.

Practically speaking, you're going to have that anyway.  Even disregarding imgCIF, does anyone really expect never to hear "it's a CIF, except encoded in <FOO-13> instead of UTF-8"?  Does anyone really think they need the authority of the CIF specification to require that CIFs be delivered to them in a particular encoding?  How is that qualitatively different from requiring particular CIF content, as most programs do?

>                                             Of course the world does
>contain CIFs created other than by fully-conformant CIF writers. To
>an extent the community should decide for itself how best to attempt
>to handle deviations from full conformance. It would help, perhaps, if
>those of us writing CIF readers would document specific practices that
>the software takes to accommodate such deviations. Ideally, such
>software should have a verbose logging mode that can be activated
>whenever surprising behaviour in reading CIFs is encountered by
>the user.

I think it's exceedingly optimistic to expect "the community" to arrive at and abide by a single, consistent set of best practices.  The best you can hope for is that a small number of organizations and / or programs will exert enough influence to establish their own de facto standards.

We can exert some influence there, however.  Either the CIF spec or a companion spec could establish conformance requirements for CIF *processors*, including, for example, the ability to diagnose particular malformations.  The XML spec does this, as do some programming language specs.

Such a document could also establish, perhaps, that CIF processors must be able to accept the UTF-8 encoding, and maybe even that they must assume UTF-8 by default.  That would establish the baseline and a guaranteed interoperability mode that we would otherwise lose by pushing character encoding outside the format specification.

>Notice that naive concatenation of CIFs will remain a bad idea for
>all sorts of reasons - beyond the purely syntactic issues, one will
>get multiple "data_TOZ" declarations for example. Undoubtedly this
>will continue to happen, but perhaps increasing the number of
>occasions when blindly concatenating files triggers software errors
>will help to raise awareness and/or the use of better software tools.

You are preaching to the choir with that as far as I am concerned.  It has never been altogether safe or reliable to assemble CIFs by concatenation of fragments or complete CIFs, and I don't see why CIF2 needs to make special accommodation for behavior that was never correct in the first place.  No matter what treatment is chosen for U+FEFF, people who exercise due care will still be able to assemble well-formed CIF2 files from fragments, even by using 'cat' if they do so shrewdly.

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]