Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:

>I'm coming to this late, I fear, but I would prefer that the spec
>be kept as simple as possible. I note the following comments in
>the Unicode FAQ document referenced by John B
>(http://www.unicode.org/faq/utf_bom.html):
>
>    "Where UTF-8 is used transparently in 8-bit environments, the use
>    of a BOM will interfere with any protocol or file format that expects
>    specific ASCII characters at the beginning, such as the use of "#!"
>    of at the beginning of Unix shell scripts."

Well yes, but that applies to protocols defined in terms of 8-bit,
ASCII-derived character sets ("8-bit environments").  It does not
argue for BOMs to be forbidden in Unicode environments such as CIF2.
Of course, neither does it require that BOMs be accepted or recognized
in Unicode environments.

>    "In the absence of a protocol supporting its use as a BOM and when
>    not at the beginning of a text stream, U+FEFF should normally not
>    occur."

I'm disappointed that you truncated the quote there.  It continues
with "For backwards compatibility it should be treated as ZERO WIDTH
NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the
file or string."  It goes on to advocate using U+2060 instead, and (in
the interest of full disclosure) it closes by commenting that a
language or protocol can specify that U+FEFF is unsupported in the
middle of a file.

>I suggest the CIF specification deprecate the use of U+FEFF so that
>*any* occurrence of it be treated formally as an error. However, a
>note should acknowledge that U+FEFF is permitted according to the
>Unicode standard at the start of a data stream, and that therefore a
>CIF reading application may at its discretion accept U+FEFF followed
>by #\#CIF2.0 as a valid magic number at the start of a file.

I don't see what is gained by forbidding U+FEFF from appearing inside
data values, where one might arrive via any number of innocent means.
As it currently stands, the draft permits this.  It is somewhat
problematic to allow it at the beginning or end of a
whitespace-delimited value, but U+FEFF is by no means the only
character that is allowed but problematic at such a position.

On the other hand, it is viable to specify that CIF itself does not
(directly) include a BOM.  That's where we started.  (Pedantic note:
"initial BOM" is redundant.  As the term is used in relation to
Unicode, a BOM necessarily appears at the beginning of a data stream;
anywhere else, U+FEFF is just U+FEFF.)  If CIF does not formally allow
a BOM then an otherwise well-formed CIF stream headed by a BOM would
then need to be interpreted either

1) as an unrecognized file, or

2) as an ill-formed CIF, or

3) as a well-formed CIF (any version) encapsulated in another protocol.
Such "another protocol" does not need to be the concern of CIF.

>The idea is that any fully-conformant CIF writer will never write an
>initial UTF-8 BOM, and so any software designed to handle only fully
>conformant CIFs will not be troubled by it.

I could live with that.  I can't imagine writing a CIF processor
limited to that mode of operation, nor would I want to use one, but I
can handle CIF's formal scope being limited in that way.

In that case, however, let's carry it to the logical conclusion.
Rather than put one particular encoding detail outside CIF's scope,
why not put character encoding out of scope altogether?  CIF can
easily be defined simply in terms of "Unicode characters".  Perhaps
instead of anointing UTF-8 as the One True Encoding for CIF, it would
be better to make encoding an entirely separate concern.

Practically speaking, you're going to have that anyway.  Even
disregarding imgCIF, does anyone really expect never to hear "it's a
CIF, except encoded in <FOO-13> instead of UTF-8"?  Does anyone
really think they need the authority of the CIF specification to
require that CIFs be delivered to them in a particular encoding?  How
is that qualitatively different from requiring particular CIF content,
as most programs do?


>                                             Of course the world does
>contain CIFs created other than by fully-conformant CIF writers. To
>an extent the community should decide for itself how best to attempt
>to handle deviations from full conformance. It would help, perhaps, if
>those of us writing CIF readers would document specific practices that
>the software takes to accommodate such deviations. Ideally, such
>software should have a verbose logging mode that can be activated
>whenever surprising behaviour in reading CIFs is encountered by
>the user.

I think it's exceedingly optimistic to expect "the community" to
arrive at and abide by a single, consistent set of best practices.
The best you can hope for is that a small number of organizations and
/ or programs will exert enough influence to establish their own de
facto standards.

We can exert some influence there, however.  Either the CIF spec or a
companion spec could establish conformance requirements for CIF
*processors*, including, for example, the ability to diagnose
particular malformations.  The XML spec does this, as do some
programming language specs.

Such a document could also establish, perhaps, that CIF processors
must be able to accept the UTF-8 encoding, and maybe even that they
must assume UTF-8 by default.  That would establish the baseline and a
guaranteed interoperability mode that we would otherwise lose by
pushing character encoding outside the format specification.


>Notice that naive concatenation of CIFs will remain a bad idea for
>all sorts of reasons - beyond the purely syntactic issues, one will
>get multiple "data_TOZ" declarations for example. Undoubtedly this
>will continue to happen, but perhaps increasing the number of
>occasions when blindly concatenating files triggers software errors
>will help to raise awareness and/or the use of better software tools.

You are preaching to the choir with that as far as I am concerned.  It
has never been altogether safe or reliable to assemble CIFs by
concatenation of fragments or complete CIFs, and I don't see why CIF2
needs to make special accommodation for behavior that was never
correct in the first place.  No matter what treatment is chosen for
U+FEFF, people who exercise due care will still be able to assemble
well-formed CIF2 files from fragments, even by using 'cat' if they do
so shrewdly.

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.