Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Dear Colleagues,

   This is quite a disruptive change.  Until now CIF has always had
machine-dependent encoding changes assumed.  I am in favor of
working the entire world towards a common representation of text,
and the use of multiple Unicode representations supported on
current systems is going to be a large positive step.  I think
it is a little premature (by about 10 years) to assume a
world of UTF-8 purity.  We ain't there yet.

   You are essentially making CIF2 into a binary format instead
of a text format.  That is a truly disruptive change.  I think
it is a serious mistake that will discourage use of CIF as an
interchange format, not encourage it.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 16 Jun 2010, James Hester wrote:

> My concern with opening up the suite of possible CIF encodings is that we
> need to maintain a guarantee that any CIF2-conformant writer will produce
> files that any CIF2-conformant reader can read.  As we are a data transfer
> and archiving standard, this is a core guarantee that we make, so we cannot
> specify optional behaviour.  Note that we are not restricted to someone
> transferring files between computers at a single point in time, when some
> negotiation of encoding protocol could take place; we may be talking about a
> third party retrieving a file archived some years ago by someone else in the
> local university repository.
> 
> What people are and have always been free to do is to encapsulate and encode
> CIFs in whatever way they wish, as long as the result is not touted as being
> 'CIF2 conformant'.  The optional UTF8 BOM that we have more or less agreed
> to is purely in deference to poorly-written text editors, rather than an
> encoding signature as such.
> 
> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C
> <John.Bollinger@stjude.org> wrote:
>       On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:
>
>       >I'm coming to this late, I fear, but I would prefer that the
>       spec
>       >be kept as simple as possible. I note the following comments in
>       >the Unicode FAQ document referenced by John B
>       >(http://www.unicode.org/faq/utf_bom.html):
>       >
>       >    "Where UTF-8 is used transparently in 8-bit environments,
>       the use
>       >    of a BOM will interfere with any protocol or file format
>       that expects
>       >    specific ASCII characters at the beginning, such as the use
>       of "#!"
>       >    of at the beginning of Unix shell scripts."
> 
> Well yes, but that applies to protocols defined in terms of 8-bit,
> ASCII-derived character sets ("8-bit environments").  It does not
> argue for BOMs to be forbidden in Unicode environments such as CIF2.
>  Of course, neither does it require that BOMs be accepted or
> recognized in Unicode environments.
> 
> >    "In the absence of a protocol supporting its use as a BOM and
> when
> >    not at the beginning of a text stream, U+FEFF should normally not
> >    occur."
> 
> I'm disappointed that you truncated the quote there.  It continues
> with "For backwards compatibility it should be treated as ZERO WIDTH
> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the
> file or string."  It goes on to advocate using U+2060 instead, and (in
> the interest of full disclosure) it closes by commenting that a
> language or protocol can specify that U+FEFF is unsupported in the
> middle of a file.
> 
> >I suggest the CIF specification deprecate the use of U+FEFF so that
> >*any* occurrence of it be treated formally as an error. However, a
> >note should acknowledge that U+FEFF is permitted according to the
> >Unicode standard at the start of a data stream, and that therefore a
> >CIF reading application may at its discretion accept U+FEFF followed
> >by #\#CIF2.0 as a valid magic number at the start of a file.
> 
> I don't see what is gained by forbidding U+FEFF from appearing inside
> data values, where one might arrive via any number of innocent means.
>  As it currently stands, the draft permits this.  It is somewhat
> problematic to allow it at the beginning or end of a
> whitespace-delimited value, but U+FEFF is by no means the only
> character that is allowed but problematic at such a position.
> 
> On the other hand, it is viable to specify that CIF itself does not
> (directly) include a BOM.  That's where we started.  (Pedantic note:
> "initial BOM" is redundant.  As the term is used in relation to
> Unicode, a BOM necessarily appears at the beginning of a data stream;
> anywhere else, U+FEFF is just U+FEFF.)  If CIF does not formally allow
> a BOM then an otherwise well-formed CIF stream headed by a BOM would
> then need to be interpreted either
> 
> 1) as an unrecognized file, or
> 
> 2) as an ill-formed CIF, or
> 
> 3) as a well-formed CIF (any version) encapsulated in another
> protocol.  Such "another protocol" does not need to be the concern of
> CIF.
> 
> >The idea is that any fully-conformant CIF writer will never write an
> >initial UTF-8 BOM, and so any software designed to handle only fully
> >conformant CIFs will not be troubled by it.
> 
> I could live with that.  I can't imagine writing a CIF processor
> limited to that mode of operation, nor would I want to use one, but I
> can handle CIF's formal scope being limited in that way.
> 
> In that case, however, let's carry it to the logical conclusion.
>  Rather than put one particular encoding detail outside CIF's scope,
> why not put character encoding out of scope altogether?  CIF can
> easily be defined simply in terms of "Unicode characters".  Perhaps
> instead of anointing UTF-8 as the One True Encoding for CIF, it would
> be better to make encoding an entirely separate concern.
> 
> Practically speaking, you're going to have that anyway.  Even
> disregarding imgCIF, does anyone really expect never to hear "it's a
> CIF, except encoded in <FOO-13> instead of UTF-8"?  Does anyone really
> think they need the authority of the CIF specification to require that
> CIFs be delivered to them in a particular encoding?  How is that
> qualitatively different from requiring particular CIF content, as most
> programs do?
> 
> >                                             Of course the world does
> >contain CIFs created other than by fully-conformant CIF writers. To
> >an extent the community should decide for itself how best to attempt
> >to handle deviations from full conformance. It would help, perhaps,
> if
> >those of us writing CIF readers would document specific practices
> that
> >the software takes to accommodate such deviations. Ideally, such
> >software should have a verbose logging mode that can be activated
> >whenever surprising behaviour in reading CIFs is encountered by
> >the user.
> 
> I think it's exceedingly optimistic to expect "the community" to
> arrive at and abide by a single, consistent set of best practices.
>  The best you can hope for is that a small number of organizations and
> / or programs will exert enough influence to establish their own de
> facto standards.
> 
> We can exert some influence there, however.  Either the CIF spec or a
> companion spec could establish conformance requirements for CIF
> *processors*, including, for example, the ability to diagnose
> particular malformations.  The XML spec does this, as do some
> programming language specs.
> 
> Such a document could also establish, perhaps, that CIF processors
> must be able to accept the UTF-8 encoding, and maybe even that they
> must assume UTF-8 by default.  That would establish the baseline and a
> guaranteed interoperability mode that we would otherwise lose by
> pushing character encoding outside the format specification.
> 
> >Notice that naive concatenation of CIFs will remain a bad idea for
> >all sorts of reasons - beyond the purely syntactic issues, one will
> >get multiple "data_TOZ" declarations for example. Undoubtedly this
> >will continue to happen, but perhaps increasing the number of
> >occasions when blindly concatenating files triggers software errors
> >will help to raise awareness and/or the use of better software tools.
> 
> You are preaching to the choir with that as far as I am concerned.  It
> has never been altogether safe or reliable to assemble CIFs by
> concatenation of fragments or complete CIFs, and I don't see why CIF2
> needs to make special accommodation for behavior that was never
> correct in the first place.  No matter what treatment is chosen for
> U+FEFF, people who exercise due care will still be able to assemble
> well-formed CIF2 files from fragments, even by using 'cat' if they do
> so shrewdly.
> 
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
> 
> 
> 
> 
> Email Disclaimer:  www.stjude.org/emaildisclaimer
> 
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.