[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Dear Colleagues,

   This is quite a disruptive change.  Until now CIF has always had
machine-dependent encoding changes assumed.  I am in favor of
working the entire world towards a common representation of text,
and the use of multiple Unicode representations supported on
current systems is going to be a large positive step.  I think
it is a little premature (by about 10 years) to assume a
world of UTF-8 purity.  We ain't there yet.

   You are essentially making CIF2 into a binary format instead
of a text format.  That is a truly disruptive change.  I think
it is a serious mistake that will discourage use of CIF as an
interchange format, not encourage it.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 16 Jun 2010, James Hester wrote:

> My concern with opening up the suite of possible CIF encodings is that we
> need to maintain a guarantee that any CIF2-conformant writer will produce
> files that any CIF2-conformant reader can read.  As we are a data transfer
> and archiving standard, this is a core guarantee that we make, so we cannot
> specify optional behaviour.  Note that we are not restricted to someone
> transferring files between computers at a single point in time, when some
> negotiation of encoding protocol could take place; we may be talking about a
> third party retrieving a file archived some years ago by someone else in the
> local university repository.
> 
> What people are and have always been free to do is to encapsulate and encode
> CIFs in whatever way they wish, as long as the result is not touted as being
> 'CIF2 conformant'.  The optional UTF8 BOM that we have more or less agreed
> to is purely in deference to poorly-written text editors, rather than an
> encoding signature as such.
> 
> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C
> <John.Bollinger@stjude.org> wrote:
>       On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:
>
>       >I'm coming to this late, I fear, but I would prefer that the
>       spec
>       >be kept as simple as possible. I note the following comments in
>       >the Unicode FAQ document referenced by John B
>       >(http://www.unicode.org/faq/utf_bom.html):
>       >
>       >    "Where UTF-8 is used transparently in 8-bit environments,
>       the use
>       >    of a BOM will interfere with any protocol or file format
>       that expects
>       >    specific ASCII characters at the beginning, such as the use
>       of "#!"
>       >    of at the beginning of Unix shell scripts."
> 
> Well yes, but that applies to protocols defined in terms of 8-bit,
> ASCII-derived character sets ("8-bit environments").  It does not
> argue for BOMs to be forbidden in Unicode environments such as CIF2.
>  Of course, neither does it require that BOMs be accepted or
> recognized in Unicode environments.
> 
> >    "In the absence of a protocol supporting its use as a BOM and
> when
> >    not at the beginning of a text stream, U+FEFF should normally not
> >    occur."
> 
> I'm disappointed that you truncated the quote there.  It continues
> with "For backwards compatibility it should be treated as ZERO WIDTH
> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the
> file or string."  It goes on to advocate using U+2060 instead, and (in
> the interest of full disclosure) it closes by commenting that a
> language or protocol can specify that U+FEFF is unsupported in the
> middle of a file.
> 
> >I suggest the CIF specification deprecate the use of U+FEFF so that
> >*any* occurrence of it be treated formally as an error. However, a
> >note should acknowledge that U+FEFF is permitted according to the
> >Unicode standard at the start of a data stream, and that therefore a
> >CIF reading application may at its discretion accept U+FEFF followed
> >by #\#CIF2.0 as a valid magic number at the start of a file.
> 
> I don't see what is gained by forbidding U+FEFF from appearing inside
> data values, where one might arrive via any number of innocent means.
>  As it currently stands, the draft permits this.  It is somewhat
> problematic to allow it at the beginning or end of a
> whitespace-delimited value, but U+FEFF is by no means the only
> character that is allowed but problematic at such a position.
> 
> On the other hand, it is viable to specify that CIF itself does not
> (directly) include a BOM.  That's where we started.  (Pedantic note:
> "initial BOM" is redundant.  As the term is used in relation to
> Unicode, a BOM necessarily appears at the beginning of a data stream;
> anywhere else, U+FEFF is just U+FEFF.)  If CIF does not formally allow
> a BOM then an otherwise well-formed CIF stream headed by a BOM would
> then need to be interpreted either
> 
> 1) as an unrecognized file, or
> 
> 2) as an ill-formed CIF, or
> 
> 3) as a well-formed CIF (any version) encapsulated in another
> protocol.  Such "another protocol" does not need to be the concern of
> CIF.
> 
> >The idea is that any fully-conformant CIF writer will never write an
> >initial UTF-8 BOM, and so any software designed to handle only fully
> >conformant CIFs will not be troubled by it.
> 
> I could live with that.  I can't imagine writing a CIF processor
> limited to that mode of operation, nor would I want to use one, but I
> can handle CIF's formal scope being limited in that way.
> 
> In that case, however, let's carry it to the logical conclusion.
>  Rather than put one particular encoding detail outside CIF's scope,
> why not put character encoding out of scope altogether?  CIF can
> easily be defined simply in terms of "Unicode characters".  Perhaps
> instead of anointing UTF-8 as the One True Encoding for CIF, it would
> be better to make encoding an entirely separate concern.
> 
> Practically speaking, you're going to have that anyway.  Even
> disregarding imgCIF, does anyone really expect never to hear "it's a
> CIF, except encoded in <FOO-13> instead of UTF-8"?  Does anyone really
> think they need the authority of the CIF specification to require that
> CIFs be delivered to them in a particular encoding?  How is that
> qualitatively different from requiring particular CIF content, as most
> programs do?
> 
> >                                             Of course the world does
> >contain CIFs created other than by fully-conformant CIF writers. To
> >an extent the community should decide for itself how best to attempt
> >to handle deviations from full conformance. It would help, perhaps,
> if
> >those of us writing CIF readers would document specific practices
> that
> >the software takes to accommodate such deviations. Ideally, such
> >software should have a verbose logging mode that can be activated
> >whenever surprising behaviour in reading CIFs is encountered by
> >the user.
> 
> I think it's exceedingly optimistic to expect "the community" to
> arrive at and abide by a single, consistent set of best practices.
>  The best you can hope for is that a small number of organizations and
> / or programs will exert enough influence to establish their own de
> facto standards.
> 
> We can exert some influence there, however.  Either the CIF spec or a
> companion spec could establish conformance requirements for CIF
> *processors*, including, for example, the ability to diagnose
> particular malformations.  The XML spec does this, as do some
> programming language specs.
> 
> Such a document could also establish, perhaps, that CIF processors
> must be able to accept the UTF-8 encoding, and maybe even that they
> must assume UTF-8 by default.  That would establish the baseline and a
> guaranteed interoperability mode that we would otherwise lose by
> pushing character encoding outside the format specification.
> 
> >Notice that naive concatenation of CIFs will remain a bad idea for
> >all sorts of reasons - beyond the purely syntactic issues, one will
> >get multiple "data_TOZ" declarations for example. Undoubtedly this
> >will continue to happen, but perhaps increasing the number of
> >occasions when blindly concatenating files triggers software errors
> >will help to raise awareness and/or the use of better software tools.
> 
> You are preaching to the choir with that as far as I am concerned.  It
> has never been altogether safe or reliable to assemble CIFs by
> concatenation of fragments or complete CIFs, and I don't see why CIF2
> needs to make special accommodation for behavior that was never
> correct in the first place.  No matter what treatment is chosen for
> U+FEFF, people who exercise due care will still be able to assemble
> well-formed CIF2 files from fragments, even by using 'cat' if they do
> so shrewdly.
> 
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
> 
> 
> 
> 
> Email Disclaimer:  www.stjude.org/emaildisclaimer
> 
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]