Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To aid Simon and perhaps others in their consideration of allowing
multiple encodings, I restate below my opinions, some of which were
aired in the thread starting at
http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00068.html (see
that thread for others' opinions as well, of course).

1. I do not see any reason to specify reading and writing multiple
encodings.  We are not writing a web browser or word processor or
operating system.  Multiple possible encodings add complexity to the
standard, make extra work for implementors, and provide more
opportunities for mistransfer.  UTF8 has clear advantages as the best
Unicode encoding to use.  These advantages are lost or diluted if it
is an optional encoding, for example, accessibility of latin text to
common text-handling tools.

2. If Simon is concerned that he will see multiple encodings anyway,
at least if the file is supposed to be UTF8 these errors will be
detected very quickly.

3. There is no such thing as "optional" behaviour for a CIF.  A
conformant reader must be able to process any file written by a
conformant writer that is distant in time and space.  There is *no*
guaranteed opportunity to negotiate encodings.  Explicitly specifying
the encoding in the standard removes the need for this negotiation.

4. Of course CIF2 (or CIF1 files for that matter) can be contained in
an encoding envelope/tar file/HDF5 file/... etc.  Other standards deal
with such encapsulation, and any negotiation involved (I think so far
Herbert would agree), and we have no reason to be concerned with them
in the syntax document.

On Wed, Jun 16, 2010 at 9:29 PM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> Dear Colleagues,
>
>  As I said in my last message, I am proposing that we do what
> most of the world really does with unicode -- treat a CIF2 as
> a text file in which the information presented is a sequence
> os valid printable unicode code points no matter what the
> encoding.
>
>  For convenience in interchange, I am proposing that all
> CIF2 processing software working on systems that provide
> support for UTF-8 must provide support for that particular
> encoding, but if someone happens to be working in a system
> the only supports a UTF-7 or a UTF-16 or an old code-page-based
> encoding then I see no reason to declare what they produce
> erroneous in any way -- just a reason to require that they
> clearly identify the encoding used so that one of the
> many reliable encoding conversion programs that are available
> may be passed over their file when it needs to be handled
> in the preferred encoding.  I happen to use cyclone on my
> mac for that purpose.
>
>  The use of a BOM is just a quick, simply way to clearly
> specify an ecnoding if the file encoding a text file
> is a unicode file, but it really is not part of the text
> itself.
>
>  I, the strong proponent of supporting binary with CIF,
> am proposing that we return to the original approach
> to CIF -- that it really is a text file, not a binary file.
> I do so precisely to help me support the handling of
> binary with CIF.
>
>  Regards,
>    Herbert
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Wed, 16 Jun 2010, James Hester wrote:
>
>> Dear Herbert,
>>
>> Would you mind enlarging a little on what you are responding to here,
>> as I don't follow your thinking.
>> Perhaps I was not clear: I am not in favour of allowing a variety of
>> encodings to be included within the CIF2 standard.  I am advocating
>> UTF8 only.  Is this what you are responding to, or are you discussing
>> the suggestion of allowing a variety of encodings?
>>
>> On Wed, Jun 16, 2010 at 12:33 PM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>>
>>> Dear Colleagues,
>>>
>>>  This is quite a disruptive change.  Until now CIF has always had
>>> machine-dependent encoding changes assumed.  I am in favor of
>>> working the entire world towards a common representation of text,
>>> and the use of multiple Unicode representations supported on
>>> current systems is going to be a large positive step.  I think
>>> it is a little premature (by about 10 years) to assume a
>>> world of UTF-8 purity.  We ain't there yet.
>>>
>>>  You are essentially making CIF2 into a binary format instead
>>> of a text format.  That is a truly disruptive change.  I think
>>> it is a serious mistake that will discourage use of CIF as an
>>> interchange format, not encourage it.
>>>
>>>  Regards,
>>>    Herbert
>>>
>>> =====================================================
>>>  Herbert J. Bernstein, Professor of Computer Science
>>>   Dowling College, Kramer Science Center, KSC 121
>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                 +1-631-244-3035
>>>                 yaya@dowling.edu
>>> =====================================================
>>>
>>> On Wed, 16 Jun 2010, James Hester wrote:
>>>
>>>> My concern with opening up the suite of possible CIF encodings is that
>>>> we
>>>> need to maintain a guarantee that any CIF2-conformant writer will
>>>> produce
>>>> files that any CIF2-conformant reader can read.  As we are a data
>>>> transfer
>>>> and archiving standard, this is a core guarantee that we make, so we
>>>> cannot
>>>> specify optional behaviour.  Note that we are not restricted to someone
>>>> transferring files between computers at a single point in time, when
>>>> some
>>>> negotiation of encoding protocol could take place; we may be talking
>>>> about
>>>> a
>>>> third party retrieving a file archived some years ago by someone else in
>>>> the
>>>> local university repository.
>>>>
>>>> What people are and have always been free to do is to encapsulate and
>>>> encode
>>>> CIFs in whatever way they wish, as long as the result is not touted as
>>>> being
>>>> 'CIF2 conformant'.  The optional UTF8 BOM that we have more or less
>>>> agreed
>>>> to is purely in deference to poorly-written text editors, rather than an
>>>> encoding signature as such.
>>>>
>>>> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C
>>>> <John.Bollinger@stjude.org> wrote:
>>>>      On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:
>>>>
>>>>      >I'm coming to this late, I fear, but I would prefer that the
>>>>      spec
>>>>      >be kept as simple as possible. I note the following comments in
>>>>      >the Unicode FAQ document referenced by John B
>>>>      >(http://www.unicode.org/faq/utf_bom.html):
>>>>      >
>>>>      >    "Where UTF-8 is used transparently in 8-bit environments,
>>>>      the use
>>>>      >    of a BOM will interfere with any protocol or file format
>>>>      that expects
>>>>      >    specific ASCII characters at the beginning, such as the use
>>>>      of "#!"
>>>>      >    of at the beginning of Unix shell scripts."
>>>>
>>>> Well yes, but that applies to protocols defined in terms of 8-bit,
>>>> ASCII-derived character sets ("8-bit environments").  It does not
>>>> argue for BOMs to be forbidden in Unicode environments such as CIF2.
>>>>  Of course, neither does it require that BOMs be accepted or
>>>> recognized in Unicode environments.
>>>>
>>>>>    "In the absence of a protocol supporting its use as a BOM and
>>>>
>>>> when
>>>>>
>>>>>    not at the beginning of a text stream, U+FEFF should normally not
>>>>>    occur."
>>>>
>>>> I'm disappointed that you truncated the quote there.  It continues
>>>> with "For backwards compatibility it should be treated as ZERO WIDTH
>>>> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the
>>>> file or string."  It goes on to advocate using U+2060 instead, and (in
>>>> the interest of full disclosure) it closes by commenting that a
>>>> language or protocol can specify that U+FEFF is unsupported in the
>>>> middle of a file.
>>>>
>>>>> I suggest the CIF specification deprecate the use of U+FEFF so that
>>>>> *any* occurrence of it be treated formally as an error. However, a
>>>>> note should acknowledge that U+FEFF is permitted according to the
>>>>> Unicode standard at the start of a data stream, and that therefore a
>>>>> CIF reading application may at its discretion accept U+FEFF followed
>>>>> by #\#CIF2.0 as a valid magic number at the start of a file.
>>>>
>>>> I don't see what is gained by forbidding U+FEFF from appearing inside
>>>> data values, where one might arrive via any number of innocent means.
>>>>  As it currently stands, the draft permits this.  It is somewhat
>>>> problematic to allow it at the beginning or end of a
>>>> whitespace-delimited value, but U+FEFF is by no means the only
>>>> character that is allowed but problematic at such a position.
>>>>
>>>> On the other hand, it is viable to specify that CIF itself does not
>>>> (directly) include a BOM.  That's where we started.  (Pedantic note:
>>>> "initial BOM" is redundant.  As the term is used in relation to
>>>> Unicode, a BOM necessarily appears at the beginning of a data stream;
>>>> anywhere else, U+FEFF is just U+FEFF.)  If CIF does not formally allow
>>>> a BOM then an otherwise well-formed CIF stream headed by a BOM would
>>>> then need to be interpreted either
>>>>
>>>> 1) as an unrecognized file, or
>>>>
>>>> 2) as an ill-formed CIF, or
>>>>
>>>> 3) as a well-formed CIF (any version) encapsulated in another
>>>> protocol.  Such "another protocol" does not need to be the concern of
>>>> CIF.
>>>>
>>>>> The idea is that any fully-conformant CIF writer will never write an
>>>>> initial UTF-8 BOM, and so any software designed to handle only fully
>>>>> conformant CIFs will not be troubled by it.
>>>>
>>>> I could live with that.  I can't imagine writing a CIF processor
>>>> limited to that mode of operation, nor would I want to use one, but I
>>>> can handle CIF's formal scope being limited in that way.
>>>>
>>>> In that case, however, let's carry it to the logical conclusion.
>>>>  Rather than put one particular encoding detail outside CIF's scope,
>>>> why not put character encoding out of scope altogether?  CIF can
>>>> easily be defined simply in terms of "Unicode characters".  Perhaps
>>>> instead of anointing UTF-8 as the One True Encoding for CIF, it would
>>>> be better to make encoding an entirely separate concern.
>>>>
>>>> Practically speaking, you're going to have that anyway.  Even
>>>> disregarding imgCIF, does anyone really expect never to hear "it's a
>>>> CIF, except encoded in <FOO-13> instead of UTF-8"?  Does anyone really
>>>> think they need the authority of the CIF specification to require that
>>>> CIFs be delivered to them in a particular encoding?  How is that
>>>> qualitatively different from requiring particular CIF content, as most
>>>> programs do?
>>>>
>>>>>                                             Of course the world does
>>>>> contain CIFs created other than by fully-conformant CIF writers. To
>>>>> an extent the community should decide for itself how best to attempt
>>>>> to handle deviations from full conformance. It would help, perhaps,
>>>>
>>>> if
>>>>>
>>>>> those of us writing CIF readers would document specific practices
>>>>
>>>> that
>>>>>
>>>>> the software takes to accommodate such deviations. Ideally, such
>>>>> software should have a verbose logging mode that can be activated
>>>>> whenever surprising behaviour in reading CIFs is encountered by
>>>>> the user.
>>>>
>>>> I think it's exceedingly optimistic to expect "the community" to
>>>> arrive at and abide by a single, consistent set of best practices.
>>>>  The best you can hope for is that a small number of organizations and
>>>> / or programs will exert enough influence to establish their own de
>>>> facto standards.
>>>>
>>>> We can exert some influence there, however.  Either the CIF spec or a
>>>> companion spec could establish conformance requirements for CIF
>>>> *processors*, including, for example, the ability to diagnose
>>>> particular malformations.  The XML spec does this, as do some
>>>> programming language specs.
>>>>
>>>> Such a document could also establish, perhaps, that CIF processors
>>>> must be able to accept the UTF-8 encoding, and maybe even that they
>>>> must assume UTF-8 by default.  That would establish the baseline and a
>>>> guaranteed interoperability mode that we would otherwise lose by
>>>> pushing character encoding outside the format specification.
>>>>
>>>>> Notice that naive concatenation of CIFs will remain a bad idea for
>>>>> all sorts of reasons - beyond the purely syntactic issues, one will
>>>>> get multiple "data_TOZ" declarations for example. Undoubtedly this
>>>>> will continue to happen, but perhaps increasing the number of
>>>>> occasions when blindly concatenating files triggers software errors
>>>>> will help to raise awareness and/or the use of better software tools.
>>>>
>>>> You are preaching to the choir with that as far as I am concerned.  It
>>>> has never been altogether safe or reliable to assemble CIFs by
>>>> concatenation of fragments or complete CIFs, and I don't see why CIF2
>>>> needs to make special accommodation for behavior that was never
>>>> correct in the first place.  No matter what treatment is chosen for
>>>> U+FEFF, people who exercise due care will still be able to assemble
>>>> well-formed CIF2 files from fragments, even by using 'cat' if they do
>>>> so shrewdly.
>>>>
>>>> John
>>>> --
>>>> John C. Bollinger, Ph.D.
>>>> Department of Structural Biology
>>>> St. Jude Children's Research Hospital
>>>>
>>>>
>>>>
>>>>
>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> T +61 (02) 9717 9907
>>>> F +61 (02) 9717 3145
>>>> M +61 (04) 0249 4148
>>>>
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>>
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.