[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] UTF-8 BOM
From: James Hester <jamesrhester@gmail.com>
Date: Wed, 16 Jun 2010 13:54:32 +1000
In-Reply-To: <alpine.BSF.2.00.1006152223480.59900@epsilon.pair.com>
References: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com><20100614142541.GA356@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikeIbft9SKfvpgTpGZVpo47Vg_acYBbXi-eUvU-@mail.gmail.com><alpine.BSF.2.00.1006152223480.59900@epsilon.pair.com>

Dear Herbert,

Would you mind enlarging a little on what you are responding to here,
as I don't follow your thinking.
 Perhaps I was not clear: I am not in favour of allowing a variety of
encodings to be included within the CIF2 standard.  I am advocating
UTF8 only.  Is this what you are responding to, or are you discussing
the suggestion of allowing a variety of encodings?

On Wed, Jun 16, 2010 at 12:33 PM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> Dear Colleagues,
>
> �This is quite a disruptive change. �Until now CIF has always had
> machine-dependent encoding changes assumed. �I am in favor of
> working the entire world towards a common representation of text,
> and the use of multiple Unicode representations supported on
> current systems is going to be a large positive step. �I think
> it is a little premature (by about 10 years) to assume a
> world of UTF-8 purity. �We ain't there yet.
>
> �You are essentially making CIF2 into a binary format instead
> of a text format. �That is a truly disruptive change. �I think
> it is a serious mistake that will discourage use of CIF as an
> interchange format, not encourage it.
>
> �Regards,
> � �Herbert
>
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � Dowling College, Kramer Science Center, KSC 121
> � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � +1-631-244-3035
> � � � � � � � � yaya@dowling.edu
> =====================================================
>
> On Wed, 16 Jun 2010, James Hester wrote:
>
>> My concern with opening up the suite of possible CIF encodings is that we
>> need to maintain a guarantee that any CIF2-conformant writer will produce
>> files that any CIF2-conformant reader can read.� As we are a data transfer
>> and archiving standard, this is a core guarantee that we make, so we
>> cannot
>> specify optional behaviour.� Note that we are not restricted to someone
>> transferring files between computers at a single point in time, when some
>> negotiation of encoding protocol could take place; we may be talking about
>> a
>> third party retrieving a file archived some years ago by someone else in
>> the
>> local university repository.
>>
>> What people are and have always been free to do is to encapsulate and
>> encode
>> CIFs in whatever way they wish, as long as the result is not touted as
>> being
>> 'CIF2 conformant'.� The optional UTF8 BOM that we have more or less agreed
>> to is purely in deference to poorly-written text editors, rather than an
>> encoding signature as such.
>>
>> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C
>> <John.Bollinger@stjude.org> wrote:
>> � � �On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote:
>>
>> � � �>I'm coming to this late, I fear, but I would prefer that the
>> � � �spec
>> � � �>be kept as simple as possible. I note the following comments in
>> � � �>the Unicode FAQ document referenced by John B
>> � � �>(http://www.unicode.org/faq/utf_bom.html):
>> � � �>
>> � � �> � �"Where UTF-8 is used transparently in 8-bit environments,
>> � � �the use
>> � � �> � �of a BOM will interfere with any protocol or file format
>> � � �that expects
>> � � �> � �specific ASCII characters at the beginning, such as the use
>> � � �of "#!"
>> � � �> � �of at the beginning of Unix shell scripts."
>>
>> Well yes, but that applies to protocols defined in terms of 8-bit,
>> ASCII-derived character sets ("8-bit environments"). �It does not
>> argue for BOMs to be forbidden in Unicode environments such as CIF2.
>> �Of course, neither does it require that BOMs be accepted or
>> recognized in Unicode environments.
>>
>> > � �"In the absence of a protocol supporting its use as a BOM and
>> when
>> > � �not at the beginning of a text stream, U+FEFF should normally not
>> > � �occur."
>>
>> I'm disappointed that you truncated the quote there. �It continues
>> with "For backwards compatibility it should be treated as ZERO WIDTH
>> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the
>> file or string." �It goes on to advocate using U+2060 instead, and (in
>> the interest of full disclosure) it closes by commenting that a
>> language or protocol can specify that U+FEFF is unsupported in the
>> middle of a file.
>>
>> >I suggest the CIF specification deprecate the use of U+FEFF so that
>> >*any* occurrence of it be treated formally as an error. However, a
>> >note should acknowledge that U+FEFF is permitted according to the
>> >Unicode standard at the start of a data stream, and that therefore a
>> >CIF reading application may at its discretion accept U+FEFF followed
>> >by #\#CIF2.0 as a valid magic number at the start of a file.
>>
>> I don't see what is gained by forbidding U+FEFF from appearing inside
>> data values, where one might arrive via any number of innocent means.
>> �As it currently stands, the draft permits this. �It is somewhat
>> problematic to allow it at the beginning or end of a
>> whitespace-delimited value, but U+FEFF is by no means the only
>> character that is allowed but problematic at such a position.
>>
>> On the other hand, it is viable to specify that CIF itself does not
>> (directly) include a BOM. �That's where we started. �(Pedantic note:
>> "initial BOM" is redundant. �As the term is used in relation to
>> Unicode, a BOM necessarily appears at the beginning of a data stream;
>> anywhere else, U+FEFF is just U+FEFF.) �If CIF does not formally allow
>> a BOM then an otherwise well-formed CIF stream headed by a BOM would
>> then need to be interpreted either
>>
>> 1) as an unrecognized file, or
>>
>> 2) as an ill-formed CIF, or
>>
>> 3) as a well-formed CIF (any version) encapsulated in another
>> protocol. �Such "another protocol" does not need to be the concern of
>> CIF.
>>
>> >The idea is that any fully-conformant CIF writer will never write an
>> >initial UTF-8 BOM, and so any software designed to handle only fully
>> >conformant CIFs will not be troubled by it.
>>
>> I could live with that. �I can't imagine writing a CIF processor
>> limited to that mode of operation, nor would I want to use one, but I
>> can handle CIF's formal scope being limited in that way.
>>
>> In that case, however, let's carry it to the logical conclusion.
>> �Rather than put one particular encoding detail outside CIF's scope,
>> why not put character encoding out of scope altogether? �CIF can
>> easily be defined simply in terms of "Unicode characters". �Perhaps
>> instead of anointing UTF-8 as the One True Encoding for CIF, it would
>> be better to make encoding an entirely separate concern.
>>
>> Practically speaking, you're going to have that anyway. �Even
>> disregarding imgCIF, does anyone really expect never to hear "it's a
>> CIF, except encoded in <FOO-13> instead of UTF-8"? �Does anyone really
>> think they need the authority of the CIF specification to require that
>> CIFs be delivered to them in a particular encoding? �How is that
>> qualitatively different from requiring particular CIF content, as most
>> programs do?
>>
>> > � � � � � � � � � � � � � � � � � � � � � � Of course the world does
>> >contain CIFs created other than by fully-conformant CIF writers. To
>> >an extent the community should decide for itself how best to attempt
>> >to handle deviations from full conformance. It would help, perhaps,
>> if
>> >those of us writing CIF readers would document specific practices
>> that
>> >the software takes to accommodate such deviations. Ideally, such
>> >software should have a verbose logging mode that can be activated
>> >whenever surprising behaviour in reading CIFs is encountered by
>> >the user.
>>
>> I think it's exceedingly optimistic to expect "the community" to
>> arrive at and abide by a single, consistent set of best practices.
>> �The best you can hope for is that a small number of organizations and
>> / or programs will exert enough influence to establish their own de
>> facto standards.
>>
>> We can exert some influence there, however. �Either the CIF spec or a
>> companion spec could establish conformance requirements for CIF
>> *processors*, including, for example, the ability to diagnose
>> particular malformations. �The XML spec does this, as do some
>> programming language specs.
>>
>> Such a document could also establish, perhaps, that CIF processors
>> must be able to accept the UTF-8 encoding, and maybe even that they
>> must assume UTF-8 by default. �That would establish the baseline and a
>> guaranteed interoperability mode that we would otherwise lose by
>> pushing character encoding outside the format specification.
>>
>> >Notice that naive concatenation of CIFs will remain a bad idea for
>> >all sorts of reasons - beyond the purely syntactic issues, one will
>> >get multiple "data_TOZ" declarations for example. Undoubtedly this
>> >will continue to happen, but perhaps increasing the number of
>> >occasions when blindly concatenating files triggers software errors
>> >will help to raise awareness and/or the use of better software tools.
>>
>> You are preaching to the choir with that as far as I am concerned. �It
>> has never been altogether safe or reliable to assemble CIFs by
>> concatenation of fragments or complete CIFs, and I don't see why CIF2
>> needs to make special accommodation for behavior that was never
>> correct in the first place. �No matter what treatment is chosen for
>> U+FEFF, people who exercise due care will still be able to assemble
>> well-formed CIF2 files from fragments, even by using 'cat' if they do
>> so shrewdly.
>>
>> John
>> --
>> John C. Bollinger, Ph.D.
>> Department of Structural Biology
>> St. Jude Children's Research Hospital
>>
>>
>>
>>
>> Email Disclaimer: �www.stjude.org/emaildisclaimer
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

References:

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Joe Krahn)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Brian McMahon)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] Vote on BOM

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: Re: [ddlm-group] UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM