[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM

Dear Colleagues,

Brian got me thinking about this again:

On Monday, May 24, 2010 1:27 AM, James Hester wrote:
>To run through the alternatives and some of the arguments so far:
>(i) treating an embedded BOM as an ordinary character runs against the
>Unicode recommendations.  If we wish our standard to be respected, I think
>we should at least respect other standards and the thinking that has gone
>into them
>(ii) treating an embedded BOM as whitespace is OK with the Unicode
>standard, but means that a non-ASCII character now has syntactic meaning
>in the CIF.  I think this would be completely inconsistent on our part,
>as an invisible character (when displayed) can actually be used to
>delimit strings.  This is my least preferred solution, as it goes
>against the human-readability expected of CIFs.
>(iii) ignoring embedded BOMs is bad because they can be a 'tip off to a serious problem'.
>(iv) treating embedded BOMs as syntax errors will cause issues when CIF2 files are naively concatenated
>I think the only viable alternatives are to choose (iii) or (iv).

I initially passed over it, but I now think the argument against (i)
is flawed.  Unicode recommends that embedded U+FEFF, if allowed, be
treated as a zero-width non-breaking space (which is its original
documented function).  One might equivalently say that it should be
treated the same as U+2060, its designated replacement for that role.
But as far as CIF is concerned, U+2060 has no special significance
whatever, therefore it is as ordinary as ordinary can be.  Treating
U+FEFF as an ordinary (i.e. having no special significance to CIF)
character is therefore perfectly consistent with Unicode

As I have already written, I am strongly opposed to both (iii) and
(iv) if they apply to U+FEFF appearing in data values.  Inasmuch as it
could be ambiguous whether some appearances of U+FEFF are in data
values, I don't think either of these options is a good choice.
Furthermore, the argument I just rejected against (i) is in fact valid
against (iii): if embedded U+FEFF is allowed, then it should be
treated as a ZWNBSP (with or without any special significance to CIF),
not ignored.

I rather like (ii), but I would be satisfied with (i).


Also, is human readability, such as James cites against option (ii),
really a significant concern to this group? I have a at least two
issues in that area, but I had not planned to raise them because of
the apparent hope and perception that CIF2 is largely done.

John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

Reply to: [list | sender only]