Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM

Given John's arguments for (i), I think I can also live with option (i) (0xFEFF is an ordinary character). 

I would suggest in addition adding 0xFEFF to the list of non-allowed characters in non-delimited datavalues, and not allowing it in datanames, datablock names, and save frame names.  Disallowing it in these tokens is a conservative choice, as we can remove some or all of these restrictions at a later date without invalidating already extant files.

Note that option (i) in conjunction with this additional suggestion would render 0xFEFF a syntax error everywhere except in comments or a delimited data value. 

On Tue, Jun 15, 2010 at 7:58 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Dear Colleagues,

Brian got me thinking about this again:

On Monday, May 24, 2010 1:27 AM, James Hester wrote:
>To run through the alternatives and some of the arguments so far:
>(i) treating an embedded BOM as an ordinary character runs against the
>Unicode recommendations.  If we wish our standard to be respected, I think
>we should at least respect other standards and the thinking that has gone
>into them
>(ii) treating an embedded BOM as whitespace is OK with the Unicode
>standard, but means that a non-ASCII character now has syntactic meaning
>in the CIF.  I think this would be completely inconsistent on our part,
>as an invisible character (when displayed) can actually be used to
>delimit strings.  This is my least preferred solution, as it goes
>against the human-readability expected of CIFs.
>(iii) ignoring embedded BOMs is bad because they can be a 'tip off to a serious problem'.
>(iv) treating embedded BOMs as syntax errors will cause issues when CIF2 files are naively concatenated
>I think the only viable alternatives are to choose (iii) or (iv).

I initially passed over it, but I now think the argument against (i) is flawed.  Unicode recommends that embedded U+FEFF, if allowed, be treated as a zero-width non-breaking space (which is its original documented function).  One might equivalently say that it should be treated the same as U+2060, its designated replacement for that role.  But as far as CIF is concerned, U+2060 has no special significance whatever, therefore it is as ordinary as ordinary can be.  Treating U+FEFF as an ordinary (i.e. having no special significance to CIF) character is therefore perfectly consistent with Unicode recommendations.

As I have already written, I am strongly opposed to both (iii) and (iv) if they apply to U+FEFF appearing in data values.  Inasmuch as it could be ambiguous whether some appearances of U+FEFF are in data values, I don't think either of these options is a good choice.  Furthermore, the argument I just rejected against (i) is in fact valid against (iii): if embedded U+FEFF is allowed, then it should be treated as a ZWNBSP (with or without any special significance to CIF), not ignored.

I rather like (ii), but I would be satisfied with (i).


Also, is human readability, such as James cites against option (ii), really a significant concern to this group? I have a at least two issues in that area, but I had not planned to raise them because of the apparent hope and perception that CIF2 is largely done.

John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.