[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] UTF-8 BOM
From: James Hester <[email protected]>
Date: Tue, 15 Jun 2010 17:13:05 +1000
In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA54165DF3381F@SJMEMXMBS11.stjude.sjcrh.local>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF3381F@SJMEMXMBS11.stjude.sjcrh.local>

Given John's arguments for (i), I think I can also live with option (i) (0xFEFF is an ordinary character).�

I would suggest in addition adding 0xFEFF to the list of non-allowed characters in non-delimited datavalues, and not allowing it in datanames, datablock names, and save frame names.� Disallowing it in these tokens is a conservative choice, as we can remove some or all of these restrictions at a later date without invalidating already extant files.

Note that option (i) in conjunction with this additional suggestion would render 0xFEFF a syntax error everywhere except in comments or a delimited data value.�

On Tue, Jun 15, 2010 at 7:58 AM, Bollinger, John C <[email protected]> wrote:

Dear Colleagues,

Brian got me thinking about this again:

On Monday, May 24, 2010 1:27 AM, James Hester wrote:

>To run through the alternatives and some of the arguments so far:
>
>(i) treating an embedded BOM as an ordinary character runs against the
>Unicode recommendations. �If we wish our standard to be respected, I think
>we should at least respect other standards and the thinking that has gone
>into them
>
>(ii) treating an embedded BOM as whitespace is OK with the Unicode
>standard, but means that a non-ASCII character now has syntactic meaning
>in the CIF. �I think this would be completely inconsistent on our part,
>as an invisible character (when displayed) can actually be used to
>delimit strings. �This is my least preferred solution, as it goes
>against the human-readability expected of CIFs.
>
>(iii) ignoring embedded BOMs is bad because they can be a 'tip off to a serious problem'.
>
>(iv) treating embedded BOMs as syntax errors will cause issues when CIF2 files are naively concatenated
>
>I think the only viable alternatives are to choose (iii) or (iv).

I initially passed over it, but I now think the argument against (i) is flawed. �Unicode recommends that embedded U+FEFF, if allowed, be treated as a zero-width non-breaking space (which is its original documented function). �One might equivalently say that it should be treated the same as U+2060, its designated replacement for that role. �But as far as CIF is concerned, U+2060 has no special significance whatever, therefore it is as ordinary as ordinary can be. �Treating U+FEFF as an ordinary (i.e. having no special significance to CIF) character is therefore perfectly consistent with Unicode recommendations.

As I have already written, I am strongly opposed to both (iii) and (iv) if they apply to U+FEFF appearing in data values. �Inasmuch as it could be ambiguous whether some appearances of U+FEFF are in data values, I don't think either of these options is a good choice. �Furthermore, the argument I just rejected against (i) is in fact valid against (iii): if embedded U+FEFF is allowed, then it should be treated as a ZWNBSP (with or without any special significance to CIF), not ignored.

I rather like (ii), but I would be satisfied with (i).

------

Also, is human readability, such as James cites against option (ii), really a significant concern to this group? I have a at least two issues in that area, but I had not planned to raise them because of the apparent hope and perception that CIF2 is largely done.

John
--
John C. Bollinger, Ph.D.

Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer: �www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

References:

[ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Joe Krahn)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Prev by Date: Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM

Next by Date: Re: [ddlm-group] UTF-8 BOM

Prev by thread: Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM

Next by thread: Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM