[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
So, I now propose the following:
(i) U+FEFF is ignored and removed from the datastream whenever encountered outside a token
(ii) U+FEFF joins the list of excluded characters in datanames, datavalues, save frame names and datablock names
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: James Hester <jamesrhester@gmail.com>
- Date: Wed, 26 May 2010 13:45:42 +1000
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA54165DF337EC@SJMEMXMBS11.stjude.sjcrh.local>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA54165DF337EC@SJMEMXMBS11.stjude.sjcrh.local>
So, I now propose the following:
(i) U+FEFF is ignored and removed from the datastream whenever encountered outside a token
(ii) U+FEFF joins the list of excluded characters in datanames, datavalues, save frame names and datablock names
On Tue, May 25, 2010 at 6:13 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:
I generally agree with that summary and analysis, though I observe thatOn Monday, May 24, 2010 1:27 AM, James Hester wrote:
>To run through the alternatives and some of the arguments so far:
>
>(i) treating an embedded BOM as an ordinary character runs against
>the Unicode recommendations. If we wish our standard to be
>respected, I think we should at least respect other standards and
>the thinking that has gone into them
>
>(ii) treating an embedded BOM as whitespace is OK with the Unicode
>standard, but means that a non-ASCII character now has syntactic
>meaning in the CIF. I think this would be completely inconsistent
>on our part, as an invisible character (when displayed) can actually
>be used to delimit strings. This is my least preferred solution, as
>it goes against the human-readability expected of CIFs
>
>(iii) ignoring embedded BOMs is bad because they can be a 'tip off
>to a serious problem'.
>
>(iv) treating embedded BOMs as syntax errors will cause issues when
>CIF2 files are naively concatenated
>
>I think the only viable alternatives are to choose (iii) or (iv).
>
>So: why exactly is ignoring a BOM a problem? If the embedded BOM is
>the leading BOM from a UTF16 file that has been naively concatenated,
>it will have bytes 0xFE 0xFF. This byte sequence (and the reverse) is
>not acceptable UTF8, leading to a decoding error from the UTF8
>decoding step. The subsequent bytes will be UTF16, which should cause
>a decoding failure in any case. So I deduce that we are simply
>discussing how to treat a UTF8 BOM, which can only find its way into a
>CIF file by naive concatenation of UTF8-encoded files written by
>certain programs.
a U+FEFF character may intentionally be embedded in a data value to serve
its (deprecated) role as a ZWNBSP. It might arise from transferring text
from an existing manuscript into a CIF, such as an author may do while
preparing a new, CIF-formatted manuscript.
I can come up with other scenarios leading to an embedded U+FEFF that
don't involve directly concatenating files, though so far they all
seem far-fetched.
There are cases where ignoring an embedded BOM would change the syntactic
>If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I
>don't see that it is indicative of any problems beyond misguided choice
>of text editor.
interpretation of the cif, generally when it is neither preceded nor
followed by whitespace. That might occur, for example, when naively
appending a CIF2 CIF, with BOM and required version comment, to the end
of a CIF with no trailing newline. If the BOM is ignored then the last
token of the first CIF and the first token of the second are pasted
together, which Might not result in a syntax error. Of course, this is
a potential problem with concatenating CIFs without BOMs, too.
There are nastier possible results from silently stripping embedded
U+FEFF, some owing to its legality in data names, and a few other tricks
I have brewing in the back of my head. None of them are likely to
occur accidentally, though.
I am not at all comfortable with allowing parsers to strip or
>So I would advocate ignoring (and removing) UTF8-BOMs in the input
>stream, and treating all other BOMs as syntax errors. Individual
>applications may wish to give users the option of interpreting U+FEFF
>as the deprecated ZWNBP (and translating to the correct character) on
>the understanding that if this occurs outside a delimited string it
>will cause a syntax error.
substitute U+FEFF embedded in data values, much less requiring that
they do so: a data protocol should faithfully deliver the data
entrusted to it, or else complain. I don't much like the idea of
stripping or substituting U+FEFF elsewhere, for that matter, but I
could live with that.
Requiring U+FEFF to be altered in some contexts but not in
others would present some practical challenges, to be sure, but not
insurmountable ones. If that is unpalatable, though, and if treating
embedded U+FEFF as whitespace is unacceptable, then we're left with
treating it as an ordinary character, which no one seems to like
much, and treating it as an error, which has had mixed reviews.
No, you're not. But Wordpad is not the only editor of note, its users
>
>James
>
>PS am I the only one who thinks it unlikely that Wordpad users would
>choose to use 'cat' to join file fragments together?
are not the only people who might end up concatenating CIFs edited
with it. Personally, though, I tend to ascribe sufficient technical
acumen to 'cat' users to understand why there's a potential problem
and to have some idea how to tackle it.
John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital
Email Disclaimer: www.stjude.org/emaildisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):