[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

What I think this discussion boils down to is that an embedded U+FEFF could be either an accidental UTF8 BOM, or a (deliberate, deprecated) ZWNBSP.  A stray UTF8 BOM is the only reasonable interpretation outside a token, so we can safely choose to ignore and drop U+FEFF in this case. Inside most tokens an automated procedure *cannot* choose the correct alternative, so the interests of fidelity in data transfer dictate that such characters cause a syntax error.   That is, we explicitly exclude U+FEFF from the list of acceptable characters that can be found inside a token.

So, I now propose the following:
(i) U+FEFF is ignored and removed from the datastream whenever encountered outside a token
(ii) U+FEFF joins the list of excluded characters in datanames, datavalues, save frame names and datablock names

On Tue, May 25, 2010 at 6:13 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:
On Monday, May 24, 2010 1:27 AM, James Hester wrote:
>To run through the alternatives and some of the arguments so far:
>(i) treating an embedded BOM as an ordinary character runs against
>the Unicode recommendations.  If we wish our standard to be
>respected, I think we should at least respect other standards and
>the thinking that has gone into them
>(ii) treating an embedded BOM as whitespace is OK with the Unicode
>standard, but means that a non-ASCII character now has syntactic
>meaning in the CIF.  I think this would be completely inconsistent
>on our part, as an invisible character (when displayed) can actually
>be used to delimit strings.  This is my least preferred solution, as
>it goes against the human-readability expected of CIFs
>(iii) ignoring embedded BOMs is bad because they can be a 'tip off
>to a serious problem'.
>(iv) treating embedded BOMs as syntax errors will cause issues when
>CIF2 files are naively concatenated
>I think the only viable alternatives are to choose (iii) or (iv).
>So: why exactly is ignoring a BOM a problem?  If the embedded BOM is
>the leading BOM from a UTF16 file that has been naively concatenated,
>it will have bytes 0xFE 0xFF.  This byte sequence (and the reverse) is
>not acceptable UTF8, leading to a decoding error from the UTF8
>decoding step.  The subsequent bytes will be UTF16, which should cause
>a decoding failure in any case.   So I deduce that we are simply
>discussing how to treat a UTF8 BOM, which can only find its way into a
>CIF file by naive concatenation of UTF8-encoded files written by
>certain programs.

I generally agree with that summary and analysis, though I observe that
a U+FEFF character may intentionally be embedded in a data value to serve
its (deprecated) role as a ZWNBSP.  It might arise from transferring text
from an existing manuscript into a CIF, such as an author may do while
preparing a new, CIF-formatted manuscript.

I can come up with other scenarios leading to an embedded U+FEFF that
don't involve directly concatenating files, though so far they all
seem far-fetched.

>If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I
>don't see that it is indicative of any problems beyond misguided choice
>of text editor.

There are cases where ignoring an embedded BOM would change the syntactic
interpretation of the cif, generally when it is neither preceded nor
followed by whitespace.  That might occur, for example, when naively
appending a CIF2 CIF, with BOM and required version comment, to the end
of a CIF with no trailing newline. If the BOM is ignored then the last
token of the first CIF and the first token of the second are pasted
together, which Might not result in a syntax error.  Of course, this is
a potential problem with concatenating CIFs without BOMs, too.

There are nastier possible results from silently stripping embedded
U+FEFF, some owing to its legality in data names, and a few other tricks
I have brewing in the back of my head.  None of them are likely to
occur accidentally, though.

>So I would advocate ignoring (and removing) UTF8-BOMs in the input
>stream, and treating all other BOMs as syntax errors.  Individual
>applications may wish to give users the option of interpreting U+FEFF
>as the deprecated ZWNBP (and translating to the correct character) on
>the understanding that if this occurs outside a delimited string it
>will cause a syntax error.

I am not at all comfortable with allowing parsers to strip or
substitute U+FEFF embedded in data values, much less requiring that
they do so: a data protocol should faithfully deliver the data
entrusted to it, or else complain.  I don't much like the idea of
stripping or substituting U+FEFF elsewhere, for that matter, but I
could live with that.

Requiring U+FEFF to be altered in some contexts but not in
others would present some practical challenges, to be sure, but not
insurmountable ones.  If that is unpalatable, though, and if treating
embedded U+FEFF as whitespace is unacceptable, then we're left with
treating it as an ordinary character, which no one seems to like
much, and treating it as an error, which has had mixed reviews.

>PS am I the only one who thinks it unlikely that Wordpad users would
>choose to use 'cat' to join file fragments together?

No, you're not.  But Wordpad is not the only editor of note, its users
are not the only people who might end up concatenating CIFs edited
with it.  Personally, though, I tend to ascribe sufficient technical
acumen to 'cat' users to understand why there's a potential problem
and to have some idea how to tackle it.

John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]