[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] UTF-8 BOM
From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
Date: Thu, 13 May 2010 14:03:27 -0500
Accept-Language: en-US
acceptlanguage: en-US
In-Reply-To: <alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005101301340.99142@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337D9@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com>

>   People make CIF out of pieces joined by cat or editors all the
>time.  We cannot tell them that thay can only make CIF2s out using
>a short list of applications, nor can we tell them that they
>cannot pick up material from old CIF1s.

I think we will be able to tell people that the limitations on
combining CIF2 fragments are about the same as those on combining CIF1
fragments.  Whatever decision is made about embedded BOMs, however,
there will be additional BOM-related considerations for CIF2 because
Unicode-aware text tools do not all treat BOMs the same way.  On the
other hand, whether we tell people or not, there is no escaping the
fact that there are more limitations on combining CIF2 fragments with
CIF1 fragments than there are on combining only CIF1 fragments, quite
apart from any question of BOM handling.  That was one of the costs of
abandoning 100% backwards compatibility.


>                                         In most cases, if we
>treat the BOMs reasonably, the concatenated CIFs will make sense
>and probably sense that the user intended.

It is true that most users, for most purposes, will be able to ignore
CIF syntax versions and proceed largely as they have been accustomed
to doing.  Some others will be able to adjust by making one-time
changes to a few boilerplate CIF fragments.  But even with no BOMs,
blind concatenation of a well-formed CIF1 file with a well-formed CIF2
file is not certain to produce a CIF compliant with either
specification, and whether it does can depend on the order in which
the component files are concatenated.  Similarly, with CIF1 in use
alongside CIF2, there will be more cases where cutting and pasting of
fragments from one well-formed CIF into another will result in an
ill-formed CIF, again without any consideration of BOMs.

Indeed, we face the worst possible case in that the same kinds of
things that users have done before are likely to continue to work most
of the time, but they will fail some of the time.  That means that
errors are more likely to creep into CIFs, and bugs are more likely to
appear in software, than if CIF2 made a clean break with CIF1 or if
CIF2 maintained full backwards compatibility.  I daresay neither of
those alternatives is attractive to this group, especially at this
point, so we have what we have: some things people are used to doing
with CIFs are no longer reliably safe, whether anybody likes it or
not.


>   I see no immediate harm in treating an embedded BOM as
>whitespace, but also no specific need to do so.  The main thing
>is not to treat it as a printing characters and not to completely
>ignore it -- it can be a tip off to a serious problem.

In other words, almost anything other than what's currently in the
spec.  I'm OK with treating it as a printing character (ala the
current spec), though that is my least preferred alternative.  Doing
so is probably the worst choice for compatibility with the kinds of
manipulations we're discussing, however.

If you don't treat an embedded BOM as a printing character or as
whitespace, and you don't ignore it (which I agree we should not do),
then does that leave any alternative other than to account it an
error?


Cheers,

John

>   Regards,
>     Herbert
>
>=====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>
>                  +1-631-244-3035
>                  yaya@dowling.edu
>=====================================================

Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] UTF-8 BOM (James Hester)

References:

[ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Joe Krahn)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] UTF-8 BOM

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: Re: [ddlm-group] UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM