Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

I am largely of a mind with James in this discussion. Better to have a
clear and straightforward standard, with deviations from that simply
marked as 'broken'. Otherwise, you increase the burden on applications
that feel they need to strive to handle a large number of deprecated
but tacitly accepted variants. For CIFs that come to Acta in such a broken
state, we shall fix them in the archival versions as a service to those
who subsequently access them. As James says, we do this already for
CIFs submitted with non-ASCII characters. I appreciate Herbert's point that
certain cues could help to make this a more automatic process, and I am sure
that we will build heuristics that take advantage of cues that people might
implant, wittingly or otherwise (such as vim-style tags if they happen to be
found in syntactically permitted locations), but I don't see a great
advantage in inventing new cues to tag particular types of brokenness!

And even if it helps in some cases, one can imagine others where it
will not - willy-nilly cut-and-paste using different editors and/or
OSes could result in a CIF where some text fields are ISO-8859 encoded,
others UTF-16, etc.

Note that we shall still need an efficient way to embed text-fields that
are in some measure "rich" - TeX/LaTeX for example. As I've suggested
before, these can be handled, if really necessary, by a mechanism such as
MIME headers that describes the content type and encoding. In this
framework, particular text encodings such as UTF-16 or ISO-8859 could
be accommodated (possibly further BASE64 encoded or the like, to take
account of whatever restricted legal character set we agree upon) if
they were really necessary, but any such judgement of "really necessary"
would inevitably reduce the value of that particular file as a universal


On Mon, Oct 26, 2009 at 11:36:59PM +1100, James Hester wrote:
> Hi Herbert and others:
>>   The heart of the problem is the person who submits a non-UTF-8 file to a
>> system, such as the IUCr, that chooses consider anything other thatn UTF-8
>> an error.  If you have no explicit flag citing what the person thinks they
>> used as an encoding, the only way you can detect and flag this error is by
>> examining the text character by character, looking for two things:
>>   keywords and tags that are invalid
>>   strings that contain invalid characters
>> Keywordd and tags are not likely to raise warning flags in, say, the
>> Latin-1 vs. UTF-8 encoding, bacause the keywords are all from the common
>> ASCII portion of both encodings, and the tag names from the official
>> dictionaries are also all from the common ASCII portion of both encodings.
> Agreed.
>> That leaves us only with the contents of the strings themselves to use to
>> spot the differences, a dubious proposition if the person has only a few
>> accented letters.
> As I have explained previously, this is actually a distinguishing
> feature of UTF-8: a single accented character in Latin-1 or any
> ISO-8859 encoding will be broken UTF-8 immediately.  Two accented
> characters side by side have an apriori chance of accidentally being
> correct UTF-8 of 1/32, and if even only two such combinations occur
> the chance of it being correct UTF-8 is <0.1%.  The more accented
> characters, the less likely an encoding error will not be detected,
> and the easier it is for a machine to detect the error; the less
> accented characters, the easier it is for a human to check the
> particular word involved.  And if there are any single high-bit set
> characters, you immediately know that it is not UTF-8.  So, I would
> assert that in the particular case of UTF-8 you do not need to worry
> about not being able to detect non-UTF-8 files.
> The IUCr is not really a good use case for alternative encodings,
> because (1) the strings that are likely to have accented etc.
> characters in them are those that appear in print and so will be
> examined by human eyes (2) the author is available, so straightforward
> rejection of non UTF-8 files is perfectly OK (just as a non-ASCII file
> at the moment would be treated).
>> If we give people a standard place to flag their encoding, then, if they
>> ignore that option and the editors they use ignore that option, we are no
>> worse off than off than if the option was not made available, but if we
>> provide the option and they either pay attention to what they are doing
>> (very unlikely) or their software pays attention to what they are doing
>> (an increasing reality) then the chances of producing a journal article
>> with a mistransliterated accent are reduced.
> I strongly disagree, for the reasons outlined in point 1 of my
> previous email.  At the same time as introducing the possibility of
> catching a rogue encoding or two, you introduce the possibility of
> using the wrong encoding and not noticing, which is much less likely
> under UTF-8.  And if we do say 'UTF-8 only', why would anyone write
> software in the first place for CIF2.0 that uses a different encoding?
> (Not only a rhetorical question - perhaps you have a use case in
> mind).
>> I do not claim this is a huge benefit, but inasmuch as the cost is very
>> low for providing it, it seems worth having, as it is worth having in XML.
>> That does leave a disagreement on the cost.  I see only the expense of a
>> few extra characters in each file.  You seem primarily concerned about
>> "2. The very existence of an encoding tag will encourage people not to
>> use UTF-8 if their own favourite encoding is easier and the recipient
>> agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15).
>> We are thus encouraging people to play fast and loose with the
>> standard."
> This is indeed an important concern, but I am also concerned about the
> other points in the previous email, including adding complexity for a
> doubtful gain, but to continue:
>> I do not see the problem here.  We are designing a tool to use, and, of
>> course, people will extend annd adapt it, just as both the IUCr and the
>> PDB already "play fast and loose with the standard".  Aren't we better off
>> if we provide a clean documented way for people to flag their deviations
>> from the standard than to force them to secretly engage in deviant
>> practices?  CIF is a tool, not a religion.  If the IUCr or the PDB needs
>> to do something different from the standard to get their jobs done, we
>> should look at ways to document, not to conceal, those practices.
> We have a pretty clear datapoint here: CIF1 has been pure ASCII for 16
> years, and as far as I know all the CIFs submitted to the IUCr have
> managed to stick to ASCII.  Therefore, there is no great need or
> desire to use different encodings.  If you have some counterexamples
> e.g. from imgCIF, that would be good to hear about.  Pending that
> counterexample, I would assert that insofar as ASCII was already
> sufficient to cover all character encoding needs, UTF-8 is even more
> so.   If we design a tool that clearly does the job, where will this
> need to change it come from?  I would have argued completely
> differently 10 years ago, when a variety of encodings was a fact of
> life and no encoding was obviously superior.  But I really think that
> UTF-8 does the job so well that there will be no need to move beyond
> it in the lifetime of the CIF2 standard.  And yes, I know that this
> email will be recorded forever on the IUCr website...but given how
> long ASCII has lasted I feel safe.
>> CIF stopped being simple when mmCIF was introduced.  As Frances says, it
>> is like dealing with PostScript.  Unlike core CIF, most people would be
>> well advised not to try to read mmCIF (and certainly not imgCIF) or do
>> hand editing of it, even though it looks like something you should be able
>> to read.  As much as possible, it should be handled by appropriate
>> syntax-aware tools, and the primary target for this proposal is to make it
>> easy for the programmers of those tools to have a way to deal with the
>> reality of varying character encoding and to be able to reliably deliver
>> the UTF-8 version for external transmission, even on systems for which
>> UTF-8 is _not_ the natve encoding.
> My apologies, I was referring to simple syntax rather than the whole
> CIF shooting match.  mmCIF has perfectly legal, simple CIF syntax,
> upon which you can indeed build beautiful, complex semantic
> structures.  Or not.  But that is your choice.  Contrast that to XML,
> where the syntax is far more complex (I believe there are 11 different
> node types, for example), and they accept all those different
> encodings to boot.  Also, in what sense is a given encoding a 'native'
> encoding for an OS?  I always assumed that this referred to the
> encoding of the filenames.  Surely there are not OSs out there that
> magically detect a text file and convert it always to UCS-2?  If such
> systems don't exist, then why do we care what the native encoding is?
>> I disagree about whether we should be looking at python and at XML.  Both
>> are successful tools that are, in fact, serving major communities that
>> have strong overlap with our community.  Both provide tools our
>> communities use.  We do not have to slavishly adopt every feature of
>> either one, but it certainly will pay to look at the choices that they
>> have made and to consider what lessons we may learn that will be of value
>> in the development of CIF.
> I do agree with this sentiment wholeheartedly.  My point was that
> using Emacs and vim tags for encoding is rather pointless in our
> context, whereas it is reasonable for a programming language.
> Likewise, adding features such as support for a variety of encodings
> is simple for them, because only one program has to change: the
> CPython interpreter, whereas in our case there are a couple of orders
> of magnitude more programs involved, and we need to think about
> transferability as well.
> all the best,
> James.
>> On Mon, 26 Oct 2009, SIMON WESTRIP wrote:
>>> Perhaps 'benifit' is the wrong word - I was reading Herbert's argument
>>> as suggesting that it would be
>> 'good practice' to include some sort of flag so that if different
>> encodings are permitted in the future, a mechanism is already in place to
>> identify them?
>> In practice, if the only permitted encoding is UTF-8, for Acta type CIFs I
>> suspect we would adopt a zero-tolerance policy with respect to other
>> encodings, though just as we will provide tools for converting between
>> CIF1 and CIF2, we may well also include tools for converting encodings.
>> Cheers
>> Simon
>> ________________________________
>> From: James Hester <jamesrhester@gmail.com>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Sent: Monday, 26 October, 2009 5:28:30
>> Subject: Re: [ddlm-group] [THREAD 4] UTF8
>> The more I think about the proposal to add an encoding header or tag,
>> the more it frightens me.  Before dealing with the specific email
>> below, let me make the following additional points:
>> 1.  We have no way of knowing whether or not the correct encoding has
>> been included in the header, as Simon points out.  This is not just a
>> case of 'oh well, it was worth a try'.  This is a case of making
>> things worse: if the character set for ISO-8859-1 is used instead of
>> ISO-8859-15, only attentive human reading of the text will turn up the
>> problem.  So, while helping to read in a few extra files, for the
>> majority of encodings this proposal opens up the possibility of
>> introducing hard-to-find errors in human-readable text.  For this
>> reason I would strongly, strongly recommend that only self-identifying
>> encodings are tolerated in the first place, and that no encoding
>> header is recognised.
>> 2. The very existence of an encoding tag will encourage people not to
>> use UTF-8 if their own favourite encoding is easier and the recipient
>> agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15).
>> We are thus encouraging people to play fast and loose with the
>> standard.
>> 3. Emacs and Vim are not the preferred file editing platform for most
>> of the crystallographic world.  Of course Python can use vi/emacs
>> encoding tags, as considerably more programmers are likely to be
>> familiar with vi and Emacs.
>> 4. Let's be cautious when adopting practices from Python or (e.g.)
>> XML.  We need to appreciate the differences between them and us.  For
>> example, Python is essentially a single program (CPython) and so
>> upgrade paths are easier to manage.
>> 5.  Don't forget the data archives - if the IUCr don't remediate the
>> ISO-8859-15 file to UTF-8, the archives will have to, as they have to
>> be able to deliver CIFs which are readable for all recipients.  So
>> there is guaranteed additional work and complexity involved as soon as
>> anybody starts agreeing to take other encodings.
>> 6.  A key virtue of CIF is its simplicity.  A single acceptable
>> encoding is simple.  Multiple optional encodings is not.
>> Returning to Herbert's latest email, I'm glad Simon can see the
>> benefit, but I still fail to.  Let's go through this again.  There are
>> two proposals on the table: both myself and Herbert are happy to state
>> that UTF-8 is the only official encoding for CIF2.0 files. Herbert
>> further proposes describing encoding flags in the standard, whereas I
>> think this is a bad idea.
>> Let's go through Herbert's cases, with two more added for completeness:
>>   H1.  Somebody submits a UTF-8 file without the flag
>>   H2.  Somebody submits a UTF-8 file with the UTF-8 flag
>>   H3.  Somebody submits a non-UTF-8 file either with no flag
>> or with a UTF-8 flag
>>   H4.  (deleted, becomes H5 or H6)
>>   H5.  Somebody submits a non-UTF-8 file with a flag correctly telling
>> us that it is encoding xxx
>>   H6.  Somebody submits a non-UTF-8 file with a flag incorrectly
>> telling us that it is encoding xxx
>> Under my proposal, this list degenerates to:
>> J1. Somebody submits a UTF-8 file
>> J2. Somebody submits a non-UTF-8 file
>> In case J2 (the equivalent of both case H3, H5 and H6 above), the file
>> is rejected as a syntactically incorrect CIF, just as incorrect CIFs
>> are rejected today.  I don't see anything wrong with this - in the
>> IUCr use case, the author is around to deal with the problem and
>> resubmit correctly.  Alternatively, under Herbert's proposal, a
>> further level of checking can be done by using the encoding flag to
>> see if a correct CIF can be produced - and will probably still fail.
>> 'Probably' because, as Simon points out, if the author thinks they've
>> sent a UTF-8 file and haven't, they are unlikely to get niceties like
>> an encoding flag correct, so H6 (and H3) are the most likely sorts of
>> files that will reach the IUCr.  Furthermore, for a large number of
>> encodings, it will fail in a way that cannot be detected automatically
>> (see point 1 at the beginning).
>> It would be good to hear from those who haven't said anything yet,
>> even if only to hear if they are undecided.
>> James.
>> On Sun, Oct 25, 2009 at 3:18 AM, SIMON WESTRIP
>> <simonwestrip@btinternet.com> wrote:
>>> Dear Herbert
>>> thanks for clarifying this for me - I can now see the benefits of such flags
>>> (actually, if I'd stopped to think about it, I should have spotted an
>>> analogy
>>> with the use of <meta charset=UTF-8...> tags in html...)
>>> Cheers
>>> Simon
>>> ________________________________
>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>>> Sent: Saturday, 24 October, 2009 16:07:50
>>> Subject: Re: [ddlm-group] [THREAD 4] UTF8
>>> Dear Simon,
>>>   The world is not a perfect place, but there is increasing use of clear
>>> flags for encoding.  If we provide a place for the flag in CIf there are
>>> four possibilities on a submission:
>>>   1.  Somebody submits a UTF-8 file without the flag
>>>   2.  Somebody submits a UTF-8 file with the UTF-8 flag
>>>   3.  Somebody submits a non-UTF-8 file either with no flag
>>> or with a UTF-8 flag
>>>   4.  Somebody submits a non-UTF-8 file with a flag telling
>>> us that it is a non-UTF-8 file
>>> Cases 1,2 and 4 all allow for rational handling of the file.
>>> Case 3 can result in mishandling
>>> If we do not have the flag, we cannot have case 4, and all non-UTF-8
>>> files are highly likely to cause trouble.  Yes, getting into case 4 right
>>> now depends on users who know what encoding they are using, but python
>>> programmers are aready learning to be careful about it, and both vi and
>>> emacs are pretty good at recognizing mismatches to users are learning
>>> to fix the comment if it is wrong.
>>> What is the worst that happens if we include the identification of the
>>> encoding? -- everybody just leaves it set at UTF-8 no matter what they do.
>>> We will have lost nothing.  But is just one submission makes use of the
>>> identification propoerly for a non-UTF-8 encoding we will have gained, and
>>> over the next few years, as the editors and their supporting scripts get
>>> smarter, I expect you will start to see significant use of the encoding
>>> flags, especially to distinguigh UTF-8 from other popular unicode
>>> encodings, such as USC-2.
>>> vim supports both the comment and the BOM.  I personally prefer the BOM to
>>> other methods, but the comment is increasingly used.
>>> Regards,
>>>   Herbert
>>> =====================================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>                   +1-631-244-3035
>>>                   yaya@dowling.edu
>>> =====================================================
>>> On Sat, 24 Oct 2009, SIMON WESTRIP wrote:
>>>> Herbert wrote:
>>> "I am saying that it would be a very good idea to conform to the
>>> vim or emacs editor conventions in marking CIF with their encoding, so
>>> that if somebody does make a mistake and send a journal a Latin-1 CIF-2
>>> file instead of a UTF-8 CIF-2, there will be some chance of spotting the
>>> error."
>>> I'm not sure what you're getting at here. Having a UTF-8 identifier would
>>> not help in this case? Or if you mean that the actual encoding used should
>>> be tagged, it seems unlikely that having already mistakingly (and probably
>>> unknowingly) used the wrong encoding, anyone would include such a tag? So
>>> unless the encoding can be determined from something inherent to the
>>> encoding, e.g. a UTF-16 BOM, I cant see that a comment-type tag is of any
>>> benefit?
>>> If the standard specifies UTF-8 there should be no reason to identify this
>>> in the CIF.
>>> However, I can see the advantages of such a tag if its envisaged that
>>> other encodings will be allowed in the future, or even simply to reinforce
>>> that the CIF is CIF2 (especially if the magic number has been ommitted)?
>>> I have to confess that I am starting to worry about all this slightly. As
>>> much as in the work I do I can happily read/write UTF-8 and convert from
>>> other encodings, at this stage I would probably struggle to convert from
>>> an unrecognized encoding - which is fair enough because if its CIF2 it
>>> should be UTF-8 and I shouldnt need to convert anyway (!), but it is a
>>> worry with respect to the issue of trying to make adoption of CIF-2 as
>>> painless as possible for the end users. But then again, I'm having a bad
>>> day :-)
>>> Cheers
>>> Simon
>>> ________________________________
>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>>> Sent: Friday, 23 October, 2009 20:47:40
>>> Subject: Re: [ddlm-group] [THREAD 4] UTF8
>>> Dear Colleagues,
>>>     I have only mild objections to saying the "UTF-8 is the only official
>>> encoding for CIF 2".  My mild objection is that imgCIF will not be
>>> compliant in sereral of its variants, but it certainly will always be able
>>> to provide at least one compliant translation of any file, 50-60% bigger
>>> than it has to be in, say, UCS-2, but compliant.
>>>     No, the real problem is not what is officially the "right" way to write
>>> CIFs, but what people will really do.  People will do what they have
>>> always done -- work with CIF on whatever system they have.  That system
>>> may be modern and support UTF-8, but, even then, its "native" mode may be
>>> something different.  If we are lucky, the differences will be
>>> sufficiently dramatic to allow the encoding used to be detected from
>>> context.  If somebody decides they are still using EBCDIC, we will have no
>>> trouble figuring that out, but sometimes the differences are more subtle.
>>> I just took a French message catalog for RasMol and converted it to the
>>> Latin-1 encoding.  Most of the text is absolutely the same.  Just a few
>>> acented characters differ.  In a large text with just a few accents, this
>>> could easily be missed, and lots of people in Europe use the Latin-1
>>> encoding.  I am not saying that we should handle Latin-1 in all CIF-2
>>> parsers.  I am saying that it would be a very good idea to conform to the
>>> vim or emacs editor conventions in marking CIF with their encoding, so
>>> that if somebody does make a mistake and send a journal a Latin-1 CIF-2
>>> file instead of a UTF-8 CIF-2, there will be some chance of spotting the
>>> error.
>>> The is the same issue as having the magic number #\# CIF 2.0 so we have a
>>> chance to spotting cases where somebody is trying to feed in a different
>>> CIF level.  Just because somebody might, somewhere, sometime, decide to
>>> send in a file to a CIF 1 parser with a magic number such as #\# CIF 2.0
>>> does not mean that suddenly we have to tell the person with the CIF 1
>>> parser that their parser is broken.  It just means the person with the CIF
>>> 1 parser or the person with the CIF 2 file have a better chance of quickly
>>> figuring out they have a mismatch.
>>> People will edit in different encodings, whether we approve of it or not.
>>> We lose nothing by flagging the UTF-8 encoding, and we can save people a
>>> lot of time in the future.
>>> Regards,
>>>     Herbert
>>> =====================================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>           Idle Hour Blvd, Oakdale, NY, 11769
>>>                   +1-631-244-3035
>>>                   yaya@dowling.edu
>>> =====================================================
>>> On Fri, 23 Oct 2009, David Brown wrote:
>>>> I would just like to point out a philosophical principle which we tried to
>>>> observe in the earlier CIFs, and which I think very important, namely that
>>>> in
>>>> a standard like CIF it is only necessary to define one convention for each
>>>> feature in the standard.  Writers are required to convert the input to
>>>> this
>>>> convention and readers can always be confident that they will only have to
>>>> read this one convention.  Every time you allow alternative ways of
>>>> encoding
>>>> a piece of information you *require* the reader to be able to read both
>>>> alternatives.  If you allow three different encodings, you require three
>>>> different parsers.  If you allow ten different codings, you require ten
>>>> different parsers in every piece of reading software.  With one standard,
>>>> a
>>>> single parser works everywhere.
>>>> If a standard allows two different codings, it is no longer a standard, it
>>>> is
>>>> two standards, and that is something we have tried to avoid (not always
>>>> successfully) in CIF.  It should be a goal.
>>>> David
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.