Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Dear James,

   The heart of the problem is the person who submits a non-UTF-8 file to a 
system, such as the IUCr, that chooses consider anything other thatn UTF-8 
an error.  If you have no explicit flag citing what the person thinks they 
used as an encoding, the only way you can detect and flag this error is by 
examining the text character by character, looking for two things:

   keywords and tags that are invalid
   strings that contain invalid characters

Keywordd and tags are not likely to raise warning flags in, say, the 
Latin-1 vs. UTF-8 encoding, bacause the keywords are all from the common 
ASCII portion of both encodings, and the tag names from the official 
dictionaries are also all from the common ASCII portion of both encodings.

That leaves us only with the contents of the strings themselves to use to 
spot the differences, a dubious proposition if the person has only a few 
accented letters.

If we give people a standard place to flag their encoding, then, if they 
ignore that option and the editors they use ignore that option, we are no 
worse off than off than if the option was not made available, but if we 
provide the option and they either pay attention to what they are doing 
(very unlikely) or their software pays attention to what they are doing 
(an increasing reality) then the chances of producing a journal article 
with a mistransliterated accent are reduced.

I do not claim this is a huge benefit, but inasmuch as the cost is very 
low for providing it, it seems worth having, as it is worth having in XML.

That does leave a disagreement on the cost.  I see only the expense of a 
few extra characters in each file.  You seem primarily concerned about

"2. The very existence of an encoding tag will encourage people not to
use UTF-8 if their own favourite encoding is easier and the recipient
agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15).
We are thus encouraging people to play fast and loose with the

I do not see the problem here.  We are designing a tool to use, and, of 
course, people will extend annd adapt it, just as both the IUCr and the 
PDB already "play fast and loose with the standard".  Aren't we better off 
if we provide a clean documented way for people to flag their deviations 
from the standard than to force them to secretly engage in deviant 
practices?  CIF is a tool, not a religion.  If the IUCr or the PDB needs 
to do something different from the standard to get their jobs done, we 
should look at ways to document, not to conceal, those practices.

CIF stopped being simple when mmCIF was introduced.  As Frances says, it 
is like dealing with PostScript.  Unlike core CIF, most people would be 
well advised not to try to read mmCIF (and certainly not imgCIF) or do 
hand editing of it, even though it looks like something you should be able 
to read.  As much as possible, it should be handled by appropriate 
syntax-aware tools, and the primary target for this proposal is to make it 
easy for the programmers of those tools to have a way to deal with the 
reality of varying character encoding and to be able to reliably deliver 
the UTF-8 version for external transmission, even on systems for which 
UTF-8 is _not_ the natve encoding.

I disagree about whether we should be looking at python and at XML.  Both 
are successful tools that are, in fact, serving major communities that 
have strong overlap with our community.  Both provide tools our 
communities use.  We do not have to slavishly adopt every feature of 
either one, but it certainly will pay to look at the choices that they 
have made and to consider what lessons we may learn that will be of value 
in the development of CIF.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Mon, 26 Oct 2009, SIMON WESTRIP wrote:

> Perhaps 'benifit' is the wrong word - I was reading Herbert's argument 
> as suggesting that it would be
'good practice' to include some sort of flag so that if different 
encodings are permitted in the future, a mechanism is already in place to 
identify them?

In practice, if the only permitted encoding is UTF-8, for Acta type CIFs I 
suspect we would adopt a zero-tolerance policy with respect to other 
encodings, though just as we will provide tools for converting between 
CIF1 and CIF2, we may well also include tools for converting encodings.



From: James Hester <jamesrhester@gmail.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Monday, 26 October, 2009 5:28:30
Subject: Re: [ddlm-group] [THREAD 4] UTF8

The more I think about the proposal to add an encoding header or tag,
the more it frightens me.  Before dealing with the specific email
below, let me make the following additional points:

1.  We have no way of knowing whether or not the correct encoding has
been included in the header, as Simon points out.  This is not just a
case of 'oh well, it was worth a try'.  This is a case of making
things worse: if the character set for ISO-8859-1 is used instead of
ISO-8859-15, only attentive human reading of the text will turn up the
problem.  So, while helping to read in a few extra files, for the
majority of encodings this proposal opens up the possibility of
introducing hard-to-find errors in human-readable text.  For this
reason I would strongly, strongly recommend that only self-identifying
encodings are tolerated in the first place, and that no encoding
header is recognised.

2. The very existence of an encoding tag will encourage people not to
use UTF-8 if their own favourite encoding is easier and the recipient
agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15).
We are thus encouraging people to play fast and loose with the

3. Emacs and Vim are not the preferred file editing platform for most
of the crystallographic world.  Of course Python can use vi/emacs
encoding tags, as considerably more programmers are likely to be
familiar with vi and Emacs.

4. Let's be cautious when adopting practices from Python or (e.g.)
XML.  We need to appreciate the differences between them and us.  For
example, Python is essentially a single program (CPython) and so
upgrade paths are easier to manage.

5.  Don't forget the data archives - if the IUCr don't remediate the
ISO-8859-15 file to UTF-8, the archives will have to, as they have to
be able to deliver CIFs which are readable for all recipients.  So
there is guaranteed additional work and complexity involved as soon as
anybody starts agreeing to take other encodings.

6.  A key virtue of CIF is its simplicity.  A single acceptable
encoding is simple.  Multiple optional encodings is not.

Returning to Herbert's latest email, I'm glad Simon can see the
benefit, but I still fail to.  Let's go through this again.  There are
two proposals on the table: both myself and Herbert are happy to state
that UTF-8 is the only official encoding for CIF2.0 files. Herbert
further proposes describing encoding flags in the standard, whereas I
think this is a bad idea.

Let's go through Herbert's cases, with two more added for completeness:
   H1.  Somebody submits a UTF-8 file without the flag
   H2.  Somebody submits a UTF-8 file with the UTF-8 flag
   H3.  Somebody submits a non-UTF-8 file either with no flag
or with a UTF-8 flag
   H4.  (deleted, becomes H5 or H6)
   H5.  Somebody submits a non-UTF-8 file with a flag correctly telling
us that it is encoding xxx
   H6.  Somebody submits a non-UTF-8 file with a flag incorrectly
telling us that it is encoding xxx

Under my proposal, this list degenerates to:

J1. Somebody submits a UTF-8 file
J2. Somebody submits a non-UTF-8 file

In case J2 (the equivalent of both case H3, H5 and H6 above), the file
is rejected as a syntactically incorrect CIF, just as incorrect CIFs
are rejected today.  I don't see anything wrong with this - in the
IUCr use case, the author is around to deal with the problem and
resubmit correctly.  Alternatively, under Herbert's proposal, a
further level of checking can be done by using the encoding flag to
see if a correct CIF can be produced - and will probably still fail.
'Probably' because, as Simon points out, if the author thinks they've
sent a UTF-8 file and haven't, they are unlikely to get niceties like
an encoding flag correct, so H6 (and H3) are the most likely sorts of
files that will reach the IUCr.  Furthermore, for a large number of
encodings, it will fail in a way that cannot be detected automatically
(see point 1 at the beginning).

It would be good to hear from those who haven't said anything yet,
even if only to hear if they are undecided.


On Sun, Oct 25, 2009 at 3:18 AM, SIMON WESTRIP
<simonwestrip@btinternet.com> wrote:
> Dear Herbert
> thanks for clarifying this for me - I can now see the benefits of such flags
> (actually, if I'd stopped to think about it, I should have spotted an
> analogy
> with the use of <meta charset=UTF-8...> tags in html...)
> Cheers
> Simon
> ________________________________
> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Saturday, 24 October, 2009 16:07:50
> Subject: Re: [ddlm-group] [THREAD 4] UTF8
> Dear Simon,
>   The world is not a perfect place, but there is increasing use of clear
> flags for encoding.  If we provide a place for the flag in CIf there are
> four possibilities on a submission:
>   1.  Somebody submits a UTF-8 file without the flag
>   2.  Somebody submits a UTF-8 file with the UTF-8 flag
>   3.  Somebody submits a non-UTF-8 file either with no flag
> or with a UTF-8 flag
>   4.  Somebody submits a non-UTF-8 file with a flag telling
> us that it is a non-UTF-8 file
> Cases 1,2 and 4 all allow for rational handling of the file.
> Case 3 can result in mishandling
> If we do not have the flag, we cannot have case 4, and all non-UTF-8
> files are highly likely to cause trouble.  Yes, getting into case 4 right
> now depends on users who know what encoding they are using, but python
> programmers are aready learning to be careful about it, and both vi and
> emacs are pretty good at recognizing mismatches to users are learning
> to fix the comment if it is wrong.
> What is the worst that happens if we include the identification of the
> encoding? -- everybody just leaves it set at UTF-8 no matter what they do.
> We will have lost nothing.  But is just one submission makes use of the
> identification propoerly for a non-UTF-8 encoding we will have gained, and
> over the next few years, as the editors and their supporting scripts get
> smarter, I expect you will start to see significant use of the encoding
> flags, especially to distinguigh UTF-8 from other popular unicode
> encodings, such as USC-2.
> vim supports both the comment and the BOM.  I personally prefer the BOM to
> other methods, but the comment is increasingly used.
> Regards,
>   Herbert
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> On Sat, 24 Oct 2009, SIMON WESTRIP wrote:
>> Herbert wrote:
> "I am saying that it would be a very good idea to conform to the
> vim or emacs editor conventions in marking CIF with their encoding, so
> that if somebody does make a mistake and send a journal a Latin-1 CIF-2
> file instead of a UTF-8 CIF-2, there will be some chance of spotting the
> error."
> I'm not sure what you're getting at here. Having a UTF-8 identifier would
> not help in this case? Or if you mean that the actual encoding used should
> be tagged, it seems unlikely that having already mistakingly (and probably
> unknowingly) used the wrong encoding, anyone would include such a tag? So
> unless the encoding can be determined from something inherent to the
> encoding, e.g. a UTF-16 BOM, I cant see that a comment-type tag is of any
> benefit?
> If the standard specifies UTF-8 there should be no reason to identify this
> in the CIF.
> However, I can see the advantages of such a tag if its envisaged that
> other encodings will be allowed in the future, or even simply to reinforce
> that the CIF is CIF2 (especially if the magic number has been ommitted)?
> I have to confess that I am starting to worry about all this slightly. As
> much as in the work I do I can happily read/write UTF-8 and convert from
> other encodings, at this stage I would probably struggle to convert from
> an unrecognized encoding - which is fair enough because if its CIF2 it
> should be UTF-8 and I shouldnt need to convert anyway (!), but it is a
> worry with respect to the issue of trying to make adoption of CIF-2 as
> painless as possible for the end users. But then again, I'm having a bad
> day :-)
> Cheers
> Simon
> ________________________________
> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Friday, 23 October, 2009 20:47:40
> Subject: Re: [ddlm-group] [THREAD 4] UTF8
> Dear Colleagues,
>     I have only mild objections to saying the "UTF-8 is the only official
> encoding for CIF 2".  My mild objection is that imgCIF will not be
> compliant in sereral of its variants, but it certainly will always be able
> to provide at least one compliant translation of any file, 50-60% bigger
> than it has to be in, say, UCS-2, but compliant.
>     No, the real problem is not what is officially the "right" way to write
> CIFs, but what people will really do.  People will do what they have
> always done -- work with CIF on whatever system they have.  That system
> may be modern and support UTF-8, but, even then, its "native" mode may be
> something different.  If we are lucky, the differences will be
> sufficiently dramatic to allow the encoding used to be detected from
> context.  If somebody decides they are still using EBCDIC, we will have no
> trouble figuring that out, but sometimes the differences are more subtle.
> I just took a French message catalog for RasMol and converted it to the
> Latin-1 encoding.  Most of the text is absolutely the same.  Just a few
> acented characters differ.  In a large text with just a few accents, this
> could easily be missed, and lots of people in Europe use the Latin-1
> encoding.  I am not saying that we should handle Latin-1 in all CIF-2
> parsers.  I am saying that it would be a very good idea to conform to the
> vim or emacs editor conventions in marking CIF with their encoding, so
> that if somebody does make a mistake and send a journal a Latin-1 CIF-2
> file instead of a UTF-8 CIF-2, there will be some chance of spotting the
> error.
> The is the same issue as having the magic number #\# CIF 2.0 so we have a
> chance to spotting cases where somebody is trying to feed in a different
> CIF level.  Just because somebody might, somewhere, sometime, decide to
> send in a file to a CIF 1 parser with a magic number such as #\# CIF 2.0
> does not mean that suddenly we have to tell the person with the CIF 1
> parser that their parser is broken.  It just means the person with the CIF
> 1 parser or the person with the CIF 2 file have a better chance of quickly
> figuring out they have a mismatch.
> People will edit in different encodings, whether we approve of it or not.
> We lose nothing by flagging the UTF-8 encoding, and we can save people a
> lot of time in the future.
> Regards,
>     Herbert
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>           Idle Hour Blvd, Oakdale, NY, 11769
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> On Fri, 23 Oct 2009, David Brown wrote:
>> I would just like to point out a philosophical principle which we tried to
>> observe in the earlier CIFs, and which I think very important, namely that
>> in
>> a standard like CIF it is only necessary to define one convention for each
>> feature in the standard.  Writers are required to convert the input to
>> this
>> convention and readers can always be confident that they will only have to
>> read this one convention.  Every time you allow alternative ways of
>> encoding
>> a piece of information you *require* the reader to be able to read both
>> alternatives.  If you allow three different encodings, you require three
>> different parsers.  If you allow ten different codings, you require ten
>> different parsers in every piece of reading software.  With one standard,
>> a
>> single parser works everywhere.
>> If a standard allows two different codings, it is no longer a standard, it
>> is
>> two standards, and that is something we have tried to avoid (not always
>> successfully) in CIF.  It should be a goal.
>> David
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.