[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: SIMON WESTRIP <simonwestrip@btinternet.com>
Date: Sat, 24 Oct 2009 09:18:23 -0700 (PDT)
In-Reply-To: <20091024104627.N28064@epsilon.pair.com>
References: <279aad2a0910120838t5f400d71wf1f237d05338c08@mail.gmail.com><C6F976F1.1206C%nick@csse.uwa.edu.au><279aad2a0910221613m2a2a7891k4ae23476e50f98e4@mail.gmail.com><20091022214818.D61491@epsilon.pair.com><279aad2a0910222132t5c8297aao90914fa40c4fbd91@mail.gmail.com><4AE20173.9060700@mcmaster.ca><20091023152244.U10188@epsilon.pair.com><715417.99025.qm@web87006.mail.ird.yahoo.com><20091024104627.N28064@epsilon.pair.com>

Dear Herbert

thanks for clarifying this for me - I can now see the benefits of such flags
(actually, if I'd stopped to think about it, I should have spotted an analogy
with the use of <meta charset=UTF-8...> tags in html...)

Cheers

Simon

From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Saturday, 24 October, 2009 16:07:50
Subject: Re: [ddlm-group] [THREAD 4] UTF8

Dear Simon,

The world is not a perfect place, but there is increasing use of clear
flags for encoding. If we provide a place for the flag in CIf there are
four possibilities on a submission:

1. Somebody submits a UTF-8 file without the flag
2. Somebody submits a UTF-8 file with the UTF-8 flag
3. Somebody submits a non-UTF-8 file either with no flag
or with a UTF-8 flag
4. Somebody submits a non-UTF-8 file with a flag telling
us that it is a non-UTF-8 file

Cases 1,2 and 4 all allow for rational handling of the file.
Case 3 can result in mishandling

If we do not have the flag, we cannot have case 4, and all non-UTF-8
files are highly likely to cause trouble. Yes, getting into case 4 right
now depends on users who know what encoding they are using, but python
programmers are aready learning to be careful about it, and both vi and
emacs are pretty good at recognizing mismatches to users are learning
to fix the comment if it is wrong.

What is the worst that happens if we include the identification of the
encoding? -- everybody just leaves it set at UTF-8 no matter what they do.
We will have lost nothing. But is just one submission makes use of the
identification propoerly for a non-UTF-8 encoding we will have gained, and
over the next few years, as the editors and their supporting scripts get
smarter, I expect you will start to see significant use of the encoding
flags, especially to distinguigh UTF-8 from other popular unicode
encodings, such as USC-2.

vim supports both the comment and the BOM. I personally prefer the BOM to
other methods, but the comment is increasingly used.

Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769

+1-631-244-3035
yaya@dowling.edu
=====================================================

On Sat, 24 Oct 2009, SIMON WESTRIP wrote:

> Herbert wrote:

"I am saying that it would be a very good idea to conform to the
vim or emacs editor conventions in marking CIF with their encoding, so
that if somebody does make a mistake and send a journal a Latin-1 CIF-2
file instead of a UTF-8 CIF-2, there will be some chance of spotting the
error."

I'm not sure what you're getting at here. Having a UTF-8 identifier would
not help in this case? Or if you mean that the actual encoding used should
be tagged, it seems unlikely that having already mistakingly (and probably
unknowingly) used the wrong encoding, anyone would include such a tag? So
unless the encoding can be determined from something inherent to the
encoding, e.g. a UTF-16 BOM, I cant see that a comment-type tag is of any
benefit?

If the standard specifies UTF-8 there should be no reason to identify this
in the CIF.

However, I can see the advantages of such a tag if its envisaged that
other encodings will be allowed in the future, or even simply to reinforce
that the CIF is CIF2 (especially if the magic number has been ommitted)?

I have to confess that I am starting to worry about all this slightly. As
much as in the work I do I can happily read/write UTF-8 and convert from
other encodings, at this stage I would probably struggle to convert from
an unrecognized encoding - which is fair enough because if its CIF2 it
should be UTF-8 and I shouldnt need to convert anyway (!), but it is a
worry with respect to the issue of trying to make adoption of CIF-2 as
painless as possible for the end users. But then again, I'm having a bad
day :-)

Cheers

Simon

________________________________
From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Friday, 23 October, 2009 20:47:40
Subject: Re: [ddlm-group] [THREAD 4] UTF8

Dear Colleagues,

I have only mild objections to saying the "UTF-8 is the only official
encoding for CIF 2". My mild objection is that imgCIF will not be
compliant in sereral of its variants, but it certainly will always be able
to provide at least one compliant translation of any file, 50-60% bigger
than it has to be in, say, UCS-2, but compliant.

No, the real problem is not what is officially the "right" way to write
CIFs, but what people will really do. People will do what they have
always done -- work with CIF on whatever system they have. That system
may be modern and support UTF-8, but, even then, its "native" mode may be
something different. If we are lucky, the differences will be
sufficiently dramatic to allow the encoding used to be detected from
context. If somebody decides they are still using EBCDIC, we will have no
trouble figuring that out, but sometimes the differences are more subtle.
I just took a French message catalog for RasMol and converted it to the
Latin-1 encoding. Most of the text is absolutely the same. Just a few
acented characters differ. In a large text with just a few accents, this
could easily be missed, and lots of people in Europe use the Latin-1
encoding. I am not saying that we should handle Latin-1 in all CIF-2
parsers. I am saying that it would be a very good idea to conform to the
vim or emacs editor conventions in marking CIF with their encoding, so
that if somebody does make a mistake and send a journal a Latin-1 CIF-2
file instead of a UTF-8 CIF-2, there will be some chance of spotting the
error.

The is the same issue as having the magic number #\# CIF 2.0 so we have a
chance to spotting cases where somebody is trying to feed in a different
CIF level. Just because somebody might, somewhere, sometime, decide to
send in a file to a CIF 1 parser with a magic number such as #\# CIF 2.0
does not mean that suddenly we have to tell the person with the CIF 1
parser that their parser is broken. It just means the person with the CIF
1 parser or the person with the CIF 2 file have a better chance of quickly
figuring out they have a mismatch.

People will edit in different encodings, whether we approve of it or not.

We lose nothing by flagging the UTF-8 encoding, and we can save people a
lot of time in the future.

Regards,
Herbert

=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769

+1-631-244-3035
yaya@dowling.edu
=====================================================

On Fri, 23 Oct 2009, David Brown wrote:

> I would just like to point out a philosophical principle which we tried to
> observe in the earlier CIFs, and which I think very important, namely that in
> a standard like CIF it is only necessary to define one convention for each
> feature in the standard. Writers are required to convert the input to this
> convention and readers can always be confident that they will only have to
> read this one convention. Every time you allow alternative ways of encoding
> a piece of information you *require* the reader to be able to read both
> alternatives. If you allow three different encodings, you require three
> different parsers. If you allow ten different codings, you require ten
> different parsers in every piece of reading software. With one standard, a
> single parser works everywhere.
>
> If a standard allows two different codings, it is no longer a standard, it is
> two standards, and that is something we have tried to avoid (not always
> successfully) in CIF. It should be a goal.
> David
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

[ddlm-group] CIF-2 changes (Herbert J. Bernstein)

References:

[ddlm-group] [THREAD 4] UTF8 (James Hester)

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

Re: [ddlm-group] [THREAD 4] UTF8 (David Brown)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8

Next by Date: [ddlm-group] CIF-2 changes

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: [ddlm-group] CIF-2 changes

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8