[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Dear James,

   We see the balance points differently.  I would suggest taking a straw 
poll of those who are interested and moving on to other issues.  imgCIF
will use UTF-8, UCS-2 and several true binary representations, but I guess
that all but the UTF-8 can go under the "binary" heading.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 26 Oct 2009, James Hester wrote:

> Hi Herbert and others:
>
>>   The heart of the problem is the person who submits a non-UTF-8 file to a
>> system, such as the IUCr, that chooses consider anything other thatn UTF-8
>> an error.  If you have no explicit flag citing what the person thinks they
>> used as an encoding, the only way you can detect and flag this error is by
>> examining the text character by character, looking for two things:
>>
>>   keywords and tags that are invalid
>>   strings that contain invalid characters
>>
>> Keywordd and tags are not likely to raise warning flags in, say, the
>> Latin-1 vs. UTF-8 encoding, bacause the keywords are all from the common
>> ASCII portion of both encodings, and the tag names from the official
>> dictionaries are also all from the common ASCII portion of both encodings.
>
> Agreed.
>
>> That leaves us only with the contents of the strings themselves to use to
>> spot the differences, a dubious proposition if the person has only a few
>> accented letters.
>
> As I have explained previously, this is actually a distinguishing
> feature of UTF-8: a single accented character in Latin-1 or any
> ISO-8859 encoding will be broken UTF-8 immediately.  Two accented
> characters side by side have an apriori chance of accidentally being
> correct UTF-8 of 1/32, and if even only two such combinations occur
> the chance of it being correct UTF-8 is <0.1%.  The more accented
> characters, the less likely an encoding error will not be detected,
> and the easier it is for a machine to detect the error; the less
> accented characters, the easier it is for a human to check the
> particular word involved.  And if there are any single high-bit set
> characters, you immediately know that it is not UTF-8.  So, I would
> assert that in the particular case of UTF-8 you do not need to worry
> about not being able to detect non-UTF-8 files.
>
> The IUCr is not really a good use case for alternative encodings,
> because (1) the strings that are likely to have accented etc.
> characters in them are those that appear in print and so will be
> examined by human eyes (2) the author is available, so straightforward
> rejection of non UTF-8 files is perfectly OK (just as a non-ASCII file
> at the moment would be treated).
>
>> If we give people a standard place to flag their encoding, then, if they
>> ignore that option and the editors they use ignore that option, we are no
>> worse off than off than if the option was not made available, but if we
>> provide the option and they either pay attention to what they are doing
>> (very unlikely) or their software pays attention to what they are doing
>> (an increasing reality) then the chances of producing a journal article
>> with a mistransliterated accent are reduced.
>
> I strongly disagree, for the reasons outlined in point 1 of my
> previous email.  At the same time as introducing the possibility of
> catching a rogue encoding or two, you introduce the possibility of
> using the wrong encoding and not noticing, which is much less likely
> under UTF-8.  And if we do say 'UTF-8 only', why would anyone write
> software in the first place for CIF2.0 that uses a different encoding?
> (Not only a rhetorical question - perhaps you have a use case in
> mind).
>
>> I do not claim this is a huge benefit, but inasmuch as the cost is very
>> low for providing it, it seems worth having, as it is worth having in XML.
>>
>> That does leave a disagreement on the cost.  I see only the expense of a
>> few extra characters in each file.  You seem primarily concerned about
>>
>> "2. The very existence of an encoding tag will encourage people not to
>> use UTF-8 if their own favourite encoding is easier and the recipient
>> agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15).
>> We are thus encouraging people to play fast and loose with the
>> standard."
>
> This is indeed an important concern, but I am also concerned about the
> other points in the previous email, including adding complexity for a
> doubtful gain, but to continue:
>
>> I do not see the problem here.  We are designing a tool to use, and, of
>> course, people will extend annd adapt it, just as both the IUCr and the
>> PDB already "play fast and loose with the standard".  Aren't we better off
>> if we provide a clean documented way for people to flag their deviations
>> from the standard than to force them to secretly engage in deviant
>> practices?  CIF is a tool, not a religion.  If the IUCr or the PDB needs
>> to do something different from the standard to get their jobs done, we
>> should look at ways to document, not to conceal, those practices.
>
> We have a pretty clear datapoint here: CIF1 has been pure ASCII for 16
> years, and as far as I know all the CIFs submitted to the IUCr have
> managed to stick to ASCII.  Therefore, there is no great need or
> desire to use different encodings.  If you have some counterexamples
> e.g. from imgCIF, that would be good to hear about.  Pending that
> counterexample, I would assert that insofar as ASCII was already
> sufficient to cover all character encoding needs, UTF-8 is even more
> so.   If we design a tool that clearly does the job, where will this
> need to change it come from?  I would have argued completely
> differently 10 years ago, when a variety of encodings was a fact of
> life and no encoding was obviously superior.  But I really think that
> UTF-8 does the job so well that there will be no need to move beyond
> it in the lifetime of the CIF2 standard.  And yes, I know that this
> email will be recorded forever on the IUCr website...but given how
> long ASCII has lasted I feel safe.
>
>> CIF stopped being simple when mmCIF was introduced.  As Frances says, it
>> is like dealing with PostScript.  Unlike core CIF, most people would be
>> well advised not to try to read mmCIF (and certainly not imgCIF) or do
>> hand editing of it, even though it looks like something you should be able
>> to read.  As much as possible, it should be handled by appropriate
>> syntax-aware tools, and the primary target for this proposal is to make it
>> easy for the programmers of those tools to have a way to deal with the
>> reality of varying character encoding and to be able to reliably deliver
>> the UTF-8 version for external transmission, even on systems for which
>> UTF-8 is _not_ the natve encoding.
>
> My apologies, I was referring to simple syntax rather than the whole
> CIF shooting match.  mmCIF has perfectly legal, simple CIF syntax,
> upon which you can indeed build beautiful, complex semantic
> structures.  Or not.  But that is your choice.  Contrast that to XML,
> where the syntax is far more complex (I believe there are 11 different
> node types, for example), and they accept all those different
> encodings to boot.  Also, in what sense is a given encoding a 'native'
> encoding for an OS?  I always assumed that this referred to the
> encoding of the filenames.  Surely there are not OSs out there that
> magically detect a text file and convert it always to UCS-2?  If such
> systems don't exist, then why do we care what the native encoding is?
>
>> I disagree about whether we should be looking at python and at XML.  Both
>> are successful tools that are, in fact, serving major communities that
>> have strong overlap with our community.  Both provide tools our
>> communities use.  We do not have to slavishly adopt every feature of
>> either one, but it certainly will pay to look at the choices that they
>> have made and to consider what lessons we may learn that will be of value
>> in the development of CIF.
>
> I do agree with this sentiment wholeheartedly.  My point was that
> using Emacs and vim tags for encoding is rather pointless in our
> context, whereas it is reasonable for a programming language.
> Likewise, adding features such as support for a variety of encodings
> is simple for them, because only one program has to change: the
> CPython interpreter, whereas in our case there are a couple of orders
> of magnitude more programs involved, and we need to think about
> transferability as well.
>
> all the best,
> James.
>
>> On Mon, 26 Oct 2009, SIMON WESTRIP wrote:
>>
>>> Perhaps 'benifit' is the wrong word - I was reading Herbert's argument
>>> as suggesting that it would be
>> 'good practice' to include some sort of flag so that if different
>> encodings are permitted in the future, a mechanism is already in place to
>> identify them?
>>
>> In practice, if the only permitted encoding is UTF-8, for Acta type CIFs I
>> suspect we would adopt a zero-tolerance policy with respect to other
>> encodings, though just as we will provide tools for converting between
>> CIF1 and CIF2, we may well also include tools for converting encodings.
>>
>> Cheers
>>
>> Simon
>>
>>
>>
>>
>> ________________________________
>> From: James Hester <jamesrhester@gmail.com>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Sent: Monday, 26 October, 2009 5:28:30
>> Subject: Re: [ddlm-group] [THREAD 4] UTF8
>>
>> The more I think about the proposal to add an encoding header or tag,
>> the more it frightens me.  Before dealing with the specific email
>> below, let me make the following additional points:
>>
>> 1.  We have no way of knowing whether or not the correct encoding has
>> been included in the header, as Simon points out.  This is not just a
>> case of 'oh well, it was worth a try'.  This is a case of making
>> things worse: if the character set for ISO-8859-1 is used instead of
>> ISO-8859-15, only attentive human reading of the text will turn up the
>> problem.  So, while helping to read in a few extra files, for the
>> majority of encodings this proposal opens up the possibility of
>> introducing hard-to-find errors in human-readable text.  For this
>> reason I would strongly, strongly recommend that only self-identifying
>> encodings are tolerated in the first place, and that no encoding
>> header is recognised.
>>
>> 2. The very existence of an encoding tag will encourage people not to
>> use UTF-8 if their own favourite encoding is easier and the recipient
>> agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15).
>> We are thus encouraging people to play fast and loose with the
>> standard.
>>
>> 3. Emacs and Vim are not the preferred file editing platform for most
>> of the crystallographic world.  Of course Python can use vi/emacs
>> encoding tags, as considerably more programmers are likely to be
>> familiar with vi and Emacs.
>>
>> 4. Let's be cautious when adopting practices from Python or (e.g.)
>> XML.  We need to appreciate the differences between them and us.  For
>> example, Python is essentially a single program (CPython) and so
>> upgrade paths are easier to manage.
>>
>> 5.  Don't forget the data archives - if the IUCr don't remediate the
>> ISO-8859-15 file to UTF-8, the archives will have to, as they have to
>> be able to deliver CIFs which are readable for all recipients.  So
>> there is guaranteed additional work and complexity involved as soon as
>> anybody starts agreeing to take other encodings.
>>
>> 6.  A key virtue of CIF is its simplicity.  A single acceptable
>> encoding is simple.  Multiple optional encodings is not.
>>
>> Returning to Herbert's latest email, I'm glad Simon can see the
>> benefit, but I still fail to.  Let's go through this again.  There are
>> two proposals on the table: both myself and Herbert are happy to state
>> that UTF-8 is the only official encoding for CIF2.0 files. Herbert
>> further proposes describing encoding flags in the standard, whereas I
>> think this is a bad idea.
>>
>> Let's go through Herbert's cases, with two more added for completeness:
>>   H1.  Somebody submits a UTF-8 file without the flag
>>   H2.  Somebody submits a UTF-8 file with the UTF-8 flag
>>   H3.  Somebody submits a non-UTF-8 file either with no flag
>> or with a UTF-8 flag
>>   H4.  (deleted, becomes H5 or H6)
>>   H5.  Somebody submits a non-UTF-8 file with a flag correctly telling
>> us that it is encoding xxx
>>   H6.  Somebody submits a non-UTF-8 file with a flag incorrectly
>> telling us that it is encoding xxx
>>
>> Under my proposal, this list degenerates to:
>>
>> J1. Somebody submits a UTF-8 file
>> J2. Somebody submits a non-UTF-8 file
>>
>> In case J2 (the equivalent of both case H3, H5 and H6 above), the file
>> is rejected as a syntactically incorrect CIF, just as incorrect CIFs
>> are rejected today.  I don't see anything wrong with this - in the
>> IUCr use case, the author is around to deal with the problem and
>> resubmit correctly.  Alternatively, under Herbert's proposal, a
>> further level of checking can be done by using the encoding flag to
>> see if a correct CIF can be produced - and will probably still fail.
>> 'Probably' because, as Simon points out, if the author thinks they've
>> sent a UTF-8 file and haven't, they are unlikely to get niceties like
>> an encoding flag correct, so H6 (and H3) are the most likely sorts of
>> files that will reach the IUCr.  Furthermore, for a large number of
>> encodings, it will fail in a way that cannot be detected automatically
>> (see point 1 at the beginning).
>>
>> It would be good to hear from those who haven't said anything yet,
>> even if only to hear if they are undecided.
>>
>> James.
>>
>> On Sun, Oct 25, 2009 at 3:18 AM, SIMON WESTRIP
>> <simonwestrip@btinternet.com> wrote:
>>> Dear Herbert
>>>
>>> thanks for clarifying this for me - I can now see the benefits of such flags
>>> (actually, if I'd stopped to think about it, I should have spotted an
>>> analogy
>>> with the use of <meta charset=UTF-8...> tags in html...)
>>>
>>> Cheers
>>>
>>> Simon
>>>
>>> ________________________________
>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>>> Sent: Saturday, 24 October, 2009 16:07:50
>>> Subject: Re: [ddlm-group] [THREAD 4] UTF8
>>>
>>> Dear Simon,
>>>
>>>   The world is not a perfect place, but there is increasing use of clear
>>> flags for encoding.  If we provide a place for the flag in CIf there are
>>> four possibilities on a submission:
>>>
>>>   1.  Somebody submits a UTF-8 file without the flag
>>>   2.  Somebody submits a UTF-8 file with the UTF-8 flag
>>>   3.  Somebody submits a non-UTF-8 file either with no flag
>>> or with a UTF-8 flag
>>>   4.  Somebody submits a non-UTF-8 file with a flag telling
>>> us that it is a non-UTF-8 file
>>>
>>> Cases 1,2 and 4 all allow for rational handling of the file.
>>> Case 3 can result in mishandling
>>>
>>> If we do not have the flag, we cannot have case 4, and all non-UTF-8
>>> files are highly likely to cause trouble.  Yes, getting into case 4 right
>>> now depends on users who know what encoding they are using, but python
>>> programmers are aready learning to be careful about it, and both vi and
>>> emacs are pretty good at recognizing mismatches to users are learning
>>> to fix the comment if it is wrong.
>>>
>>> What is the worst that happens if we include the identification of the
>>> encoding? -- everybody just leaves it set at UTF-8 no matter what they do.
>>> We will have lost nothing.  But is just one submission makes use of the
>>> identification propoerly for a non-UTF-8 encoding we will have gained, and
>>> over the next few years, as the editors and their supporting scripts get
>>> smarter, I expect you will start to see significant use of the encoding
>>> flags, especially to distinguigh UTF-8 from other popular unicode
>>> encodings, such as USC-2.
>>>
>>> vim supports both the comment and the BOM.  I personally prefer the BOM to
>>> other methods, but the comment is increasingly used.
>>>
>>> Regards,
>>>   Herbert
>>> =====================================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                   +1-631-244-3035
>>>                   yaya@dowling.edu
>>> =====================================================
>>>
>>> On Sat, 24 Oct 2009, SIMON WESTRIP wrote:
>>>
>>>> Herbert wrote:
>>>
>>> "I am saying that it would be a very good idea to conform to the
>>> vim or emacs editor conventions in marking CIF with their encoding, so
>>> that if somebody does make a mistake and send a journal a Latin-1 CIF-2
>>> file instead of a UTF-8 CIF-2, there will be some chance of spotting the
>>> error."
>>>
>>> I'm not sure what you're getting at here. Having a UTF-8 identifier would
>>> not help in this case? Or if you mean that the actual encoding used should
>>> be tagged, it seems unlikely that having already mistakingly (and probably
>>> unknowingly) used the wrong encoding, anyone would include such a tag? So
>>> unless the encoding can be determined from something inherent to the
>>> encoding, e.g. a UTF-16 BOM, I cant see that a comment-type tag is of any
>>> benefit?
>>>
>>> If the standard specifies UTF-8 there should be no reason to identify this
>>> in the CIF.
>>>
>>> However, I can see the advantages of such a tag if its envisaged that
>>> other encodings will be allowed in the future, or even simply to reinforce
>>> that the CIF is CIF2 (especially if the magic number has been ommitted)?
>>>
>>> I have to confess that I am starting to worry about all this slightly. As
>>> much as in the work I do I can happily read/write UTF-8 and convert from
>>> other encodings, at this stage I would probably struggle to convert from
>>> an unrecognized encoding - which is fair enough because if its CIF2 it
>>> should be UTF-8 and I shouldnt need to convert anyway (!), but it is a
>>> worry with respect to the issue of trying to make adoption of CIF-2 as
>>> painless as possible for the end users. But then again, I'm having a bad
>>> day :-)
>>>
>>> Cheers
>>>
>>> Simon
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>>> Sent: Friday, 23 October, 2009 20:47:40
>>> Subject: Re: [ddlm-group] [THREAD 4] UTF8
>>>
>>> Dear Colleagues,
>>>
>>>     I have only mild objections to saying the "UTF-8 is the only official
>>> encoding for CIF 2".  My mild objection is that imgCIF will not be
>>> compliant in sereral of its variants, but it certainly will always be able
>>> to provide at least one compliant translation of any file, 50-60% bigger
>>> than it has to be in, say, UCS-2, but compliant.
>>>
>>>     No, the real problem is not what is officially the "right" way to write
>>> CIFs, but what people will really do.  People will do what they have
>>> always done -- work with CIF on whatever system they have.  That system
>>> may be modern and support UTF-8, but, even then, its "native" mode may be
>>> something different.  If we are lucky, the differences will be
>>> sufficiently dramatic to allow the encoding used to be detected from
>>> context.  If somebody decides they are still using EBCDIC, we will have no
>>> trouble figuring that out, but sometimes the differences are more subtle.
>>> I just took a French message catalog for RasMol and converted it to the
>>> Latin-1 encoding.  Most of the text is absolutely the same.  Just a few
>>> acented characters differ.  In a large text with just a few accents, this
>>> could easily be missed, and lots of people in Europe use the Latin-1
>>> encoding.  I am not saying that we should handle Latin-1 in all CIF-2
>>> parsers.  I am saying that it would be a very good idea to conform to the
>>> vim or emacs editor conventions in marking CIF with their encoding, so
>>> that if somebody does make a mistake and send a journal a Latin-1 CIF-2
>>> file instead of a UTF-8 CIF-2, there will be some chance of spotting the
>>> error.
>>>
>>> The is the same issue as having the magic number #\# CIF 2.0 so we have a
>>> chance to spotting cases where somebody is trying to feed in a different
>>> CIF level.  Just because somebody might, somewhere, sometime, decide to
>>> send in a file to a CIF 1 parser with a magic number such as #\# CIF 2.0
>>> does not mean that suddenly we have to tell the person with the CIF 1
>>> parser that their parser is broken.  It just means the person with the CIF
>>> 1 parser or the person with the CIF 2 file have a better chance of quickly
>>> figuring out they have a mismatch.
>>>
>>> People will edit in different encodings, whether we approve of it or not.
>>>
>>> We lose nothing by flagging the UTF-8 encoding, and we can save people a
>>> lot of time in the future.
>>>
>>> Regards,
>>>     Herbert
>>>
>>> =====================================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>           Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                   +1-631-244-3035
>>>                   yaya@dowling.edu
>>> =====================================================
>>>
>>> On Fri, 23 Oct 2009, David Brown wrote:
>>>
>>>> I would just like to point out a philosophical principle which we tried to
>>>> observe in the earlier CIFs, and which I think very important, namely that
>>>> in
>>>> a standard like CIF it is only necessary to define one convention for each
>>>> feature in the standard.  Writers are required to convert the input to
>>>> this
>>>> convention and readers can always be confident that they will only have to
>>>> read this one convention.  Every time you allow alternative ways of
>>>> encoding
>>>> a piece of information you *require* the reader to be able to read both
>>>> alternatives.  If you allow three different encodings, you require three
>>>> different parsers.  If you allow ten different codings, you require ten
>>>> different parsers in every piece of reading software.  With one standard,
>>>> a
>>>> single parser works everywhere.
>>>>
>>>> If a standard allows two different codings, it is no longer a standard, it
>>>> is
>>>> two standards, and that is something we have tried to avoid (not always
>>>> successfully) in CIF.  It should be a goal.
>>>> David
>>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>>
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]