[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: James Hester <jamesrhester@gmail.com>
Date: Fri, 23 Oct 2009 15:22:37 +1100
In-Reply-To: <20091022190955.H7480@epsilon.pair.com>
References: <C6F976F1.1206C%nick@csse.uwa.edu.au><504270.84370.qm@web87013.mail.ird.yahoo.com><20091013055314.F86319@epsilon.pair.com><279aad2a0910160435x3876c24ev797e022adbc05529@mail.gmail.com><20091016091713.I65032@epsilon.pair.com><279aad2a0910221606k11e28de4la7582d9a85cadcf3@mail.gmail.com><20091022190955.H7480@epsilon.pair.com>

Hi Herbert and others: perhaps I've been a little vague.  I am not
proposing to 'determine the encoding' automatically from context.  As
I propose to allow only UTF8 encoding, I am simply proposing to answer
the question 'is this file UTF8 encoded?' based on the bit patterns in
the non-ASCII characters.  I assert that automatically answering this
question is practically possible.  Consider the following:

The first byte in an n-byte UTF8 sequence has the first n bits set,
followed by a zero bit.  Subsequent bytes in that sequence have the
first bit set, followed by a zero bit.  This is a distinctive pattern
of zero / one bits which is unlikely to occur in a different encoding.
 Specifically, for the worst case scenario of a random sequence of two
non-ASCII characters, the probability of them corresponding to a valid
UTF8 sequence is 1/8 * 1/4 = 1/32.  Insofar as there will be more than
just a single group of two non-ASCII characters in a file, the
probability of misunderstanding the character encoding reduces
multiplicatively.  Not to mention multiplying by the probability that
someone gets the encoding wrong in the first place, and the
probability that the mistake is not picked up by some human editor
looking at the file.

So I don't agree with your statement that we can't determine encoding.
 Of course, if we are proposing to allow other encodings, then I agree
that we would need some sort of encoding specifier.

You also say 'you lose nothing by including a place to flag the
encoding'.   But we are not just 'including a place'; we are diluting
the standard, as I keep saying, by providing such an "option" and
thereby essentially guaranteeing that not all CIF files are readable
by all CIF readers.  And Herbert, I'm not clear on the benefits of
UCS2 as you put them forward.  Firstly, I have no problems looking at
UTF8-encoded web pages in my browser, so I don't know how ignoring
UCS2 hinders us there. Secondly, while Java native encoding might by
UCS2, that doesn't mean that Java can't handle UTF8, indeed according
to the Wikipedia entry InputStreamReader and OutputStreamReader do so.
 Thirdly, from what I can tell UCS2 is a deprecated encoding.  As you
may have a broader perspective than me, perhaps you could give us a
concrete example of where CIF UCS2 support would be desirable?

James.

On Fri, Oct 23, 2009 at 10:13 AM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> Dear Colleagues,
>
> �It really is not possible to determing the encoding of a file with a few
> accented characters from context. �You lose nothing by including a place to
> flag the encoding used, and can always then say that a conformant parser
> need only accept the delcared UTF-8 encoding and is free to declare
> all other encodings to be an error, but you gain a lot (such a clean
> access to the entire java/browser world and many older unicode operating
> systems) by handling at least UCS-2 encoding.
>
> �Regards,
> � �Herbert
>
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � Dowling College, Kramer Science Center, KSC 121
> � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � +1-631-244-3035
> � � � � � � � � yaya@dowling.edu
> =====================================================
>
> On Fri, 23 Oct 2009, James Hester wrote:
>
>> Dear All: Herbert's argument for at least UCS2 inclusion, if I have
>> understood correctly, is that CIF2 users may inadvertently or
>> deliberately save a CIF file containing multilingual characters in a
>> non-UTF8 encoding. �This file may look fine until it escapes from the
>> confines of that person's computer or lab, when suddently other users
>> find it difficult or impossible to decode. I agree that inadvertent
>> saving in another encoding is entirely likely, given that virtually
>> all internationalised applications support multiple encodings out of
>> the box. �I would expect that this is more of a problem when a CIF
>> file is directly edited, whereas a program is likely to do it properly
>> - or at least get it wrong only once...
>>
>> I believe that John's concerns about encodings may also arise out of
>> the multiple possible ways of encoding any given piece of non-ASCII
>> text and having to deal with that incoming variety in a reliable way.
>> This is a real problem e.g. a file containing Cyrillic characters could
>> be encoded in any of at least 5 ways, and these encodings differ only
>> in which bytes correspond to which letters - if you don't read
>> Russian, you may not even know that there is an encoding error. �If
>> the letters are in isolation, even a Russian speaker may have trouble.
>>
>> But: despite agreeing with Herbert on the likelihood of CIF files
>> being saved in the wrong encoding, I repeat the point that we are
>> defining a standard for successfully communicating information between
>> computers across time and space. �In that context 'optional' does not
>> make sense. �If any conformant CIF reader cannot read any conformant
>> CIF file, the standard has failed. It would therefore be better to
>> completely ignore UCS2, so that when the CIF reader fails, it is
>> because of non-conformance to the standard either at the writer's end
>> or the reader's end, not because of lack of some option. �Those who
>> have inadvertently or otherwise miscoded their multilingual characters
>> simply cause an error (I say 'will' and not 'may', see paragraph below
>> about distinctiveness of UTF8 encoding), and get feedback to correct
>> the problem, just as today when they mess up their hand-edited CIFs.
>>
>> One of the nice aspects of UTF8 is that the European non-ASCII
>> characters require at least two bytes where the likely alternative
>> non-Unicode encodings would have only one. �These two-byte pairs have
>> a distinctive bit pattern which makes it very likely that a non-UTF8
>> encoding would be detected almost immediately. �So in that sense I
>> think that allowing only UTF8 encoding is a robust solution to the
>> multiple encoding problem as a non-UTF8 encoding can be automatically
>> detected.
>>
>> Finally, Herbert says:
>>
>> � � � �Even if we mandate UTF-8 as the archiving and file
>> � � � �transmission standard, we really do need to deal with other
>> � � � �encodings in a properly, self-identifying manner, just as
>> � � � �emacs and vim do.
>>
>> I would suggest that specifying a standards violation whenever a
>> different encoding is detected is sufficiently proper. �I don't think
>> that the comparison with vim and emacs is valid: these applications
>> aim to actually deal with text in different encodings, whereas we do
>> not.
>>
>>
>> On Sat, Oct 17, 2009 at 12:36 AM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>>
>>> Dear Colleagues,
>>>
>>> � I think as a practical matter there are two encodings for which we need
>>> to consider providing support:
>>>
>>> � 1. �UTF-8 -- I think we now all agree that this is the sensible default
>>> encoding for CIF-2
>>>
>>> � 2. �UCS-2/UTF-16. �This is the encoding used in java and in web
>>> browsers. �It is also the encoding used in imgCIF base-32K binary
>>> encoding. �This is where the BOM flag becomes important -- it tells you
>>> when a switch to UCS-2/UTF-16 has ocurred and whether what follows is
>>> big-endian or little-endian. �It also gives you the capability of
>>> switching back to UTF-8. �However, the major use is simply as a flag at
>>> the start of a file, all of which is in one encoding.
>>>
>>> Certainly there are other encodings that people may use -- in a system
>>> dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII
>>> (what we have used in the past). �I am not proposing that we try to get
>>> into the business of asking every parser to support every coding on every
>>> legacy system, and certainly for interchange, we should be telling people
>>> to stick to unicode, preferably as UTF-8, but I am certain that people
>>> will still want to use CIF in other enviroments with other "native" (i.e.
>>> system-dependent) encodings, and everybody gains from having a formalism
>>> for what should only be system-internal files propoerly marking with the
>>> encoding they are using to avoid the disasters that can occur when such
>>> files escape from their system cage without proper marking as to what
>>> they
>>> are. �Think of the mess we could have is people using java accidentally
>>> shipped a UCS-2/UTF-16 file without a BOM. �Most text editors will _not_
>>> show you the alternating 0 bytes on the ordinary ASCII characters in that
>>> encodings, but it can produce very strange errors even there, and when we
>>> get to embedded accented characters, there is likely to simply be a wrong
>>> character with no indication of an error.
>>>
>>> � Even if we mandate UTF-8 as the archiving and file transmission
>>> standard, we really do need to deal with other encodings in a properly,
>>> self-identifying manner, just as emacs and vim do.
>>>
>>> � Regards,
>>> � � �Herbert
>>>
>>> =====================================================
>>> �Herbert J. Bernstein, Professor of Computer Science
>>> � �Dowling College, Kramer Science Center, KSC 121
>>> � � � � Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>> � � � � � � � � �+1-631-244-3035
>>> � � � � � � � � �yaya@dowling.edu
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8

Next by Date: Re: [ddlm-group] [THREAD 4] UTF8

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: Re: [ddlm-group] [THREAD 4] UTF8

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8