[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: "Herbert J. Bernstein" <[email protected]>
Date: Thu, 22 Oct 2009 19:13:39 -0400 (EDT)
In-Reply-To: <[email protected]>
References: <C6F976F1.1206C%[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>

Dear Colleagues,

   It really is not possible to determing the encoding of a file with a few 
accented characters from context.  You lose nothing by including a place 
to flag the encoding used, and can always then say that a conformant 
parser need only accept the delcared UTF-8 encoding and is free to declare
all other encodings to be an error, but you gain a lot (such a clean
access to the entire java/browser world and many older unicode operating
systems) by handling at least UCS-2 encoding.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Fri, 23 Oct 2009, James Hester wrote:

> Dear All: Herbert's argument for at least UCS2 inclusion, if I have
> understood correctly, is that CIF2 users may inadvertently or
> deliberately save a CIF file containing multilingual characters in a
> non-UTF8 encoding.  This file may look fine until it escapes from the
> confines of that person's computer or lab, when suddently other users
> find it difficult or impossible to decode. I agree that inadvertent
> saving in another encoding is entirely likely, given that virtually
> all internationalised applications support multiple encodings out of
> the box.  I would expect that this is more of a problem when a CIF
> file is directly edited, whereas a program is likely to do it properly
> - or at least get it wrong only once...
>
> I believe that John's concerns about encodings may also arise out of
> the multiple possible ways of encoding any given piece of non-ASCII
> text and having to deal with that incoming variety in a reliable way.
> This is a real problem e.g. a file containing Cyrillic characters could
> be encoded in any of at least 5 ways, and these encodings differ only
> in which bytes correspond to which letters - if you don't read
> Russian, you may not even know that there is an encoding error.  If
> the letters are in isolation, even a Russian speaker may have trouble.
>
> But: despite agreeing with Herbert on the likelihood of CIF files
> being saved in the wrong encoding, I repeat the point that we are
> defining a standard for successfully communicating information between
> computers across time and space.  In that context 'optional' does not
> make sense.  If any conformant CIF reader cannot read any conformant
> CIF file, the standard has failed. It would therefore be better to
> completely ignore UCS2, so that when the CIF reader fails, it is
> because of non-conformance to the standard either at the writer's end
> or the reader's end, not because of lack of some option.  Those who
> have inadvertently or otherwise miscoded their multilingual characters
> simply cause an error (I say 'will' and not 'may', see paragraph below
> about distinctiveness of UTF8 encoding), and get feedback to correct
> the problem, just as today when they mess up their hand-edited CIFs.
>
> One of the nice aspects of UTF8 is that the European non-ASCII
> characters require at least two bytes where the likely alternative
> non-Unicode encodings would have only one.  These two-byte pairs have
> a distinctive bit pattern which makes it very likely that a non-UTF8
> encoding would be detected almost immediately.  So in that sense I
> think that allowing only UTF8 encoding is a robust solution to the
> multiple encoding problem as a non-UTF8 encoding can be automatically
> detected.
>
> Finally, Herbert says:
>
> 	Even if we mandate UTF-8 as the archiving and file
> 	transmission standard, we really do need to deal with other
> 	encodings in a properly, self-identifying manner, just as
> 	emacs and vim do.
>
> I would suggest that specifying a standards violation whenever a
> different encoding is detected is sufficiently proper.  I don't think
> that the comparison with vim and emacs is valid: these applications
> aim to actually deal with text in different encodings, whereas we do
> not.
>
>
> On Sat, Oct 17, 2009 at 12:36 AM, Herbert J. Bernstein
> <[email protected]> wrote:
>> Dear Colleagues,
>>
>> � I think as a practical matter there are two encodings for which we need
>> to consider providing support:
>>
>> � 1. �UTF-8 -- I think we now all agree that this is the sensible default
>> encoding for CIF-2
>>
>> � 2. �UCS-2/UTF-16. �This is the encoding used in java and in web
>> browsers. �It is also the encoding used in imgCIF base-32K binary
>> encoding. �This is where the BOM flag becomes important -- it tells you
>> when a switch to UCS-2/UTF-16 has ocurred and whether what follows is
>> big-endian or little-endian. �It also gives you the capability of
>> switching back to UTF-8. �However, the major use is simply as a flag at
>> the start of a file, all of which is in one encoding.
>>
>> Certainly there are other encodings that people may use -- in a system
>> dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII
>> (what we have used in the past). �I am not proposing that we try to get
>> into the business of asking every parser to support every coding on every
>> legacy system, and certainly for interchange, we should be telling people
>> to stick to unicode, preferably as UTF-8, but I am certain that people
>> will still want to use CIF in other enviroments with other "native" (i.e.
>> system-dependent) encodings, and everybody gains from having a formalism
>> for what should only be system-internal files propoerly marking with the
>> encoding they are using to avoid the disasters that can occur when such
>> files escape from their system cage without proper marking as to what they
>> are. �Think of the mess we could have is people using java accidentally
>> shipped a UCS-2/UTF-16 file without a BOM. �Most text editors will _not_
>> show you the alternating 0 bytes on the ordinary ASCII characters in that
>> encodings, but it can produce very strange errors even there, and when we
>> get to embedded accented characters, there is likely to simply be a wrong
>> character with no indication of an error.
>>
>> � Even if we mandate UTF-8 as the archiving and file transmission
>> standard, we really do need to deal with other encodings in a properly,
>> self-identifying manner, just as emacs and vim do.
>>
>> � Regards,
>> � � �Herbert
>>
>> =====================================================
>> �Herbert J. Bernstein, Professor of Computer Science
>> � �Dowling College, Kramer Science Center, KSC 121
>> � � � � Idle Hour Blvd, Oakdale, NY, 11769
>>
>> � � � � � � � � �+1-631-244-3035
>> � � � � � � � � �[email protected]
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

References:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8

Next by Date: Re: [ddlm-group] [THREAD 4] UTF8

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: Re: [ddlm-group] [THREAD 4] UTF8

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8