[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: James Hester <[email protected]>
Date: Fri, 23 Oct 2009 10:06:13 +1100
In-Reply-To: <[email protected]>
References: <C6F976F1.1206C%[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>

Dear All: Herbert's argument for at least UCS2 inclusion, if I have
understood correctly, is that CIF2 users may inadvertently or
deliberately save a CIF file containing multilingual characters in a
non-UTF8 encoding.  This file may look fine until it escapes from the
confines of that person's computer or lab, when suddently other users
find it difficult or impossible to decode. I agree that inadvertent
saving in another encoding is entirely likely, given that virtually
all internationalised applications support multiple encodings out of
the box.  I would expect that this is more of a problem when a CIF
file is directly edited, whereas a program is likely to do it properly
- or at least get it wrong only once...

I believe that John's concerns about encodings may also arise out of
the multiple possible ways of encoding any given piece of non-ASCII
text and having to deal with that incoming variety in a reliable way.
This is a real problem e.g. a file containing Cyrillic characters could
be encoded in any of at least 5 ways, and these encodings differ only
in which bytes correspond to which letters - if you don't read
Russian, you may not even know that there is an encoding error.  If
the letters are in isolation, even a Russian speaker may have trouble.

But: despite agreeing with Herbert on the likelihood of CIF files
being saved in the wrong encoding, I repeat the point that we are
defining a standard for successfully communicating information between
computers across time and space.  In that context 'optional' does not
make sense.  If any conformant CIF reader cannot read any conformant
CIF file, the standard has failed. It would therefore be better to
completely ignore UCS2, so that when the CIF reader fails, it is
because of non-conformance to the standard either at the writer's end
or the reader's end, not because of lack of some option.  Those who
have inadvertently or otherwise miscoded their multilingual characters
simply cause an error (I say 'will' and not 'may', see paragraph below
about distinctiveness of UTF8 encoding), and get feedback to correct
the problem, just as today when they mess up their hand-edited CIFs.

One of the nice aspects of UTF8 is that the European non-ASCII
characters require at least two bytes where the likely alternative
non-Unicode encodings would have only one.  These two-byte pairs have
a distinctive bit pattern which makes it very likely that a non-UTF8
encoding would be detected almost immediately.  So in that sense I
think that allowing only UTF8 encoding is a robust solution to the
multiple encoding problem as a non-UTF8 encoding can be automatically
detected.

Finally, Herbert says:

	Even if we mandate UTF-8 as the archiving and file
	transmission standard, we really do need to deal with other
	encodings in a properly, self-identifying manner, just as
	emacs and vim do.

I would suggest that specifying a standards violation whenever a
different encoding is detected is sufficiently proper.  I don't think
that the comparison with vim and emacs is valid: these applications
aim to actually deal with text in different encodings, whereas we do
not.

On Sat, Oct 17, 2009 at 12:36 AM, Herbert J. Bernstein
<[email protected]> wrote:
> Dear Colleagues,
>
> � I think as a practical matter there are two encodings for which we need
> to consider providing support:
>
> � 1. �UTF-8 -- I think we now all agree that this is the sensible default
> encoding for CIF-2
>
> � 2. �UCS-2/UTF-16. �This is the encoding used in java and in web
> browsers. �It is also the encoding used in imgCIF base-32K binary
> encoding. �This is where the BOM flag becomes important -- it tells you
> when a switch to UCS-2/UTF-16 has ocurred and whether what follows is
> big-endian or little-endian. �It also gives you the capability of
> switching back to UTF-8. �However, the major use is simply as a flag at
> the start of a file, all of which is in one encoding.
>
> Certainly there are other encodings that people may use -- in a system
> dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII
> (what we have used in the past). �I am not proposing that we try to get
> into the business of asking every parser to support every coding on every
> legacy system, and certainly for interchange, we should be telling people
> to stick to unicode, preferably as UTF-8, but I am certain that people
> will still want to use CIF in other enviroments with other "native" (i.e.
> system-dependent) encodings, and everybody gains from having a formalism
> for what should only be system-internal files propoerly marking with the
> encoding they are using to avoid the disasters that can occur when such
> files escape from their system cage without proper marking as to what they
> are. �Think of the mess we could have is people using java accidentally
> shipped a UCS-2/UTF-16 file without a BOM. �Most text editors will _not_
> show you the alternating 0 bytes on the ordinary ASCII characters in that
> encodings, but it can produce very strange errors even there, and when we
> get to embedded accented characters, there is likely to simply be a wrong
> character with no indication of an error.
>
> � Even if we mandate UTF-8 as the archiving and file transmission
> standard, we really do need to deal with other encodings in a properly,
> self-identifying manner, just as emacs and vim do.
>
> � Regards,
> � � �Herbert
>
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � �Dowling College, Kramer Science Center, KSC 121
> � � � � Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � �+1-631-244-3035
> � � � � � � � � �[email protected]

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

References:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8

Next by Date: Re: [ddlm-group] [THREAD 4] UTF8

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: Re: [ddlm-group] [THREAD 4] UTF8

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8