[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: "Herbert J. Bernstein" <[email protected]>
Date: Fri, 16 Oct 2009 09:36:16 -0400 (EDT)
In-Reply-To: <[email protected]>
References: <C6F976F1.1206C%[email protected]><[email protected]><[email protected]><[email protected]>

Dear Colleagues,

   I think as a practical matter there are two encodings for which we need 
to consider providing support:

   1.  UTF-8 -- I think we now all agree that this is the sensible default
encoding for CIF-2

   2.  UCS-2/UTF-16.  This is the encoding used in java and in web 
browsers.  It is also the encoding used in imgCIF base-32K binary 
encoding.  This is where the BOM flag becomes important -- it tells you 
when a switch to UCS-2/UTF-16 has ocurred and whether what follows is 
big-endian or little-endian.  It also gives you the capability of 
switching back to UTF-8.  However, the major use is simply as a flag at 
the start of a file, all of which is in one encoding.

Certainly there are other encodings that people may use -- in a system 
dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII 
(what we have used in the past).  I am not proposing that we try to get 
into the business of asking every parser to support every coding on every 
legacy system, and certainly for interchange, we should be telling people 
to stick to unicode, preferably as UTF-8, but I am certain that people 
will still want to use CIF in other enviroments with other "native" (i.e. 
system-dependent) encodings, and everybody gains from having a formalism 
for what should only be system-internal files propoerly marking with the 
encoding they are using to avoid the disasters that can occur when such 
files escape from their system cage without proper marking as to what they 
are.  Think of the mess we could have is people using java accidentally 
shipped a UCS-2/UTF-16 file without a BOM.  Most text editors will _not_ 
show you the alternating 0 bytes on the ordinary ASCII characters in that 
encodings, but it can produce very strange errors even there, and when we 
get to embedded accented characters, there is likely to simply be a wrong 
character with no indication of an error.

   Even if we mandate UTF-8 as the archiving and file transmission 
standard, we really do need to deal with other encodings in a properly,
self-identifying manner, just as emacs and vim do.

   Regards,
      Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Fri, 16 Oct 2009, James Hester wrote:

> Some thoughts on the first part of Herbert's proposals:
>
> Herbert proposes:
>  C1:  that the character set for a "new cif" be unicode, and
>  C2:  that the default encoding be UTF-8; and
>  C3:  that other encodings be permitted as an optional
> system-dependent feature when an explicit encoding
> has been specified by
>    C3.1:  a unicode BOM (byte-order-mark) (see
> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
> into a character stream, or
>    C3.2.  the first or second line being a comment of the form:
>      # -*- coding: <encoding-name> -*-
>    as recognized by GNU Emacs, or
>    C3.3.  the first or second line being a comment of the form:
>      # vim:fileencoding=<encoding-name>
>    as recognized by Bram Moolenaar's VIM
> (see section 2.1.4 of
> http://docs.python.org/reference/lexical_analysis.html for a more
> information).
>
> (James again:)
> I agree with C1 and C2.  Regarding C3, I don't see the need for other
> encodings at all.  Furthermore, I want to run screaming from the room
> when I see the words 'system dependent'.  As a file transfer standard,
> we care most about the (possibly different) sending and receiving
> systems agreeing on the contents, and so 'system-dependent' is
> completely unacceptable. In contrast to CIF, system-independence is a
> lower priority for a programming language, as a programmer who does
> not wish to distribute their program widely can usefully take
> advantage of system-dependent features.
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

References:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)

Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by Date: Re: [ddlm-group] [THREAD 4] UTF8

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: Re: [ddlm-group] [THREAD 4] UTF8

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8