[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Fri, 16 Oct 2009 09:36:16 -0400 (EDT)
- In-Reply-To: <[email protected]>
- References: <C6F976F1.1206C%[email protected]><[email protected]><[email protected]><[email protected]>
Dear Colleagues,
I think as a practical matter there are two encodings for which we need
to consider providing support:
1. UTF-8 -- I think we now all agree that this is the sensible default
encoding for CIF-2
2. UCS-2/UTF-16. This is the encoding used in java and in web
browsers. It is also the encoding used in imgCIF base-32K binary
encoding. This is where the BOM flag becomes important -- it tells you
when a switch to UCS-2/UTF-16 has ocurred and whether what follows is
big-endian or little-endian. It also gives you the capability of
switching back to UTF-8. However, the major use is simply as a flag at
the start of a file, all of which is in one encoding.
Certainly there are other encodings that people may use -- in a system
dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII
(what we have used in the past). I am not proposing that we try to get
into the business of asking every parser to support every coding on every
legacy system, and certainly for interchange, we should be telling people
to stick to unicode, preferably as UTF-8, but I am certain that people
will still want to use CIF in other enviroments with other "native" (i.e.
system-dependent) encodings, and everybody gains from having a formalism
for what should only be system-internal files propoerly marking with the
encoding they are using to avoid the disasters that can occur when such
files escape from their system cage without proper marking as to what they
are. Think of the mess we could have is people using java accidentally
shipped a UCS-2/UTF-16 file without a BOM. Most text editors will _not_
show you the alternating 0 bytes on the ordinary ASCII characters in that
encodings, but it can produce very strange errors even there, and when we
get to embedded accented characters, there is likely to simply be a wrong
character with no indication of an error.
Even if we mandate UTF-8 as the archiving and file transmission
standard, we really do need to deal with other encodings in a properly,
self-identifying manner, just as emacs and vim do.
Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
On Fri, 16 Oct 2009, James Hester wrote:
> Some thoughts on the first part of Herbert's proposals:
>
> Herbert proposes:
> C1: that the character set for a "new cif" be unicode, and
> C2: that the default encoding be UTF-8; and
> C3: that other encodings be permitted as an optional
> system-dependent feature when an explicit encoding
> has been specified by
> C3.1: a unicode BOM (byte-order-mark) (see
> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
> into a character stream, or
> C3.2. the first or second line being a comment of the form:
> # -*- coding: <encoding-name> -*-
> as recognized by GNU Emacs, or
> C3.3. the first or second line being a comment of the form:
> # vim:fileencoding=<encoding-name>
> as recognized by Bram Moolenaar's VIM
> (see section 2.1.4 of
> http://docs.python.org/reference/lexical_analysis.html for a more
> information).
>
> (James again:)
> I agree with C1 and C2. Regarding C3, I don't see the need for other
> encodings at all. Furthermore, I want to run screaming from the room
> when I see the words 'system dependent'. As a file transfer standard,
> we care most about the (possibly different) sending and receiving
> systems agreeing on the contents, and so 'system-dependent' is
> completely unacceptable. In contrast to CIF, system-independence is a
> lower priority for a programming language, as a programmer who does
> not wish to distribute their program widely can usefully take
> advantage of system-dependent features.
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):

