[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Fri, 16 Oct 2009 09:36:16 -0400 (EDT)
- In-Reply-To: <279aad2a0910160435x3876c24ev797e022adbc05529@mail.gmail.com>
- References: <C6F976F1.1206C%nick@csse.uwa.edu.au><504270.84370.qm@web87013.mail.ird.yahoo.com><20091013055314.F86319@epsilon.pair.com><279aad2a0910160435x3876c24ev797e022adbc05529@mail.gmail.com>
Dear Colleagues, I think as a practical matter there are two encodings for which we need to consider providing support: 1. UTF-8 -- I think we now all agree that this is the sensible default encoding for CIF-2 2. UCS-2/UTF-16. This is the encoding used in java and in web browsers. It is also the encoding used in imgCIF base-32K binary encoding. This is where the BOM flag becomes important -- it tells you when a switch to UCS-2/UTF-16 has ocurred and whether what follows is big-endian or little-endian. It also gives you the capability of switching back to UTF-8. However, the major use is simply as a flag at the start of a file, all of which is in one encoding. Certainly there are other encodings that people may use -- in a system dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII (what we have used in the past). I am not proposing that we try to get into the business of asking every parser to support every coding on every legacy system, and certainly for interchange, we should be telling people to stick to unicode, preferably as UTF-8, but I am certain that people will still want to use CIF in other enviroments with other "native" (i.e. system-dependent) encodings, and everybody gains from having a formalism for what should only be system-internal files propoerly marking with the encoding they are using to avoid the disasters that can occur when such files escape from their system cage without proper marking as to what they are. Think of the mess we could have is people using java accidentally shipped a UCS-2/UTF-16 file without a BOM. Most text editors will _not_ show you the alternating 0 bytes on the ordinary ASCII characters in that encodings, but it can produce very strange errors even there, and when we get to embedded accented characters, there is likely to simply be a wrong character with no indication of an error. Even if we mandate UTF-8 as the archiving and file transmission standard, we really do need to deal with other encodings in a properly, self-identifying manner, just as emacs and vim do. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Fri, 16 Oct 2009, James Hester wrote: > Some thoughts on the first part of Herbert's proposals: > > Herbert proposes: > C1: that the character set for a "new cif" be unicode, and > C2: that the default encoding be UTF-8; and > C3: that other encodings be permitted as an optional > system-dependent feature when an explicit encoding > has been specified by > C3.1: a unicode BOM (byte-order-mark) (see > http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced > into a character stream, or > C3.2. the first or second line being a comment of the form: > # -*- coding: <encoding-name> -*- > as recognized by GNU Emacs, or > C3.3. the first or second line being a comment of the form: > # vim:fileencoding=<encoding-name> > as recognized by Bram Moolenaar's VIM > (see section 2.1.4 of > http://docs.python.org/reference/lexical_analysis.html for a more > information). > > (James again:) > I agree with C1 and C2. Regarding C3, I don't see the need for other > encodings at all. Furthermore, I want to run screaming from the room > when I see the words 'system dependent'. As a file transfer standard, > we care most about the (possibly different) sending and receiving > systems agreeing on the contents, and so 'system-dependent' is > completely unacceptable. In contrast to CIF, system-independence is a > lower priority for a programming language, as a programmer who does > not wish to distribute their program widely can usefully take > advantage of system-dependent features. > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):