Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Dear Colleagues,

   I think as a practical matter there are two encodings for which we need 
to consider providing support:

   1.  UTF-8 -- I think we now all agree that this is the sensible default
encoding for CIF-2

   2.  UCS-2/UTF-16.  This is the encoding used in java and in web 
browsers.  It is also the encoding used in imgCIF base-32K binary 
encoding.  This is where the BOM flag becomes important -- it tells you 
when a switch to UCS-2/UTF-16 has ocurred and whether what follows is 
big-endian or little-endian.  It also gives you the capability of 
switching back to UTF-8.  However, the major use is simply as a flag at 
the start of a file, all of which is in one encoding.

Certainly there are other encodings that people may use -- in a system 
dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII 
(what we have used in the past).  I am not proposing that we try to get 
into the business of asking every parser to support every coding on every 
legacy system, and certainly for interchange, we should be telling people 
to stick to unicode, preferably as UTF-8, but I am certain that people 
will still want to use CIF in other enviroments with other "native" (i.e. 
system-dependent) encodings, and everybody gains from having a formalism 
for what should only be system-internal files propoerly marking with the 
encoding they are using to avoid the disasters that can occur when such 
files escape from their system cage without proper marking as to what they 
are.  Think of the mess we could have is people using java accidentally 
shipped a UCS-2/UTF-16 file without a BOM.  Most text editors will _not_ 
show you the alternating 0 bytes on the ordinary ASCII characters in that 
encodings, but it can produce very strange errors even there, and when we 
get to embedded accented characters, there is likely to simply be a wrong 
character with no indication of an error.

   Even if we mandate UTF-8 as the archiving and file transmission 
standard, we really do need to deal with other encodings in a properly,
self-identifying manner, just as emacs and vim do.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Fri, 16 Oct 2009, James Hester wrote:

> Some thoughts on the first part of Herbert's proposals:
> Herbert proposes:
>  C1:  that the character set for a "new cif" be unicode, and
>  C2:  that the default encoding be UTF-8; and
>  C3:  that other encodings be permitted as an optional
> system-dependent feature when an explicit encoding
> has been specified by
>    C3.1:  a unicode BOM (byte-order-mark) (see
> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
> into a character stream, or
>    C3.2.  the first or second line being a comment of the form:
>      # -*- coding: <encoding-name> -*-
>    as recognized by GNU Emacs, or
>    C3.3.  the first or second line being a comment of the form:
>      # vim:fileencoding=<encoding-name>
>    as recognized by Bram Moolenaar's VIM
> (see section 2.1.4 of
> http://docs.python.org/reference/lexical_analysis.html for a more
> information).
> (James again:)
> I agree with C1 and C2.  Regarding C3, I don't see the need for other
> encodings at all.  Furthermore, I want to run screaming from the room
> when I see the words 'system dependent'.  As a file transfer standard,
> we care most about the (possibly different) sending and receiving
> systems agreeing on the contents, and so 'system-dependent' is
> completely unacceptable. In contrast to CIF, system-independence is a
> lower priority for a programming language, as a programmer who does
> not wish to distribute their program widely can usefully take
> advantage of system-dependent features.
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.