Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Colleagues,

   There is a misundertsanding about UTF-8.  For the point of view of any
C-program intended to work with the 256-chacacter ISO characters sets,
a UTF-8 string handles just the same as an ISO string.  The major 
differences are that the bottom 128 characters are the US national variant
we call ASCII, and the second 128 characters that in the past would have
had the accented and special characters needs to handle the western
European languages in an ASCII environment have been replaced with the
variable length encodings for a 31 bit character set.  That is what is 
nice about UTF8 -- it is actually using what should be printable 
characters to do its encoding, avoiding anything that looks like
binary data.

   UTF-16/UCS-2 is different.  There you have a lot that looks like binary
when working in an ascii world, and you need special libraries (for wide
characters) to deal with them, unless you are working in java or with a
browser, where that is the native encoding.

   We are in the midst of a painful, worldwide transition in which we have
a mixture of:

   1.  The code code-page based character encodings based on the multiple 
ISO national variants.  ASCII is just the US national variant.
   2.  The UTF-16/UCS-2 version of unicode heavily adopted by many hardware
vendors and used as the native encoding in many operating systems and all
   3.  The UTF-8 version of unicode, extensively adopted in Linux-based
applications and slowly being accepted in almost all operating systems.

My guess is that by 10 years from now, UTF-8 will have been fairly
completely adopted except for some legacy java and browser UCS-2

   My suggestion would be to try to support ascii, UCS-2 and UTF-8 for the
moment and work towards joining the march towards UTF-8.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Sat, 10 Oct 2009, Brian McMahon wrote:

> Regarding the adoption of the Unicode character set, I agree that
> this would make it easier to accommodate accented and non-Latin
> characters and symbols, and I see no reason to oppose implementing
> it as a UTF-8 encoding, and so I vote 3.2.
> (It's not a panacea, especially for maths, where new symbols can
> always be invented, and one must be able to specify a two-dimensional
> layout as well as just the glyphs, so we shall still need other
> approaches for various types of "rich" text.)
> However, this is a binary encoding, is it not, and so the underlying
> STAR specification must be modified to accommodate this. (I'm afraid
> I haven't got Nick's draft paper for the revised STAR specification
> to hand, so I apologise if that's already been addrressed.)
> Does it raise issues of endian-ness? If we are introducing binary
> encodings, are there any reasons to restrict the character set
> encoding to UTF-8 or should one also allow UTF-16 etc. (i) in STAR
> and (ii) in CIF? And, ultimately, is there a prospect of extending
> the STAR spec in a way that properly accommodates at least the CBF
> implementation, and possibly other binary data incorporation?
> I am happy in this case that handling by "old" CIF software can
> be done by adopting a protocol that allows UTF-8 Unicode characters
> to be represented by ASCII encodings such as \u27. (I don't think
> that we need specify a protocol at this point, just be sure that
> one can be defined if needed.)
> I again draw attention to the amusing fact that with an ASCII
> Unicode encoding, "O\u27Neill" is a valid data value under the
> current proposals, whereas the UTF-8 equivalent would not be,
> because the UTF-8 encoding of ' is just ' !
> Brian
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.