Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 versus extended ASCII

I agree with James on this one. It is specified as UTF-8, so that is what
you expect. Most of the files will be pure ASCII, as they are now, but over
time that will change. If we say it can be extended ASCII, which is ALMOST
(but not) the same as UTF-8 then I can only see confusion with users.

I was discussing the move to UTF-8 with Syd the other day. He posed a
question, the answer for which I took for granted, but now I am wandering.

The specification for STAR is broad so it will say encoding is UTF-8. But
when it comes to specific instances like CIF are we thinking that the data
names in the file will still be restricted to the ASCII subset of UTF-8? I
must admit I have been thinking of UTF-8 in terms of the data values, not in
terms of the data tags.

I have been trying to work out if XML accepts UTF-8 characters in the
strings that define start- and end-tags (the elements). It looks like they
do but every example I have seen works with the ASCII character set.

Anybody know the answer.


On 6/11/09 11:31 PM, "Joe Krahn" <krahn@niehs.nih.gov> wrote:

> Traditionally, non-ASCII characters are encoded as "extended" ASCII,
> using character codes 128-255. UTF-8 gained broad support because it
> still fits this design, even though it encodes many more non-ASCII
> characters.
> 
> My suggestion is to define the low-level STAR2/CIF2 syntax as allowing
> characters 128-255, but not specifically declaring UTF-8 encoding. It is
> almost the same, but has a few potential advantages.
> 
> First, it becomes a bit more sensible for the DDL to declare where UTF-8
> is allowed, rather than excluding it from all of the other strings. I
> assume that UTF-8 is intended mainly for publication-oriented formatted
> text, but the numerous label strings will remain ASCII. If not, it still
> follows the original STAR/CIF idea where the exact details of string
> encoding is left to the DDL.
> 
> Second, generic 8-bit extended ASCII would make it easier to efficiently
> encode binary data, with 7-bits of raw binary data per byte. It has half
> the overhead of Base64, and does not require mapping characters in a
> look-up table. It is not as efficient as embedding binary in UCS-2, but
> it also does not have the UCS-2 overhead for all of the non-binary CIF
> files.
> 
> The advantage of UCS-2 is that they easily fit into short fixed-length
> strings, and are much more efficient at manipulating sub-strings. That
> is why Java and the MS-Windows kernel use UCS-2. UTF-8 is more efficient
> for storage, which is one reason MS-Windows does not default to UCS-2
> for text files. Therefore, in my opinion, UTF-8 is better suited to an
> archival format. However, UCS-2 might really be a better choice for
> mostly-binary CIF files. It would be nice for UCS-2 CIF beginning with
> the BOM encoding mark to also be a valid CIF alternative, instead of
> just a hacked pseudo-CIF.
> 
> If CIF still wants to go with global UTF-8 encoding, maybe the low-level
> STAR syntax can be updated to define a more generic encoding. Herbert
> mentioned that using "not exactly CIF" often is useful to get work done,
> when the strict CIF format gets in the way. It would be nice if these
> sorts of files could at least stick to STAR syntax to avoid running into
> incompatibilities.
> 
> OTOH, I am much more picky about proper syntax standards than most
> people. Maybe this group is happy to declare standard CIF as UTF-8, and
> leave any alternative forms as a customised, non-standard CIF.
> 
> Joe Krahn
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au




_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.