Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Title: Re: [ddlm-group] [THREAD 4] UTF8
OK. Well I think this discussion has dies its death. One of the benefits (to Herb mostly) is that with UCS-2 encoding, the CBF comes in to the imgCIF fold. Herb’s group has a binary->UCS-2 encoding with a 7% added footprint so it is efficient. At the moment (or at least shortly) the alternative is to go binary->BASE64 which has a 33% added footprint. I guess binary->UTF-8 would have the same footprint. By being able to switch encodings Herb can achieve a single file for his work. The price is that we have to adopt a switching encoding option – not attractive to most people here apparently.

It seems the consensus is that UTF-8 is the encoding standard and everything else is a broken file. Not with standing that any application may parse it, attempt to determine the encoding and produce a corrected file. That is nice, but not a required part of the standard (which is the standard, to quote David, quoting me).

It seems to me Herb, the easiest thing you can do with regard to CBF (since it can’t be a real CIF) is to adopt everything from CIF-2 except that CBF must be encoded as UCS-2. That may give you a bigger footprint per ASCII character, but since these are a small number of data values in a CBF it may well be worth the price. That way at least the binary part is efficiently encoded.

If in the DDLm we manage to flag order dependency in the loop_ header (along with a flag to indicate row ordering is important in the loop_ - something John W wanted) then CBF will essentially be CIF-2 except that the encoding is UCS-2.

On 26/10/09 10:43 PM, "David Brown" <idbrown@mcmaster.ca> wrote:

James has asked for the views of those of us who have just been watching this discussion (it seems to go by faster than I can follow).

For what is is worth, my stance is strongly the same as James'.  He has laid out all the argumemts very succinctly - I will just briefly reinforce those I think most importatn.

1. Permitting one or more other encodings (presumably these must constitute a well defined list as their names must be recognized) immediately invites people to use them.  To quote Nick, if we have a standard it should be a standard.  If people decide to bend the standard they do so at their own peril.

2. If other encodings are depricated (yes, that word again) in order to encourage people to use the real standard, they are likely to be unaware that their program has used a depricated encoding and therefore has failed to identify it.  If the writer of the CIF knows that it is using a depricated standard it can make the conversion.  Only if it does not know it is using a depricated standard and therefore neither converts nor identifies the encoding will the depricated encoding get through.

3. Providing a space (at this stage) for identifying encodings that may (or may not) later become part of the standard is unnecessary since it can be added if and when such other encodings are allowed.  UTF-8 then becomes the default.

4. Having a single standard requires that the readers need only consider one encoding and the writers need only support a conversion from the native to the CIF2 standard.  Allowing 5 or 10 other encodings makes life easier for the writer since it does not have to provide any conversion, but there is a price.  Every
reader must be able to read 5 or 10 different encodings because it is not allowed to reject any of the depricated standardss.  With a single standard the IUCr may or may not decide that they will handle different encodings, but that is their choice.  Making different encodings legal removes that choice from the reader: it has to handle all possibilities, which is only likely to discourage people writing local programs for occasional use.


ddlm-group mailing list



Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.