[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: Nick Spadaccini <[email protected]>
Date: Wed, 28 Oct 2009 15:16:40 +0800
Authentication-Results: postfix;
In-Reply-To: <[email protected]>

Title: Re: [ddlm-group] [THREAD 4] UTF8

OK. Well I think this discussion has dies its death. One of the benefits (to Herb mostly) is that with UCS-2 encoding, the CBF comes in to the imgCIF fold. Herb’s group has a binary->UCS-2 encoding with a 7% added footprint so it is efficient. At the moment (or at least shortly) the alternative is to go binary->BASE64 which has a 33% added footprint. I guess binary->UTF-8 would have the same footprint. By being able to switch encodings Herb can achieve a single file for his work. The price is that we have to adopt a switching encoding option – not attractive to most people here apparently.

It seems the consensus is that UTF-8 is the encoding standard and everything else is a broken file. Not with standing that any application may parse it, attempt to determine the encoding and produce a corrected file. That is nice, but not a required part of the standard (which is the standard, to quote David, quoting me).

It seems to me Herb, the easiest thing you can do with regard to CBF (since it can’t be a real CIF) is to adopt everything from CIF-2 except that CBF must be encoded as UCS-2. That may give you a bigger footprint per ASCII character, but since these are a small number of data values in a CBF it may well be worth the price. That way at least the binary part is efficiently encoded.

If in the DDLm we manage to flag order dependency in the loop_ header (along with a flag to indicate row ordering is important in the loop_ - something John W wanted) then CBF will essentially be CIF-2 except that the encoding is UCS-2.

On 26/10/09 10:43 PM, "David Brown" <[email protected]> wrote:

James has asked for the views of those of us who have just been watching this discussion (it seems to go by faster than I can follow).

For what is is worth, my stance is strongly the same as James'. He has laid out all the argumemts very succinctly - I will just briefly reinforce those I think most importatn.

1. Permitting one or more other encodings (presumably these must constitute a well defined list as their names must be recognized) immediately invites people to use them. To quote Nick, if we have a standard it should be a standard. If people decide to bend the standard they do so at their own peril.

2. If other encodings are depricated (yes, that word again) in order to encourage people to use the real standard, they are likely to be unaware that their program has used a depricated encoding and therefore has failed to identify it. If the writer of the CIF knows that it is using a depricated standard it can make the conversion. Only if it does not know it is using a depricated standard and therefore neither converts nor identifies the encoding will the depricated encoding get through.

3. Providing a space (at this stage) for identifying encodings that may (or may not) later become part of the standard is unnecessary since it can be added if and when such other encodings are allowed. UTF-8 then becomes the default.

4. Having a single standard requires that the readers need only consider one encoding and the writers need only support a conversion from the native to the CIF2 standard. Allowing 5 or 10 other encodings makes life easier for the writer since it does not have to provide any conversion, but there is a price. Every reader must be able to read 5 or 10 different encodings because it is not allowed to reject any of the depricated standardss. With a single standard the IUCr may or may not decide that they will handle different encodings, but that is their choice. Making different encodings legal removes that choice from the reader: it has to handle all possibilities, which is only likely to discourage people writing local programs for occasional use.

David

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth, WA 6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP M002

CRICOS Provider Code: 00126G

e: [email protected]

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

References:

Re: [ddlm-group] [THREAD 4] UTF8 (David Brown)

Prev by Date: Re: [ddlm-group] CIF header

Next by Date: Re: [ddlm-group] Triple-quoted strings

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: Re: [ddlm-group] [THREAD 4] UTF8

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8