[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 versus extended ASCII

To: [email protected], Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] UTF-8 versus extended ASCII
From: James Hester <[email protected]>
Date: Tue, 10 Nov 2009 22:59:43 +1100
In-Reply-To: <C71F4206.123BF%[email protected]>
References: <[email protected]><C71F4206.123BF%[email protected]>

My answer to Syd's question would be that both STAR and CIF datanames
are not restricted to plain ASCII by the syntax.  That said, I expect
that the IUCr dictionaries would specify only ASCII datanames in the
immediate future, perhaps augmented by accented characters once it is
clear that everybody can handle UTF-8.

On Tue, Nov 10, 2009 at 7:15 PM, Nick Spadaccini <[email protected]> wrote:
> I agree with James on this one. It is specified as UTF-8, so that is what
> you expect. Most of the files will be pure ASCII, as they are now, but over
> time that will change. If we say it can be extended ASCII, which is ALMOST
> (but not) the same as UTF-8 then I can only see confusion with users.
>
> I was discussing the move to UTF-8 with Syd the other day. He posed a
> question, the answer for which I took for granted, but now I am wandering.
>
> The specification for STAR is broad so it will say encoding is UTF-8. But
> when it comes to specific instances like CIF are we thinking that the data
> names in the file will still be restricted to the ASCII subset of UTF-8? I
> must admit I have been thinking of UTF-8 in terms of the data values, not in
> terms of the data tags.
>
> I have been trying to work out if XML accepts UTF-8 characters in the
> strings that define start- and end-tags (the elements). It looks like they
> do but every example I have seen works with the ASCII character set.
>
> Anybody know the answer.
>
>
> On 6/11/09 11:31 PM, "Joe Krahn" <[email protected]> wrote:
>
>> Traditionally, non-ASCII characters are encoded as "extended" ASCII,
>> using character codes 128-255. UTF-8 gained broad support because it
>> still fits this design, even though it encodes many more non-ASCII
>> characters.
>>
>> My suggestion is to define the low-level STAR2/CIF2 syntax as allowing
>> characters 128-255, but not specifically declaring UTF-8 encoding. It is
>> almost the same, but has a few potential advantages.
>>
>> First, it becomes a bit more sensible for the DDL to declare where UTF-8
>> is allowed, rather than excluding it from all of the other strings. I
>> assume that UTF-8 is intended mainly for publication-oriented formatted
>> text, but the numerous label strings will remain ASCII. If not, it still
>> follows the original STAR/CIF idea where the exact details of string
>> encoding is left to the DDL.
>>
>> Second, generic 8-bit extended ASCII would make it easier to efficiently
>> encode binary data, with 7-bits of raw binary data per byte. It has half
>> the overhead of Base64, and does not require mapping characters in a
>> look-up table. It is not as efficient as embedding binary in UCS-2, but
>> it also does not have the UCS-2 overhead for all of the non-binary CIF
>> files.
>>
>> The advantage of UCS-2 is that they easily fit into short fixed-length
>> strings, and are much more efficient at manipulating sub-strings. That
>> is why Java and the MS-Windows kernel use UCS-2. UTF-8 is more efficient
>> for storage, which is one reason MS-Windows does not default to UCS-2
>> for text files. Therefore, in my opinion, UTF-8 is better suited to an
>> archival format. However, UCS-2 might really be a better choice for
>> mostly-binary CIF files. It would be nice for UCS-2 CIF beginning with
>> the BOM encoding mark to also be a valid CIF alternative, instead of
>> just a hacked pseudo-CIF.
>>
>> If CIF still wants to go with global UTF-8 encoding, maybe the low-level
>> STAR syntax can be updated to define a more generic encoding. Herbert
>> mentioned that using "not exactly CIF" often is useful to get work done,
>> when the strict CIF format gets in the way. It would be nice if these
>> sorts of files could at least stick to STAR syntax to avoid running into
>> incompatibilities.
>>
>> OTOH, I am much more picky about proper syntax standards than most
>> people. Maybe this group is happy to declare standard CIF as UTF-8, and
>> leave any alternative forms as a customised, non-standard CIF.
>>
>> Joe Krahn
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia � �t: +61 (0)8 6488 3452
> 35 Stirling Highway � � � � � � � � � �f: +61 (0)8 6488 1089
> CRAWLEY, Perth, �WA �6009 AUSTRALIA � w3: www.csse.uwa.edu.au/~nick
> MBDP �M002
>
> CRICOS Provider Code: 00126G
>
> e: [email protected]
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

[ddlm-group] UTF-8 versus extended ASCII (Joe Krahn)

Re: [ddlm-group] UTF-8 versus extended ASCII (Nick Spadaccini)

Prev by Date: Re: [ddlm-group] UTF-8 versus extended ASCII

Next by Date: Re: [ddlm-group] UTF-8 versus extended ASCII

Prev by thread: Re: [ddlm-group] UTF-8 versus extended ASCII

Next by thread: Re: [ddlm-group] UTF-8 versus extended ASCII

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 versus extended ASCII