[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 versus extended ASCII

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] UTF-8 versus extended ASCII
From: James Hester <[email protected]>
Date: Mon, 9 Nov 2009 13:30:14 +1100
In-Reply-To: <[email protected]>
References: <[email protected]>

This raises the issue of how we write the specification: in terms of
UTF-8 characters, or in terms of allowed byte values.  If the STAR
syntax specifies the allowed lexical patterns in terms of byte values,
then a conforming non-UTF8 file could easily be produced due to
high-bit-set byte sequences not corresponding to any UTF-8 character.
I view this as undesirable as it waters down the UTF-8 requirement.
Therefore, if we are serious about UTF-8 encoding, we write the
specification in terms of UTF8 characters.

I am of course aware that parsers can and will be written which simply
allow all bytes with values between 128-255 in data values or
comments, and leave the UTF-8 verification for a downstream data
consumer. Nevertheless, we should not lead CIF2 writers to rely on
such behaviour and so should not leave the door open for it any more
than necessary.  If at all possible, the 'flagship' CIF2 readers (e.g.
CheckCIF) should do the UTF-8 check and flag non-UTF-8 encoding.

On Sat, Nov 7, 2009 at 2:31 AM, Joe Krahn <[email protected]> wrote:
> Traditionally, non-ASCII characters are encoded as "extended" ASCII,
> using character codes 128-255. UTF-8 gained broad support because it
> still fits this design, even though it encodes many more non-ASCII
> characters.
>
> My suggestion is to define the low-level STAR2/CIF2 syntax as allowing
> characters 128-255, but not specifically declaring UTF-8 encoding. It is
> almost the same, but has a few potential advantages.
>
> First, it becomes a bit more sensible for the DDL to declare where UTF-8
> is allowed, rather than excluding it from all of the other strings. I
> assume that UTF-8 is intended mainly for publication-oriented formatted
> text, but the numerous label strings will remain ASCII. If not, it still
> follows the original STAR/CIF idea where the exact details of string
> encoding is left to the DDL.
>
> Second, generic 8-bit extended ASCII would make it easier to efficiently
> encode binary data, with 7-bits of raw binary data per byte. It has half
> the overhead of Base64, and does not require mapping characters in a
> look-up table. It is not as efficient as embedding binary in UCS-2, but
> it also does not have the UCS-2 overhead for all of the non-binary CIF
> files.
>
> The advantage of UCS-2 is that they easily fit into short fixed-length
> strings, and are much more efficient at manipulating sub-strings. That
> is why Java and the MS-Windows kernel use UCS-2. UTF-8 is more efficient
> for storage, which is one reason MS-Windows does not default to UCS-2
> for text files. Therefore, in my opinion, UTF-8 is better suited to an
> archival format. However, UCS-2 might really be a better choice for
> mostly-binary CIF files. It would be nice for UCS-2 CIF beginning with
> the BOM encoding mark to also be a valid CIF alternative, instead of
> just a hacked pseudo-CIF.
>
> If CIF still wants to go with global UTF-8 encoding, maybe the low-level
> STAR syntax can be updated to define a more generic encoding. Herbert
> mentioned that using "not exactly CIF" often is useful to get work done,
> when the strict CIF format gets in the way. It would be nice if these
> sorts of files could at least stick to STAR syntax to avoid running into
> incompatibilities.
>
> OTOH, I am much more picky about proper syntax standards than most
> people. Maybe this group is happy to declare standard CIF as UTF-8, and
> leave any alternative forms as a customised, non-standard CIF.
>
> Joe Krahn
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

[ddlm-group] UTF-8 versus extended ASCII (Joe Krahn)

Prev by Date: Re: [ddlm-group] CIF-2 changes

Next by Date: Re: [ddlm-group] Triple-quoted strings

Prev by thread: Re: [ddlm-group] UTF-8 versus extended ASCII

Next by thread: Re: [ddlm-group] UTF-8 versus extended ASCII

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 versus extended ASCII