[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
[ddlm-group] UTF-8 versus extended ASCII
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: [ddlm-group] UTF-8 versus extended ASCII
- From: Joe Krahn <krahn@niehs.nih.gov>
- Date: Fri, 06 Nov 2009 10:31:52 -0500
Traditionally, non-ASCII characters are encoded as "extended" ASCII, using character codes 128-255. UTF-8 gained broad support because it still fits this design, even though it encodes many more non-ASCII characters. My suggestion is to define the low-level STAR2/CIF2 syntax as allowing characters 128-255, but not specifically declaring UTF-8 encoding. It is almost the same, but has a few potential advantages. First, it becomes a bit more sensible for the DDL to declare where UTF-8 is allowed, rather than excluding it from all of the other strings. I assume that UTF-8 is intended mainly for publication-oriented formatted text, but the numerous label strings will remain ASCII. If not, it still follows the original STAR/CIF idea where the exact details of string encoding is left to the DDL. Second, generic 8-bit extended ASCII would make it easier to efficiently encode binary data, with 7-bits of raw binary data per byte. It has half the overhead of Base64, and does not require mapping characters in a look-up table. It is not as efficient as embedding binary in UCS-2, but it also does not have the UCS-2 overhead for all of the non-binary CIF files. The advantage of UCS-2 is that they easily fit into short fixed-length strings, and are much more efficient at manipulating sub-strings. That is why Java and the MS-Windows kernel use UCS-2. UTF-8 is more efficient for storage, which is one reason MS-Windows does not default to UCS-2 for text files. Therefore, in my opinion, UTF-8 is better suited to an archival format. However, UCS-2 might really be a better choice for mostly-binary CIF files. It would be nice for UCS-2 CIF beginning with the BOM encoding mark to also be a valid CIF alternative, instead of just a hacked pseudo-CIF. If CIF still wants to go with global UTF-8 encoding, maybe the low-level STAR syntax can be updated to define a more generic encoding. Herbert mentioned that using "not exactly CIF" often is useful to get work done, when the strict CIF format gets in the way. It would be nice if these sorts of files could at least stick to STAR syntax to avoid running into incompatibilities. OTOH, I am much more picky about proper syntax standards than most people. Maybe this group is happy to declare standard CIF as UTF-8, and leave any alternative forms as a customised, non-standard CIF. Joe Krahn _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 versus extended ASCII (Nick Spadaccini)
- Re: [ddlm-group] UTF-8 versus extended ASCII (James Hester)
- Re: [ddlm-group] UTF-8 versus extended ASCII (Herbert J. Bernstein)
- Prev by Date: [ddlm-group] CIF2 summary?
- Next by Date: Re: [ddlm-group] UTF-8 versus extended ASCII
- Prev by thread: [ddlm-group] Ordering in CIFs
- Next by thread: Re: [ddlm-group] UTF-8 versus extended ASCII
- Index(es):