[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 versus extended ASCII
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 versus extended ASCII
- From: James Hester <jamesrhester@gmail.com>
- Date: Tue, 10 Nov 2009 22:59:43 +1100
- In-Reply-To: <C71F4206.123BF%nick@csse.uwa.edu.au>
- References: <4AF44168.10402@niehs.nih.gov><C71F4206.123BF%nick@csse.uwa.edu.au>
My answer to Syd's question would be that both STAR and CIF datanames are not restricted to plain ASCII by the syntax. That said, I expect that the IUCr dictionaries would specify only ASCII datanames in the immediate future, perhaps augmented by accented characters once it is clear that everybody can handle UTF-8. On Tue, Nov 10, 2009 at 7:15 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote: > I agree with James on this one. It is specified as UTF-8, so that is what > you expect. Most of the files will be pure ASCII, as they are now, but over > time that will change. If we say it can be extended ASCII, which is ALMOST > (but not) the same as UTF-8 then I can only see confusion with users. > > I was discussing the move to UTF-8 with Syd the other day. He posed a > question, the answer for which I took for granted, but now I am wandering. > > The specification for STAR is broad so it will say encoding is UTF-8. But > when it comes to specific instances like CIF are we thinking that the data > names in the file will still be restricted to the ASCII subset of UTF-8? I > must admit I have been thinking of UTF-8 in terms of the data values, not in > terms of the data tags. > > I have been trying to work out if XML accepts UTF-8 characters in the > strings that define start- and end-tags (the elements). It looks like they > do but every example I have seen works with the ASCII character set. > > Anybody know the answer. > > > On 6/11/09 11:31 PM, "Joe Krahn" <krahn@niehs.nih.gov> wrote: > >> Traditionally, non-ASCII characters are encoded as "extended" ASCII, >> using character codes 128-255. UTF-8 gained broad support because it >> still fits this design, even though it encodes many more non-ASCII >> characters. >> >> My suggestion is to define the low-level STAR2/CIF2 syntax as allowing >> characters 128-255, but not specifically declaring UTF-8 encoding. It is >> almost the same, but has a few potential advantages. >> >> First, it becomes a bit more sensible for the DDL to declare where UTF-8 >> is allowed, rather than excluding it from all of the other strings. I >> assume that UTF-8 is intended mainly for publication-oriented formatted >> text, but the numerous label strings will remain ASCII. If not, it still >> follows the original STAR/CIF idea where the exact details of string >> encoding is left to the DDL. >> >> Second, generic 8-bit extended ASCII would make it easier to efficiently >> encode binary data, with 7-bits of raw binary data per byte. It has half >> the overhead of Base64, and does not require mapping characters in a >> look-up table. It is not as efficient as embedding binary in UCS-2, but >> it also does not have the UCS-2 overhead for all of the non-binary CIF >> files. >> >> The advantage of UCS-2 is that they easily fit into short fixed-length >> strings, and are much more efficient at manipulating sub-strings. That >> is why Java and the MS-Windows kernel use UCS-2. UTF-8 is more efficient >> for storage, which is one reason MS-Windows does not default to UCS-2 >> for text files. Therefore, in my opinion, UTF-8 is better suited to an >> archival format. However, UCS-2 might really be a better choice for >> mostly-binary CIF files. It would be nice for UCS-2 CIF beginning with >> the BOM encoding mark to also be a valid CIF alternative, instead of >> just a hacked pseudo-CIF. >> >> If CIF still wants to go with global UTF-8 encoding, maybe the low-level >> STAR syntax can be updated to define a more generic encoding. Herbert >> mentioned that using "not exactly CIF" often is useful to get work done, >> when the strict CIF format gets in the way. It would be nice if these >> sorts of files could at least stick to STAR syntax to avoid running into >> incompatibilities. >> >> OTOH, I am much more picky about proper syntax standards than most >> people. Maybe this group is happy to declare standard CIF as UTF-8, and >> leave any alternative forms as a customised, non-standard CIF. >> >> Joe Krahn >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] UTF-8 versus extended ASCII (Joe Krahn)
- Re: [ddlm-group] UTF-8 versus extended ASCII (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] UTF-8 versus extended ASCII
- Next by Date: Re: [ddlm-group] UTF-8 versus extended ASCII
- Prev by thread: Re: [ddlm-group] UTF-8 versus extended ASCII
- Next by thread: Re: [ddlm-group] UTF-8 versus extended ASCII
- Index(es):