[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 versus extended ASCII
- To: "Nick.Spadaccini@uwa.edu.au" <Nick.Spadaccini@uwa.edu.au>, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 versus extended ASCII
- From: Joe Krahn <krahn@niehs.nih.gov>
- Date: Tue, 10 Nov 2009 14:40:49 -0500
- In-Reply-To: <C71F8165.123CB%nick@csse.uwa.edu.au>
- References: <C71F8165.123CB%nick@csse.uwa.edu.au>
One reason I suggested extended ASCII was thinking in terms of the low-level parser processing in terms of bytes rather than parsing a full UTF-8 character set. Even if the entire file is UTF-8, the lexer should be able to "think" in terms of bytes, similar to current parsers, but with characters 128-256 passed as valid printable text. This requires the delimiter characters to be defined in the 7-bit ASCII range, for underscore, reverse solidus, quotes and white space. The lexer should also not have to actually process or validate UTF-8 sequences, but just pass the 128-255 characters. This makes it much more efficient to parse through sections of a large file without saving data. My idea is that the lexing syntax is defined by STAR, which could simply define characters 128-256 as allowed characters. CIF2 can then define that those characters are UTF-8. Of course, I am thinking in terms of STAR defining the low-level syntax, and CIF being the high-level syntax. In any case, it is useful to keep the lexing tokens as plain 7-bit ASCII. Joe Nick Spadaccini wrote: > Thanks, Herb. That was what I thought from what I could decipher of the W3C > documentation, though I didn't appreciate the requirement to handle UTF-16 > also. > > But for the record we have been talking about delimiters and other special > token characters and we are talking about the ASCII set of those characters > aren't we? When I say \ (reverse solidus, RS) has special significance I am > writing my parser to look for 0x5c, not the other possible reverse solidi(?) > such as 0xEF 0xB9 0xA8 (small RS) or 0xEF 0xBC 0xBC (fullwidth RS). Same for > quotes, double quotes etc. > > On 10/11/09 8:16 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com> > wrote: > >> Dear Colleagues, >> The basic answer is, yes, XML does accept much more than the ASCII >> characters in its tags. XML explicitly requires all proccessors of XML to >> be able to handle boht UTF-8 and UTF-16, but restricts itself to the >> following subset: >> #x9 >> #xA >> #xD >> #x20-#xD7FF >> #xE000-#xFFFD >> #x10000-#x10FFFF >> >> >> #x100 >> ========= >> and "Document authors are encourage to avoif 'compatibility charcaters' " >> and certain control characters or permanently undefined Unicode >> characters. >> >> The whitespace characters are defined as space, carriage return, line feed >> or the horizontal tab, but discourages explicit use of carriage return. >> They are required to be removed or replaced by new-line before processing. >> >> Names may being with >> : >> A-Z >> _ >> a-z >> #xC0-#xD6 >> #xD8-#xF6 >> #xF8-#x2FF >> #x370-#x37D >> #x37F-#x1FFF >> #x200C-#x200D >> #x2070-#x218F >> #x2C00-#x2FEF >> #x3001-#xD7FF >> #xF900-#xFDCF >> #xFDF0-#xFFFD >> #x10000-#xEFFFF >> >> and may continue with those plus >> - >> . >> 0-9 >> #xB7 >> #x0300-#x036F >> #x203F-#x2040 >> >> Does that answer the question? There is more. >> >> Regards, >> Herbert >> >> >> ============================================ >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Tue, 10 Nov 2009, Nick Spadaccini wrote: >> >>> I agree with James on this one. It is specified as UTF-8, so that is what >>> you expect. Most of the files will be pure ASCII, as they are now, but over >>> time that will change. If we say it can be extended ASCII, which is ALMOST >>> (but not) the same as UTF-8 then I can only see confusion with users. >>> >>> I was discussing the move to UTF-8 with Syd the other day. He posed a >>> question, the answer for which I took for granted, but now I am wandering. >>> >>> The specification for STAR is broad so it will say encoding is UTF-8. But >>> when it comes to specific instances like CIF are we thinking that the data >>> names in the file will still be restricted to the ASCII subset of UTF-8? I >>> must admit I have been thinking of UTF-8 in terms of the data values, not in >>> terms of the data tags. >>> >>> I have been trying to work out if XML accepts UTF-8 characters in the >>> strings that define start- and end-tags (the elements). It looks like they >>> do but every example I have seen works with the ASCII character set. >>> >>> Anybody know the answer. >>> >>> >>> On 6/11/09 11:31 PM, "Joe Krahn" <krahn@niehs.nih.gov> wrote: >>> >>>> Traditionally, non-ASCII characters are encoded as "extended" ASCII, >>>> using character codes 128-255. UTF-8 gained broad support because it >>>> still fits this design, even though it encodes many more non-ASCII >>>> characters. >>>> >>>> My suggestion is to define the low-level STAR2/CIF2 syntax as allowing >>>> characters 128-255, but not specifically declaring UTF-8 encoding. It is >>>> almost the same, but has a few potential advantages. >>>> >>>> First, it becomes a bit more sensible for the DDL to declare where UTF-8 >>>> is allowed, rather than excluding it from all of the other strings. I >>>> assume that UTF-8 is intended mainly for publication-oriented formatted >>>> text, but the numerous label strings will remain ASCII. If not, it still >>>> follows the original STAR/CIF idea where the exact details of string >>>> encoding is left to the DDL. >>>> >>>> Second, generic 8-bit extended ASCII would make it easier to efficiently >>>> encode binary data, with 7-bits of raw binary data per byte. It has half >>>> the overhead of Base64, and does not require mapping characters in a >>>> look-up table. It is not as efficient as embedding binary in UCS-2, but >>>> it also does not have the UCS-2 overhead for all of the non-binary CIF >>>> files. >>>> >>>> The advantage of UCS-2 is that they easily fit into short fixed-length >>>> strings, and are much more efficient at manipulating sub-strings. That >>>> is why Java and the MS-Windows kernel use UCS-2. UTF-8 is more efficient >>>> for storage, which is one reason MS-Windows does not default to UCS-2 >>>> for text files. Therefore, in my opinion, UTF-8 is better suited to an >>>> archival format. However, UCS-2 might really be a better choice for >>>> mostly-binary CIF files. It would be nice for UCS-2 CIF beginning with >>>> the BOM encoding mark to also be a valid CIF alternative, instead of >>>> just a hacked pseudo-CIF. >>>> >>>> If CIF still wants to go with global UTF-8 encoding, maybe the low-level >>>> STAR syntax can be updated to define a more generic encoding. Herbert >>>> mentioned that using "not exactly CIF" often is useful to get work done, >>>> when the strict CIF format gets in the way. It would be nice if these >>>> sorts of files could at least stick to STAR syntax to avoid running into >>>> incompatibilities. >>>> >>>> OTOH, I am much more picky about proper syntax standards than most >>>> people. Maybe this group is happy to declare standard CIF as UTF-8, and >>>> leave any alternative forms as a customised, non-standard CIF. >>>> >>>> Joe Krahn >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> cheers >>> >>> Nick >>> >>> -------------------------------- >>> Associate Professor N. Spadaccini, PhD >>> School of Computer Science & Software Engineering >>> >>> The University of Western Australia t: +61 (0)8 6488 3452 >>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>> MBDP M002 >>> >>> CRICOS Provider Code: 00126G >>> >>> e: Nick.Spadaccini@uwa.edu.au >>> >>> >>> >>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 versus extended ASCII (Nick Spadaccini)
- References:
- Re: [ddlm-group] UTF-8 versus extended ASCII (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] CIF-2 changes
- Next by Date: Re: [ddlm-group] CIF-2 changes
- Prev by thread: Re: [ddlm-group] UTF-8 versus extended ASCII
- Next by thread: Re: [ddlm-group] UTF-8 versus extended ASCII
- Index(es):