[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 versus extended ASCII

To: "Nick.Spadaccini@uwa.edu.au" <Nick.Spadaccini@uwa.edu.au>, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] UTF-8 versus extended ASCII
From: Joe Krahn <krahn@niehs.nih.gov>
Date: Tue, 10 Nov 2009 14:40:49 -0500
In-Reply-To: <C71F8165.123CB%nick@csse.uwa.edu.au>
References: <C71F8165.123CB%nick@csse.uwa.edu.au>

One reason I suggested extended ASCII was thinking in terms of the 
low-level parser processing in terms of bytes rather than parsing a full 
UTF-8 character set. Even if the entire file is UTF-8, the lexer should 
be able to "think" in terms of bytes, similar to current parsers, but 
with characters 128-256 passed as valid printable text. This requires 
the delimiter characters to be defined in the 7-bit ASCII range, for 
underscore, reverse solidus, quotes and white space. The lexer should 
also not have to actually process or validate UTF-8 sequences, but just 
pass the 128-255 characters. This makes it much more efficient to parse 
through sections of a large file without saving data.

My idea is that the lexing syntax is defined by STAR, which could simply 
define characters 128-256 as allowed characters. CIF2 can then define 
that those characters are UTF-8. Of course, I am thinking in terms of 
STAR defining the low-level syntax, and CIF being the high-level syntax. 
In any case, it is useful to keep the lexing tokens as plain 7-bit ASCII.

Joe

Nick Spadaccini wrote:
> Thanks, Herb. That was what I thought from what I could decipher of the W3C
> documentation, though I didn't appreciate the requirement to handle UTF-16
> also.
> 
> But for the record we have been talking about delimiters and other special
> token characters and we are talking about the ASCII set of those characters
> aren't we? When I say \  (reverse solidus, RS) has special significance I am
> writing my parser to look for 0x5c, not the other possible reverse solidi(?)
> such as 0xEF 0xB9 0xA8 (small RS) or 0xEF 0xBC 0xBC (fullwidth RS). Same for
> quotes, double quotes etc.
> 
> On 10/11/09 8:16 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
> wrote:
> 
>> Dear Colleagues,
>>    The basic answer is, yes, XML does accept much more than the ASCII
>> characters in its tags.  XML explicitly requires all proccessors of XML to
>> be able to handle boht UTF-8 and UTF-16, but restricts itself to the
>> following subset:
>>      #x9
>>      #xA
>>      #xD
>>      #x20-#xD7FF
>>      #xE000-#xFFFD
>>      #x10000-#x10FFFF
>>
>>
>>      #x100
>> =========
>> and "Document authors are encourage to avoif 'compatibility charcaters' "
>> and certain control characters or permanently undefined Unicode
>> characters.
>>
>> The whitespace characters are defined as space, carriage return, line feed
>> or the horizontal tab, but discourages explicit use of carriage return.
>> They are required to be removed or replaced by new-line before processing.
>>
>> Names may being with
>>      :
>>      A-Z
>>      _
>>      a-z
>>      #xC0-#xD6
>>      #xD8-#xF6
>>      #xF8-#x2FF
>>      #x370-#x37D
>>      #x37F-#x1FFF
>>      #x200C-#x200D
>>      #x2070-#x218F
>>      #x2C00-#x2FEF
>>      #x3001-#xD7FF
>>      #xF900-#xFDCF
>>      #xFDF0-#xFFFD
>>      #x10000-#xEFFFF
>>
>> and may continue with those plus
>>      -
>>      .
>>      0-9
>>      #xB7
>>      #x0300-#x036F
>>      #x203F-#x2040
>>
>> Does that answer the question?  There is more.
>>
>>    Regards,
>>      Herbert
>>
>>
>> ============================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 10 Nov 2009, Nick Spadaccini wrote:
>>
>>> I agree with James on this one. It is specified as UTF-8, so that is what
>>> you expect. Most of the files will be pure ASCII, as they are now, but over
>>> time that will change. If we say it can be extended ASCII, which is ALMOST
>>> (but not) the same as UTF-8 then I can only see confusion with users.
>>>
>>> I was discussing the move to UTF-8 with Syd the other day. He posed a
>>> question, the answer for which I took for granted, but now I am wandering.
>>>
>>> The specification for STAR is broad so it will say encoding is UTF-8. But
>>> when it comes to specific instances like CIF are we thinking that the data
>>> names in the file will still be restricted to the ASCII subset of UTF-8? I
>>> must admit I have been thinking of UTF-8 in terms of the data values, not in
>>> terms of the data tags.
>>>
>>> I have been trying to work out if XML accepts UTF-8 characters in the
>>> strings that define start- and end-tags (the elements). It looks like they
>>> do but every example I have seen works with the ASCII character set.
>>>
>>> Anybody know the answer.
>>>
>>>
>>> On 6/11/09 11:31 PM, "Joe Krahn" <krahn@niehs.nih.gov> wrote:
>>>
>>>> Traditionally, non-ASCII characters are encoded as "extended" ASCII,
>>>> using character codes 128-255. UTF-8 gained broad support because it
>>>> still fits this design, even though it encodes many more non-ASCII
>>>> characters.
>>>>
>>>> My suggestion is to define the low-level STAR2/CIF2 syntax as allowing
>>>> characters 128-255, but not specifically declaring UTF-8 encoding. It is
>>>> almost the same, but has a few potential advantages.
>>>>
>>>> First, it becomes a bit more sensible for the DDL to declare where UTF-8
>>>> is allowed, rather than excluding it from all of the other strings. I
>>>> assume that UTF-8 is intended mainly for publication-oriented formatted
>>>> text, but the numerous label strings will remain ASCII. If not, it still
>>>> follows the original STAR/CIF idea where the exact details of string
>>>> encoding is left to the DDL.
>>>>
>>>> Second, generic 8-bit extended ASCII would make it easier to efficiently
>>>> encode binary data, with 7-bits of raw binary data per byte. It has half
>>>> the overhead of Base64, and does not require mapping characters in a
>>>> look-up table. It is not as efficient as embedding binary in UCS-2, but
>>>> it also does not have the UCS-2 overhead for all of the non-binary CIF
>>>> files.
>>>>
>>>> The advantage of UCS-2 is that they easily fit into short fixed-length
>>>> strings, and are much more efficient at manipulating sub-strings. That
>>>> is why Java and the MS-Windows kernel use UCS-2. UTF-8 is more efficient
>>>> for storage, which is one reason MS-Windows does not default to UCS-2
>>>> for text files. Therefore, in my opinion, UTF-8 is better suited to an
>>>> archival format. However, UCS-2 might really be a better choice for
>>>> mostly-binary CIF files. It would be nice for UCS-2 CIF beginning with
>>>> the BOM encoding mark to also be a valid CIF alternative, instead of
>>>> just a hacked pseudo-CIF.
>>>>
>>>> If CIF still wants to go with global UTF-8 encoding, maybe the low-level
>>>> STAR syntax can be updated to define a more generic encoding. Herbert
>>>> mentioned that using "not exactly CIF" often is useful to get work done,
>>>> when the strict CIF format gets in the way. It would be nice if these
>>>> sorts of files could at least stick to STAR syntax to avoid running into
>>>> incompatibilities.
>>>>
>>>> OTOH, I am much more picky about proper syntax standards than most
>>>> people. Maybe this group is happy to declare standard CIF as UTF-8, and
>>>> leave any alternative forms as a customised, non-standard CIF.
>>>>
>>>> Joe Krahn
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> cheers
>>>
>>> Nick
>>>
>>> --------------------------------
>>> Associate Professor N. Spadaccini, PhD
>>> School of Computer Science & Software Engineering
>>>
>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>> MBDP  M002
>>>
>>> CRICOS Provider Code: 00126G
>>>
>>> e: Nick.Spadaccini@uwa.edu.au
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
> 
> cheers
> 
> Nick
> 
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
> 
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
> 
> CRICOS Provider Code: 00126G
> 
> e: Nick.Spadaccini@uwa.edu.au
> 
> 
> 
> 
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] UTF-8 versus extended ASCII (Nick Spadaccini)

References:

Re: [ddlm-group] UTF-8 versus extended ASCII (Nick Spadaccini)

Prev by Date: Re: [ddlm-group] CIF-2 changes

Next by Date: Re: [ddlm-group] CIF-2 changes

Prev by thread: Re: [ddlm-group] UTF-8 versus extended ASCII

Next by thread: Re: [ddlm-group] UTF-8 versus extended ASCII

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 versus extended ASCII