[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ddlm-group] UTF-8 versus extended ASCII

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] UTF-8 versus extended ASCII
From: Nick Spadaccini <[email protected]>
Date: Wed, 11 Nov 2009 15:54:02 +0800
Authentication-Results: postfix;
In-Reply-To: <[email protected]>

I wouldn't specify what you suggest in STAR or CIF but it is an excellent
approach to implementation. BY parsing as an extended ASCII encoding you can
tokenize completely and validate the correctness of the file structure. Once
you have the tokens you can then parse them to ensure they are valid UTF-8
encodings.

The relationship between STAR and CIF is that CIF is strictly a subset of
STAR. We tend to define everything in the widest application in STAR and
then take pragmatics in to account for CIF. (I do realise that, at the byte
level, extended ASCII is the superset of UTF-8, so that you are not
contradicting what I write above).  However I feel we should be specify
exactly how to interpret the stream in STAR, rather than suggest it is a
stream of bytes and that any encoding can be used extracted out of it.

On 11/11/09 3:40 AM, "Joe Krahn" <[email protected]> wrote:

> One reason I suggested extended ASCII was thinking in terms of the
> low-level parser processing in terms of bytes rather than parsing a full
> UTF-8 character set. Even if the entire file is UTF-8, the lexer should
> be able to "think" in terms of bytes, similar to current parsers, but
> with characters 128-256 passed as valid printable text. This requires
> the delimiter characters to be defined in the 7-bit ASCII range, for
> underscore, reverse solidus, quotes and white space. The lexer should
> also not have to actually process or validate UTF-8 sequences, but just
> pass the 128-255 characters. This makes it much more efficient to parse
> through sections of a large file without saving data.
> 
> My idea is that the lexing syntax is defined by STAR, which could simply
> define characters 128-256 as allowed characters. CIF2 can then define
> that those characters are UTF-8. Of course, I am thinking in terms of
> STAR defining the low-level syntax, and CIF being the high-level syntax.
> In any case, it is useful to keep the lexing tokens as plain 7-bit ASCII.
> 
> Joe
> 
> Nick Spadaccini wrote:
>> Thanks, Herb. That was what I thought from what I could decipher of the W3C
>> documentation, though I didn't appreciate the requirement to handle UTF-16
>> also.
>> 
>> But for the record we have been talking about delimiters and other special
>> token characters and we are talking about the ASCII set of those characters
>> aren't we? When I say \  (reverse solidus, RS) has special significance I am
>> writing my parser to look for 0x5c, not the other possible reverse solidi(?)
>> such as 0xEF 0xB9 0xA8 (small RS) or 0xEF 0xBC 0xBC (fullwidth RS). Same for
>> quotes, double quotes etc.
>> 
>> On 10/11/09 8:16 PM, "Herbert J. Bernstein" <[email protected]>
>> wrote:
>> 
>>> Dear Colleagues,
>>>    The basic answer is, yes, XML does accept much more than the ASCII
>>> characters in its tags.  XML explicitly requires all proccessors of XML to
>>> be able to handle boht UTF-8 and UTF-16, but restricts itself to the
>>> following subset:
>>>      #x9
>>>      #xA
>>>      #xD
>>>      #x20-#xD7FF
>>>      #xE000-#xFFFD
>>>      #x10000-#x10FFFF
>>> 
>>> 
>>>      #x100
>>> =========
>>> and "Document authors are encourage to avoif 'compatibility charcaters' "
>>> and certain control characters or permanently undefined Unicode
>>> characters.
>>> 
>>> The whitespace characters are defined as space, carriage return, line feed
>>> or the horizontal tab, but discourages explicit use of carriage return.
>>> They are required to be removed or replaced by new-line before processing.
>>> 
>>> Names may being with
>>>      :
>>>      A-Z
>>>      _
>>>      a-z
>>>      #xC0-#xD6
>>>      #xD8-#xF6
>>>      #xF8-#x2FF
>>>      #x370-#x37D
>>>      #x37F-#x1FFF
>>>      #x200C-#x200D
>>>      #x2070-#x218F
>>>      #x2C00-#x2FEF
>>>      #x3001-#xD7FF
>>>      #xF900-#xFDCF
>>>      #xFDF0-#xFFFD
>>>      #x10000-#xEFFFF
>>> 
>>> and may continue with those plus
>>>      -
>>>      .
>>>      0-9
>>>      #xB7
>>>      #x0300-#x036F
>>>      #x203F-#x2040
>>> 
>>> Does that answer the question?  There is more.
>>> 
>>>    Regards,
>>>      Herbert
>>> 
>>> 
>>> ============================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>          Idle Hour Blvd, Oakdale, NY, 11769
>>> 
>>>                   +1-631-244-3035
>>>                   [email protected]
>>> =====================================================
>>> 
>>> On Tue, 10 Nov 2009, Nick Spadaccini wrote:
>>> 
>>>> I agree with James on this one. It is specified as UTF-8, so that is what
>>>> you expect. Most of the files will be pure ASCII, as they are now, but over
>>>> time that will change. If we say it can be extended ASCII, which is ALMOST
>>>> (but not) the same as UTF-8 then I can only see confusion with users.
>>>> 
>>>> I was discussing the move to UTF-8 with Syd the other day. He posed a
>>>> question, the answer for which I took for granted, but now I am wandering.
>>>> 
>>>> The specification for STAR is broad so it will say encoding is UTF-8. But
>>>> when it comes to specific instances like CIF are we thinking that the data
>>>> names in the file will still be restricted to the ASCII subset of UTF-8? I
>>>> must admit I have been thinking of UTF-8 in terms of the data values, not
>>>> in
>>>> terms of the data tags.
>>>> 
>>>> I have been trying to work out if XML accepts UTF-8 characters in the
>>>> strings that define start- and end-tags (the elements). It looks like they
>>>> do but every example I have seen works with the ASCII character set.
>>>> 
>>>> Anybody know the answer.
>>>> 
>>>> 
>>>> On 6/11/09 11:31 PM, "Joe Krahn" <[email protected]> wrote:
>>>> 
>>>>> Traditionally, non-ASCII characters are encoded as "extended" ASCII,
>>>>> using character codes 128-255. UTF-8 gained broad support because it
>>>>> still fits this design, even though it encodes many more non-ASCII
>>>>> characters.
>>>>> 
>>>>> My suggestion is to define the low-level STAR2/CIF2 syntax as allowing
>>>>> characters 128-255, but not specifically declaring UTF-8 encoding. It is
>>>>> almost the same, but has a few potential advantages.
>>>>> 
>>>>> First, it becomes a bit more sensible for the DDL to declare where UTF-8
>>>>> is allowed, rather than excluding it from all of the other strings. I
>>>>> assume that UTF-8 is intended mainly for publication-oriented formatted
>>>>> text, but the numerous label strings will remain ASCII. If not, it still
>>>>> follows the original STAR/CIF idea where the exact details of string
>>>>> encoding is left to the DDL.
>>>>> 
>>>>> Second, generic 8-bit extended ASCII would make it easier to efficiently
>>>>> encode binary data, with 7-bits of raw binary data per byte. It has half
>>>>> the overhead of Base64, and does not require mapping characters in a
>>>>> look-up table. It is not as efficient as embedding binary in UCS-2, but
>>>>> it also does not have the UCS-2 overhead for all of the non-binary CIF
>>>>> files.
>>>>> 
>>>>> The advantage of UCS-2 is that they easily fit into short fixed-length
>>>>> strings, and are much more efficient at manipulating sub-strings. That
>>>>> is why Java and the MS-Windows kernel use UCS-2. UTF-8 is more efficient
>>>>> for storage, which is one reason MS-Windows does not default to UCS-2
>>>>> for text files. Therefore, in my opinion, UTF-8 is better suited to an
>>>>> archival format. However, UCS-2 might really be a better choice for
>>>>> mostly-binary CIF files. It would be nice for UCS-2 CIF beginning with
>>>>> the BOM encoding mark to also be a valid CIF alternative, instead of
>>>>> just a hacked pseudo-CIF.
>>>>> 
>>>>> If CIF still wants to go with global UTF-8 encoding, maybe the low-level
>>>>> STAR syntax can be updated to define a more generic encoding. Herbert
>>>>> mentioned that using "not exactly CIF" often is useful to get work done,
>>>>> when the strict CIF format gets in the way. It would be nice if these
>>>>> sorts of files could at least stick to STAR syntax to avoid running into
>>>>> incompatibilities.
>>>>> 
>>>>> OTOH, I am much more picky about proper syntax standards than most
>>>>> people. Maybe this group is happy to declare standard CIF as UTF-8, and
>>>>> leave any alternative forms as a customised, non-standard CIF.
>>>>> 
>>>>> Joe Krahn
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> [email protected]
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>> cheers
>>>> 
>>>> Nick
>>>> 
>>>> --------------------------------
>>>> Associate Professor N. Spadaccini, PhD
>>>> School of Computer Science & Software Engineering
>>>> 
>>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>>> MBDP  M002
>>>> 
>>>> CRICOS Provider Code: 00126G
>>>> 
>>>> e: [email protected]
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> [email protected]
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>> 
>> 
>> cheers
>> 
>> Nick
>> 
>> --------------------------------
>> Associate Professor N. Spadaccini, PhD
>> School of Computer Science & Software Engineering
>> 
>> The University of Western Australia    t: +61 (0)8 6488 3452
>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>> MBDP  M002
>> 
>> CRICOS Provider Code: 00126G
>> 
>> e: [email protected]
>> 
>> 
>> 
>> 
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: [email protected]




_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]

References:

Re: [ddlm-group] UTF-8 versus extended ASCII (Joe Krahn)

Prev by Date: Re: [ddlm-group] CIF-2 changes

Next by Date: Re: [ddlm-group] UTF-8 versus extended ASCII

Prev by thread: Re: [ddlm-group] UTF-8 versus extended ASCII

Next by thread: Re: [ddlm-group] UTF-8 versus extended ASCII

Index(es):

Date

Thread
Discussion List Archives

Re: [ddlm-group] UTF-8 versus extended ASCII