[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Sat, 10 Oct 2009 13:02:35 -0400 (EDT)
In-Reply-To: <20091010164455.GA7668@emerald.iucr.org>
References: <C6F5BF24.1200E%nick@csse.uwa.edu.au><645410.77656.qm@web87015.mail.ird.yahoo.com><279aad2a0910100249o2c09897anb767ab28b06cbdcf@mail.gmail.com><279aad2a0910100513u1e9ef18dua5f984cc20ac9a9b@mail.gmail.com><20091010125924.GA7536@emerald.iucr.org><20091010145830.GA7607@emerald.iucr.org><20091010113728.W18936@epsilon.pair.com><20091010164455.GA7668@emerald.iucr.org>

Yes, most modern Fortrans cannot tell the difference between UTF-8 and 
ascii.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Sat, 10 Oct 2009, Brian McMahon wrote:

> Dear Herbert
>
> Thanks for the clarification. I've now read
>   http://en.wikipedia.org/wiki/UTF-8
> :-)
>
> It seems to me that the STAR spec still needs to be modified to
> state explicitly that its allowed character set is Unicode as
> expressed in UTF-8 encoding.
>
> I note also from the above Wikipaedia entry that there is some
> latitude in practices for handling invalid byte sequences (and to some
> extent invalid code points). I think we should consider whether the
> full STAR/CIF1.2 specs should formalise exception handling procedures
> in such cases.
>
> Regards
> Brian
>
> PS Just for my own information, does the statement
>  > For the point of view of any
>  > C-program intended to work with the 256-chacacter ISO characters sets,
>  > a UTF-8 string handles just the same as an ISO string.
> hold equally well for modern Fortran applications?
>
> On Sat, Oct 10, 2009 at 12:01:05PM -0400, Herbert J. Bernstein wrote:
>> Dear Colleagues,
>>
>>    There is a misundertsanding about UTF-8.  For the point of view of any
>> C-program intended to work with the 256-chacacter ISO characters sets,
>> a UTF-8 string handles just the same as an ISO string.  The major
>> differences are that the bottom 128 characters are the US national variant
>> we call ASCII, and the second 128 characters that in the past would have
>> had the accented and special characters needs to handle the western
>> European languages in an ASCII environment have been replaced with the
>> variable length encodings for a 31 bit character set.  That is what is
>> nice about UTF8 -- it is actually using what should be printable
>> characters to do its encoding, avoiding anything that looks like
>> binary data.
>>
>>    UTF-16/UCS-2 is different.  There you have a lot that looks like binary
>> when working in an ascii world, and you need special libraries (for wide
>> characters) to deal with them, unless you are working in java or with a
>> browser, where that is the native encoding.
>>
>>    We are in the midst of a painful, worldwide transition in which we have
>> a mixture of:
>>
>>    1.  The code code-page based character encodings based on the multiple
>> ISO national variants.  ASCII is just the US national variant.
>>    2.  The UTF-16/UCS-2 version of unicode heavily adopted by many hardware
>> vendors and used as the native encoding in many operating systems and all
>> browsers
>>    3.  The UTF-8 version of unicode, extensively adopted in Linux-based
>> applications and slowly being accepted in almost all operating systems.
>>
>> My guess is that by 10 years from now, UTF-8 will have been fairly
>> completely adopted except for some legacy java and browser UCS-2
>> stuff.
>>
>>    My suggestion would be to try to support ascii, UCS-2 and UTF-8 for the
>> moment and work towards joining the march towards UTF-8.
>>
>>    Regards,
>>      Herbert
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Sat, 10 Oct 2009, Brian McMahon wrote:
>>
>>> Regarding the adoption of the Unicode character set, I agree that
>>> this would make it easier to accommodate accented and non-Latin
>>> characters and symbols, and I see no reason to oppose implementing
>>> it as a UTF-8 encoding, and so I vote 3.2.
>>>
>>> (It's not a panacea, especially for maths, where new symbols can
>>> always be invented, and one must be able to specify a two-dimensional
>>> layout as well as just the glyphs, so we shall still need other
>>> approaches for various types of "rich" text.)
>>>
>>> However, this is a binary encoding, is it not, and so the underlying
>>> STAR specification must be modified to accommodate this. (I'm afraid
>>> I haven't got Nick's draft paper for the revised STAR specification
>>> to hand, so I apologise if that's already been addrressed.)
>>>
>>> Does it raise issues of endian-ness? If we are introducing binary
>>> encodings, are there any reasons to restrict the character set
>>> encoding to UTF-8 or should one also allow UTF-16 etc. (i) in STAR
>>> and (ii) in CIF? And, ultimately, is there a prospect of extending
>>> the STAR spec in a way that properly accommodates at least the CBF
>>> implementation, and possibly other binary data incorporation?
>>>
>>> I am happy in this case that handling by "old" CIF software can
>>> be done by adopting a protocol that allows UTF-8 Unicode characters
>>> to be represented by ASCII encodings such as \u27. (I don't think
>>> that we need specify a protocol at this point, just be sure that
>>> one can be defined if needed.)
>>>
>>> I again draw attention to the amusing fact that with an ASCII
>>> Unicode encoding, "O\u27Neill" is a valid data value under the
>>> current proposals, whereas the UTF-8 equivalent would not be,
>>> because the UTF-8 encoding of ' is just ' !
>>>
>>> Brian
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)

References:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)

Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.