[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear all

Before this thread diverges into a deeper discussion of UTF-8 and unicode, can I ask for clarification of a few points.
As an observer, seems to me that this thread has been 'stumbling' because of some fundamental issues with respect to
adopting DDLm. I now find myself questioning my understanding of the situation. At the risk of sounding as if I'm just repeating some of the recent comments from Brian and James (or indeed that I shouldnt have been asked to listen in at all), I've been observing these discussions under the assumption that:

1) it was already accepted that CIF1.2 is going to have to be treated as a distinct format, requiring new CIF1.2-enabled software. The new software should be backwards-compatible - able to read/write CIF1.1 if required. This is not an uncommon scenario (e.g. in the world of word-processing software - the latest format will not be readable by programs written for the previous formats, but programs supporting the latest format will be able to convert between the old and new). This is an acceptable annoyance if the new format markedly enhances the old format?

2) The general aim is to make the transition between the old and new as painless as possible, but not at the expense of realizing the benefits of the new?

3) The sooner the specs for the new are made available the better - so that developers can at least keep them in mind when they work on their projects - whether it be a fully fledged CIF reader/writer, or a program that just accepts CIF as a data source.

Forgive me if I'm off the mark with my assumptions, or if I'm going over ground you've already covered (being a newcomer, I'm afraid I may not be up to speed on all this, though as someone who may well be involved in implementing whatever is decided upon, even my ignorance may be of use to you when it comes to considerations of how the changes may be handled by interested parties).

Cheers

Simon

Simon P. Westrip




From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Saturday, 10 October, 2009 18:02:35
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Yes, most modern Fortrans cannot tell the difference between UTF-8 and
ascii.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Sat, 10 Oct 2009, Brian McMahon wrote:

> Dear Herbert
>
> Thanks for the clarification. I've now read
http://en.wikipedia.org/wiki/UTF-8
> :-)
>
> It seems to me that the STAR spec still needs to be modified to
> state explicitly that its allowed character set is Unicode as
> expressed in UTF-8 encoding.
>
> I note also from the above Wikipaedia entry that there is some
> latitude in practices for handling invalid byte sequences (and to some
> extent invalid code points). I think we should consider whether the
> full STAR/CIF1.2 specs should formalise exception handling procedures
> in such cases.
>
> Regards
> Brian
>
> PS Just for my own information, does the statement
>  > For the point of view of any
>  > C-program intended to work with the 256-chacacter ISO characters sets,
>  > a UTF-8 string handles just the same as an ISO string.
> hold equally well for modern Fortran applications?
>
> On Sat, Oct 10, 2009 at 12:01:05PM -0400, Herbert J. Bernstein wrote:
>> Dear Colleagues,
>>
>>    There is a misundertsanding about UTF-8.  For the point of view of any
>> C-program intended to work with the 256-chacacter ISO characters sets,
>> a UTF-8 string handles just the same as an ISO string.  The major
>> differences are that the bottom 128 characters are the US national variant
>> we call ASCII, and the second 128 characters that in the past would have
>> had the accented and special characters needs to handle the western
>> European languages in an ASCII environment have been replaced with the
>> variable length encodings for a 31 bit character set.  That is what is
>> nice about UTF8 -- it is actually using what should be printable
>> characters to do its encoding, avoiding anything that looks like
>> binary data.
>>
>>    UTF-16/UCS-2 is different.  There you have a lot that looks like binary
>> when working in an ascii world, and you need special libraries (for wide
>> characters) to deal with them, unless you are working in java or with a
>> browser, where that is the native encoding.
>>
>>    We are in the midst of a painful, worldwide transition in which we have
>> a mixture of:
>>
>>    1.  The code code-page based character encodings based on the multiple
>> ISO national variants.  ASCII is just the US national variant.
>>    2.  The UTF-16/UCS-2 version of unicode heavily adopted by many hardware
>> vendors and used as the native encoding in many operating systems and all
>> browsers
>>    3.  The UTF-8 version of unicode, extensively adopted in Linux-based
>> applications and slowly being accepted in almost all operating systems.
>>
>> My guess is that by 10 years from now, UTF-8 will have been fairly
>> completely adopted except for some legacy java and browser UCS-2
>> stuff.
>>
>>    My suggestion would be to try to support ascii, UCS-2 and UTF-8 for the
>> moment and work towards joining the march towards UTF-8.
>>
>>    Regards,
>>      Herbert
>>
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                  +1-631-244-3035
>>                  yaya@dowling.edu
>> =====================================================
>>
>> On Sat, 10 Oct 2009, Brian McMahon wrote:
>>
>>> Regarding the adoption of the Unicode character set, I agree that
>>> this would make it easier to accommodate accented and non-Latin
>>> characters and symbols, and I see no reason to oppose implementing
>>> it as a UTF-8 encoding, and so I vote 3.2.
>>>
>>> (It's not a panacea, especially for maths, where new symbols can
>>> always be invented, and one must be able to specify a two-dimensional
>>> layout as well as just the glyphs, so we shall still need other
>>> approaches for various types of "rich" text.)
>>>
>>> However, this is a binary encoding, is it not, and so the underlying
>>> STAR specification must be modified to accommodate this. (I'm afraid
>>> I haven't got Nick's draft paper for the revised STAR specification
>>> to hand, so I apologise if that's already been addrressed.)
>>>
>>> Does it raise issues of endian-ness? If we are introducing binary
>>> encodings, are there any reasons to restrict the character set
>>> encoding to UTF-8 or should one also allow UTF-16 etc. (i) in STAR
>>> and (ii) in CIF? And, ultimately, is there a prospect of extending
>>> the STAR spec in a way that properly accommodates at least the CBF
>>> implementation, and possibly other binary data incorporation?
>>>
>>> I am happy in this case that handling by "old" CIF software can
>>> be done by adopting a protocol that allows UTF-8 Unicode characters
>>> to be represented by ASCII encodings such as \u27. (I don't think
>>> that we need specify a protocol at this point, just be sure that
>>> one can be defined if needed.)
>>>
>>> I again draw attention to the amusing fact that with an ASCII
>>> Unicode encoding, "O\u27Neill" is a valid data value under the
>>> current proposals, whereas the UTF-8 equivalent would not be,
>>> because the UTF-8 encoding of ' is just ' !
>>>
>>> Brian
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]