[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Sat, 10 Oct 2009 13:02:35 -0400 (EDT)
- In-Reply-To: <20091010164455.GA7668@emerald.iucr.org>
- References: <C6F5BF24.1200E%nick@csse.uwa.edu.au><645410.77656.qm@web87015.mail.ird.yahoo.com><279aad2a0910100249o2c09897anb767ab28b06cbdcf@mail.gmail.com><279aad2a0910100513u1e9ef18dua5f984cc20ac9a9b@mail.gmail.com><20091010125924.GA7536@emerald.iucr.org><20091010145830.GA7607@emerald.iucr.org><20091010113728.W18936@epsilon.pair.com><20091010164455.GA7668@emerald.iucr.org>
Yes, most modern Fortrans cannot tell the difference between UTF-8 and ascii. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Sat, 10 Oct 2009, Brian McMahon wrote: > Dear Herbert > > Thanks for the clarification. I've now read > http://en.wikipedia.org/wiki/UTF-8 > :-) > > It seems to me that the STAR spec still needs to be modified to > state explicitly that its allowed character set is Unicode as > expressed in UTF-8 encoding. > > I note also from the above Wikipaedia entry that there is some > latitude in practices for handling invalid byte sequences (and to some > extent invalid code points). I think we should consider whether the > full STAR/CIF1.2 specs should formalise exception handling procedures > in such cases. > > Regards > Brian > > PS Just for my own information, does the statement > > For the point of view of any > > C-program intended to work with the 256-chacacter ISO characters sets, > > a UTF-8 string handles just the same as an ISO string. > hold equally well for modern Fortran applications? > > On Sat, Oct 10, 2009 at 12:01:05PM -0400, Herbert J. Bernstein wrote: >> Dear Colleagues, >> >> There is a misundertsanding about UTF-8. For the point of view of any >> C-program intended to work with the 256-chacacter ISO characters sets, >> a UTF-8 string handles just the same as an ISO string. The major >> differences are that the bottom 128 characters are the US national variant >> we call ASCII, and the second 128 characters that in the past would have >> had the accented and special characters needs to handle the western >> European languages in an ASCII environment have been replaced with the >> variable length encodings for a 31 bit character set. That is what is >> nice about UTF8 -- it is actually using what should be printable >> characters to do its encoding, avoiding anything that looks like >> binary data. >> >> UTF-16/UCS-2 is different. There you have a lot that looks like binary >> when working in an ascii world, and you need special libraries (for wide >> characters) to deal with them, unless you are working in java or with a >> browser, where that is the native encoding. >> >> We are in the midst of a painful, worldwide transition in which we have >> a mixture of: >> >> 1. The code code-page based character encodings based on the multiple >> ISO national variants. ASCII is just the US national variant. >> 2. The UTF-16/UCS-2 version of unicode heavily adopted by many hardware >> vendors and used as the native encoding in many operating systems and all >> browsers >> 3. The UTF-8 version of unicode, extensively adopted in Linux-based >> applications and slowly being accepted in almost all operating systems. >> >> My guess is that by 10 years from now, UTF-8 will have been fairly >> completely adopted except for some legacy java and browser UCS-2 >> stuff. >> >> My suggestion would be to try to support ascii, UCS-2 and UTF-8 for the >> moment and work towards joining the march towards UTF-8. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Sat, 10 Oct 2009, Brian McMahon wrote: >> >>> Regarding the adoption of the Unicode character set, I agree that >>> this would make it easier to accommodate accented and non-Latin >>> characters and symbols, and I see no reason to oppose implementing >>> it as a UTF-8 encoding, and so I vote 3.2. >>> >>> (It's not a panacea, especially for maths, where new symbols can >>> always be invented, and one must be able to specify a two-dimensional >>> layout as well as just the glyphs, so we shall still need other >>> approaches for various types of "rich" text.) >>> >>> However, this is a binary encoding, is it not, and so the underlying >>> STAR specification must be modified to accommodate this. (I'm afraid >>> I haven't got Nick's draft paper for the revised STAR specification >>> to hand, so I apologise if that's already been addrressed.) >>> >>> Does it raise issues of endian-ness? If we are introducing binary >>> encodings, are there any reasons to restrict the character set >>> encoding to UTF-8 or should one also allow UTF-16 etc. (i) in STAR >>> and (ii) in CIF? And, ultimately, is there a prospect of extending >>> the STAR spec in a way that properly accommodates at least the CBF >>> implementation, and possibly other binary data incorporation? >>> >>> I am happy in this case that handling by "old" CIF software can >>> be done by adopting a protocol that allows UTF-8 Unicode characters >>> to be represented by ASCII encodings such as \u27. (I don't think >>> that we need specify a protocol at this point, just be sure that >>> one can be defined if needed.) >>> >>> I again draw attention to the amusing fact that with an ASCII >>> Unicode encoding, "O\u27Neill" is a valid data value under the >>> current proposals, whereas the UTF-8 equivalent would not be, >>> because the UTF-8 encoding of ' is just ' ! >>> >>> Brian >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):