[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: Brian McMahon <bm@iucr.org>
- Date: Sat, 10 Oct 2009 17:44:55 +0100
- In-Reply-To: <20091010113728.W18936@epsilon.pair.com>
- References: <C6F5BF24.1200E%nick@csse.uwa.edu.au><645410.77656.qm@web87015.mail.ird.yahoo.com><279aad2a0910100249o2c09897anb767ab28b06cbdcf@mail.gmail.com><279aad2a0910100513u1e9ef18dua5f984cc20ac9a9b@mail.gmail.com><20091010125924.GA7536@emerald.iucr.org><20091010145830.GA7607@emerald.iucr.org><20091010113728.W18936@epsilon.pair.com>
Dear Herbert Thanks for the clarification. I've now read http://en.wikipedia.org/wiki/UTF-8 :-) It seems to me that the STAR spec still needs to be modified to state explicitly that its allowed character set is Unicode as expressed in UTF-8 encoding. I note also from the above Wikipaedia entry that there is some latitude in practices for handling invalid byte sequences (and to some extent invalid code points). I think we should consider whether the full STAR/CIF1.2 specs should formalise exception handling procedures in such cases. Regards Brian PS Just for my own information, does the statement > For the point of view of any > C-program intended to work with the 256-chacacter ISO characters sets, > a UTF-8 string handles just the same as an ISO string. hold equally well for modern Fortran applications? On Sat, Oct 10, 2009 at 12:01:05PM -0400, Herbert J. Bernstein wrote: > Dear Colleagues, > > There is a misundertsanding about UTF-8. For the point of view of any > C-program intended to work with the 256-chacacter ISO characters sets, > a UTF-8 string handles just the same as an ISO string. The major > differences are that the bottom 128 characters are the US national variant > we call ASCII, and the second 128 characters that in the past would have > had the accented and special characters needs to handle the western > European languages in an ASCII environment have been replaced with the > variable length encodings for a 31 bit character set. That is what is > nice about UTF8 -- it is actually using what should be printable > characters to do its encoding, avoiding anything that looks like > binary data. > > UTF-16/UCS-2 is different. There you have a lot that looks like binary > when working in an ascii world, and you need special libraries (for wide > characters) to deal with them, unless you are working in java or with a > browser, where that is the native encoding. > > We are in the midst of a painful, worldwide transition in which we have > a mixture of: > > 1. The code code-page based character encodings based on the multiple > ISO national variants. ASCII is just the US national variant. > 2. The UTF-16/UCS-2 version of unicode heavily adopted by many hardware > vendors and used as the native encoding in many operating systems and all > browsers > 3. The UTF-8 version of unicode, extensively adopted in Linux-based > applications and slowly being accepted in almost all operating systems. > > My guess is that by 10 years from now, UTF-8 will have been fairly > completely adopted except for some legacy java and browser UCS-2 > stuff. > > My suggestion would be to try to support ascii, UCS-2 and UTF-8 for the > moment and work towards joining the march towards UTF-8. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Sat, 10 Oct 2009, Brian McMahon wrote: > > > Regarding the adoption of the Unicode character set, I agree that > > this would make it easier to accommodate accented and non-Latin > > characters and symbols, and I see no reason to oppose implementing > > it as a UTF-8 encoding, and so I vote 3.2. > > > > (It's not a panacea, especially for maths, where new symbols can > > always be invented, and one must be able to specify a two-dimensional > > layout as well as just the glyphs, so we shall still need other > > approaches for various types of "rich" text.) > > > > However, this is a binary encoding, is it not, and so the underlying > > STAR specification must be modified to accommodate this. (I'm afraid > > I haven't got Nick's draft paper for the revised STAR specification > > to hand, so I apologise if that's already been addrressed.) > > > > Does it raise issues of endian-ness? If we are introducing binary > > encodings, are there any reasons to restrict the character set > > encoding to UTF-8 or should one also allow UTF-16 etc. (i) in STAR > > and (ii) in CIF? And, ultimately, is there a prospect of extending > > the STAR spec in a way that properly accommodates at least the CBF > > implementation, and possibly other binary data incorporation? > > > > I am happy in this case that handling by "old" CIF software can > > be done by adopting a protocol that allows UTF-8 Unicode characters > > to be represented by ASCII encodings such as \u27. (I don't think > > that we need specify a protocol at this point, just be sure that > > one can be defined if needed.) > > > > I again draw attention to the amusing fact that with an ASCII > > Unicode encoding, "O\u27Neill" is a valid data value under the > > current proposals, whereas the UTF-8 equivalent would not be, > > because the UTF-8 encoding of ' is just ' ! > > > > Brian > > _______________________________________________ > > ddlm-group mailing list > > ddlm-group@iucr.org > > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):