[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Herbert

Thanks for the clarification. I've now read
   http://en.wikipedia.org/wiki/UTF-8
:-)

It seems to me that the STAR spec still needs to be modified to
state explicitly that its allowed character set is Unicode as
expressed in UTF-8 encoding.

I note also from the above Wikipaedia entry that there is some
latitude in practices for handling invalid byte sequences (and to some
extent invalid code points). I think we should consider whether the
full STAR/CIF1.2 specs should formalise exception handling procedures
in such cases.

Regards
Brian

PS Just for my own information, does the statement
  > For the point of view of any
  > C-program intended to work with the 256-chacacter ISO characters sets,
  > a UTF-8 string handles just the same as an ISO string.
hold equally well for modern Fortran applications?

On Sat, Oct 10, 2009 at 12:01:05PM -0400, Herbert J. Bernstein wrote:
> Dear Colleagues,
> 
>    There is a misundertsanding about UTF-8.  For the point of view of any
> C-program intended to work with the 256-chacacter ISO characters sets,
> a UTF-8 string handles just the same as an ISO string.  The major 
> differences are that the bottom 128 characters are the US national variant
> we call ASCII, and the second 128 characters that in the past would have
> had the accented and special characters needs to handle the western
> European languages in an ASCII environment have been replaced with the
> variable length encodings for a 31 bit character set.  That is what is 
> nice about UTF8 -- it is actually using what should be printable 
> characters to do its encoding, avoiding anything that looks like
> binary data.
> 
>    UTF-16/UCS-2 is different.  There you have a lot that looks like binary
> when working in an ascii world, and you need special libraries (for wide
> characters) to deal with them, unless you are working in java or with a
> browser, where that is the native encoding.
> 
>    We are in the midst of a painful, worldwide transition in which we have
> a mixture of:
> 
>    1.  The code code-page based character encodings based on the multiple 
> ISO national variants.  ASCII is just the US national variant.
>    2.  The UTF-16/UCS-2 version of unicode heavily adopted by many hardware
> vendors and used as the native encoding in many operating systems and all
> browsers
>    3.  The UTF-8 version of unicode, extensively adopted in Linux-based
> applications and slowly being accepted in almost all operating systems.
> 
> My guess is that by 10 years from now, UTF-8 will have been fairly
> completely adopted except for some legacy java and browser UCS-2
> stuff.
> 
>    My suggestion would be to try to support ascii, UCS-2 and UTF-8 for the
> moment and work towards joining the march towards UTF-8.
> 
>    Regards,
>      Herbert
> 
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
> 
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> 
> On Sat, 10 Oct 2009, Brian McMahon wrote:
> 
> > Regarding the adoption of the Unicode character set, I agree that
> > this would make it easier to accommodate accented and non-Latin
> > characters and symbols, and I see no reason to oppose implementing
> > it as a UTF-8 encoding, and so I vote 3.2.
> >
> > (It's not a panacea, especially for maths, where new symbols can
> > always be invented, and one must be able to specify a two-dimensional
> > layout as well as just the glyphs, so we shall still need other
> > approaches for various types of "rich" text.)
> >
> > However, this is a binary encoding, is it not, and so the underlying
> > STAR specification must be modified to accommodate this. (I'm afraid
> > I haven't got Nick's draft paper for the revised STAR specification
> > to hand, so I apologise if that's already been addrressed.)
> >
> > Does it raise issues of endian-ness? If we are introducing binary
> > encodings, are there any reasons to restrict the character set
> > encoding to UTF-8 or should one also allow UTF-16 etc. (i) in STAR
> > and (ii) in CIF? And, ultimately, is there a prospect of extending
> > the STAR spec in a way that properly accommodates at least the CBF
> > implementation, and possibly other binary data incorporation?
> >
> > I am happy in this case that handling by "old" CIF software can
> > be done by adopting a protocol that allows UTF-8 Unicode characters
> > to be represented by ASCII encodings such as \u27. (I don't think
> > that we need specify a protocol at this point, just be sure that
> > one can be defined if needed.)
> >
> > I again draw attention to the amusing fact that with an ASCII
> > Unicode encoding, "O\u27Neill" is a valid data value under the
> > current proposals, whereas the UTF-8 equivalent would not be,
> > because the UTF-8 encoding of ' is just ' !
> >
> > Brian
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org
> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]