[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Regarding the adoption of the Unicode character set, I agree that
this would make it easier to accommodate accented and non-Latin
characters and symbols, and I see no reason to oppose implementing
it as a UTF-8 encoding, and so I vote 3.2.

(It's not a panacea, especially for maths, where new symbols can
always be invented, and one must be able to specify a two-dimensional
layout as well as just the glyphs, so we shall still need other
approaches for various types of "rich" text.)

However, this is a binary encoding, is it not, and so the underlying
STAR specification must be modified to accommodate this. (I'm afraid
I haven't got Nick's draft paper for the revised STAR specification
to hand, so I apologise if that's already been addrressed.)

Does it raise issues of endian-ness? If we are introducing binary
encodings, are there any reasons to restrict the character set
encoding to UTF-8 or should one also allow UTF-16 etc. (i) in STAR
and (ii) in CIF? And, ultimately, is there a prospect of extending
the STAR spec in a way that properly accommodates at least the CBF
implementation, and possibly other binary data incorporation?

I am happy in this case that handling by "old" CIF software can
be done by adopting a protocol that allows UTF-8 Unicode characters
to be represented by ASCII encodings such as \u27. (I don't think
that we need specify a protocol at this point, just be sure that
one can be defined if needed.)

I again draw attention to the amusing fact that with an ASCII
Unicode encoding, "O\u27Neill" is a valid data value under the
current proposals, whereas the UTF-8 equivalent would not be,
because the UTF-8 encoding of ' is just ' !

ddlm-group mailing list

Reply to: [list | sender only]