Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .

Title:
I plan to be at the ACA in Chicago and will be happy to meet anyone else interested in this issue.

David




Herbert J. Bernstein wrote:
Dear Colleagues,

   It is an unfortunate reality that we seem unable to agree on this issue
and perhaps others related to CIF2 and DDLm.  Perhaps we need a meeting.
If enough of us are at the ACA meeting in Chicago, and a few others
could join in via Skype, maybe we could make some progress.

   Right now we seem to be going in circles.

   Regards,
     Herbert



At 12:24 PM -0500 6/24/10, Bollinger, John C wrote:
On Wednesday, June 23, 2010 8:24 PM, SIMON WESTRIP wrote:
I've attempted to take a step back and look at the encoding problem 
>from the perspective of my working experience.

Fair enough.

[...]

To start with, please indulge me by putting aside the 
philosophical/respectful ('internationalization') considerations.
What are the short/medium term benefits of extending CIF beyond ASCII text?

1) With regard to the promise of DDLm (all ASCII) - none?
I'm insufficiently informed to respond to that one.

2) With regard to processing crystallographic data output by e.g. 
refinement software - none?
As far as I know, no current refinement software outputs non-ASCII 
CIF content, except by using the limited and somewhat arcane system 
of ASCII elides described among the CIF 1.1 "Common Semantic 
Features" (and which technically is not part of the CIF 1.1 spec). 
If there are any that do otherwise then the files they produce may 
not conform to CIF 1.1.  Any existing processing software that 
consumes CIFs therefore either will assume the character set to be 
restricted to ASCII, or will make some specific local provision for 
handling non-standard CIFs.  Some such software may be able to 
immediately take advantage of the larger character repertoire 
afforded by Unicode, but a lot of software will need to be updated 
to make any use of it.

I'm not sure any of that answers the question, though.  What 
behaviors count as "processing"?  To the extent that few 
crystallographic computations can be performed on non-numeric data, 
I see no special benefit for that kind of processing.

On the other hand, I do see certain advantages to CIF being able to 
represent personal names without transliteration, as variant 
transliteration approaches applied to the same name sometimes 
produce different results.  If the "processing" in question involves 
storing CIF data in a database then there are searching and 
normalization advantages to having names, at least, written in their 
native script.  (The elide system covers many of these cases, at 
least for European names, but not all possible cases.)

3) With regard to richer content within data values - minimal?
Again, names.

Also, deprecating the elide system -- I understand that it is 
designed to be mnemonic, and it *is* easier to read than Unicode 
escape codes would be, but it's still limited and hard to read.  I 
contend that this one thing that is broken in CIF1 (whether you 
characterize the problem as an insufficient character repertoire or 
as an insufficient elide system).

Plus, there are various non-ASCII characters in routine use in 
crystallography and related fields that it would be nice to 
represent directly, among them the degree symbol and many upper- and 
lower-case Greek letters.  The elide system currently covers these, 
but again, it's uncomfortable and not an official standard.

Furthermore, if there is some hope or expectation of CIF2 as an 
electronic representation of non-English manuscripts, then that 
virtually requires direct support for all the characters of the 
scripts in which such manuscripts will be written.  The elide system 
is workable for short pieces of text, but only via machine 
translation could it be comfortable for longer texts.

I think these amount to more than a minimal advantage for Unicode in 
data values.

In the latter case an extended character set can be represented 
using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on 
my experience (and in light of the issues we've been discussing), 
it will probably be considerably easier for a user to adapt to a 
few extra ASCII control sequences than asking them to pay any 
attention to the underlying text encodings. The same applies from a 
developers point of view - i.e. its far easier to accept extended 
ASCII control sequences than to try to determine the text encoding 
(unless of course the encodings are unambiguously identifiable).
Java / Python-style Unicode escapes have the advantages of covering 
all of Unicode, of providing an unambiguous encoding of an 
underlying Unicode text model, and of embedding that encoding in an 
ASCII-based host format.

They have the disadvantages of being difficult for a human to 
directly read or edit, and of introducing their own set of issues. 
For example, consider the following potential CIF2 fragment:

        _foo \u000A;bar\u000A;

What is the value assigned to data name _foo?  If the Unicode 
escapes are processed according to the Java model (i.e. as if 
replaced by the corresponding character prior to lexical analysis), 
then the value is bar.  If the escapes are processed later, then the 
value is <LF>;bar<LF>;, apparently a "simple data value" as CIF 1.1 
calls them, but containing <LF> characters (in fact, this particular 
value cannot be represented in CIF 1 at all).

These issues do not by any means block Unicode escapes from being 
adopted for CIF, but they do mean that taking such an approach 
requires some additional details to be settled, and that there will 
be interesting gotchas involved in adapting some existing CIF1 
software for CIF2.

Furthermore, extending the character set (however represented) does 
not address issues such as representing mathematical
content in a CIF data value, nor images (imgCIF will not be fully 
compliant with CIF2 - but please correct me if I'm wrong). There 
are yet unexplored alternatives to enabling richer publication and 
archival content using CIF, but they do not concern the fundamental 
syntax/encoding.
By "mathematical content" I suppose you mean formulae.  I agree, 
formulae, images, and various other content types that might be of 
interest are not supported by a Unicode character model alone, 
however encoded.  It was never my understanding that supporting such 
content types was a reason for switching to a Unicode character 
model, however much (or little) it might be advantageous to imgCIF.

So the leading ('forward thinking') motivation for basing CIF2 on 
unicode lies in 'internationalization'. In the short/medium term I 
don't imagine that introducing an extended character set through 
unicode or multiple encodings is going to lead to any one/group 
adopting the new CIF2 as the basis of their private/public data 
archive/retrieval system. Hopefully they will take advantage of 
what DDLm has to offer, though most likely by using third-party 
software.
I think that's missing the point.  CIF already has to deal with 
internationalization issues, which it does, as best it can, via the 
elide system.  Even in English it has to in some way provide a 
character model that extends beyond ASCII.

At this point in my train of thought, I might say stick to ASCII as 
'internationalization' has not been widely called for by the 
community and has minimal benefits at this time.
As a practical matter, CIF already goes beyond ASCII.  The usual 
manner in which it does so, however, is explicitly NOT standardized. 
Personally, I find this a sorry state of affairs indeed.

 However, I think CIF should move forward in this respect. So how 
do we achieve this? Unicode is the accepted answer? Unicode was 
designed for this and has some established unambiguous encodings?
I think Unicode or (almost) equivalently, ISO-10646, is indeed the 
accepted answer, at least inasmuch as ISO-10646 is an international 
standard.  As far as I know, there is no competing standard of 
comparable scope.

 The majority (including Microsoft) recommend adopting UTF-8 in 
preference to other encodings?
XML gives special status to UTF-8 as the encoding to assume in the 
absence of internal or external metadata directing otherwise. 
Nevertheless, XML also requires conformant processors to be able to 
recognize and handle UTF-16 (though not necessarily UTF-16LE, 
UTF-16BE, or other variants).  I believe Microsoft NT-based 
operating systems internally use UCS-2 or UTF-16 for file names, 
depending on OS version and patch level.  Microsoft and many others 
provide decent support for creating, reading, and editing Unicode 
text files encoded in UTF-8, but this frequently is not the default 
encoding.  I am not aware of Microsoft in particular promoting UTF-8 
above locale-specific code pages, but it is my general, personal 
perception that UTF-8 use is broad, expanding, and widely 
recommended.  However, I do not see UTF-8 or any other encoding ever 
being preferred over all others for all purposes.

So in the light of current CIF practice (i.e. unspecified-encoding 
of ASCII text, where the encoding has never to my knowledge been a 
problem), why not specify UTF-8 only, don't accommodate any 
non-ASCII code points in the dictionaries (which is what is 
proposed anyway?), and see what happens? :-) At worst a few users 
will find that existing software will not handle the non-ASCII text 
they have diligently included in their UTF-8 CIF (but this is 
inevitable once you extend beyond ASCII). At best their text will 
be handled as UTF-8 by CIF2 software.
That is a possible way forward, and indeed, it is basically what is 
in the current spec.  The main problem I see with it is that in 
practice, many people will create, use, and exchange (successfully 
or not) "CIFs" that are not UTF-8 encoded, regardless of what the 
spec says about that.  Although it is certainly possible to declare 
that such files are not compliant CIFs, I don't see how that 
provides any benefit.

So what about the issue of accessing archived UTF-8 CIFs? Make it 
clear to the recipient that the CIF will be encoded in UTF-8; if 
for some reason they have trouble reading the CIF, point them at 
appropriate UTF-8 software (preferably provide them with a fully 
compliant CIF2 editor/viewer that introduces them to the benefits 
of CIF2 and its support for unicode:-)
And that is exactly the same thing that would be done if CIF2 did 
not specify a particular encoding.

Similarly, with day-to-day transmission of a CIF, if the CIF 
doesn't contain any characters beyond the ASCII set, the chances 
are there wont be any issues (there havn't been in the past?). If a 
diligent user has followed the spec and prepared a UTF-8 CIF, again 
the chances are it will be interpretted as UTF-8 (very few modern 
systems struggle with UTF-8?).
I'm not in a position to know how many encoding-related issues there 
may have been in the past.  UTF-16 variants and EBCDIC variants are 
the only encodings I know that are in wide use and might present an 
interchange problem for CIF 1.1 compliant CIFs.  They would present 
exactly the same problems if used to encode ASCII-only CIF2 text.

I fully expect to be 'shot down' on any number of my thoughts - 
but, given the amount of emails it has generated, I dont think it 
is unreasonable to put this issue in the context of perceived 
current practice (however narrow the viewpoint - others have 
referred to CIF systems that I have no idea about)?
It is not my goal to "shoot you down", or anyone else.  I am not 
debating for the sake of the debate.  I want CIF2 to be as 
technically sound and as practically useful as possible, and I don't 
foresee a lot of latitude for tweaking or revising it after it is 
adopted.

I started by probing several areas where the draft spec seemed to 
give too little consideration to the implications of expanding the 
CIF character repertoire to all of Unicode.  For the most part these 
have been resolved easily, but the issue of embedded U+FEFF 
characters was contentious (and still has not been resolved).  That 
led into the related area of character encoding and text vs. binary, 
which has become such a brouhaha.

Much of the disagreement over these contentious issues arises from 
CIF's split-personality design.  It has always been promoted as a 
human-readable text format, yet it is intended largely to be 
produced and primarily to be consumed by computers.  Humans and 
computers have different requirements, and it is not always possible 
to align them.  XML followed a similar path, and nowadays the 
prevailing opinion seems to be that XML isn't well suited to direct 
human reading or modification.  Opinion of CIF has not reached that 
point yet, and it's unclear whether it ever will do.

Best,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

  


begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:idbrown@mcmaster.ca
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.