Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .

On Wednesday, June 23, 2010 8:24 PM, SIMON WESTRIP wrote:
>I've attempted to take a step back and look at the encoding problem from the perspective of my working experience.

Fair enough.


>To start with, please indulge me by putting aside the philosophical/respectful ('internationalization') considerations.
>What are the short/medium term benefits of extending CIF beyond ASCII text?
>1) With regard to the promise of DDLm (all ASCII) - none?

I'm insufficiently informed to respond to that one.

>2) With regard to processing crystallographic data output by e.g. refinement software - none?

As far as I know, no current refinement software outputs non-ASCII CIF content, except by using the limited and somewhat arcane system of ASCII elides described among the CIF 1.1 "Common Semantic Features" (and which technically is not part of the CIF 1.1 spec).  If there are any that do otherwise then the files they produce may not conform to CIF 1.1.  Any existing processing software that consumes CIFs therefore either will assume the character set to be restricted to ASCII, or will make some specific local provision for handling non-standard CIFs.  Some such software may be able to immediately take advantage of the larger character repertoire afforded by Unicode, but a lot of software will need to be updated to make any use of it.

I'm not sure any of that answers the question, though.  What behaviors count as "processing"?  To the extent that few crystallographic computations can be performed on non-numeric data, I see no special benefit for that kind of processing.

On the other hand, I do see certain advantages to CIF being able to represent personal names without transliteration, as variant transliteration approaches applied to the same name sometimes produce different results.  If the "processing" in question involves storing CIF data in a database then there are searching and normalization advantages to having names, at least, written in their native script.  (The elide system covers many of these cases, at least for European names, but not all possible cases.)

>3) With regard to richer content within data values - minimal?

Again, names.

Also, deprecating the elide system -- I understand that it is designed to be mnemonic, and it *is* easier to read than Unicode escape codes would be, but it's still limited and hard to read.  I contend that this one thing that is broken in CIF1 (whether you characterize the problem as an insufficient character repertoire or as an insufficient elide system).

Plus, there are various non-ASCII characters in routine use in crystallography and related fields that it would be nice to represent directly, among them the degree symbol and many upper- and lower-case Greek letters.  The elide system currently covers these, but again, it's uncomfortable and not an official standard.

Furthermore, if there is some hope or expectation of CIF2 as an electronic representation of non-English manuscripts, then that virtually requires direct support for all the characters of the scripts in which such manuscripts will be written.  The elide system is workable for short pieces of text, but only via machine translation could it be comfortable for longer texts.

I think these amount to more than a minimal advantage for Unicode in data values.

>In the latter case an extended character set can be represented using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on my experience (and in light of the issues we've been discussing), it will probably be considerably easier for a user to adapt to a few extra ASCII control sequences than asking them to pay any attention to the underlying text encodings. The same applies from a developers point of view - i.e. its far easier to accept extended ASCII control sequences than to try to determine the text encoding (unless of course the encodings are unambiguously identifiable).

Java / Python-style Unicode escapes have the advantages of covering all of Unicode, of providing an unambiguous encoding of an underlying Unicode text model, and of embedding that encoding in an ASCII-based host format.

They have the disadvantages of being difficult for a human to directly read or edit, and of introducing their own set of issues.  For example, consider the following potential CIF2 fragment:

        _foo \u000A;bar\u000A;

What is the value assigned to data name _foo?  If the Unicode escapes are processed according to the Java model (i.e. as if replaced by the corresponding character prior to lexical analysis), then the value is bar.  If the escapes are processed later, then the value is <LF>;bar<LF>;, apparently a "simple data value" as CIF 1.1 calls them, but containing <LF> characters (in fact, this particular value cannot be represented in CIF 1 at all).

These issues do not by any means block Unicode escapes from being adopted for CIF, but they do mean that taking such an approach requires some additional details to be settled, and that there will be interesting gotchas involved in adapting some existing CIF1 software for CIF2.

>Furthermore, extending the character set (however represented) does not address issues such as representing mathematical
>content in a CIF data value, nor images (imgCIF will not be fully compliant with CIF2 - but please correct me if I'm wrong). There are yet unexplored alternatives to enabling richer publication and archival content using CIF, but they do not concern the fundamental syntax/encoding.

By "mathematical content" I suppose you mean formulae.  I agree, formulae, images, and various other content types that might be of interest are not supported by a Unicode character model alone, however encoded.  It was never my understanding that supporting such content types was a reason for switching to a Unicode character model, however much (or little) it might be advantageous to imgCIF.

>So the leading ('forward thinking') motivation for basing CIF2 on unicode lies in 'internationalization'. In the short/medium term I don't imagine that introducing an extended character set through unicode or multiple encodings is going to lead to any one/group adopting the new CIF2 as the basis of their private/public data archive/retrieval system. Hopefully they will take advantage of what DDLm has to offer, though most likely by using third-party software.

I think that's missing the point.  CIF already has to deal with internationalization issues, which it does, as best it can, via the elide system.  Even in English it has to in some way provide a character model that extends beyond ASCII.

>At this point in my train of thought, I might say stick to ASCII as 'internationalization' has not been widely called for by the community and has minimal benefits at this time.

As a practical matter, CIF already goes beyond ASCII.  The usual manner in which it does so, however, is explicitly NOT standardized.  Personally, I find this a sorry state of affairs indeed.

> However, I think CIF should move forward in this respect. So how do we achieve this? Unicode is the accepted answer? Unicode was designed for this and has some established unambiguous encodings?

I think Unicode or (almost) equivalently, ISO-10646, is indeed the accepted answer, at least inasmuch as ISO-10646 is an international standard.  As far as I know, there is no competing standard of comparable scope.

> The majority (including Microsoft) recommend adopting UTF-8 in preference to other encodings?

XML gives special status to UTF-8 as the encoding to assume in the absence of internal or external metadata directing otherwise.  Nevertheless, XML also requires conformant processors to be able to recognize and handle UTF-16 (though not necessarily UTF-16LE, UTF-16BE, or other variants).  I believe Microsoft NT-based operating systems internally use UCS-2 or UTF-16 for file names, depending on OS version and patch level.  Microsoft and many others provide decent support for creating, reading, and editing Unicode text files encoded in UTF-8, but this frequently is not the default encoding.  I am not aware of Microsoft in particular promoting UTF-8 above locale-specific code pages, but it is my general, personal perception that UTF-8 use is broad, expanding, and widely recommended.  However, I do not see UTF-8 or any other encoding ever being preferred over all others for all purposes.

>So in the light of current CIF practice (i.e. unspecified-encoding of ASCII text, where the encoding has never to my knowledge been a problem), why not specify UTF-8 only, don't accommodate any non-ASCII code points in the dictionaries (which is what is proposed anyway?), and see what happens? :-) At worst a few users will find that existing software will not handle the non-ASCII text they have diligently included in their UTF-8 CIF (but this is inevitable once you extend beyond ASCII). At best their text will be handled as UTF-8 by CIF2 software.

That is a possible way forward, and indeed, it is basically what is in the current spec.  The main problem I see with it is that in practice, many people will create, use, and exchange (successfully or not) "CIFs" that are not UTF-8 encoded, regardless of what the spec says about that.  Although it is certainly possible to declare that such files are not compliant CIFs, I don't see how that provides any benefit.

>So what about the issue of accessing archived UTF-8 CIFs? Make it clear to the recipient that the CIF will be encoded in UTF-8; if for some reason they have trouble reading the CIF, point them at appropriate UTF-8 software (preferably provide them with a fully compliant CIF2 editor/viewer that introduces them to the benefits of CIF2 and its support for unicode:-)

And that is exactly the same thing that would be done if CIF2 did not specify a particular encoding.

>Similarly, with day-to-day transmission of a CIF, if the CIF doesn't contain any characters beyond the ASCII set, the chances are there wont be any issues (there havn't been in the past?). If a diligent user has followed the spec and prepared a UTF-8 CIF, again the chances are it will be interpretted as UTF-8 (very few modern systems struggle with UTF-8?).

I'm not in a position to know how many encoding-related issues there may have been in the past.  UTF-16 variants and EBCDIC variants are the only encodings I know that are in wide use and might present an interchange problem for CIF 1.1 compliant CIFs.  They would present exactly the same problems if used to encode ASCII-only CIF2 text.

>I fully expect to be 'shot down' on any number of my thoughts - but, given the amount of emails it has generated, I dont think it is unreasonable to put this issue in the context of perceived current practice (however narrow the viewpoint - others have referred to CIF systems that I have no idea about)?

It is not my goal to "shoot you down", or anyone else.  I am not debating for the sake of the debate.  I want CIF2 to be as technically sound and as practically useful as possible, and I don't foresee a lot of latitude for tweaking or revising it after it is adopted.

I started by probing several areas where the draft spec seemed to give too little consideration to the implications of expanding the CIF character repertoire to all of Unicode.  For the most part these have been resolved easily, but the issue of embedded U+FEFF characters was contentious (and still has not been resolved).  That led into the related area of character encoding and text vs. binary, which has become such a brouhaha.

Much of the disagreement over these contentious issues arises from CIF's split-personality design.  It has always been promoted as a human-readable text format, yet it is intended largely to be produced and primarily to be consumed by computers.  Humans and computers have different requirements, and it is not always possible to align them.  XML followed a similar path, and nowadays the prevailing opinion seems to be that XML isn't well suited to direct human reading or modification.  Opinion of CIF has not reached that point yet, and it's unclear whether it ever will do.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.