[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .

Dear Colleagues,

   It is an unfortunate reality that we seem unable to agree on this issue
and perhaps others related to CIF2 and DDLm.  Perhaps we need a meeting.
If enough of us are at the ACA meeting in Chicago, and a few others
could join in via Skype, maybe we could make some progress.

   Right now we seem to be going in circles.


At 12:24 PM -0500 6/24/10, Bollinger, John C wrote:
>On Wednesday, June 23, 2010 8:24 PM, SIMON WESTRIP wrote:
>>I've attempted to take a step back and look at the encoding problem 
>>from the perspective of my working experience.
>Fair enough.
>>To start with, please indulge me by putting aside the 
>>philosophical/respectful ('internationalization') considerations.
>>What are the short/medium term benefits of extending CIF beyond ASCII text?
>>1) With regard to the promise of DDLm (all ASCII) - none?
>I'm insufficiently informed to respond to that one.
>>2) With regard to processing crystallographic data output by e.g. 
>>refinement software - none?
>As far as I know, no current refinement software outputs non-ASCII 
>CIF content, except by using the limited and somewhat arcane system 
>of ASCII elides described among the CIF 1.1 "Common Semantic 
>Features" (and which technically is not part of the CIF 1.1 spec). 
>If there are any that do otherwise then the files they produce may 
>not conform to CIF 1.1.  Any existing processing software that 
>consumes CIFs therefore either will assume the character set to be 
>restricted to ASCII, or will make some specific local provision for 
>handling non-standard CIFs.  Some such software may be able to 
>immediately take advantage of the larger character repertoire 
>afforded by Unicode, but a lot of software will need to be updated 
>to make any use of it.
>I'm not sure any of that answers the question, though.  What 
>behaviors count as "processing"?  To the extent that few 
>crystallographic computations can be performed on non-numeric data, 
>I see no special benefit for that kind of processing.
>On the other hand, I do see certain advantages to CIF being able to 
>represent personal names without transliteration, as variant 
>transliteration approaches applied to the same name sometimes 
>produce different results.  If the "processing" in question involves 
>storing CIF data in a database then there are searching and 
>normalization advantages to having names, at least, written in their 
>native script.  (The elide system covers many of these cases, at 
>least for European names, but not all possible cases.)
>>3) With regard to richer content within data values - minimal?
>Again, names.
>Also, deprecating the elide system -- I understand that it is 
>designed to be mnemonic, and it *is* easier to read than Unicode 
>escape codes would be, but it's still limited and hard to read.  I 
>contend that this one thing that is broken in CIF1 (whether you 
>characterize the problem as an insufficient character repertoire or 
>as an insufficient elide system).
>Plus, there are various non-ASCII characters in routine use in 
>crystallography and related fields that it would be nice to 
>represent directly, among them the degree symbol and many upper- and 
>lower-case Greek letters.  The elide system currently covers these, 
>but again, it's uncomfortable and not an official standard.
>Furthermore, if there is some hope or expectation of CIF2 as an 
>electronic representation of non-English manuscripts, then that 
>virtually requires direct support for all the characters of the 
>scripts in which such manuscripts will be written.  The elide system 
>is workable for short pieces of text, but only via machine 
>translation could it be comfortable for longer texts.
>I think these amount to more than a minimal advantage for Unicode in 
>data values.
>>In the latter case an extended character set can be represented 
>>using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on 
>>my experience (and in light of the issues we've been discussing), 
>>it will probably be considerably easier for a user to adapt to a 
>>few extra ASCII control sequences than asking them to pay any 
>>attention to the underlying text encodings. The same applies from a 
>>developers point of view - i.e. its far easier to accept extended 
>>ASCII control sequences than to try to determine the text encoding 
>>(unless of course the encodings are unambiguously identifiable).
>Java / Python-style Unicode escapes have the advantages of covering 
>all of Unicode, of providing an unambiguous encoding of an 
>underlying Unicode text model, and of embedding that encoding in an 
>ASCII-based host format.
>They have the disadvantages of being difficult for a human to 
>directly read or edit, and of introducing their own set of issues. 
>For example, consider the following potential CIF2 fragment:
>         _foo \u000A;bar\u000A;
>What is the value assigned to data name _foo?  If the Unicode 
>escapes are processed according to the Java model (i.e. as if 
>replaced by the corresponding character prior to lexical analysis), 
>then the value is bar.  If the escapes are processed later, then the 
>value is <LF>;bar<LF>;, apparently a "simple data value" as CIF 1.1 
>calls them, but containing <LF> characters (in fact, this particular 
>value cannot be represented in CIF 1 at all).
>These issues do not by any means block Unicode escapes from being 
>adopted for CIF, but they do mean that taking such an approach 
>requires some additional details to be settled, and that there will 
>be interesting gotchas involved in adapting some existing CIF1 
>software for CIF2.
>>Furthermore, extending the character set (however represented) does 
>>not address issues such as representing mathematical
>>content in a CIF data value, nor images (imgCIF will not be fully 
>>compliant with CIF2 - but please correct me if I'm wrong). There 
>>are yet unexplored alternatives to enabling richer publication and 
>>archival content using CIF, but they do not concern the fundamental 
>By "mathematical content" I suppose you mean formulae.  I agree, 
>formulae, images, and various other content types that might be of 
>interest are not supported by a Unicode character model alone, 
>however encoded.  It was never my understanding that supporting such 
>content types was a reason for switching to a Unicode character 
>model, however much (or little) it might be advantageous to imgCIF.
>>So the leading ('forward thinking') motivation for basing CIF2 on 
>>unicode lies in 'internationalization'. In the short/medium term I 
>>don't imagine that introducing an extended character set through 
>>unicode or multiple encodings is going to lead to any one/group 
>>adopting the new CIF2 as the basis of their private/public data 
>>archive/retrieval system. Hopefully they will take advantage of 
>>what DDLm has to offer, though most likely by using third-party 
>I think that's missing the point.  CIF already has to deal with 
>internationalization issues, which it does, as best it can, via the 
>elide system.  Even in English it has to in some way provide a 
>character model that extends beyond ASCII.
>>At this point in my train of thought, I might say stick to ASCII as 
>>'internationalization' has not been widely called for by the 
>>community and has minimal benefits at this time.
>As a practical matter, CIF already goes beyond ASCII.  The usual 
>manner in which it does so, however, is explicitly NOT standardized. 
>Personally, I find this a sorry state of affairs indeed.
>>  However, I think CIF should move forward in this respect. So how 
>>do we achieve this? Unicode is the accepted answer? Unicode was 
>>designed for this and has some established unambiguous encodings?
>I think Unicode or (almost) equivalently, ISO-10646, is indeed the 
>accepted answer, at least inasmuch as ISO-10646 is an international 
>standard.  As far as I know, there is no competing standard of 
>comparable scope.
>>  The majority (including Microsoft) recommend adopting UTF-8 in 
>>preference to other encodings?
>XML gives special status to UTF-8 as the encoding to assume in the 
>absence of internal or external metadata directing otherwise. 
>Nevertheless, XML also requires conformant processors to be able to 
>recognize and handle UTF-16 (though not necessarily UTF-16LE, 
>UTF-16BE, or other variants).  I believe Microsoft NT-based 
>operating systems internally use UCS-2 or UTF-16 for file names, 
>depending on OS version and patch level.  Microsoft and many others 
>provide decent support for creating, reading, and editing Unicode 
>text files encoded in UTF-8, but this frequently is not the default 
>encoding.  I am not aware of Microsoft in particular promoting UTF-8 
>above locale-specific code pages, but it is my general, personal 
>perception that UTF-8 use is broad, expanding, and widely 
>recommended.  However, I do not see UTF-8 or any other encoding ever 
>being preferred over all others for all purposes.
>>So in the light of current CIF practice (i.e. unspecified-encoding 
>>of ASCII text, where the encoding has never to my knowledge been a 
>>problem), why not specify UTF-8 only, don't accommodate any 
>>non-ASCII code points in the dictionaries (which is what is 
>>proposed anyway?), and see what happens? :-) At worst a few users 
>>will find that existing software will not handle the non-ASCII text 
>>they have diligently included in their UTF-8 CIF (but this is 
>>inevitable once you extend beyond ASCII). At best their text will 
>>be handled as UTF-8 by CIF2 software.
>That is a possible way forward, and indeed, it is basically what is 
>in the current spec.  The main problem I see with it is that in 
>practice, many people will create, use, and exchange (successfully 
>or not) "CIFs" that are not UTF-8 encoded, regardless of what the 
>spec says about that.  Although it is certainly possible to declare 
>that such files are not compliant CIFs, I don't see how that 
>provides any benefit.
>>So what about the issue of accessing archived UTF-8 CIFs? Make it 
>>clear to the recipient that the CIF will be encoded in UTF-8; if 
>>for some reason they have trouble reading the CIF, point them at 
>>appropriate UTF-8 software (preferably provide them with a fully 
>>compliant CIF2 editor/viewer that introduces them to the benefits 
>>of CIF2 and its support for unicode:-)
>And that is exactly the same thing that would be done if CIF2 did 
>not specify a particular encoding.
>>Similarly, with day-to-day transmission of a CIF, if the CIF 
>>doesn't contain any characters beyond the ASCII set, the chances 
>>are there wont be any issues (there havn't been in the past?). If a 
>>diligent user has followed the spec and prepared a UTF-8 CIF, again 
>>the chances are it will be interpretted as UTF-8 (very few modern 
>>systems struggle with UTF-8?).
>I'm not in a position to know how many encoding-related issues there 
>may have been in the past.  UTF-16 variants and EBCDIC variants are 
>the only encodings I know that are in wide use and might present an 
>interchange problem for CIF 1.1 compliant CIFs.  They would present 
>exactly the same problems if used to encode ASCII-only CIF2 text.
>>I fully expect to be 'shot down' on any number of my thoughts - 
>>but, given the amount of emails it has generated, I dont think it 
>>is unreasonable to put this issue in the context of perceived 
>>current practice (however narrow the viewpoint - others have 
>>referred to CIF systems that I have no idea about)?
>It is not my goal to "shoot you down", or anyone else.  I am not 
>debating for the sake of the debate.  I want CIF2 to be as 
>technically sound and as practically useful as possible, and I don't 
>foresee a lot of latitude for tweaking or revising it after it is 
>I started by probing several areas where the draft spec seemed to 
>give too little consideration to the implications of expanding the 
>CIF character repertoire to all of Unicode.  For the most part these 
>have been resolved easily, but the issue of embedded U+FEFF 
>characters was contentious (and still has not been resolved).  That 
>led into the related area of character encoding and text vs. binary, 
>which has become such a brouhaha.
>Much of the disagreement over these contentious issues arises from 
>CIF's split-personality design.  It has always been promoted as a 
>human-readable text format, yet it is intended largely to be 
>produced and primarily to be consumed by computers.  Humans and 
>computers have different requirements, and it is not always possible 
>to align them.  XML followed a similar path, and nowadays the 
>prevailing opinion seems to be that XML isn't well suited to direct 
>human reading or modification.  Opinion of CIF has not reached that 
>point yet, and it's unclear whether it ever will do.
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>ddlm-group mailing list

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

ddlm-group mailing list

Reply to: [list | sender only]