Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Survey of available CIF software and request for wish list

At 07:17 04/10/00 +0100, Nick Spadaccini wrote:

Thanks Nick...

>On Mon, 2 Oct 2000, Peter Murray-Rust wrote:
>
>As a general note, looking at things from a STAR point of view, I have
>never considered XML a exclusive competitor to STAR with respect to the
>discipline domains that have adopted STAR or some derivative of it.
>Certainly STAR is not a competitor to XML in wider web based applications.
>I believe that discipline specific derivatives of XML (such as CML) can
>and should coexist with STAR. It would be pointless not to leverage off
>the XML based tools which are touted to be "just around the corner".

The coexistence of XML and STAR (and other formats) is very important and 
one that we shall need to address. A common question in XML is "I have 
(binary) data - how do I incorporate this into XML?" I see this at 4 levels:
         - encoding. Are the character sets compatible? If not some 
conversion will be needed. A good approach is to convert "binary" data into 
base64 (or similar) and wrap this with appropriate delimiters.

         - syntax. The syntax of each component must be carefully 
preserved. Some minimal escaping will always be needed in case the 
delimiters occur by chance in the included material.

         - semantics. How to we determine what to do with the included 
material. It must at least be  labelled with appropriate metadata, e.g.:

<?xml version="1.0" encoding="UTF-7"?>
<!DOCTYPE cml SYSTEM http://www.xml-cml.org/dtd">
<cml xmlns="http://www.xml-cml.org/dtd/V1.0">
   <molecule id="toz">
     <string title="data" convention="org.iucr/CIF/DDL1.0/data">
<![CDATA[
data_TOZ#=================================================================== 
===========
# 5. CHEMICAL DATA
_chemical_name_systematic 
"trans-3-Benzoyl-2-(tert-butyl)-4-(iso-butyl)-1,3-oxazolidin-5one"
_chemical_formula_moiety          'C18 H25 N O3'
_chemical_formula_sum             'C18 H25 N O3'
_chemical_formula_weight          303.40
loop_
_atom_type_symbol     _atom_type_scat_dispersion_real 
_atom_type_scat_dispersion_imag     _atom_type_scat_source
   C    .017  .009  International_Tables_Vol_IV_Table_2.3.1
   H    0  0 International_Tables_Vol_IV_Table_2.3.1
   O    .047  .032  International_Tables_Vol_IV_Table_2.3.1
   N    .029  .018  International_Tables_Vol_IV_Table_2.3.1
]]>
</string>
</molecule>
</cml>

There are many important points here.

         1.The whole file is a well-formed XML file. The CDATA mechanism 
escapes all characters except ]]> so that the body of the <string> element 
is just seen as simply character data (#PCDATA in XML). The file identifies 
itself as XML - and this mechanism (<?xml...?>) is registered with the 
IETF. (It would also carry a media type of text/xml)

         2. The  file identifies the tags (element names) as belonging to a 
namespace, uniquified by the URI www.xml-cml.org/dtd/V1.0 THERE ARE NO 
SEMANTICS ASSOCIATED WITH THIS STATEMENT. In XML there is no current 
agreement on how to determine the semantics of a namespace; it is simply 
there to uniquefy the tags. Mechanisms are required and are starting to 
emerge but there is no universal way of applying semantics.

         3. The file can be validated against the DTD listed in the 
DOCTYPE. This (for example) recognises that the <molecule> tag is allowed 
in CML but would forbid (say) <unitCell> which is not part of CML. The DTD 
is required to contain *prose* semantics but has no machine-processable 
means of delivering semantics.

         This is as far as XML can go. Beyond this it is up to the 
semantics of the particular Language (application).

         Implicit in the file is that there are CML semantics. Thus if I 
read the file into JUMBO3 it will create a molecule object. This object has 
a string child - that is all that JUMBO3 knows. [If there had been <atom> 
children, JUMBO3 would have drawn a molecule.]

         JUMBO recognises the keyword "convention" (this is part of CML). 
There is no agreed way of treating this in CML at present. The intention is 
that there will be a list of conventions which CML. This is the real 
challenge! The possibilities are:
         1 the system simply carries the information through. This is 
likely to be the first phase. At least we avoid information loss.
         2 the system can hyperlink to appropriate dictionaries. This is 
also possible if the convention-provider produces a URL. Thus:
         <cml:float convention="IUCr" title="_cell.measurement_temperature" 
units="K">293</float>
could be processed to something like:

<a 
href="http://www.iucr.org/cif/core/dic.html#_cell.measurement_temperature">_ 
cell.measurement_temperature</a>: 293

so that at least the human reader knows what the quantity is and what it 
means in human terms (by reading the dictionary)
         3 there can be a mapping of equivalent terms. Thus <cml:builtin 
type="a">... maps to cell.length_a in CIF. This as  be done manually and 
hopefully agreed by curatorial humans.
         4 the "terms" can have machine semantics included. STAR ca do this 
through dREL and Python/Java. JUMBO does it by associating an XML element 
with a class through the DOM mechanism. There is still the question of how 
to discover these semantics. IN the first instance I suspect we shall 
compile lists of conventions with which we can interoperate. Thus CML could 
know that when it encountered a STAR/CIF term for which it had no mapping 
but knew it was CI from the convention attribute, it could extract the 
dRELs (if any) and apply the Python/java This is getting rather hairy - but 
it represents the current cutting edge.
         5. We then run up against ontology. Do I mean the same by 
bond-order as CIF? Probably not. In which case there has to be extensive 
mapping by humans. This is a highly valuable, if very tedious activity. I 
suspect that we shall want to limit the number of conventions with which 
each interacts. Thus for CML I would see:
                 - core XML tools (XSL, XSL-FO, Schemas)
                 - CIF
                 - MathML
                 - SVG
                 - UnitsML (if it happens)
                 - various Bio-MLs, possibly
and I have to be able to do some horrid stuff with the main legacy formats 
in chemistry
STAR/CIF will presumably have a similar list

Back to the example. At present JUMBO would be able to:
         - hold the CIF
         - convert the CIF to generic XML (I have a CIFDOM for this. It has 
general elements like <data> and <loop>)
         - extract some of the key equivalences (essentially cell params 
and atoms)
         - orthogonalise things
         - keep the other stuff safe but uninterpreted
         - be able to write out a CIF or a CML file at the end.

The reverse might also be possible. Consider:

data_cmlfile
_cml
;
<molecule id="NaCl">
   <atomArray>
     <atom id="a1">
       <builtin type="elementType">Na</builtin>
     </atom>
     <atom id="cl1">
       <builtin type="elementType">Cl</builtin>
     </atom>
   </atomArray>
</molecule>
;

This is (I think)  a valid CIF. CIF would have to decide how to wrap the 
XML/CML - what metadata to provide, etc. I would suggest  that this would 
become increasingly common so that CML might wish to be able to support 
namespaces, e.g.

_xml.namespace "http://www.xml-cml.org/dtd/V1.0"

> > XML seeing something I have already tried to tackle in CIF/STAR and
> > realising why I found it difficult! My analysis is that *semantically*, 
> XML
> > and STAR are virtually identical.
>
>Yes, I would think so, otherwise the universality of XML would be brought
>into question. I think the new developments with respect to methods
>included in the dictionary definitions, and then compiling the dictionary
>into classes and object instantiations of those means we have moved on
>significantly from the view of STAR/CIF/DDL as piles of text. The
>dictionaries in our new system are executable, and any attempt to "access"
>a data item results in the Java/Python object for that data item being
>invoked. In this way all manner of validation, verfication and evaluation
>can be done on data items. The Java/Python blend has worked for us because
>both support "reflection" (the programmatic term, not the crystallographic
>term), meaning that executable objects written in either code can be
>pulled in at run-time.

I now call these things "information objects". In one sense they can be 
seen as documents - and XML has excellent tools for processing this - XSLT 
and XSL-FO. On the other hand they are objects with methods and behaviour. 
XML provides a DOM - which is fairly basic and mainly consists of 
navigating the tree (and editing it) but there is no very good way of 
adding element-specific semantics. Each ML has to make these up. XML 
schemas may add a bit here but I think we are near the limit of consensus.



>I know all of this can be specified in XML but I think the generation of
>an executable version of the XML based dictionary isn't going to result
>from "tools", someone is going to have to knuckle down and write some
>significant code.

No question!! There are different sorts of tools:
         1 generic tools. These can deal with a wide variety of 
documents/objects but mainly move the material round. Examples are XSLT and 
DOM which can reorder and reformat but doe domain-specific stuff like 
inverting matrices :-). However the generic tools will almost certainly 
form the basis of editors, etc.
         2 discipline-specific but application-independent. I put SVG, 
MathML, CML in this class. They don't know what the user is going to do in 
detail. I have also developed a generic dictionary application 
(http://www.vhg.org.uk) which will browse any hierarchical dictionary. For 
example it will index and search dictionaries. It may be useful for parts 
of what CIF does.
         3 application-specific. This could be a publishing application 
(e.g. ActaCryst), a logfile analyser for X-ray refinement, a database of 
crystal structures, etc. In general these will have to include several 
components of 2 mixed in varying proportions.

I put CIF/STAR in 2, but note that it actually contains several 
disciplines. Wherever possible these should be separated and modularised. 
In some cases it will be seen that there are solutions which CIF needs to 
provide (e.g. validation against crystallographic concepts) in others 
(typesetting) it may be easier to convert CIF to another approach.


> > My own approach to CIF - which is the only one that I personally can write
> > code for - is to transform it into XML and use XML tools. This may seem
> > like heresy, but ... If I wish to use CIF/STAR syntax then I write DOM and
> > XSL-based converters in both directions.  This does NOT mean I abandon the
> > CIF effort - quite the reverse. In CML I specifically support the use of
> > ontologies (dictionaries) from IU's and other learned bodies and I put the
> > CIF dictionaries at the top. But I have to convert them to XML to make the
> > reading, editing, display, validation, etc. possible.
>
>Doesn't sound like heresy to me. It sounds like astute and sensible re-use
>of existing technologies to leverage up CIF/STAR as a usuable format.

Glad it's not heresy!

 > Apart from syntax (above) checking should involve:
> >          document VS dictionary (equivalent to XML validation). not trivial
> >          dictionary VS DDL       ditto but additional effort
> >          DDL vs DDL
>
>We do this is star. Infact everything is driven through the dictionaries,
>the discipline specific dictionary and the DDL dictionary (the dictionary
>that defines the DDL language)
>
> > All these are equivalent to XSLT operations. For example, sorting a CIF is
> > not trivial.
>
>What do you mean by "sorting a CIF"?

I mean that some readers/authors may wish to view a CIF in a different 
order from the authors/readers, perhaps based on category names, or a local 
set of "most important data items". This is what XSLT is good at.

> > I now use java because it is the lingua franca of the web and the first
> > tool for XML developers. It also comes with a huge library (e.g. Date,
> > Math, Collections, etc.) which simplify a lot of things.
>
>We have focussed on both Java and Python. I think Java is more than just a
>web language (though many will disagree) and I think it will probably
>survive any onslaught from Mircosoft's C#. We use the reflection
>capabilities of both languages to write code components in either
>language.

Agreed. Personally Java is ideal for what I do - development - but I don't 
want to force it on others.Jon Bosak has commented on how java and XML 
complement each other.

> > No - please no! I have horror stories of TIFFs and GIFs for chemistry.
> > Sometimes when rescaled bits disappear and bonds can literally 
> disappear. 4
> > can be transformed to +, etc.
> >
> > I strongly urge SVG - the new graphics language from W3C. it's gorgeous.
> > See http://www.adobe.com/svg for some examples.
> > See also http://www.xml-cml.org
>
>I had a quick browse of svg. Very impressive, but plug-ins are restricted
>to WinTel and Macintosh. The fact that Adobe is behind it and given the
>great job they have done with postscript and pdf I think svg is very
>likely to be around for a while.

There is a CSIRO applet/application in Java which is being used and 
developed by the Apache/FOP effort. Also there was a very good early IBM 
java implementation. I am not sure which efforts are being pursued most 
strongly. I believe that Netscape6 also has woken up to SVG in the  end but 
I can't quote...

         P.


>Nick
>
>--------------------------------
>Dr Nick Spadaccini
>Department of Computer Science              voice: +(61 8) 9380 3452
>University of Western Australia               fax: +(61 8) 9380 1089
>Nedlands, Perth,  WA  6907                 email: nick@cs.uwa.edu.au
>AUSTRALIA                        web: http://www.cs.uwa.edu.au/~nick
>

Peter Murray-Rust, Director Virtual School of Molecular Sciences
Pharmaceutical Sciences, University of Nottingham, NG7 2RD, UK
Tel: +44-(0)-115-951-5087 Fax: +44-(0)-115-951-5110
http://www.vsms.nottingham.ac.uk