Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Survey of available CIF software and request for wish list

At 14:22 20/09/00 +0100, Brian McMahon wrote:

>There has been a private discussion among some members over the
>last few days about how to direct the development of software to
>advance the use of CIF. I'd like to take that discussion onto the whole
>COMCIFS list for two reasons : (1) to survey what is needed, and
>(2) to canvass opinions on how to secure development effort and funding. It
>will be best to split these two threads, so I'll start here by trying to
>categorise the types of tools we need to consider, and reviewing what I know
>about the ones that exist.

I have spent a lot of my life writing tools for CIF and am fully committed 
to the CIF effort and process. I frequently tell people in other 
disciplines that I think CIF is a  major achievement in scientific 
informatics. [So, if any of the remarks below seem to suggest anything 
else, I assure you that they do not, but warn you in advance that we need 
to look at other technologies as well as CIF. But CIF works and 
Chester/diffractometers will continue to speak in CIF for some time yet]

Firstly, writing a protocol like CIF implies a huge amount of work that is 
invisible at the start and is gradually catching up with us. I have learnt 
this the hard way (!) and have found that only when I became heavily 
engaged in the XML effort did I realise the full extent of what was 
required in managing structured documents (SDs). CIF, and more so STAR, are 
structured documents and require a *large* amount of software to process 
them properly. This software is not easy to write, is tedious, does not 
bring glamorous rewards and cannot normally justify research grants. So if 
the CIF community intends to develop all its tools among itself I doubt 
whether there is the resource and commitment, especially for quality control.

My experience is taken from XML. I have spent the last 3-4 years heavily 
involved in XML, including development of the language. This includes 
Chemical Markup Language (CML) for chemistry. This does NOT mean that I 
have deserted the CIF camp but I can speak from experience about the issues 
involved. Essentially  XML and STAR (and probably the mmCIF syntax) are of 
the same complexity. XML and STAR are specifications (metalanguages) for 
creating domain-specific languages - XML is used to define XHTML, MathML, 
CML and so on; STAR is used to define CIF, mmCIF, pdCIF and so on. It is 
natural to define other support processes using the language itself, so XML 
has XMLSchemas written in XML to control the structure of the languages; 
CIF has DDLs (written in CIF). Almost weekly I get a feeling of deja vu in 
XML seeing something I have already tried to tackle in CIF/STAR and 
realising why I found it difficult! My analysis is that *semantically*, XML 
and STAR are virtually identical.

The reality of XML is that the community has put in a huge amount of effort 
to prove the language and an even larger amount to build tools (many being 
open source, including mine). CIF/STAR will have to go through the same 
steps - there is no real alternative unless the scope and power of CIF/STAR 
are dumbed down. For example, I have more or less finished writing a 
Document Object Model (DOM) for CML - I didn't realise I would have to do 
this when I developed CML - now I realise it is inevitable. this will be 
required for CIF.

In essence there are these conservation laws:
         - you cannot hide complexity, you can only move it around
         - for everything you define in a specification, someone has to 
write code
         - it is far easier to write a specification than to implement it

CIF/STAR is *unavoidably* complex, and also requires a great deal of code 
to support it (** if it is to be processed by machines **). If we were 
writing software on an industrial basis we would be talking of 10+ years' 
work (and only that low because we already have the experience from XML of 
what needs to be done).

         The technical options for software to support CIF/STAR are:
                 - continue with CIF-specific software and commit much more 
resource than we currently do
                 - re-use non-CIF tools already written in other contexts
                 - re-define what we wish to do using CIF and what using 
other representations

I believe that only the last two are feasible.

My experience is that the effort involved in implementing a protocol 
increases in the order:
         1 a "paper" specification
         2 tools to write documents in the specification
         3 tools to read documents in the specification (much harder if the 
spec is flexible, like STAR or XML)
         4 tools to edit and transform documents (you have to have the 
equivalent of DOM or XSLT)

I have been through all of these with CML and reached about 3.5. I would 
only have got to 2.5 if there had not already been a community of 1000's of 
XML developers and masses of free, high quality software (e.g. from James 
Clark)

(This omits all the discipline-specific stuff like checking bond orders, 
cell parameters, etc. which is where our most valuable efforts should be put).

Do not underestimate the problems that many people find with SD technology. 
The W3C has developed XML schemas (very similar to DDLs) and many people - 
including software developers - are questioning whether they are too 
complex. There is real doubt as to whether some of the XML constructs will 
be easy enough for general implementation.

My own approach to CIF - which is the only one that I personally can write 
code for - is to transform it into XML and use XML tools. This may seem 
like heresy, but ... If I wish to use CIF/STAR syntax then I write DOM and 
XSL-based converters in both directions.  This does NOT mean I abandon the 
CIF effort - quite the reverse. In CML I specifically support the use of 
ontologies (dictionaries) from IU's and other learned bodies and I put the 
CIF dictionaries at the top. But I have to convert them to XML to make the 
reading, editing, display, validation, etc. possible.

If you have stayed so far :-), I'll comment on specifics below.



>=== Executive Summary ===
>There is a shortage of basic tools for handling syntax issues and dictionary
>validation checks. The existing ones are often incomplete or not fully
>robust. In particular, support for fashionable scripting languages (Perl,
>Tcl, Python) is poor. The needs of the small-molecule crystallographer are
>(or soon will be) reasonably well met, but uptake of mmCIF and imgCIF are
>still weak. Even with small-molecule applications much would be gained by
>working in an environment that can interface easily with existing
>lexer/parser tools, graphical widget sets and object storage conventions.
>=========================

There must be an object storage convention. There are two approaches to 
this - XML effectively defines one in the DOM, and OMG/CORBA define one in 
IDL. My impression is that XML specs are easier for most people to 
understand but that IDL is more powerful. At the limit they can both be 
defined in UML (and I have started to do this for CML). UML allows other 
tools to automatically generate code, specs, etc. though there is still a 
lot o manual work to be done.

>A major problem with CIF is its breadth. Unlike rendering a graphics image,
>which is well defined (so TIFF, GIF, JPEG, PNG etc are addressing the same
>problem), CIF (and friends) includes raw and processed data, connectivity
>maps, 3d coordinate sets, symmetry operations, discursive text etc etc, and
>is used to describe inorganic, molecular, macromolecular and incommensurate
>structures at least - there are already many other dictionaries in the
>pipeline.

This is a major task. XML addresses it through the namespace mechanism and 
assumes that different domains will develop protocols in parallel. It also 
presupposes that there are high-quality tools for processing all of the 
components and a means for assembling them and managing the ensemble. I 
have a list of ca 12 "media-types" I have to support in CML - and these 
basically cover the range of STM documents in general:
         text, hypermedia, image, vectorgraphics, tables, units, math, 
bibliography, terminology, multimedia?, metadata, molecules
         there are XML solutions for "most". CIF should not try to address 
these problems independently.


I shall add comments from my CIF and XML experience below. Please treat 
these as constructive, though taken altogether they may seem negative.

>So we need to envisage domain-specific applications; but we must also provide
>a core of utilities that can be used in any domain. Let's begin by thinking
>about these application-independent tools. What can we identify as essential
>or even desirable?
>
>
>1. PURE SYNTAX HANDLERS
>-----------------------
>Tools that handle CIF tokens without any interpretation, and so are
>universal across all domains.
>
>a. Standalone tools
>   Function            Description                             Exists?
>   --------            -----------                             -------
>   Syntax checker      Returns result code if there is a       vcif (C)
>                        definite syntax error, and perhaps
>                        a human-readable error message

Agreed. Equivalent to well-formed XML checkers.

>   Intelligent syntax  Indicates (probably) where the error    No
>    checker             really occurred

Apart from syntax (above) checking should involve:
         document VS dictionary (equivalent to XML validation). not trivial
         dictionary VS DDL       ditto but additional effort
         DDL vs DDL

>   Prettifier          Enforces line lengths, aligns loop      cif2cif 
> (Fortran)
>                        elements

equivalent to simple XSLT transforms

>   Stream editor       Allows CIF elements to be added,        No
>                        deleted by command-line instruction
>   Rearranger          Modifies order of existing elements     quasar (f77)
>                                                               cif2cif (f77)
>   Interrogator        Extracts CIF data meeting specified     starbase (C)
>                        criteria

All these are equivalent to XSLT operations. For example, sorting a CIF is 
not trivial.

>   Tokeniser           Reads CIF and passes individual tokens  cifzinc (C)
>                        to stdout in some normalised meta
>                        representation
>   Interactive editor  Enforces correct syntax during          emacs cif.el
>                        on-screen editing                      (Lisp)

This is ultimately equivalent to XML editors with Schema-enforced 
validation. These are very complex to write. It may be that the CIF 
versions can be simpler because the range of operations is less complex, 
but ultimately an editor should check:
         - document structure (what elements can go here?)
         - element content (what generic content can this element have?)
         - domain-specific content. Is this value allowed (e.g. by the 
dictionary)


>b. Libraries
>
>   CIFtbx (Fortran), CIFLIB (C API), CIFOBJ (C++ class library) are publicly
>   available, CCDC has developed a C++ class library within the CIFer
>   project, Luca Lutterotti of Trento, Italy has advertised an incomplete
>   Java class library on cif-developers. There is also Peter Murray-Rust's
>   old C++ library (somewhere).

I can look out everything I have written and make it available! Some may 
have decayed. I have some more recent Java stuff for CIF2XML which can act 
as a basis for someone to work with.


>   So far as I am aware, the Rutgers libraries compile (easily) on only a
>   small number of platforms.

I now use java because it is the lingua franca of the web and the first 
tool for XML developers. It also comes with a huge library (e.g. Date, 
Math, Collections, etc.) which simplify a lot of things.

>   Is it beneficial to define a standard applications program interface that
>   different libraries could converge to?

It is ultimately essential. It is also expensive and boring. I know! This 
is effectively what a DOM is. CML DOM has ca 50 classes and over 1000 
methods. CIF DOMs will be smaller (if there is no crystallography involved).


>  For example, a standard set of
>   exceptions defining types of syntax error (applications would of course
>   use their own exception handlers, but the specific errors in a file would
>   be well defined across all libraries, e.g.
>         _a  A  _b  'Broken char string  _c C
>   would raise the exception INCOMPLETE_QUOTE_DELIMITED_STRING at the end of
>   the line). Likewise, how closely aligned are the library functions across
>   the existing libraries? Does CIFtbx have an equivalent function to the
>   CIFLIB cifGetRowByIndex, for example? Should it have?
>
>
>2. DICTIONARY TOOLS
>-------------------
>The next most general category contains tools which know how to handle
>dictionaries, but have no domain-specific content. Ideally they should be
>able to handle DDL1 and DDL2 dictionaries transparently.
>
>a. Standalone tools
>   Function            Description                             Exists?
>   --------            -----------                             -------
>   Syntax checker      As for data files, but knows about      vcif (C)
>                        save_ frames which are absent from
>                        data files
>   Intelligent syntax  Less important than for data files      No
>    checker
>   Prettifier          Aligns lists of definition elements     No
>   Merger              Combines dictionary files and fragments No
>                        into a single dictionary a la
>                        McMahon/Bernstein/Westbrook protocol
>   Name locator        Finds CIF datanames in dictionaries     cyclops (f77)
>   Extractor           Extracts definition                     cman
> 
>(rudimentary) (C)
>   Browser             Graphical tool to browse dictionary     No
>                        (read-only)

I wrote an mmCIF dictionary browser in Java ca 2 years ago. It would be 
easier now. The dictionary is sufficiently complex that it has to have a 
browser.

>   Web browser         Really an implementation of a           mmCIF 
> (Rutgers)/
>                        cif2html conversion                    core/pdCIF 
> (IUCr)

Again I wrote something which expanded CIFs into something that could be 
displayed on the screen. There is a real challenge with mmCIF as it can be 
viewed as a structured document and/or a set of relational tables. It is 
very difficult to devise a generic approach to browsing that satisfies all 
possible mmCIFs. I would certainly now address it through XSLT which allows 
joins through keys.


>b. Libraries
>
>The primary requirement is to validate data files against the contents of
>one or more nominated dictionaries. CIFtbx (f77) and CIFOBJ (C++) provide
>routines for this (probably some also in CIFLIB), but I think these are all
>incomplete - please correct me if I'm wrong. CIFOBJ is DDL2 specific. CCDC's
>HICCuP program had some Python validation routines against DDL1
>dictionaries, again incomplete.
>
>Specific things that need doing include:
>
>     completing validation functions for DDL1/2 dictionaries in CIFtbx;
>     a C or C++ DDL1 validator;
>     a reference _type_construct parser/validator to check data typing
>      through regular expressions (_type_construct has been used in the
>      msCIF dictionary, but without software it's difficule to be sure that
>      Gotzon's expressions will work). In fact, _type_construct would need
>      to be fully specified before such software can be developed;
>     an IP-enabled tool to retrieve and cache public dictionaries referenced
>      through _audit_conform... data items and the IUCr registry;
>     implementation of the dictionary merging protocol.
>
>c. "Trip" test
>
>A suite of tests that would allow developers to confirm that they are
>writing CIFs fully compliant with the standard would be beneficial. This
>should be at the level of checking syntax and compliance against specified
>dictionaries.

Does this mean roundtripping? I mean the ability to transform a CIF into 
something else (memory or other format) and retransform to original CIF 
without information loss. I have just finished doing this for a 
(non-molecular) XML application and it has been very useful. There is also 
the question of whether there should be a canonical CIF representation - 
given 2 CIF representations of  data can we normalise/canonicalise these to 
show they are identical?

>3. SEMANTIC TRANSLATORS
>-----------------------
>Still steering clear of applications that need specifically crystallographic
>programming...
>
>a. Standalone programs
>   Function            Description                             Exists?
>   --------            -----------                             -------
>   Formatters          Render in readable format via TeX,      ciftex, 
> cif2xml,
>                        HTML, SGML, XML etc                     Rutgers 
> dic->HTML

If you start with XML, XSLT does all of these and could do XML2CML. XSL-FO 
is also being developed to render to PDF

>   Data converters     Conversion of all (or some) CIF data    cif2sx (ShelX)
>                        to various other existing 
> formats       pdb2cif/cif2pdb
XSLT can sometimes do this, but other times there needs to be a DOM.


>b. Libraries
>   Such utilities will tend to be fairly specific, but it would help to have
>   common routines for mapping tokens between identical or similar data
>   structures. So an mmCIF and associated DDL2 dictionary are isomorphous
>   to a relational database with an associated schema. My ciftex output is
>   a linear stream of tagged values, and is essentially isomorphous to the
>   input CIF. However, an SGML translation is harder, because the document
>   structure in SGML (depending on how it is defined by a DTD) may be a
>   hierarchical model; how does the flat-field CIF map into that structure?

This is a useful point. core CIF is less complex than STAR and is flattish. 
But it still needs some of the SD technology.


>4. CRYSTALLOGRAPHIC APPLICATIONS
>--------------------------------
>Now we get to the bit where we ask what the crystallographic community
>wants. Here are a few observations and suggestions from me; others are
>welcome to add their 2 cents (or $2!).
>
>Small-molecule community
>------------------------
>a. A structured CIF editor. CCDC are working well on this. The tool can import
>    data files and data blocks (so things like descriptions of equipment can
>     be stored in a template block. There is a "wizard" that prompts for
>    "required" data items (to be supplied by journals or other applications
>    in a lookup file). There is a visualisation window where a 3D structure
>    can be rendered and rotated - this borrows code from the CSD database
>    software, and so is quite crystallographically aware - it can (I think)
>    show symmetry-generated parts of a molecule and packing in a unit cell,
>    in a variety of rendering styles.

I differentiate between an *editor* and a *primary authoring tool*. An 
editor has to be able to read in *any* compliant CIF (which could have any 
elements in any order) and validate it. A p.a.t simply has to be able to 
emit valid CIF. This is normally a lot easier.

>    What's missing? I would guess that the version 1 release will lack the
>    following features and functionality that CCDC want to have in due course
>    (please correct me if I've got anything wrong, Owen).
>
>    "WYSIWYG". Text needs to be entered using the CIF backslash coding
>    conventions. Probably WYSIWYG will be introduced initially through
>    cut-and-paste out of a word-processor window. I don't know whether
>    it's possible to support clipboard formats across different platforms
>    (Microsoft Windows, Mac, Linux StarOffice etc).
>
>    Two-dimensional chemical diagrams. CCDC and Acta have requirements for
>    2D diagrams. There are various possible avenues of approach. (i) One is to
>    embed a graphics file (TIFF or PostScript) in a text file in the CIF.

No - please no! I have horror stories of TIFFs and GIFs for chemistry. 
Sometimes when rescaled bits disappear and bonds can literally disappear. 4 
can be transformed to +, etc.

I strongly urge SVG - the new graphics language from W3C. it's gorgeous. 
See http://www.adobe.com/svg for some examples.
See also http://www.xml-cml.org

>    This would require an embedding convention, similar to the imgCIF
>    MIME convention; software to de-embed and decode the graphic;
>    software to render the resulting TIFF or PS image. Substantial effort,
>    and the result is just a picture. (ii) Another way is to embed the output
>    file from common drawing packages such as ChemDraw and ISISDraw. As
>    before, one needs to de-embed the file, decode it, render it in the
>    style of the original package, and then parse it for chemical
>    connectivity information (which is what is really wanted). The payoff
>    is that the connectivity is read, but the software engineering is
>    substantial and at the mercy of several proprietary formats.

I sympathise with this and as a result developed CML. CML is open, and I am 
developing an opensource set of tools. So far I have a CMLDOM (to be 
announced shortly), and am developing display and editing software. A major 
problem with *all* chemical editors is that there is no agreed ontology 
(unlike CIF!!) and so conventions from different manufacturers require 
proprietary software to convert them. As Brian mentioned IUPAC is working 
on a unique chemical identifier (IChI) which will address the unique 
representation of molecules and I have actively committed CML to this.

Proprietary tools are a first step, but open protocols should be used asap

>    (iii) One could use the CIF (or, better, MIF) connectivity datanames.
>    Ideally one would persuade the major manufacturers of such software
>    to provide CIF/MIF as an export format from their packages. It may
>    still be necessary to embed a graphics file for high-resolution
>    publication, however. (iv) The other approach to connectivity is to
>    infer chemical bond types from the 3D image, and allow the user to
>    edit the 3D diagram interactively, trapping the result in CIF/MIF
>    fields. This captures the chemical information, but loses the
>    aesthetics of the commercial graphic presentation. It also alienates
>    chemist authors who are familiar with the existing software
>    packages. Of these options, (iii) looks best, but depends on
>    persuading the manufacturers... usual story.
>
>    Polyhedron rendering for inorganics.
>
>    Intensity profiles for powder patterns? Not one I've discussed anywhere
>    else, but if a structural CIF included a powder pattern it would be nice
>    to be able to visualise the intensity data. Maybe not an essential 
> component
>    of a CIF editor, though.

I think it's critical to start capturing as much data *in machine 
processable form* as possible. I assume that pdCIF will do this anyway so 
this is a question of how it is displayed. SVG could be very useful here - 
I use it for spectra from JCAMP, but I admit that there are no high-level 
tools yet.

>    Consistency checks against CIF dictionaries.
>
>    mmCIF compatibility (i.e. I don't think it will be able to read a
>    small-molecule structure written in the DDL2 version of the Core).
>
>
>b. Three-dimensional visualiser
>
>    Existing tools are: Xtal_GX - not bad, and with a lot of crystallographic
>    knowledge. Accesses CIF data blocks through a Tcl/Tk parser and GUI
>    editor. Undoubtedly very useful to Xtal users, but the user interface
>    is probably not intuitive to other users. I don't think it can read mmCIF
>    format.
>
>    OpenSource RasMol - favourite tool of protein folk; can read DDL1 and
>    DDL2 CIFs, though can make incorrect bond assignments in small-molecule
>    structures. Not crystallographically aware - cannot generate missing
>    molecular fragments through application of symmetry operations, nor
>    cell packing diagrams. Displays properly annotated disordered ensembles
>    in different colours. Easy to use. Its major drawback (other than its
>    lack of crystallography) is that it is available only as a helper and not
>    a plugin to web browser windows - though I understand from Herbert that
>    developing Netscape/IE plugins is a very high-overhead business.
>
>    There are also commercial products: I know of at least Crystallographica,
>    WebLabViewer. These are platform-specific (Windows, Mac respectively) and
>    cannot read mmCIF.
>
>
>c. Data exchange
>
>    Most small-molecule refinement packages seem to read and write coreCIF
>    satisfactorily. Some make assumptions about data ordering or content
>    that are not mandated (or even warranted) by the specification.

This is a fundamental aspect of the spec. It requires all software to be 
able to read CIFs *from another tool*. No assumption about ordering can be 
made - the spec says so. It is a good example of how reading is a lot 
harder than writing!

Note that I am NOT suggesting that diffractometers, IUCR editors, etc 
abandon CIF *syntax*. It's here, works well and has a very high success 
rate.  But the process should accommodate more recent technologies when 
they become appropriate.


>Macromolecular community
>------------------------
>
>I invite comment from mmCIFers. As I understand things, protein
>crystallographers deposit data through a web editor which transforms the
>input to mmCIF files. The web editor uses the mmCIF dictionary for
>validation as the deposition proceeds, and ensures a high degree of data
>consistency. It is configurable to different purposes, but I'm not sure that
>it would have any application to constructing a small-molecule CIF (though I
>shall be happy to be corrected). mmCIFs are also available for download for
>every structure in the PDB, generated in the case of legacy data from
>Herbert's reworking of Phil Bourne's original pdb2cif translator. Despite
>community awareness of deficiencies, the old PDB format remains the de facto
>standard for macromolecular software, though a small number of refinement
>packages now write (and read?) mmCIF. RasMol is an effective macromolecular
>structure viewer.
>
>
>Powder diffraction, modulated structures
>----------------------------------------
>pdCIF and msCIF are written by a small number of programs in their
>respective fields (msCIF still in beta). I am not aware of any visualisation
>tools or any specific requirements by journals that would impinge upon
>software for these domains.
>
>
>Image plate data
>----------------
>The imgCIF dictionary is now under active COMCIFS review, and the imgCIF/CBF
>working group have a well developed API and library. The handling of images,
>though not a trivial task, is well defined. Support is still lacking from
>equipment manufacturers.
>
>
>Chemistry
>---------
>As mentioned above in my lengthy discourse on the CCDC editor, it would be
>beneficial to have 2D chemical structural information output in MIF format
>by standard commercial software packages. Perhaps relevant to this is an
>IUPAC initiative to generate identifiers for chemical compounds that is
>derivable from the compound's connection table. Perhaps also of some
>relevance to CIF matters is IUPAC's official endorsement of CML (chemical
>markup language) as an information interchange mechanism.

See above.

I hope this is useful. I am willing to try to unearth (though not to 
repair!) any CIF-related software I may have written and make it available 
as a first step. But software decays if not used and I can't make promises.

My summary is roughly:
         - the CIF initiative in creating dictionaries is absolutely the 
right way to go
         - the greatest effort should go into verifying the domain-specific 
aspects of the dictionaries. Only IUCr/COMCIFs can reasonably do this
         - the dictionaries should be re-usable by other disciplines 
(chemistry, materials science, etc.) In this way we start to normalise the 
use of crystallographic information over the world
         - in reverse, CIF should borrow from other disciplines (e.g. 
chemistry) where appropriate
         - the CIF project implies a large amount of generic technology for 
structured documents. Where possible this technology should be borrowed 
from elsewhere rather than rewritten by crystallographers
         - this is a general problem facing all IUs, scientific authors and 
publishers. The last 3 years have shown dramatic changes in technology and 
appreciation of the challenges. Whatever is decided must have an element of 
flexibility and an element of consistency. Not easy!

And - if it is some reassurance - I see a number of other disciplines and 
crystallography is often well ahead of them. Several of them are moving to 
XML and I am sure that this will play a central role in the future.

         Peter

>Regards
>Brian

Peter Murray-Rust, Director Virtual School of Molecular Sciences
Pharmaceutical Sciences, University of Nottingham, NG7 2RD, UK
Tel: +44-(0)-115-951-5087 Fax: +44-(0)-115-951-5110
http://www.vsms.nottingham.ac.uk