Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Survey of available CIF software and request for wish list

  • To: Multiple recipients of list <comcifs-l@iucr.org>
  • Subject: Survey of available CIF software and request for wish list
  • From: Brian McMahon <bm@iucr.org>
  • Date: Wed, 20 Sep 2000 14:22:45 +0100 (BST)

There has been a private discussion among some members over the
last few days about how to direct the development of software to 
advance the use of CIF. I'd like to take that discussion onto the whole
COMCIFS list for two reasons : (1) to survey what is needed, and
(2) to canvass opinions on how to secure development effort and funding. It
will be best to split these two threads, so I'll start here by trying to
categorise the types of tools we need to consider, and reviewing what I know
about the ones that exist.

=== Executive Summary ===
There is a shortage of basic tools for handling syntax issues and dictionary
validation checks. The existing ones are often incomplete or not fully
robust. In particular, support for fashionable scripting languages (Perl,
Tcl, Python) is poor. The needs of the small-molecule crystallographer are
(or soon will be) reasonably well met, but uptake of mmCIF and imgCIF are
still weak. Even with small-molecule applications much would be gained by
working in an environment that can interface easily with existing
lexer/parser tools, graphical widget sets and object storage conventions.
=========================

A major problem with CIF is its breadth. Unlike rendering a graphics image,
which is well defined (so TIFF, GIF, JPEG, PNG etc are addressing the same
problem), CIF (and friends) includes raw and processed data, connectivity
maps, 3d coordinate sets, symmetry operations, discursive text etc etc, and
is used to describe inorganic, molecular, macromolecular and incommensurate
structures at least - there are already many other dictionaries in the
pipeline.

So we need to envisage domain-specific applications; but we must also provide
a core of utilities that can be used in any domain. Let's begin by thinking
about these application-independent tools. What can we identify as essential
or even desirable?


1. PURE SYNTAX HANDLERS
-----------------------
Tools that handle CIF tokens without any interpretation, and so are
universal across all domains.

a. Standalone tools
  Function            Description                             Exists?
  --------            -----------                             -------
  Syntax checker      Returns result code if there is a       vcif (C)
                       definite syntax error, and perhaps
                       a human-readable error message
  Intelligent syntax  Indicates (probably) where the error    No
   checker             really occurred
  Prettifier          Enforces line lengths, aligns loop      cif2cif (Fortran)
                       elements
  Stream editor       Allows CIF elements to be added,        No
                       deleted by command-line instruction
  Rearranger          Modifies order of existing elements     quasar (f77)
                                                              cif2cif (f77)
  Interrogator        Extracts CIF data meeting specified     starbase (C)
                       criteria
  Tokeniser           Reads CIF and passes individual tokens  cifzinc (C)
                       to stdout in some normalised meta
                       representation
  Interactive editor  Enforces correct syntax during          emacs cif.el
                       on-screen editing                      (Lisp)

b. Libraries
  
  CIFtbx (Fortran), CIFLIB (C API), CIFOBJ (C++ class library) are publicly
  available, CCDC has developed a C++ class library within the CIFer
  project, Luca Lutterotti of Trento, Italy has advertised an incomplete
  Java class library on cif-developers. There is also Peter Murray-Rust's
  old C++ library (somewhere).

  So far as I am aware, the Rutgers libraries compile (easily) on only a
  small number of platforms.

  Is it beneficial to define a standard applications program interface that
  different libraries could converge to? For example, a standard set of
  exceptions defining types of syntax error (applications would of course
  use their own exception handlers, but the specific errors in a file would
  be well defined across all libraries, e.g.
        _a  A  _b  'Broken char string  _c C
  would raise the exception INCOMPLETE_QUOTE_DELIMITED_STRING at the end of
  the line). Likewise, how closely aligned are the library functions across
  the existing libraries? Does CIFtbx have an equivalent function to the
  CIFLIB cifGetRowByIndex, for example? Should it have?


2. DICTIONARY TOOLS
-------------------
The next most general category contains tools which know how to handle
dictionaries, but have no domain-specific content. Ideally they should be
able to handle DDL1 and DDL2 dictionaries transparently.

a. Standalone tools
  Function            Description                             Exists?
  --------            -----------                             -------
  Syntax checker      As for data files, but knows about      vcif (C)
                       save_ frames which are absent from
                       data files
  Intelligent syntax  Less important than for data files      No
   checker             
  Prettifier          Aligns lists of definition elements     No
  Merger              Combines dictionary files and fragments No
                       into a single dictionary a la
                       McMahon/Bernstein/Westbrook protocol
  Name locator        Finds CIF datanames in dictionaries     cyclops (f77)
  Extractor           Extracts definition                     cman
                                                              (rudimentary) (C)
  Browser             Graphical tool to browse dictionary     No
                       (read-only)
  Web browser         Really an implementation of a           mmCIF (Rutgers)/
                       cif2html conversion                    core/pdCIF (IUCr)

b. Libraries

The primary requirement is to validate data files against the contents of
one or more nominated dictionaries. CIFtbx (f77) and CIFOBJ (C++) provide 
routines for this (probably some also in CIFLIB), but I think these are all
incomplete - please correct me if I'm wrong. CIFOBJ is DDL2 specific. CCDC's
HICCuP program had some Python validation routines against DDL1
dictionaries, again incomplete.

Specific things that need doing include:

    completing validation functions for DDL1/2 dictionaries in CIFtbx;
    a C or C++ DDL1 validator;
    a reference _type_construct parser/validator to check data typing
     through regular expressions (_type_construct has been used in the
     msCIF dictionary, but without software it's difficule to be sure that 
     Gotzon's expressions will work). In fact, _type_construct would need
     to be fully specified before such software can be developed;
    an IP-enabled tool to retrieve and cache public dictionaries referenced
     through _audit_conform... data items and the IUCr registry;
    implementation of the dictionary merging protocol.

c. "Trip" test

A suite of tests that would allow developers to confirm that they are
writing CIFs fully compliant with the standard would be beneficial. This
should be at the level of checking syntax and compliance against specified
dictionaries.


3. SEMANTIC TRANSLATORS
-----------------------
Still steering clear of applications that need specifically crystallographic
programming...

a. Standalone programs
  Function            Description                             Exists?
  --------            -----------                             -------
  Formatters          Render in readable format via TeX,      ciftex, cif2xml,
                       HTML, SGML, XML etc                     Rutgers dic->HTML
  Data converters     Conversion of all (or some) CIF data    cif2sx (ShelX)
                       to various other existing formats       pdb2cif/cif2pdb

b. Libraries
  Such utilities will tend to be fairly specific, but it would help to have
  common routines for mapping tokens between identical or similar data
  structures. So an mmCIF and associated DDL2 dictionary are isomorphous
  to a relational database with an associated schema. My ciftex output is
  a linear stream of tagged values, and is essentially isomorphous to the
  input CIF. However, an SGML translation is harder, because the document
  structure in SGML (depending on how it is defined by a DTD) may be a
  hierarchical model; how does the flat-field CIF map into that structure?


4. CRYSTALLOGRAPHIC APPLICATIONS
--------------------------------
Now we get to the bit where we ask what the crystallographic community
wants. Here are a few observations and suggestions from me; others are
welcome to add their 2 cents (or $2!).

Small-molecule community
------------------------
a. A structured CIF editor. CCDC are working well on this. The tool can import
   data files and data blocks (so things like descriptions of equipment can
    be stored in a template block. There is a "wizard" that prompts for
   "required" data items (to be supplied by journals or other applications
   in a lookup file). There is a visualisation window where a 3D structure
   can be rendered and rotated - this borrows code from the CSD database
   software, and so is quite crystallographically aware - it can (I think)
   show symmetry-generated parts of a molecule and packing in a unit cell,
   in a variety of rendering styles.

   What's missing? I would guess that the version 1 release will lack the
   following features and functionality that CCDC want to have in due course
   (please correct me if I've got anything wrong, Owen).

   "WYSIWYG". Text needs to be entered using the CIF backslash coding
   conventions. Probably WYSIWYG will be introduced initially through
   cut-and-paste out of a word-processor window. I don't know whether
   it's possible to support clipboard formats across different platforms
   (Microsoft Windows, Mac, Linux StarOffice etc).

   Two-dimensional chemical diagrams. CCDC and Acta have requirements for 
   2D diagrams. There are various possible avenues of approach. (i) One is to
   embed a graphics file (TIFF or PostScript) in a text file in the CIF.
   This would require an embedding convention, similar to the imgCIF
   MIME convention; software to de-embed and decode the graphic;
   software to render the resulting TIFF or PS image. Substantial effort,
   and the result is just a picture. (ii) Another way is to embed the output
   file from common drawing packages such as ChemDraw and ISISDraw. As
   before, one needs to de-embed the file, decode it, render it in the
   style of the original package, and then parse it for chemical
   connectivity information (which is what is really wanted). The payoff
   is that the connectivity is read, but the software engineering is
   substantial and at the mercy of several proprietary formats.
   (iii) One could use the CIF (or, better, MIF) connectivity datanames.
   Ideally one would persuade the major manufacturers of such software
   to provide CIF/MIF as an export format from their packages. It may
   still be necessary to embed a graphics file for high-resolution
   publication, however. (iv) The other approach to connectivity is to
   infer chemical bond types from the 3D image, and allow the user to
   edit the 3D diagram interactively, trapping the result in CIF/MIF
   fields. This captures the chemical information, but loses the 
   aesthetics of the commercial graphic presentation. It also alienates
   chemist authors who are familiar with the existing software
   packages. Of these options, (iii) looks best, but depends on
   persuading the manufacturers... usual story.

   Polyhedron rendering for inorganics.

   Intensity profiles for powder patterns? Not one I've discussed anywhere
   else, but if a structural CIF included a powder pattern it would be nice
   to be able to visualise the intensity data. Maybe not an essential component
   of a CIF editor, though.

   Consistency checks against CIF dictionaries.

   mmCIF compatibility (i.e. I don't think it will be able to read a
   small-molecule structure written in the DDL2 version of the Core).


b. Three-dimensional visualiser

   Existing tools are: Xtal_GX - not bad, and with a lot of crystallographic
   knowledge. Accesses CIF data blocks through a Tcl/Tk parser and GUI
   editor. Undoubtedly very useful to Xtal users, but the user interface
   is probably not intuitive to other users. I don't think it can read mmCIF
   format.

   OpenSource RasMol - favourite tool of protein folk; can read DDL1 and
   DDL2 CIFs, though can make incorrect bond assignments in small-molecule
   structures. Not crystallographically aware - cannot generate missing
   molecular fragments through application of symmetry operations, nor
   cell packing diagrams. Displays properly annotated disordered ensembles
   in different colours. Easy to use. Its major drawback (other than its
   lack of crystallography) is that it is available only as a helper and not
   a plugin to web browser windows - though I understand from Herbert that
   developing Netscape/IE plugins is a very high-overhead business.

   There are also commercial products: I know of at least Crystallographica,
   WebLabViewer. These are platform-specific (Windows, Mac respectively) and
   cannot read mmCIF.


c. Data exchange

   Most small-molecule refinement packages seem to read and write coreCIF
   satisfactorily. Some make assumptions about data ordering or content
   that are not mandated (or even warranted) by the specification.


Macromolecular community
------------------------

I invite comment from mmCIFers. As I understand things, protein
crystallographers deposit data through a web editor which transforms the
input to mmCIF files. The web editor uses the mmCIF dictionary for
validation as the deposition proceeds, and ensures a high degree of data
consistency. It is configurable to different purposes, but I'm not sure that
it would have any application to constructing a small-molecule CIF (though I
shall be happy to be corrected). mmCIFs are also available for download for
every structure in the PDB, generated in the case of legacy data from
Herbert's reworking of Phil Bourne's original pdb2cif translator. Despite
community awareness of deficiencies, the old PDB format remains the de facto
standard for macromolecular software, though a small number of refinement
packages now write (and read?) mmCIF. RasMol is an effective macromolecular
structure viewer.


Powder diffraction, modulated structures
----------------------------------------
pdCIF and msCIF are written by a small number of programs in their
respective fields (msCIF still in beta). I am not aware of any visualisation
tools or any specific requirements by journals that would impinge upon
software for these domains.


Image plate data
----------------
The imgCIF dictionary is now under active COMCIFS review, and the imgCIF/CBF
working group have a well developed API and library. The handling of images,
though not a trivial task, is well defined. Support is still lacking from
equipment manufacturers.


Chemistry
---------
As mentioned above in my lengthy discourse on the CCDC editor, it would be
beneficial to have 2D chemical structural information output in MIF format
by standard commercial software packages. Perhaps relevant to this is an
IUPAC initiative to generate identifiers for chemical compounds that is
derivable from the compound's connection table. Perhaps also of some
relevance to CIF matters is IUPAC's official endorsement of CML (chemical
markup language) as an information interchange mechanism.


Regards
Brian