[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Survey of available CIF software and request for wish list
- To: Multiple recipients of list <[email protected]>
- Subject: Survey of available CIF software and request for wish list
- From: Brian McMahon <[email protected]>
- Date: Wed, 20 Sep 2000 14:22:45 +0100 (BST)
There has been a private discussion among some members over the
last few days about how to direct the development of software to
advance the use of CIF. I'd like to take that discussion onto the whole
COMCIFS list for two reasons : (1) to survey what is needed, and
(2) to canvass opinions on how to secure development effort and funding. It
will be best to split these two threads, so I'll start here by trying to
categorise the types of tools we need to consider, and reviewing what I know
about the ones that exist.
=== Executive Summary ===
There is a shortage of basic tools for handling syntax issues and dictionary
validation checks. The existing ones are often incomplete or not fully
robust. In particular, support for fashionable scripting languages (Perl,
Tcl, Python) is poor. The needs of the small-molecule crystallographer are
(or soon will be) reasonably well met, but uptake of mmCIF and imgCIF are
still weak. Even with small-molecule applications much would be gained by
working in an environment that can interface easily with existing
lexer/parser tools, graphical widget sets and object storage conventions.
=========================
A major problem with CIF is its breadth. Unlike rendering a graphics image,
which is well defined (so TIFF, GIF, JPEG, PNG etc are addressing the same
problem), CIF (and friends) includes raw and processed data, connectivity
maps, 3d coordinate sets, symmetry operations, discursive text etc etc, and
is used to describe inorganic, molecular, macromolecular and incommensurate
structures at least - there are already many other dictionaries in the
pipeline.
So we need to envisage domain-specific applications; but we must also provide
a core of utilities that can be used in any domain. Let's begin by thinking
about these application-independent tools. What can we identify as essential
or even desirable?
1. PURE SYNTAX HANDLERS
-----------------------
Tools that handle CIF tokens without any interpretation, and so are
universal across all domains.
a. Standalone tools
Function Description Exists?
-------- ----------- -------
Syntax checker Returns result code if there is a vcif (C)
definite syntax error, and perhaps
a human-readable error message
Intelligent syntax Indicates (probably) where the error No
checker really occurred
Prettifier Enforces line lengths, aligns loop cif2cif (Fortran)
elements
Stream editor Allows CIF elements to be added, No
deleted by command-line instruction
Rearranger Modifies order of existing elements quasar (f77)
cif2cif (f77)
Interrogator Extracts CIF data meeting specified starbase (C)
criteria
Tokeniser Reads CIF and passes individual tokens cifzinc (C)
to stdout in some normalised meta
representation
Interactive editor Enforces correct syntax during emacs cif.el
on-screen editing (Lisp)
b. Libraries
CIFtbx (Fortran), CIFLIB (C API), CIFOBJ (C++ class library) are publicly
available, CCDC has developed a C++ class library within the CIFer
project, Luca Lutterotti of Trento, Italy has advertised an incomplete
Java class library on cif-developers. There is also Peter Murray-Rust's
old C++ library (somewhere).
So far as I am aware, the Rutgers libraries compile (easily) on only a
small number of platforms.
Is it beneficial to define a standard applications program interface that
different libraries could converge to? For example, a standard set of
exceptions defining types of syntax error (applications would of course
use their own exception handlers, but the specific errors in a file would
be well defined across all libraries, e.g.
_a A _b 'Broken char string _c C
would raise the exception INCOMPLETE_QUOTE_DELIMITED_STRING at the end of
the line). Likewise, how closely aligned are the library functions across
the existing libraries? Does CIFtbx have an equivalent function to the
CIFLIB cifGetRowByIndex, for example? Should it have?
2. DICTIONARY TOOLS
-------------------
The next most general category contains tools which know how to handle
dictionaries, but have no domain-specific content. Ideally they should be
able to handle DDL1 and DDL2 dictionaries transparently.
a. Standalone tools
Function Description Exists?
-------- ----------- -------
Syntax checker As for data files, but knows about vcif (C)
save_ frames which are absent from
data files
Intelligent syntax Less important than for data files No
checker
Prettifier Aligns lists of definition elements No
Merger Combines dictionary files and fragments No
into a single dictionary a la
McMahon/Bernstein/Westbrook protocol
Name locator Finds CIF datanames in dictionaries cyclops (f77)
Extractor Extracts definition cman
(rudimentary) (C)
Browser Graphical tool to browse dictionary No
(read-only)
Web browser Really an implementation of a mmCIF (Rutgers)/
cif2html conversion core/pdCIF (IUCr)
b. Libraries
The primary requirement is to validate data files against the contents of
one or more nominated dictionaries. CIFtbx (f77) and CIFOBJ (C++) provide
routines for this (probably some also in CIFLIB), but I think these are all
incomplete - please correct me if I'm wrong. CIFOBJ is DDL2 specific. CCDC's
HICCuP program had some Python validation routines against DDL1
dictionaries, again incomplete.
Specific things that need doing include:
completing validation functions for DDL1/2 dictionaries in CIFtbx;
a C or C++ DDL1 validator;
a reference _type_construct parser/validator to check data typing
through regular expressions (_type_construct has been used in the
msCIF dictionary, but without software it's difficule to be sure that
Gotzon's expressions will work). In fact, _type_construct would need
to be fully specified before such software can be developed;
an IP-enabled tool to retrieve and cache public dictionaries referenced
through _audit_conform... data items and the IUCr registry;
implementation of the dictionary merging protocol.
c. "Trip" test
A suite of tests that would allow developers to confirm that they are
writing CIFs fully compliant with the standard would be beneficial. This
should be at the level of checking syntax and compliance against specified
dictionaries.
3. SEMANTIC TRANSLATORS
-----------------------
Still steering clear of applications that need specifically crystallographic
programming...
a. Standalone programs
Function Description Exists?
-------- ----------- -------
Formatters Render in readable format via TeX, ciftex, cif2xml,
HTML, SGML, XML etc Rutgers dic->HTML
Data converters Conversion of all (or some) CIF data cif2sx (ShelX)
to various other existing formats pdb2cif/cif2pdb
b. Libraries
Such utilities will tend to be fairly specific, but it would help to have
common routines for mapping tokens between identical or similar data
structures. So an mmCIF and associated DDL2 dictionary are isomorphous
to a relational database with an associated schema. My ciftex output is
a linear stream of tagged values, and is essentially isomorphous to the
input CIF. However, an SGML translation is harder, because the document
structure in SGML (depending on how it is defined by a DTD) may be a
hierarchical model; how does the flat-field CIF map into that structure?
4. CRYSTALLOGRAPHIC APPLICATIONS
--------------------------------
Now we get to the bit where we ask what the crystallographic community
wants. Here are a few observations and suggestions from me; others are
welcome to add their 2 cents (or $2!).
Small-molecule community
------------------------
a. A structured CIF editor. CCDC are working well on this. The tool can import
data files and data blocks (so things like descriptions of equipment can
be stored in a template block. There is a "wizard" that prompts for
"required" data items (to be supplied by journals or other applications
in a lookup file). There is a visualisation window where a 3D structure
can be rendered and rotated - this borrows code from the CSD database
software, and so is quite crystallographically aware - it can (I think)
show symmetry-generated parts of a molecule and packing in a unit cell,
in a variety of rendering styles.
What's missing? I would guess that the version 1 release will lack the
following features and functionality that CCDC want to have in due course
(please correct me if I've got anything wrong, Owen).
"WYSIWYG". Text needs to be entered using the CIF backslash coding
conventions. Probably WYSIWYG will be introduced initially through
cut-and-paste out of a word-processor window. I don't know whether
it's possible to support clipboard formats across different platforms
(Microsoft Windows, Mac, Linux StarOffice etc).
Two-dimensional chemical diagrams. CCDC and Acta have requirements for
2D diagrams. There are various possible avenues of approach. (i) One is to
embed a graphics file (TIFF or PostScript) in a text file in the CIF.
This would require an embedding convention, similar to the imgCIF
MIME convention; software to de-embed and decode the graphic;
software to render the resulting TIFF or PS image. Substantial effort,
and the result is just a picture. (ii) Another way is to embed the output
file from common drawing packages such as ChemDraw and ISISDraw. As
before, one needs to de-embed the file, decode it, render it in the
style of the original package, and then parse it for chemical
connectivity information (which is what is really wanted). The payoff
is that the connectivity is read, but the software engineering is
substantial and at the mercy of several proprietary formats.
(iii) One could use the CIF (or, better, MIF) connectivity datanames.
Ideally one would persuade the major manufacturers of such software
to provide CIF/MIF as an export format from their packages. It may
still be necessary to embed a graphics file for high-resolution
publication, however. (iv) The other approach to connectivity is to
infer chemical bond types from the 3D image, and allow the user to
edit the 3D diagram interactively, trapping the result in CIF/MIF
fields. This captures the chemical information, but loses the
aesthetics of the commercial graphic presentation. It also alienates
chemist authors who are familiar with the existing software
packages. Of these options, (iii) looks best, but depends on
persuading the manufacturers... usual story.
Polyhedron rendering for inorganics.
Intensity profiles for powder patterns? Not one I've discussed anywhere
else, but if a structural CIF included a powder pattern it would be nice
to be able to visualise the intensity data. Maybe not an essential component
of a CIF editor, though.
Consistency checks against CIF dictionaries.
mmCIF compatibility (i.e. I don't think it will be able to read a
small-molecule structure written in the DDL2 version of the Core).
b. Three-dimensional visualiser
Existing tools are: Xtal_GX - not bad, and with a lot of crystallographic
knowledge. Accesses CIF data blocks through a Tcl/Tk parser and GUI
editor. Undoubtedly very useful to Xtal users, but the user interface
is probably not intuitive to other users. I don't think it can read mmCIF
format.
OpenSource RasMol - favourite tool of protein folk; can read DDL1 and
DDL2 CIFs, though can make incorrect bond assignments in small-molecule
structures. Not crystallographically aware - cannot generate missing
molecular fragments through application of symmetry operations, nor
cell packing diagrams. Displays properly annotated disordered ensembles
in different colours. Easy to use. Its major drawback (other than its
lack of crystallography) is that it is available only as a helper and not
a plugin to web browser windows - though I understand from Herbert that
developing Netscape/IE plugins is a very high-overhead business.
There are also commercial products: I know of at least Crystallographica,
WebLabViewer. These are platform-specific (Windows, Mac respectively) and
cannot read mmCIF.
c. Data exchange
Most small-molecule refinement packages seem to read and write coreCIF
satisfactorily. Some make assumptions about data ordering or content
that are not mandated (or even warranted) by the specification.
Macromolecular community
------------------------
I invite comment from mmCIFers. As I understand things, protein
crystallographers deposit data through a web editor which transforms the
input to mmCIF files. The web editor uses the mmCIF dictionary for
validation as the deposition proceeds, and ensures a high degree of data
consistency. It is configurable to different purposes, but I'm not sure that
it would have any application to constructing a small-molecule CIF (though I
shall be happy to be corrected). mmCIFs are also available for download for
every structure in the PDB, generated in the case of legacy data from
Herbert's reworking of Phil Bourne's original pdb2cif translator. Despite
community awareness of deficiencies, the old PDB format remains the de facto
standard for macromolecular software, though a small number of refinement
packages now write (and read?) mmCIF. RasMol is an effective macromolecular
structure viewer.
Powder diffraction, modulated structures
----------------------------------------
pdCIF and msCIF are written by a small number of programs in their
respective fields (msCIF still in beta). I am not aware of any visualisation
tools or any specific requirements by journals that would impinge upon
software for these domains.
Image plate data
----------------
The imgCIF dictionary is now under active COMCIFS review, and the imgCIF/CBF
working group have a well developed API and library. The handling of images,
though not a trivial task, is well defined. Support is still lacking from
equipment manufacturers.
Chemistry
---------
As mentioned above in my lengthy discourse on the CCDC editor, it would be
beneficial to have 2D chemical structural information output in MIF format
by standard commercial software packages. Perhaps relevant to this is an
IUPAC initiative to generate identifiers for chemical compounds that is
derivable from the compound's connection table. Perhaps also of some
relevance to CIF matters is IUPAC's official endorsement of CML (chemical
markup language) as an information interchange mechanism.
Regards
Brian
- Prev by Date: Re: OMG proposal for macromolecular structure
- Next by Date: Backus-Naur Form for CIF
- Prev by thread: Re: OMG proposal for macromolecular structure
- Next by thread: Re: Survey of available CIF software and request for wish list
- Index(es):

