Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CHEMISTRY (was Re: Survey of available CIF software and

I have been working on how CIF and CML can interoperate and benefit from 
each other - the synergy looks very good  There are a few specific comments 
below about chemistry.

**I would be very grateful for samples CIFs that support chemistry, see below**

At 14:22 20/09/00 +0100, Brian McMahon wrote:

>    Two-dimensional chemical diagrams. CCDC and Acta have requirements for
>    2D diagrams.

I would suggest that "chemical diagrams" is replaced by "structural 
formula", "connection table" and "2D coordinates". CIF core provides 
support for all these concepts but I suspect they are drastically 
underused, to the detriment of  everyone.

The purpose of a structural formula is to communicate to humans **and 
machines** what the compound(s) actually *are*! This is, of course, not 
trivial and at present there is a surprising amount of implicit human 
perception required in many diagrams.  Work with CML has shown that some 
chemistry can *only* be represented with a graphical component but the vast 
majority of compounds can be presented by some or all of:
         - connection tables
         - 3-D coordinates
Note, of course, that neither is formally deducible from the other by an 
algorithm - charges, formal bond orders, etc. are matters of human opinion 
and hopefully convention. 3-D coordinates do not normally represent 
fluxional and similar molecules completely.

Does Acta currently accept (a) 2D diagrams (b) connection tables in 
CIF-based papers? (Please excuse ignorance here :-) If so, do they use 
pixel-based representations or use 2-D coordinates in _chemical_ to draw 
the diagrams.



>There are various possible avenues of approach. (i) One is to
>    embed a graphics file (TIFF or PostScript) in a text file in the CIF.
>    This would require an embedding convention, similar to the imgCIF
>    MIME convention; software to de-embed and decode the graphic;
>    software to render the resulting TIFF or PS image. Substantial effort,
>    and the result is just a picture.

I argue very strongly against the continuing use of pixel-based diagrams. I 
have examples where a diagram is embedded in HTML, rescaled by the browser 
and **BONDS DISAPPEAR**. This happens when a horizontal or vertical bond is 
1-pixel wide and falls on a non-integer coordinate. there are also many 
diagrams where it is impossible to be sure whether a set of pixels is (say) 
a 4, a + or something else.



>(ii) Another way is to embed the output
>    file from common drawing packages such as ChemDraw and ISISDraw. As
>    before, one needs to de-embed the file, decode it, render it in the
>    style of the original package, and then parse it for chemical
>    connectivity information (which is what is really wanted). The payoff
>    is that the connectivity is read, but the software engineering is
>    substantial and at the mercy of several proprietary formats.

I have several examples of such files that I cannot interpret. ChemDraw has 
some binary formats and these are unreadable without the software.

>    (iii) One could use the CIF (or, better, MIF) connectivity datanames.

I support this. CIF has got it all present - we only need to use it.

>    Ideally one would persuade the major manufacturers of such software
>    to provide CIF/MIF as an export format from their packages. It may
>    still be necessary to embed a graphics file for high-resolution
>    publication, however. (iv) The other approach to connectivity is to
>    infer chemical bond types from the 3D image, and allow the user to
>    edit the 3D diagram interactively, trapping the result in CIF/MIF
>    fields. This captures the chemical information, but loses the
>    aesthetics of the commercial graphic presentation. It also alienates
>    chemist authors who are familiar with the existing software
>    packages. Of these options, (iii) looks best, but depends on
>    persuading the manufacturers... usual story.

There is an important principle here. Any bond types, charges, etc. depend 
on conventions (ontologies). Unless the source of these is documented, 
there is considerable opportunity for confusion. Thus (I believe) 
MDLMolfiles use 4 for aromatic whereas other packages use this for (the 
rare) quadruple bond. CCDC use -5 for aromatic, etc. Therefore any 
representation requires either:
         - agreement to use a single convention
         - careful recording of the convention.
CML uses both approaches - it has a small core ontology but can support the 
use of any other convention.

I suspect that current CIF terminology, extended with some MIF terminology 
- e.g. for stereochemistry would cater for 99% of "small" crystal structures.

There is thus option (v) which is possible:

convert legacy formats to the appropriate CIF datanames (this is possible 
within Core CIF without breaking it). These can be extended if appropriate 
with either MIF datanames or IUPAC terms. CIF/MIF/IUPAC is probably strong 
enough to hold most communal concepts. The re-export to legacy formats is 
not always possible because these are not extensible (thus "PDB" and 
MDLMolfile do not support lots of what is in CIF).

I am developing support for this approach by using an internal DOM to hold 
the CIF enhanced by CMLDOM.

>[...]



>Chemistry
>---------
>As mentioned above in my lengthy discourse on the CCDC editor, it would be
>beneficial to have 2D chemical structural information output in MIF format
>by standard commercial software packages. Perhaps relevant to this is an
>IUPAC initiative to generate identifiers for chemical compounds that is
>derivable from the compound's connection table. Perhaps also of some
>relevance to CIF matters is IUPAC's official endorsement of CML (chemical
>markup language) as an information interchange mechanism.

CML interoperates very well with IF because both are extensible and CIF 
provides a top-class dictionary approach. Thus I do not reinvent ontologies 
- I re-use existing ones. Thus CML uses CIF _cell_ concepts to hold cell 
data. CML will interoperate extremely closely with the IUPAC initiative and 
help to separate ontology from semantics and ontology.

I am therefore developing CIF2CML and vice versa. XML provides high-quality 
graphics tools (SVG) which make it possible to provide true vector-based 
graphics *with semantic and ontological enhancement*. Thus it's possible to 
create smart diagrams which can be clicked and carry the whole _chemical_ 
information underneath (see http://www.xmlcml.org for examples).

CML has now been adopted as a central part of one of the submissions to OMG 
for "small molecules". This means that with the mmCIF-based  submission we 
have a very strong crystallographic input into the formal representation of 
both small and large molecular objects. This will make it much easier to 
use standard tools.

To do this I would be grateful for some sample CIFs which contain chemical 
connectivity, and also for any which contain 2D coordinates with/out 3D 
coordinates

         P.


Peter Murray-Rust, Director Virtual School of Molecular Sciences
Pharmaceutical Sciences, University of Nottingham, NG7 2RD, UK
Tel: +44-(0)-115-951-5087 Fax: +44-(0)-115-951-5110
http://www.vsms.nottingham.ac.uk