[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
CHEMISTRY (was Re: Survey of available CIF software and
- To: Multiple recipients of list <[email protected]>
- Subject: CHEMISTRY (was Re: Survey of available CIF software and
- From: Peter Murray-Rust <[email protected]>
- Date: Fri, 1 Dec 2000 11:13:11 GMT
I have been working on how CIF and CML can interoperate and benefit from
each other - the synergy looks very good There are a few specific comments
below about chemistry.
**I would be very grateful for samples CIFs that support chemistry, see below**
At 14:22 20/09/00 +0100, Brian McMahon wrote:
> Two-dimensional chemical diagrams. CCDC and Acta have requirements for
> 2D diagrams.
I would suggest that "chemical diagrams" is replaced by "structural
formula", "connection table" and "2D coordinates". CIF core provides
support for all these concepts but I suspect they are drastically
underused, to the detriment of everyone.
The purpose of a structural formula is to communicate to humans **and
machines** what the compound(s) actually *are*! This is, of course, not
trivial and at present there is a surprising amount of implicit human
perception required in many diagrams. Work with CML has shown that some
chemistry can *only* be represented with a graphical component but the vast
majority of compounds can be presented by some or all of:
- connection tables
- 3-D coordinates
Note, of course, that neither is formally deducible from the other by an
algorithm - charges, formal bond orders, etc. are matters of human opinion
and hopefully convention. 3-D coordinates do not normally represent
fluxional and similar molecules completely.
Does Acta currently accept (a) 2D diagrams (b) connection tables in
CIF-based papers? (Please excuse ignorance here :-) If so, do they use
pixel-based representations or use 2-D coordinates in _chemical_ to draw
the diagrams.
>There are various possible avenues of approach. (i) One is to
> embed a graphics file (TIFF or PostScript) in a text file in the CIF.
> This would require an embedding convention, similar to the imgCIF
> MIME convention; software to de-embed and decode the graphic;
> software to render the resulting TIFF or PS image. Substantial effort,
> and the result is just a picture.
I argue very strongly against the continuing use of pixel-based diagrams. I
have examples where a diagram is embedded in HTML, rescaled by the browser
and **BONDS DISAPPEAR**. This happens when a horizontal or vertical bond is
1-pixel wide and falls on a non-integer coordinate. there are also many
diagrams where it is impossible to be sure whether a set of pixels is (say)
a 4, a + or something else.
>(ii) Another way is to embed the output
> file from common drawing packages such as ChemDraw and ISISDraw. As
> before, one needs to de-embed the file, decode it, render it in the
> style of the original package, and then parse it for chemical
> connectivity information (which is what is really wanted). The payoff
> is that the connectivity is read, but the software engineering is
> substantial and at the mercy of several proprietary formats.
I have several examples of such files that I cannot interpret. ChemDraw has
some binary formats and these are unreadable without the software.
> (iii) One could use the CIF (or, better, MIF) connectivity datanames.
I support this. CIF has got it all present - we only need to use it.
> Ideally one would persuade the major manufacturers of such software
> to provide CIF/MIF as an export format from their packages. It may
> still be necessary to embed a graphics file for high-resolution
> publication, however. (iv) The other approach to connectivity is to
> infer chemical bond types from the 3D image, and allow the user to
> edit the 3D diagram interactively, trapping the result in CIF/MIF
> fields. This captures the chemical information, but loses the
> aesthetics of the commercial graphic presentation. It also alienates
> chemist authors who are familiar with the existing software
> packages. Of these options, (iii) looks best, but depends on
> persuading the manufacturers... usual story.
There is an important principle here. Any bond types, charges, etc. depend
on conventions (ontologies). Unless the source of these is documented,
there is considerable opportunity for confusion. Thus (I believe)
MDLMolfiles use 4 for aromatic whereas other packages use this for (the
rare) quadruple bond. CCDC use -5 for aromatic, etc. Therefore any
representation requires either:
- agreement to use a single convention
- careful recording of the convention.
CML uses both approaches - it has a small core ontology but can support the
use of any other convention.
I suspect that current CIF terminology, extended with some MIF terminology
- e.g. for stereochemistry would cater for 99% of "small" crystal structures.
There is thus option (v) which is possible:
convert legacy formats to the appropriate CIF datanames (this is possible
within Core CIF without breaking it). These can be extended if appropriate
with either MIF datanames or IUPAC terms. CIF/MIF/IUPAC is probably strong
enough to hold most communal concepts. The re-export to legacy formats is
not always possible because these are not extensible (thus "PDB" and
MDLMolfile do not support lots of what is in CIF).
I am developing support for this approach by using an internal DOM to hold
the CIF enhanced by CMLDOM.
>[...]
>Chemistry
>---------
>As mentioned above in my lengthy discourse on the CCDC editor, it would be
>beneficial to have 2D chemical structural information output in MIF format
>by standard commercial software packages. Perhaps relevant to this is an
>IUPAC initiative to generate identifiers for chemical compounds that is
>derivable from the compound's connection table. Perhaps also of some
>relevance to CIF matters is IUPAC's official endorsement of CML (chemical
>markup language) as an information interchange mechanism.
CML interoperates very well with IF because both are extensible and CIF
provides a top-class dictionary approach. Thus I do not reinvent ontologies
- I re-use existing ones. Thus CML uses CIF _cell_ concepts to hold cell
data. CML will interoperate extremely closely with the IUPAC initiative and
help to separate ontology from semantics and ontology.
I am therefore developing CIF2CML and vice versa. XML provides high-quality
graphics tools (SVG) which make it possible to provide true vector-based
graphics *with semantic and ontological enhancement*. Thus it's possible to
create smart diagrams which can be clicked and carry the whole _chemical_
information underneath (see http://www.xmlcml.org for examples).
CML has now been adopted as a central part of one of the submissions to OMG
for "small molecules". This means that with the mmCIF-based submission we
have a very strong crystallographic input into the formal representation of
both small and large molecular objects. This will make it much easier to
use standard tools.
To do this I would be grateful for some sample CIFs which contain chemical
connectivity, and also for any which contain 2D coordinates with/out 3D
coordinates
P.
Peter Murray-Rust, Director Virtual School of Molecular Sciences
Pharmaceutical Sciences, University of Nottingham, NG7 2RD, UK
Tel: +44-(0)-115-951-5087 Fax: +44-(0)-115-951-5110
http://www.vsms.nottingham.ac.uk
- Prev by Date: CONFORMANCE [was Re: Survey of available CIF software and
- Next by Date: Some New Perl mmCIF Software Tools
- Prev by thread: Re: CONFORMANCE [was Re: Survey of available CIF software and
- Next by thread: Some New Perl mmCIF Software Tools
- Index(es):

