Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: Re: Proposed CIF addition for chemical descriptions]

  • To: corecifchem@iucr.org
  • Subject: [Fwd: Re: Proposed CIF addition for chemical descriptions]
  • From: David Brown <idbrown@mcmaster.ca>
  • Date: Tue, 05 Oct 2004 14:35:46 -0500
The following comments were received from Peter Murray-Rust to whom I 
circulated the latest discussion (#6).  Since Peter is not yet on this 
list I am taking the liberty of circulating them to the rest of the 
group, as he has some very pertinent comments.  I have taken the liberty 
of deleting parts of the discussion paper that are not relevant to 
Peter's comments in order to keep the file to a manageable size.

David

Peter writes:

I have read to the end of the TNT example and made several comments. What 
is proposed is valuable and necessary. It is also tackling about 5 hard 
problems which have not been publicly solved. As you will see I think it is 
very ambitious and I am worried about whether it is implementable. It also 
addresses problems of general chemistry as well as crystallography - how 
widely have views been sought in chemoinformatics?

I mention CML at regular intervals. That is because CML has had to tackle 
several of the problems that you do. The only places in chemistry where 
this type of analysis is being conducted publicly are:
- IUPAC stereochemistry (I am on this group)
- other IUPAC committees (I have little knowledge there)
- INChI
- CML
- some OpenSource projects

In some cases I believe that CML has a useful formalism. If so, CIF is 
welcome to it if it helps. In several areas there is not enough use of 
non-proprietary systems to know whether a particular approach works. 
(Proprietary systems can rely on inbuilt semantics and are often not easy 
to evaluate and describe).

I comment several times on implementability. This is critical to decide 
what parts of the spec work. I imagine that one essential software product 
is a tool that will annotate the results of a diffraction analysis as you 
suggest. It depends what tools you already have but I would think that to 
design, write and test it would be at least 3 years. (I'm gauging this by 
the performance of the very good authors of OpenSource display and editing 
tools).

My comments are made in the order of reading the document

I will prefix remarks by

PMR Response
-------------------

Generally...
I think you are addressing important points. There are several of them and 
they are complex. They have not all been fully or even partially solved 
elsewhere. The following points are taken from my 15 years' writing 
software for CIF and related systems and I hope they will be taken positively.

1. IMO a system has to be implementable. I adhere to the IETF motto "rough 
consensus and running code". I tried hard in the mid-90's and later at 
Syd's invite to implement the DDL system. It is far more difficult to 
implement than it appears on reading. In any such system there are lots of 
nuances which can be surprisingly difficult. I have now formally 
implemented a complete CIF DOM using the SAX/DOM/Infoset approach, but 
without dictionary control. Similarly I spent some time looking at mmCIF 
and again found that the task was large. mmCIF has been implemented, but 
has a considerable amount of dedicated resource. So do you have similar 
resource?

So if you wish software to *read and understand* this design it will need a 
lot of effort. Without prototyping the system you don't know whether the 
design is complete or self-consistent. Note the questions I have asked 
recently on COMCIFs - there are still several areas that I have no 
definitive answers to and I am sure this is likely to happen here. So I'm 
simply asking you to be aware of the size of the problem. IMO this was the 
problem with MIF - it was a reasonable spec but no-one really implemented it.

Also who or what will generate these files? I think the analysis you 
present below for TNT will require quite a bit of heuristic software, so 
presumably an author has to edit it. At  the least they will have to have 
an editing tool to keep the referential integrity for the pointers.

2. There are several concepts here that we are tackling with in CML and 
have not fully solved. I distinguish "design" where there is a formal spec 
and "implement" where it is actually shown to work or not. They include:
        - multiple conformations (designed but not implemented in software)
        - unique atom ids (designed and implemented, but possibly fragile)
        - levels of indirection (pointers) (designed but not widely 
implemented)
        - role of atomSets (your tectons).(designed and partially implemented)
        - polymeric systems (not designed; far too difficult at present)

  It is possible to come up with reasonable solutions but it is often 
unclear how well these will stand up to the variety of examples that will 
be exposed to it when the system is released. It almost certainly will 
require a redesign. That has happened for CML and I predict it will occur here

As an example I am on a IUPAC group tackling stereochemical representation. 
It is a tough problem.If/when some consensus is reached all of that would 
ideally have to be implemented in CIF solely to help describe what the 
substance actually is

3. My suggestion would be to be somewhat less ambitious and to engage the 
community in a structured program
 - concentrate on molecular crystals (they are easier and the 
informatics/representation is better understood)
 - get the authors actually to submit the chemical structures (if this 
doesn't happen then the design won't be tested)
 - anticipate the problems of mapping crystallographic atoms onto chemical 
connection tables. These are:primarily
    - symmetry
    - disorder
    - unreported atoms

Detailed comments follow

>HDF on tecton v molecular unit
>--------------------------------------
>   I've used the word 'tecton' to mean a general building block instead of
>molecular_unit. I heard it used in a talk by Guy Orpen but Guy has written to
>me to say he did not invent. He has sent me a few references which I have not
>yet had time to read.
>
>IDB reponse
>-----------
>I have adopted this terminology to refer to a collection of bonded atoms whose
>topology we describe.  These may not conform the concept of a tecton used
>elsewhere, so I hope this does not cause confusion.

PMR Response
-------------------

I'm not sure tecton is a good term. I find the following on the web

a concise definition of a tecton. A tecton is a molecule whose interactions 
are dominated by particular associative forces that induce self-assembly of 
an organized network, with specific architectural or functional features.

and there are others similar. Your use of tecton will collide with those 
and will mean nothing to most chemoinformaticians. CML uses atomSet, other 
programs use fragment. You may also have to decide whether an atom can 
belong to two tectons, or whether a symmetrical tecton can map onto itself.

IDB
---
>In summary, the list-reference is always a string that has no semantic content
>but is used solely for file management (e.g., locating particular lines in a
>list).  Such strings are never parsed by the computer.  The chemical
>information always resides in other items on the line.

PMR Response
-------------------
In CML the use of unique atom IDs is critical and they are required to 
define bonds, torsions, etc. Uniqueness is only required within a 
"molecule" - a user-defined set of atoms. The atoms *can* be canonically 
labelled through the IUPAC INChI if required. This is what we provide on 
our web site http://wwmm.ch.cam.ac.uk/Bob
Software has to be able to distinguish atoms in different molecules. and 
resolve/ignore duplicate ids. Merging molecules is a problem if two 
molecules are joined

>HDF proposes that all bonds and angles be defined in terms of atoms
>-------------------------------------------------------------------
>In defining the geometry of a tecton David's uses two atoms to define a
>geometric bond, two topological bonds to define an angle and worries what
>should be the correct way of doing a dihedral angle either by way of atoms or
>bonds. I maintain that the only correct way to define the geometry is in all
>three cases to use a set of atoms: 2 for distances, 3 for angles and 4 for
>dihedral angles. The reason is as follows: the geometry section allows
>interatomic distances to be specified but nothing requires that the two atoms
>concerned form a bond as defined by the topology; similarly the three atoms
>used to specify an angle may or may not be forming bonds as specified by the
>topology; etc. One fairly frequently specifies angles by specifying
>interatomic distances between atoms which are not bonded as defined by the
>topology.
>
>IDB response
>------------
>I agree.  This simplifies the CIF since (almost) all bonds, distances, angles
>and mappings use the atom_ids that are defined in the tecton_topology_atom
>loop.


PMR Response
-------------------
This is how CML does it

IDB
---
>Howard mentions crystallographic models of disorder that have no atomic
>description.  A method is currently being developed by the CIF Core Dictionary
>Maintenance Group whereby the number of electrons in a diffuse patch of
>electron density will be indicated by including dummy atoms in the _atom_site
>loop, representing atoms presumed to be present in the crystal.  A direct
>mapping from the topology to the real and dummy atoms in the atom_site loop
>should be possible even in these cases even though the positions of the dummy
>atoms are not defined.

PMR Response
-------------------
CML can support concepts like this. There is no requirement in CML to have 
3D coords for all atoms. Also an atom can possess a number of electron children

>HDF comment on compatibility with INChI
>---------------------------------------
>I would like to have some reassurance that the molecular data structures we
>are trying to define in CIF are as compatible as possible with those used in
>the IUPAC project for producing unique chemical identifiers (I've forgotten
>its name yet again).
>
>IDB response
>------------
>The name is now INChI, the IUPAC-NIST CHemical Identifier.  I don't think
>there is a problem at the topology level, but I don't know if there are
>problems in dealing with conformers.  I will have to look into this.

PMR Response
-------------------
There is no way that I know of for labelling conformers other than humans 
making arbitrary labels such as "chair". Conformers involve mapping real 
numbers onto labels and this cannot be done canonically

>HDF comment on dangling bonds
>-----------------------------
>In the chemical sub-groups like the 1,2,4,6 benzene and nitro groups, it makes
>sense to me to include the dangling bonds. I'm also very much in favour of
>including ALL the atoms especially the hydrogen atoms.
>
>IDB response
>------------
>Fine.  I have retained this feature of Howard's proposal, but see the note
>below about the difficulties of mapping dangling bonds.

PMR Response
-------------------
One of the most important contributions would be to require that EVERY atom 
is reported. This allows systems like INChI to determine **what the 
molecule actually is**. So please include every atom exactly once.
Question - could software automatically read the TNT example and come out 
with the correct formula for the molecule?

>HDF comment on mapping
>----------------------
>   I was disturbed by David's use of the word 'map'. In mathematics it has a
>very precise meaning [If you map set A on to set B then you have to assign one
>single element of B to every element in A. This means every element in A has
>to have a unique son in B although several different elements in A can lead to
>the same element in B. Also whereas every element in A must have a son in B,
>not every element in B has to be the son of an element in A.] Especially in
>the relation between the tectons and the crystal structure, these criteria
>were not being obeyed.
>
>IDB response
>------------
>While it would be good to adhere to the mathematical practice, I wonder how
>well it would be observed by crystallographers most of whom are unaware of the
>mathematical rules, particularly as the kind of 'mapping' we do in this file
>does not, in general, follow the mathematical rules.  Perhaps we should use a
>different word, but I have not been able to think of a good substitute.  I
>have continued to use the word 'map' in the present draft in the absence of a
>suitable alternative even though the mathematical rules of mapping are not
>followed.
PMR Response
-------------------
CML has a map element which consists of a series of links. A link can have 
many roles (it is modelled on XML's XLink). One role is to map toms in one 
molecule onto those in another


>IDB response
>------------
>The present proposal has a remarkable flexibility.  The topology can be mapped
>to the crystal structure with or without reference to the conformation, but if
>the conformers are specified, they can be combined in any desired way that
>matches the known or supposed molecular structure of the crystal.  Similarly
>different isomers may be mapped to the crystal in any combination.  (N.B.
>isomers differ at the topological level, conformers have the same topology but
>differ at the geometry level).

PMR Response
-------------------
If there is flexibility in interpretations it is harder to write programs. 
It is easiest when there are no implicit semantics or heuristics

>HDF on the need for all the atoms in the bond graph to be in the crystal
>------------------------------------------------------------------------
> > It is not necessary that the molecular units (tectons) account for all
> > the atoms found in the crystal structure, nor that the crystal structure
> > contain all the atoms specified in the molecular units.
>
>  I have no trouble with the first part of the sentence but the second part
>after 'nor' leaves me somewhat perplexed. I expected that all of the atoms
>specified in the molecular units would be in the crystal structure even if one
>could not see them clearly. Could you give examples of what you have in mind
>here.
>
>IDB response
>------------
>The inherent structure of the CIF does not require that every atom in the
>crystal map onto an atom in the tecton and vice versa.  To require such a
>restriction seems unnecessary and difficult to enforce in any automatic way.
>Someone may wish to define a tecton for which no crystal structure has been
>reported, or for which only the unit cell is known.  They may wish to define a
>monomer and a dimer as two tectons, where only the monomer appears in the
>crystal.

PMR Response
-------------------
I can certainly see that there can be:
- atoms without known positions but whose bonding is known or assumed
- atoms without bonding but with known positions
CML manages these by including each atom exactly once and deciding whether 
it should have formal connectivity or 3D coordinates or both or neither. If 
atoms are included multiple times it may become very difficult to determine 
what the constituents actually are.

On this point there are many CIFs where it is impossible to tell what the 
substance actually is simply by looking in the CIF. Proper use of measured 
formula and chemical connectivity could have solved many of these problems

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>2. PREAMBLE TO THE REPORT TO THE CORE DICTIONARY MAINTENANCE GROUP
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>With this addition, the CIF may include a description of the bonding topology
>of one or more tectons, a tecton being defined as a group of atoms linked by
>bonds, usually representing a molecule, complex or functional group.  There is
>no limit to the number of tectons that may be described in a given CIF and
>there is provision for mapping the atoms of one tecton onto the atoms of
>another, as well as identifying the atoms in a tecton with the atoms in the
>crystal.

PMR Response
-------------------
You may have to decide how to determine whether inter-tecton bonds count 
atoms more than once

>The conformation and geometry of the tectons are given in the tecton_conformer
>and tecton_geom categories, the former identifying the different conformers
>that may be present, the latter defining their geometry.
>
>   TECTON_CONFORMER         Lists different conformers and their properties
>   TECTON_CONFORMER_EQUIV   Defines the geometry labels of conformer
>   TECTON_GEOM_ATOM         Gives coordinates of ideal geometry
>   TECTON_GEOM_DIST         Gives ideal interatomic distances
>   TECTON_GEOM_ANGLE        Gives ideal bond angles
>   TECTON_GEOM_TORSION Gives ideal torsion angles

PMR response
------------
Are conformers only relevant for disordered structures or might a species 
such as TNT have one NO2 tecton with three conformations (I would argue 
against that)

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>                            3. SAMPLE CIFS
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>At this stage in the development of the tecton description we are not
>attempting to write dictionary definitions, but only create two sample CIFs to
>ensure that the file structure is organized in a way that can handle both
>organic and inorganic crystals in the simplest possible way.
>
>The first CIF describes the structure of the molecule trinitrotoluene, TNT.
>It shows how a molecule with a finite bond graph is handled when the molecule
>lies on a Wyckoff special position and two of the nitro groups are 
>disordered.
>By way of illustration, tectons corresponding to several subunits of the
>molecule are also defined and are mapped onto the molecule itself.
>
>The second CIF describes the structure of CaCrF5 which has an infinite bond
>graph and a formula unit that spans more than one asymmetric unit.
>
>[Editorial comment: Data names may be changed in the final report and
>dictionary definitions will eventually be needed.  Suggestions for better
>names are welcome.  Items marked as 'list-reference' are required for the
>management of the CIF's relational file structure and must be unique for each
>line in a list.  The list-reference item in one loop is frequently parent to
>similarly named items in other loops.  There is at least one serious
>unresolved problem (in the geom categories of the second CIF).  Its solution
>is deferred to the next draft.  The following sample CIFs contain extensive
>comments to explain how the CIF is to be interpreted.  The same CIFs with the
>comments stripped out appear at the end of this file so that one can see more
>clearly what they look like.]

PMR Response
-------------------
These preambles are a useful overview of the problem


>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>                         3.1 FIRST SAMPLE CIF
>                         --------------------
>
>                        TRINITROTOLUENE
>
>                          O    CH3  0
>                          |    |    |
>                    O --- N2   C1   N6 --- O
>                           \  /  \ /
>                             C2  C6
>                             |    |
>                        H -- C3  C5 -- H
>                              \   /
>                               C4
>                                |
>                                N4
>                               / \
>                              O   O
>
>In the fictitious crystal structure I have invented for the purposes of this
>illustration, the molecule contains a crystallographic mirror plane that
>passes through the methyl group and the N4 nitro group and is perpendicular to
>the plane of the molecule.  The N2 and N6 nitro groups are related by the
>mirror plane and are disordered with the two components each having occupation
>numbers of 0.5.  Because of the disorder the crystallographic structure does
>not define the point group of an individual molecule.  By choosing one
>combination of the disordered nitro groups the molecule would have Cs
>symmetry, but by choosing a different combination the individual molecules
>would have C1 symmetry.  Either or both combination may of course be present
>in the real crystal but x-ray diffraction cannot distinguish between them.

PMR Response
-------------------
It would be useful to see an example of a simple structure without 
problems, and perhaps one without disorder but either symmetry or multiple 
molecules. I think the present example is trying to tackle too many 
problems at once

>############# Beginning of first CIF #############
>#
>#
>data_disordered_TNT
># If a crystal contained molecules of more than one compound, or more than one
># isomer of a compound, each would be described by a separate tecton.
># If the crystal contained more than one copy of the same molecule in the
># asymmetric unit (Z'>1) the topology of the tecton would be given only once
># but it would be mapped onto all the crystallographically distinct copies.

PMR Response
-------------------
So there is one tecton definition but it is dereferenced twice?

># Need to find out about IUPAC rules for naming conformers

PMR Response
-------------------
AFAIK there aren't any

HDF
---
># group of permutations (of atoms). If I understand correctly there are no
># standard symbols for these automorphism groups although it seems that in a
># fair number of cases they are isomorphic to a point group in three
># dimensions. So often one could use a Schoenflies symbol. [Even that is
># equivocal - point groups Cs, Ci and C2 are isomorphic] The TNT graph should
># be given the symbol C2v for its _tecton_graph_automorphism_group. [C2v is
># isomorphic to D2 and C2h. Why is it I prefer C2v? Am I (are we) too geometry
># oriented?
># I think we (David?) need to call John Rutherford again.] What one does in
># the case that the graph automorphism group is not isomorphic to a 3D point
># group, I do not have the least idea apart from writing a ? .

PMR response
------------
There are many groups that are not isomorphic to a point group. They 
include permutation groups and products. I spent some time many years ago 
looking at whether such groups could usefully be represented geometrically.

># IDB response
># ------------
># The CIF dictionary already contains instructions for drawing a 2-D molecular
># diagram in the group of chemical_conn categories.  Although the
># chemical_conn categories also describe the topology of a molecule they are
># not a substitute for the tecton categories because
>1) they are restricted to
># organic molecules,

PMR Response
-------------------
...restricted to molecules where a connection table is a valuable 
description. many inorganic molecules would fit this

>  2) they are designed only to display a molecular diagram,

PMR Response
-------------------
The display is not an integral part of a connection table. There are some 
stereochemical features which are sometimes represented graphically but 
otherwise the connectivity is what matters

># 3) only one molecule can be described

PMR Response
-------------------
CML can store multi-molecules - e.g. hexane+urea. The problem seems to come 
from conformers

>and 4) the atoms are not mapped onto
># the atom_sites in the crystal.

PMR Response
-------------------

This is a serious omission. CML supports this intrinsically as there is 
only one table

>loop_
>_tecton_topology_id           # List-reference
>_tecton_topology_name         # Name e.g. full IUPAC name
>_tecton_topology_formula      # Numbers of atoms in the tecton
>_tecton_topology_Zprime       # Number of symmetry independent copies of the
>                                     # tecton in the crystal
>_tecton_topology_special_details
>TNT   '2,4,6 trinitrotoluene'     'C7 H5 N3 O6' 1  molecule
>BNZ   '1,2,4,6 benzene ring'      'C6 H2'       1  moiety
>NITRO 'nitro group'               'N O2'        2  group
>#

PMR Response
-------------------
CML is only just starting to tackle the problem of describing molecules as 
assemblies of fragments. Do your fragments have unfilled valences, dummy 
atoms, etc.?

># I have added an item _tecton_topology_atom_chirality which is not needed in
># this example, but is needed in chiral structures to identify any atom that
># serves as a chiral center.  Chirality is not captured by the topology, but
># it is, like topology, a feature of the structure that can only be changed by
># breaking and making bonds.  It is included here because it is more closely
># related to the topology than to the geometry which can be changed without
># breaking any bonds.  I will defer to others what values should be associated
># with this item - presumably some letter like R or S.
>#
PMR Response
-------------------
Your "chirality" appears to be atom connectivity. It needs careful definition

># Howard has added dangling bonds to show the full coordination around all the
># atoms in the tecton as well as indicating the points at which the tecton is
># attached to other species.  The dummy atoms are indicated by the default '.'
># meaning that this atom cannot be defined.  It would make sense to include
># the dangling bonds when, for example, the benzene ring is mapped onto TNT in
># the map_tecton loop.  However, this requires that the atom at the far end of
># the dangling bond be given a name, which in turn means that the name must be
># added to the tecton_topology_atom list in order to preserve the parent-child
># relations.  It would be necessary to identify such atoms as dummies, which
># could be done by assigning them a non-existant atom_type such as X though
># this would in turn have to be defined in the atom_type loop.  It all seems a
># little convoluted. A simpler scheme may be possible.
># To avoid these problems the dangling bonds are not mapped and the CIF is
># fully compliant.

PMR Response
-------------------
CML allows dummy atoms ("Du") or R-groups "R". There may be bonds to such 
atoms.

># HDF on 'delocalized'
># -------------------
>#   Between C1 and C2:
>#    (sigma) there is a sigma bond due to the overlap of a lobe of an sp2
># hybrid on C1 with a lobe of an sp2 hybrid on C2 with consequent sharing of
># electrons. That part of the 'bond' is not delocalized.
>#    (pi) participation in a localized pi bond due to overlap of the pz
># orbitals and consequent sharing of electrons.
>#   I don't think the C1-C2 interaction should be described as 'delocalized'.
># Only a part of the bond could be so described.
>#
># IDB response
># ------------
># I have adopted the convention used by CCDC.

PMR Response
-------------------
delocalised does not describe the precise mechanism of bonding but the 
formal 2D representation. Different groups have different ideas on 
delocalisation, aromaticity, etc. For this reason the INChI approach - 
don't have any bond types but include every atom exactly is superior

># The following loop appears in Howard's draft as a way of allowing the
># geometry common to all conformers to appear only once.  In his draft TNTaa
># etc. as well as TNT were defined as tectons in a previous loop (see above).
>#
># loop_
># _tecton_topology_combine_id              # Child of _tecton_id
># _tecton_topology_combine_source_id       # Child of _tecton_id
># TNTaa   TNT  # Means any information about TNT also applies, as such, 
>to
>#              #TNTaa
># TNTbb   TNT
># TNTab   TNT
># TNTba   TNT
># MAY BE THIS CAN BE DONE LEGALLY WITHIN CURRENT CIF SYNTAX BY SAVE FRAMES ???
># [Save frames are used in dictionaries but are not yet part of CIF - IDB]

PMR Response
-------------------
Unless you have working software I would caution using save frames

>#
># In the draft presented here the combining of the geometries is achieved in a
># different way which, I believe, is more appropriate for CIF, is more
># flexible and would make programming simpler.  The above loop therefore is
># not part of the current draft.
>#
># The first loop in this group of categories is one that identifies the
># different conformers, but if only one conformation is present this loop may
># be omitted unless one wished to give properties of the geometry as a whole
># such as the point group of the tecton.
>#
># Since in the TNT example the ideal geometries of the conformers differ only
># in the torsion angles, the remaining geometry of the molecule is
># common and need only to be given once.  This means that each conformer is
># described in part by items that give the common geometry and in part by
># items that give the distinctive geometry of the conformer (the torsion
># angles in this case).  Each distance or angle in the geom loops is assigned
># a conformer_label (e.g. aa, ab, all) to identify which conformer (or group
># of conformers) it describes.  The second loop in this group
># (tecton_conformer_equiv) associates each conformer_id with the appropriate
># conformer_labels.
># Then follow the tecton_geom loops which define the atomic coordinates, the
># interatomic distances, the angles and the torsion angles.

PMR Response
-------------------
CML has an approach to conformers that defines the common properties of a 
molecule or group and allows another molecule or group to override them. It 
is designed but not deployed. I am wary of committing to a design at this stage


Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069



_______________________________________________
coreCIFchem mailing list
coreCIFchem@iucr.org
http://scripts.iucr.org/mailman/listinfo/corecifchem

[Send comment to list secretary]
[Reply to list (subscribers only)]