Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: coreCIFchem #5

Comments by H.D. Flack on coreCIFchem#5

  I've only really considered David's comments and the TNT example. I did not work in detail at the CaCrF5 example. As you have already seen I've made a new skeleton CIF of the TNT example which to me is simpler and easier to read. I guess it breaks many CIF rules of syntax.

Some general remarks

  One of the aspects of David's implementation as seen in the TNT example which troubles me, is the necessity for each atom in each molecule to have an unique identifier as coded in '_molecular_unit_atom_mu_id'. As far as I remember there are already several million molecules that are known and giving every atom in each molecule a unique identifier is cumbersome to say the least. I definitely think that we will see molecular libraries come into existence either locally or globally. New molecules can be added to a library by editing, cutting and pasting in bits from other molecules so again atom identifers containing the molecular name are very heavy in use. One perverse aspect of using unique atom identifiers over a set of molecules is that it does not per se ensure molecular integrity. In defining the bond topology it is quite possible to do the stupid thing of defining a bond between two atoms which are not in the same unit.

  Another aspect which seemed rather heavy in David's TNT example was the repetition of certain information. As I see it there are four conformers (aa, bb, ab, ba in David's nomenclature) all of which correspond to the same molecule, meaning to the same molecular topology. To improve this state of affairs it seems natural to input a unit defining only its topological features - the TNT molecule - and then reuse this unit several times adding in either the minimum or complete geometric information necessary to distinguish the 4 conformers. So I ended up with four units of identical toplogy but differing geometry and one unit defining only the topology. This makes the relationship between the conformers and the parent molecule easy to perceive in the file. I felt that it was essential to be able to define all four conformers. Although I well understand that often one can not make an unequivocal assignment of molecules or conformers in the case of a disordered crystal structure, it is certainly also the case at present that models of disorder in molecular crystal-structure determinations are being used which have no possible interpretation in terms of the (assumed) constitutent molecules. 

  I would like to have some reassurance that the molecular data structures we are trying to define in CIF are as compatible as possible with those used in the IUPAC project for producing unique chemical identifiers (I've forgotten its name yet again).

  In the chemical sub-groups like the 1,2,4,6 benzene and nitro groups, it makes sense to me to include the dangling bonds. I'm also very much in favour of including ALL the atoms especially the hydrogen atoms.

  I've used the word 'tecton' to mean a general building block instead of molecular_unit. I heard it used in a talk by Guy Orpen but Guy has written to me to say he did not invent. He has sent me a few references which I have not yet had time to read.

  I was disturbed by David's use of the word 'map'. In mathematics it has a very precise meaning [If you map set A on to set B then you have to assign one single element of B to every element in A. This means every element in A has to have a unique son in B although several different elements in A can lead to the same element in B. Also whereas every element in A must have a son in B, not every element in B has to be the son of an element in A.] Especially in the relation between the tectons and the crystal structure, these criteria were not being obeyed.

  In defining the geometry of a tecton David's uses two atoms to define a geometric bond, two topological bonds to define an angle and worries what should be the correct way of doing a dihedral angle either by way of atoms or bonds. I maintain that the only correct way to define the geometry is in all three cases to use a set of atoms: 2 for distances, 3 for angles and 4 for dihedral angles. The reason is as follows: the geometry section allows interatomic distances to be specified but nothing requires that the two atoms concerned form a bond as defined by the topology; similarly the three atoms used to specify an angle may or may not be forming bonds as specified by the topology; etc. One fairly frequently specifies angles by specifying interatomic distances between atoms which are not bonded as defined by the topology.

More specific comments

Disorder in molecular crystals:
  As Greg makes clear, it may well be that there are several interpretations (mappings) which relate the topological definition of a molecule to the atomic coordinates determined from crystal-structure analysis. We must be sure that we provide a mechanism to encode these alternate molecular interpretations and associated geometry. It seems to me that the needs for journal publishing (i.e. checking) disordered structures and the way that they are subsequently entered into a database are somewhat different. For the publishing/checking side of the business one needs to provide a mechanism to evaluate the structural sense of the molecules including the disordered part of the structure.  I've seen too many papers where the disordered part of the structure makes absolutely no molecular sense at all. (One of my colleagues in inorganic chemistry recently received a paper to referee in which about 50% of the electron density was modelled through Ton Spek's BYPASS procedure with no attempt at any molecular interpretation of the disordered region. To our minds in that case the structure analysis could have been improved so we recommended reanalysis.) On the other hand I think that for the data bases, tentative interpretations of disordered regions have much less use and probably what is required is that although the topology of the complete molecule be defined, the mapping (atoms and bonds) between the topology and the crystal structure relate only to those parts of the molecule which are well ordered in the crystal structure. 

> since it does not require the author to specify how the disordered atoms 
> sites are combined in the individual molecules,

  I'm very suspicious of that. One must provide a mechanism that allows multiple mappings of the topological definition of the molecule onto the atoms seen in the crystal-structure analysis. Of course it's not for coreCIFchem to 'require' such information but I certainly see that it could be put to very good use for the purposes of checking a crystal-structure analysis.

> Finally the proposed chemical description allows the ideal geometry and
conformation of the molecular units to be specified - information which can be
used during the refinement of the crystal structure of for validating the
experimental bond distances and angles.

  Typo: "of for" should be "for"

> It is not necessary that the molecular units account for all
> the atoms found in the crystal structure, nor that the crystal structure
> contain all the atoms specified in the molecular units.

 I have no trouble with the first part of the sentence but the second part after 'nor' leaves me somewhat perplexed. I expected that all of the atoms specified in the molecular units would be in the crystal structure even if one could not see them clearly. Could you give examples of what you have in mind here.

  The paragraph starting
> The decision as to what  .....
  is badly written as it starts off by explaining some of the considerations concerning MOLECULAR UNITS and then drifts into chatting about formula units. The latter no longer figure as an integral part of our current chemical description of a crystal. Also in
> the size of the formula unit is necessarily arbitrary.
 I'm not sure that is quite what you intend to say.

> The geometry may be given by specifying atomic coordinates in a 
> rectangular Cartesian coordinate system of arbitrary orientation,

  Can a 'Cartesian' coordinate system by anything other than rectangular? I think the term you require is 'orthonormal basis'. I think the 'arbitrary orientation' is superfluous.

  As the highest level of operation is the MAP which is qualified at a first level to be molecular unit to molecular unit and further qualified at the second level to be the information on the atoms, I think one should use here and elsewhere in the text:  MAP_MOL2MOL_ATOM etc

> This feature will likely not be used often. 
  It is perhaps not a very good idea to start this description with an item you do not expect to use.

> and infinitely bond graphs.
  Typo: Should be 'infinite'.

> Because of the disorder. the
  Typo: Should be 'Because of the disorder, the'

> here by was of illustration. #
  Typo:  Should be
    here by way of illustration.

> only once. #
   should be 
    only once.


                           H H H
                        O22  C7   O62
                        |    |    |
                O21 --- N2   C1   N6 --- O61
                         \  /  \ /
                           C2  C6
                           |    |
                     H3 -- C3  C5 -- H5
                            \   /
                             / \
                            O41 O42

># The list reference items in each loop are unique for each line and are here
># given sequential numbers which is satisfactory for computer analysis but
># makes a visual inspection of the mappings more difficult.

  For the examples I find the sequential numbers as the list reference items make it very difficult to follow what is going on. So I have replaced them by something more contextual.

>                                                              However, the list
># reference items could be constructed from, e.g., the _molecular_unit_id and
># the _molecular_unit_atom_label since the contents of the _*_id character
># string may have any value so long as it is unique within the list.

  May be this is yet another thing that I have not really understood about the CIF syntax. David is suggesting the contruction of a unique reference item by the concatenation of two others. Why not just use the initial pair of reference items together as a unique pointer in its own right. This is what I have done in my 'improved' CIF.
  What I always have in mind is that for this system to be practical, users will need a 'library' or 'dictionary' of molecules and molecular units perhaps even a standard one provided by the IUCr (in which case it would need to be referenced but not copied into an individual CIF - BMcM will explain to me how this might done on my next visit to Chester)

>_molecular_unit_id          # List reference

  Note especially that I think that it should be possible to retrieve the 'molecular' information (topology and geometry) from a data bank / data base. Each 'molecule' should stand in its own right. So the sort of comment that David has in his _details "This is the whole molecule, A portion of the TNT molecule, A group that appears three times in the TNT molecule" should not be included in the above loop and this renders the information dependent on a particular instance.

> 2  'benzene ring'    'C6 H2'       mm2 1 'A portion of the TNT molecule'
  David's CIF is slightly inconsistent with respect to this portion. The two hydrogens should be systematically included. The same applies to the whole molecule which needs H3 and H5.

> # molecular_geom_atom category. #
  should be 
# molecular_geom_atom category. 

>7  1   C7    C    4   4  ?
>8  1   H71   H    1   1  ?
>9  1   H72   H    1   1  ?
>10 1   H73   H    1   1  ?


>11 1   N1    N    3   3  ?
>12 1   O1    O    2   1  ?
>13 1   O2    O    2   1  ?

  The above three atoms should be N2, O21, O22

>14 1   N1    N    3   3  ?
>15 1   O1    O    2   1  ?
>16 1   O2    O    2   1  ?

  The above three atoms should be N4, O41, O42

>17 1   N1    N    3   3  ?
>18 1   O1    O    2   1  ?
>19 1   O2    O    2   1  ?

  The above three atoms should be N6, O61, O62

> _molecular_unit_atom_valence
  We need to define closely what it is that one wishes to encode under this item.

> # The above items define all the atoms in the molecule.
    What about H3 and H5 ?
    I think that as this is the chemistry bit of the CIF and hydrogen atoms have a considerable influence on the chemistry of a molecule we must avoid the crystallographic habit of forgetting about the H atoms.

> # derived directly form the topology

> 1   1   2   1.5  delocalized     # TNT Benzene ring
   Between C1 and C2:
    (sigma) there is a sigma bond due to the overlap of a lobe of an sp2 hybrid on C1 with a lobe of an sp2 hybrid on C2 with consequent sharing of electrons. That part of the 'bond' is not delocalized.
    (pi) participation in a localized pi bond due to overlap of the pz orbitals and consequent sharing of electrons.
   I don't think the C1-C2 interaction should be described as 'delocalized'. Only a part of the bond could be so described.

>11  2  11   1.5  delocalized     # TNT N2 nitro group
>12  11 12   2.0  double
>13  11 13   2.0  double
>14  4  14   1.5  delocalized     # TNT N4 nitro group
>15  14 15   2.0  double
>16  14 16   2.0  double
>17  6  17   1.5  delocalized     # TNT N6 nitro group
>18  17 18   2.0  double
>19  17 19   2.0  double

>26  26 27   2.0  double           # Nitro group
>27  26 28   2.0  double

   The boys in chemistry here did not go much on those bond orders.

># The rest of this loop lists the bonds in the benzene ring (20-25) and nitro
># group (26-27) molecular units.

  I see good reason to code the dangling bonds in the molecular moeities. I've done this in my cif.

> # being the enatiomer of the other). 

  Sorry for some of the repetition but I jotted down these notes as I went along.

coreCIFchem mailing list

[Send comment to list secretary]
[Reply to list (subscribers only)]