Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

coreCIFchem Discussion #3

  • To: corecifchem@iucr.org
  • Subject: coreCIFchem Discussion #3
  • From: David Brown <idbrown@mcmaster.ca>
  • Date: Mon, 26 Jan 2004 12:00:57 -0500
Dear colleagues,

     Thank you for the feedback to coreCIFchem discussion paper
#2.  In part I of this email I summarize the discussion, in
part II I make a proposal and in Part III I enclose a CIF which
illustrates the proposal.  I would invite you to comment on these
by replying to the group list before February 29.  If you need
more time, let me know and we can negotiate a later date.  This
email can be saved as a text file.  It prints on about 15 pages.


I.David Brown, Professor Emeritus of Physics
Brockhouse Institute for Materials Research
McMaster University, Hamilton, Ontario
Canada L8S 4M1
Tel: +905 525 9140 ext 24710
Fax: +905 521 2773
email: idbrown@mcmaster.ca

================= PART I =========================
  In the previous paper I outlined a possible graph
theoretical approach to the problem and ended with the following

 >>     Before we try to define CIF items for particular chemical
 >> concepts, we need to have a consensus about the definition of
 >> a molecule.  I have made some
 >> suggestions above, and I would be interested in people's
 >> comments.  Is graph theory a fruitful way to go or should we
 >> take a different approach?  What are the problems we might
 >> encounter using the approach described above?

Comments related to the use of graph theory
John Bollinger commented:-

    I find the idea of relating chemical properties to a bond
  graph to be rather attractive, although I confess to the
  influence of a bit of background in formal mathematics.

    One aspect that David left unexplored is the possibility of
  applying multiple properties to bond graph edges.  One need
  not choose a single measure of bond strength, nor make
  measures of bond strength the only properties a bond graph
  edge may have.  For instance, one could apply an explicit
  bond categorization (e.g. "covalent", "dative", "hydrogen
  bond", "non-bond").  One could also express purely chemical
  information, such as the fact that "this is the bond that is
  broken in the course of the von Foo reaction".  CIF is
  conveniently flexible in this regard, as authors may include
  exactly the properties they wish to describe while ignoring
  all others.

    One oddity I see with that approach is that the interatomic
  distances currently described by _geom_bond_distance fit
  nicely into the collection of properties that could be
  associated with bond graph edges, but many other _geom_*
  items do not.  Graph theorists do have concepts that could
  be applied there, but we must take care to avoid making CIF
  (more) incomprehensible to mere mortals.  Perhaps, though,
  it does make sense to consider whether all the various
  geom_* categories should be subsumed into a scheme such as
  this -- they are all examples of data that have both
  crystallographic significance and chemical significance. 

Herbert Bernstein also argues for a flexible scheme:-

    How about this approach -- instead of trying to agree on a
  definition of a molecule, should we not be trying to clearly
  state a reasonable set of chemical views of matter and,
  where possible the relationships and/or transformations
  among those views, as has already been started for
  macromolecules.  The approaches, chemical formulae, 3-D
  structures, charge density maps, residue-based polymeric
  descriptions, all have something to say about how various
  chemical entities can interact with other chemical entities,
  which is, after all, what Chemistry is all about.

Comments related to the definition of a molecule
Howard Flack:-

    One property of a molecule is that it is composed of atoms.

    Even better would be:

      A molecule is a set containing atoms and molecules.

    It might be advisable to add the restriction that an atom,
  as an individual member, may only occur in one molecule.
  This restriction definitely does not apply to the non-
  molecular tectons or 'building blocks' as in the quartz

John Bollinger explores a number of possible approaches:-

    As chemists in general have no single consistent definition
  of a molecule, it would be fruitless for us to attempt to
  impose a universal one of our own in hopes of satisfying
  everyone.  The alternatives I see are
    (1) to choose our own definition for CIF purposes and use it
    (2) to support diverse CIF items with which to describe
  multiple different molecule concepts;
    (3) to provide sufficient data for a chemist to apply his or
  her own definition of "molecule"; or
    (4) to attempt to ignore molecules altogether.

    Although (4) is perhaps most true to pure crystallography, I
  think it is least suitable for our purpose.  Option (2)
  strikes me as inelegant and short-sighted.  Option (1) might
  be feasible if we could actually come up with a suitable
  definition, but I question whether that is possible. That
  leaves option (3), to which category I would assign most
  applications of the bond graph approach that I have imagined
  so far.

Summary of these comments:-
We should take advantage of the ability of both graph theory and
CIF to associate a variety of different properties with the
vertices and edges (atoms and bonds) of a graph in a way that
leaves people free to use whatever model they feel comfortable
with.  The definition of a molecule should also be left flexible
with the possibility of nesting smaller molecular units within
larger ones.

=================== PART II ===========================
Below I outline an arrangement of categories built on the ideas
expressed above.  As it is important that we have a clear idea of
what these categories represent and the relationships between
them I start with some definitions.

Concepts and definitions
I shall refer to clusters of atoms as MOLECULAR UNITS.  The atoms
in these molecular units need not be fully connected by bonds but
frequently will be.  The molecular units may be charged (e.g.,
complex ions) and they may consist of a single atom.  Molecular
units may be nested, e.g., MgSO4.7(H2O) could be broken down into
(H14, Mg, O11, S} or {Mg, (SO4), (H2O)7} or {Mg(H2O)6, SO4, H2O]
or at the highest level by the formula unit [MgSO4(H2O)7}.  Each
of the clusters divided by commas in the above descriptions is a
molecular unit and there is a hierarchy in which the molecular
units at the higher levels are composed of molecular units at a
lower level.

The FORMULA UNIT is the smallest number of atoms that captures
all the differences in the chemical and crystallographic
structures.  It is at least as large as the asymmetric unit and
should be no larger than the contents of the unit cell.  It may
contain more than one molecule in cases where Z' > 1.  It will
contain more than one asymmetric unit if the molecular units of
interest contain crystallographic symmetry.

There are three layers at which a DESCRIPTION OF A STRUCTURE can
be given:

1: The highest layer of description is the TOPOLOGICAL
description which consists of a listing of the atoms and the
bonds that link them.  This is the graph theoretical level.  The
topological description is close to the nineteenth century model
of chemical structure described by the familiar 2-dimensional
molecular diagrams (bond graphs).  It is the normal level at
which molecular structure is described, organic syntheses are
planned and chemical structures are taught to students.  The list
of atoms is determined by the chemical formula (assumed to be
given) and the bonds between them are assigned by applying a set
of simple rules.  This topological description does not include
any information on the three-dimensional geometry such as bond
lengths and angles.  Although it is customary to draw a molecular
diagram as the projection of the 3-dimensional structure, the
positions of the atoms in a bond graph are not defined, atoms can
be placed in any arbitrary position, only the linkage between
them is important.  The description at this layer excludes
properties such as the electron density distribution which
depends on a knowledge of the atomic positions.

2: Next is the GEOMETRY layer.  In this layer the 3-dimensional
coordinates of atoms are introduced, usually expressed using an
orthogonal Cartesian basis.  Atom labels are identical to those
defined in the topological layer.  Interatomic distances and
vectors can be derived and, by mapping from the topological
level, distances can be identified as either bonds or not-bonds.

3: Third is the CRYSTAL layer.  The lattice parameters form the
basis for the atomic coordinates.  The atoms of the formula unit
are labelled by augmenting the labels of the asymmetric unit with
the appropriate crystallographic symmetry operation.  Interatomic
distances may be calculated, and can be classified as bonds or
not-bonds by mapping from the topological layer.  Although there
are effective empirical schemes for assigning bonds on the basis
of interatomic distance, there is no guarantee that this
assignment agrees with the assignment in the topological layer.
The topological assignment takes precedence.

(An aside: electron densities
Models of chemical bonding based on electron densities, while
containing much important information about the stabilities of
different structures, are still struggling to find their place in
the simple descriptions used by chemists.  In any case these
models usually require an a priori knowledge of the 3-dimensional
structure.  The electron density can be measured or calculated in
the crystal layer, and can be calculated in the geometry layer.
We need to make provision for electron density studies, but we
need first to get the broad structures established.)

(Another aside: treatment infinite graphs
The graph of an infinitely connected structure might seem to
present further difficulties, but these can be handled.
Discussion of this can be deferred.  For the moment the
discussion is restricted to finite molecular units.)

There are some obvious symmetry relationships we should not loose
sight of.

The SYMMETRY of a graph is called an AUTOMORPHISM, an operation
that interchanges the labels of atoms without changing the graph,
e.g., the 6-fold rotation of the atoms in the benzene molecule is
an automorphism.  The symmetry in the geometry layer is given by
the POINT GROUP of the molecular unit, and symmetry in the
crystal layer by the SPACE GROUP of the crystal or by the SITE
SYMMETRY of the molecular unit.  This is the site symmetry of a
special position lying within the molecular unit which may or may
not be occupied by an atom.

The following theorems apply to all molecular units.

THEOREM 1: Every symmetry operation of the point group of the
molecular unit (in the geometry layer) must correspond to an
automorphism of the bond graph.  The converse is not true.

THEOREM 2: Every symmetry operation of the crystallographic site
symmetry of the molecular unit must be an element of the point
group of the molecular unit in the geometry layer.  Again, the
converse is not true.

COROLLARY: Automorphism order >= point group order >= site
symmetry order.

We are aiming to introduce into CIF a description of the
chemistry of a compound in the topological layer.  This
description should allow for the nesting of complex structures
and the assignment of a variety of properties to the atoms, the
bonds and the molecular units.  The properties of an atom include
its name, label, and physical properties such as valence,
coordination number, etc., the properties of a bond are its
length, valence, character, etc. and the properties of a
molecular unit are its formula, name, formula mass, etc.

The chemistry (i.e. the pattern of bonds that link the atoms) is
defined in the topological layer and in this layer the atoms are
labelled according to a chemical scheme.  In this layer, and only
in this layer, there are rules for determining which atoms are
bonded, though knowledge of the nearest neighbours may be needed
in some cases (e.g., for assigning bonds around alkali metal
atoms).  Experimental structures are most often determined from
crystals and the atom labels in the crystal layer are based on a
combination of crystallographic symmetry operations and the
labels of the atoms in the asymmetric unit.  A mapping between
the topological and crystal layers is necessary to link the
chemical and crystallographic labels.

Problems in mapping between the layers
In principle the mapping between layers is straightforward, but
there are practical difficulties.

The labelling of atoms in the topological and crystal levels is,
in general, different because the crystal labels contain
crystallographic symmetry operations that have no meaning in the
topological layer, and the labelling in the topological layer
will reflect chemical rather than crystallographic properties.

There are different ways to approach the labelling of atoms in
the different layers.  A labelling that tries to capture the
mappings implicitly rapidly becomes impossibly complex.  A better
approach is to define topological labels for each atom in the
formula unit on a chemical basis, and then explicitly map these
onto the atom labels assigned to the formula unit in the
crystallographic layer.  The lower-level molecular units are
defined separately and are mapped back to the formula unit or
other molecular unit as appropriate.

Role of the geometric description
The atomic labels used in the geometry layer can be the same as
the labels used in the topology layer, so in principle these two
layers could be combined.  However there may be practical as well
as theoretical reasons for keeping them separate, namely
manipulation of the CIF, including printing out the information
it contains, would probably arrange to treat the topology and
geometry separately.  In the sample CIF below they are shown as
separate loops.

==================== PART III ===============================
# This file illustrates how CIF might accommodate chemical information.  
It is
# based on the idea that the bond topology, bonding geometry and
# crystallography are distinct descriptions of the structure, that each is
# reported in its own set of categories and that these can be mapped on to
# each other.  This draft is intended only to show the organization of the
# information.  Data names may change, they may be dropped or new names 
may be
# added. 
# There are two questions we should consider.  The first is whether the
# topology and geometry categories should be combined.  The second is 
# the description of the formula unit should appear in its own loop, 
since it
# is treated differently from the lower-level molecular units.
# The first loop defines the properties of the molecular units the user 
# to define.  These can be nested so that a higher level molecular unit is
# composed of lower level units.  In this description the highest level
# molecular unit must be the formula unit of the crystal, the other 
# units listed are components of the formula unit and may themselves be
# composed of lower-level molecular units.  This example describes the
# molecular structure of trinitrotoluene and its relationship to the crystal
# structure.  The lower-level molecular units are identified here as a
# deprotonated benzene ring, a methyl group, a nitrate group, a hydrogen 
# and a carbon atom.  This example illustrates the nesting of molecular 
# (TNT contains a methyl molecular unit that is itself composed of one C and
# three H molecular units.) This example also shows how low-level molecular
# units (NO2 and H) can be used more than once in building the higher-level
# structure.  In practice nested molecular units may not often be used.  
# chemical descriptions would be much less complex than this one.
# This first loop defines the formula and point group.  It could also 
# molecular mass, formal charge, chirality etc.
1  TNT           'C7 H5 N3 O6'   m  
   'formula unit composed of a benzene ring, methyl and three nitro groups'
2  benzene_ring  'C6'            mm    ?
3  methyl        'C H3'          3m  'composed of C and three H'
4  nitro         'N O2'          mm    ?
5  hydrogen       H              ?   'atom as a molecular unit'
6  carbon         C              ?   'atom as a molecular unit'

# The following loop identifies the atoms that comprise the different
# molecular units.  _topology_atom_molecule_id is the child of _molecule_id
# and identifies the molecular unit to which the atom belongs.
# _topology_atom_label is parent to _*_atom_label items in other 
# The first two items taken together may appear only once in the list.  Note
# that each lower-level unit, e.g., NO2, is given only once even if it 
# more than once in the formula unit, and that the same atom may be defined
# several times if it appears in different molecular units, though it 
does not
# necessarily have the same name.
# _topology_atom_element might link to a category that gives elemental
# properties such as the atomic mass, but this connection is not shown 
in this
# draft.
# First come the atoms in the formula unit
1   C1    C    4   3   ?
1   C2    C    4   3   ?
1   C3    C    4   3   ? 
1   C4    C    4   3   ?
1   C5    C    4   3   ?
1   C6    C    4   3   ?
1   C7    C    4   4   ?
1   H3    H    1   1   ?
1   H5    H    1   1   ?
1   H71   H    1   1   ?
1   H72   H    1   1   ?
1   H73   H    1   1   ?
1   N2    N    3   3   ?
1   O21   O    2   1   ?
1   O22   O    2   1   ?
1   N4    N    3   3   ?
1   O41   O    2   1   ?
1   O42   O    2   1   ?
1   N6    N    3   3   ?
1   O61   O    2   1   ?
1   O62   O    2   1   ?
# The benzene ring is shown next
2   C1    C    4   3   ?
2   C2    C    4   3   ?
2   C3    C    4   3   ?
2   C4    C    4   3   ?
2   C5    C    4   3   ?
2   C6    C    4   3   ?
# Then the methyl group
3   C1    C    4   4   ?
3   H1    H    1   1   ?
3   H2    H    1   1   ?
3   H3    H    1   1   ?
# The nitrate group follows
4   N1    N    3   3   ?
4   O1    O    2   1   ?
4   O2    O    2   1   ?
# The H and C atoms are defined as single atom molecular units
5   H     H    1   1   ?
6   C     C    4   4   ?

# The following loop maps the different molecular units on to each other and
# therefore contains information about the nesting of molecular units.  Each
# atom is identified by its molecule_id and atom_label, the first pair of
# items, which together must be unique in the list, define the atom in the
# molecular unit at the higher level, the second pair define the atoms 
in the
# lower-level molecular units.  The same lower-level descriptor may appear
# more than once.  In this example, the formula unit (1) is composed of
# molecular units 2 (C6), 3 (CH3), 4 (NO2, three times) and 5 (H, twice).
# Molecular unit 3 is composed of 6 (C) and 5 (H, three times).
# The atoms of the formula unit map onto the atoms of the lower level
# molecular units
1   C1   2   C1
1   C2   2   C2
1   C3   2   C3
1   C4   2   C4
1   C5   2   C5
1   C6   2   C6
1   H3   5   H
1   H5   5   H
1   C7   3   C1
1   H71  3   H1
1   H72  3   H2
1   H73  3   H3
1   N2   4   N1
1   O21  4   O1
1   O22  4   O2
1   N4   4   N1
1   O41  4   O1
1   O42  4   O2
1   N6   4   N1
1   O61  4   O1
1   O62  4   O2
# The methyl group maps on to the molecular units that consist of single 
C and
# three H atoms
3   C1   6   C
3   H1   5   H
3   H2   5   H
3   H3   5   H

# The following loop maps the atoms of the formula unit onto the
# crystallographic atom descriptor.  _*_crystal_atom_label is the child 
of the
# existing _atom_site_label and _*_crystal_symop_id is the child of the
# existing _space_group_symop_id.  Further items to give the lattice
# translations of the symmetry operation will sometimes be needed,
# particularly in infinitely connected structures.  These are not shown 
# To illustrate the use of crystallographic symmetry I have assumed
# (presumably incorrectly) that the TNT molecule lies on a crystallographic
# mirror plane perpendicular to plane of the benzene ring.  This mirror
# operation is denoted by the _*_symop_id value of 2. 
# Note that the lower-level molecular units are not mapped on to the
# crystallographic formula unit for various reasons, the main one being that
# the topological bond graph and the crystallographic bond graph are only
# isomorphous for the formula unit, so there is no unique mapping 
between the
# crystallographic formula unit and the lower-level molecular units.  The
# strategy for mapping the lower-level molecular units onto the crystal is
# first to map them back to the formula unit (in this case NO2 maps back to
# three different nitro groups in the formula unit) and these map onto two
# symmetry independent NO2 groups of the crystal .
1   C1   C1   1
1   C2   C2   1
1   C3   C3   1
1   C4   C4   1
1   C5   C3   2
1   C6   C2   2
1   H3   H3   1
1   H5   H3   2
1   C7   C7   1
1   H71  H71  1
1   H72  H72  1
1   H73  H72  2
1   N2   N2   1
1   O21  O21  1
1   O22  O22  1
1   N4   N4   1
1   O41  O41  1
1   O42  O41  2
1   N6   N2   2
1   O61  O21  2
1   O62  O22  2

# The following loop gives the topological connectivities (i.e., the bonds)
# found in each of the molecular units.  The first three items must together
# be unique.  _topology_bond_molecule_id is a child of _molecule_id and the
# two _topology_bond_atom_label_* items are children of 
# Note that although in this example the formula unit is fully 
connected, this
# is not a requirement.  A crystal containing two independent examples 
of the
# same molecule (Z' = 2) would have both listed (but disconnected) at the
# level of the formula unit and these would both be mapped to a lower-level
# molecular unit containing a generic example of the molecule.
# In this example the properties of the same bonds are defined at different
# levels, giving rise to a possible conflict (e.g., the N-O bonds might be
# defined as delocalized in the formula unit, but as double bonds in the NO2
# unit.  What is the proper way to resolve these conflicts?
1   C1   C2   delocalized
1   C2   C3   delocalized
1   C3   C4   delocalized
1   C4   C5   delocalized
1   C5   C6   delocalized
1   C6   C1   delocalized
1   C1   C7   single
1   C3   H3   single
1   C5   H5   single
1   C7   H71  single
1   C7   H71  single
1   C7   H71  single
1   C2   N2   single
1   N2   O21  double
1   N2   O22  double
1   C4   N4   single
1   N4   O41  double
1   N4   O42  double
1   C6   N6   single
1   N6   O61  double
1   N6   O62  double
2   C1   C2   delocalized
2   C2   C3   delocalized
2   C3   C4   delocalized
2   C4   C5   delocalized
2   C5   C6   delocalized
2   C6   C1   delocalized
3   C1   C7   single
2   C3   H3   single
2   C5   H5   single
3   C7   H71  single
3   C7   H71  single
3   C7   H71  single
4   C2   N2   single
4   N2   O21  double
4   N2   O22  double

# The next two loops give the atomic coordinates and bond geometries of the
# molecular units described above.  Each row is uniquely labelled using the
# same items as are used to label the topological atoms and bonds, 
raising the
# question as to whether the geometry categories can be combined with the
# topology categories, so that the coordinates, bond lengths etc. could be
# given in the topology loops above.  Is this desirable or is there a virtue
# in making a clear distinction between topology and geometry?  It may not
# for example, be possible to combine all the properties conveniently on a
# single line so there may be an advantage in splitting them into different
# loops, c.f., atomic coordinates and ADPs in the current CIFs.
# Note that the geometry is not necessarily derived from the crystal 
# and would in general not be possible for the lower level molecular units,
# e.g., the NO2 molecular unit maps onto two crystallographically 
# NO2 groups.  We may wish later to distinguish between different sources of
# the geometry (databases, crystal structure, theory etc.)
# The next loop gives the atomic coordinates at the geometry level, i.e., in
# orthogonal Cartesian coordinates.  As before the first two items are
# children of the above ids and their combined values must be unique.  For
# simplicity in this example the coordinates are represented by the place
# holder '?'.
# Each molecular unit will be referred to its own coordinate system and
# presumably we need to make provision for including the crystallographic to
# molecular transformation matrices, at least for the formula unit.
1   C1        ?    ?    ?
1   C2        ?    ?    ?
1   C3        ?    ?    ?
1   C4        ?    ?    ?
1   C5        ?    ?    ?
1   C6        ?    ?    ?
1   C7        ?    ?    ?
1   H3        ?    ?    ?
1   H5        ?    ?    ?
1   H71       ?    ?    ?
1   H72       ?    ?    ?
1   H73       ?    ?    ?
1   N2        ?    ?    ?
1   O21       ?    ?    ?
1   O22       ?    ?    ?
1   N4        ?    ?    ?
1   O41       ?    ?    ?
1   O42       ?    ?    ?
1   N6        ?    ?    ?
1   O61       ?    ?    ?
1   O62       ?    ?    ?
2   C1        ?    ?    ?
2   C2        ?    ?    ?
2   C3        ?    ?    ?
2   C4        ?    ?    ?
2   C5        ?    ?    ?
2   C6        ?    ?    ?
2   H3        ?    ?    ?
2   H4        ?    ?    ?
3   C7        ?    ?    ?
3   H1        ?    ?    ?
3   H2        ?    ?    ?
3   H3        ?    ?    ?
4   N1        ?    ?    ?
4   O1        ?    ?    ?
4   O2        ?    ?    ?
# The atoms H and C do not need molecular coordinates since they appear 
at the
# origin in their own coordinate system.

# The following list gives the geometry of the bonds.  The first three items
# together must be unique.
1   C1   C2  1.346(3)   1.246(3)  -0.615(3)  0.347(3)
# Items omitted for the sake of brevity, the total number being equal to the
# number of items in the _topology_bond list.
4   N2   O22   ?   ?   ?   ?
############ End of file ################

coreCIFchem mailing list

[Send comment to list secretary]
[Reply to list (subscribers only)]