[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

# coreCIFchem Discussion #3

*To*:*corecifchem@iucr.org***Subject**:**coreCIFchem Discussion #3****From**:**David Brown <idbrown@mcmaster.ca>***Date*:*Mon, 26 Jan 2004 12:00:57 -0500*

Dear colleagues, Thank you for the feedback to coreCIFchem discussion paper #2. In part I of this email I summarize the discussion, in part II I make a proposal and in Part III I enclose a CIF which illustrates the proposal. I would invite you to comment on these by replying to the group list before February 29. If you need more time, let me know and we can negotiate a later date. This email can be saved as a text file. It prints on about 15 pages. David *************************************** I.David Brown, Professor Emeritus of Physics Brockhouse Institute for Materials Research McMaster University, Hamilton, Ontario Canada L8S 4M1 Tel: +905 525 9140 ext 24710 Fax: +905 521 2773 email: idbrown@mcmaster.ca *************************************** ================= PART I ========================= COMMENTS RECEIVED ----------------- In the previous paper I outlined a possible graph theoretical approach to the problem and ended with the following questions: >> Before we try to define CIF items for particular chemical >> concepts, we need to have a consensus about the definition of >> a molecule. I have made some >> suggestions above, and I would be interested in people's >> comments. Is graph theory a fruitful way to go or should we >> take a different approach? What are the problems we might >> encounter using the approach described above? Comments related to the use of graph theory ------------------------------------------- John Bollinger commented:- I find the idea of relating chemical properties to a bond graph to be rather attractive, although I confess to the influence of a bit of background in formal mathematics. One aspect that David left unexplored is the possibility of applying multiple properties to bond graph edges. One need not choose a single measure of bond strength, nor make measures of bond strength the only properties a bond graph edge may have. For instance, one could apply an explicit bond categorization (e.g. "covalent", "dative", "hydrogen bond", "non-bond"). One could also express purely chemical information, such as the fact that "this is the bond that is broken in the course of the von Foo reaction". CIF is conveniently flexible in this regard, as authors may include exactly the properties they wish to describe while ignoring all others. One oddity I see with that approach is that the interatomic distances currently described by _geom_bond_distance fit nicely into the collection of properties that could be associated with bond graph edges, but many other _geom_* items do not. Graph theorists do have concepts that could be applied there, but we must take care to avoid making CIF (more) incomprehensible to mere mortals. Perhaps, though, it does make sense to consider whether all the various geom_* categories should be subsumed into a scheme such as this -- they are all examples of data that have both crystallographic significance and chemical significance. Herbert Bernstein also argues for a flexible scheme:- How about this approach -- instead of trying to agree on a definition of a molecule, should we not be trying to clearly state a reasonable set of chemical views of matter and, where possible the relationships and/or transformations among those views, as has already been started for macromolecules. The approaches, chemical formulae, 3-D structures, charge density maps, residue-based polymeric descriptions, all have something to say about how various chemical entities can interact with other chemical entities, which is, after all, what Chemistry is all about. Comments related to the definition of a molecule ------------------------------------------------ Howard Flack:- One property of a molecule is that it is composed of atoms. Even better would be: A molecule is a set containing atoms and molecules. It might be advisable to add the restriction that an atom, as an individual member, may only occur in one molecule. This restriction definitely does not apply to the non- molecular tectons or 'building blocks' as in the quartz example. John Bollinger explores a number of possible approaches:- As chemists in general have no single consistent definition of a molecule, it would be fruitless for us to attempt to impose a universal one of our own in hopes of satisfying everyone. The alternatives I see are (1) to choose our own definition for CIF purposes and use it consistently; (2) to support diverse CIF items with which to describe multiple different molecule concepts; (3) to provide sufficient data for a chemist to apply his or her own definition of "molecule"; or (4) to attempt to ignore molecules altogether. Although (4) is perhaps most true to pure crystallography, I think it is least suitable for our purpose. Option (2) strikes me as inelegant and short-sighted. Option (1) might be feasible if we could actually come up with a suitable definition, but I question whether that is possible. That leaves option (3), to which category I would assign most applications of the bond graph approach that I have imagined so far. Summary of these comments:- --------------------------- We should take advantage of the ability of both graph theory and CIF to associate a variety of different properties with the vertices and edges (atoms and bonds) of a graph in a way that leaves people free to use whatever model they feel comfortable with. The definition of a molecule should also be left flexible with the possibility of nesting smaller molecular units within larger ones. =================== PART II =========================== A PROPOSAL ---------- Below I outline an arrangement of categories built on the ideas expressed above. As it is important that we have a clear idea of what these categories represent and the relationships between them I start with some definitions. Concepts and definitions ------------------------ I shall refer to clusters of atoms as MOLECULAR UNITS. The atoms in these molecular units need not be fully connected by bonds but frequently will be. The molecular units may be charged (e.g., complex ions) and they may consist of a single atom. Molecular units may be nested, e.g., MgSO4.7(H2O) could be broken down into (H14, Mg, O11, S} or {Mg, (SO4), (H2O)7} or {Mg(H2O)6, SO4, H2O] or at the highest level by the formula unit [MgSO4(H2O)7}. Each of the clusters divided by commas in the above descriptions is a molecular unit and there is a hierarchy in which the molecular units at the higher levels are composed of molecular units at a lower level. The FORMULA UNIT is the smallest number of atoms that captures all the differences in the chemical and crystallographic structures. It is at least as large as the asymmetric unit and should be no larger than the contents of the unit cell. It may contain more than one molecule in cases where Z' > 1. It will contain more than one asymmetric unit if the molecular units of interest contain crystallographic symmetry. There are three layers at which a DESCRIPTION OF A STRUCTURE can be given: 1: The highest layer of description is the TOPOLOGICAL description which consists of a listing of the atoms and the bonds that link them. This is the graph theoretical level. The topological description is close to the nineteenth century model of chemical structure described by the familiar 2-dimensional molecular diagrams (bond graphs). It is the normal level at which molecular structure is described, organic syntheses are planned and chemical structures are taught to students. The list of atoms is determined by the chemical formula (assumed to be given) and the bonds between them are assigned by applying a set of simple rules. This topological description does not include any information on the three-dimensional geometry such as bond lengths and angles. Although it is customary to draw a molecular diagram as the projection of the 3-dimensional structure, the positions of the atoms in a bond graph are not defined, atoms can be placed in any arbitrary position, only the linkage between them is important. The description at this layer excludes properties such as the electron density distribution which depends on a knowledge of the atomic positions. 2: Next is the GEOMETRY layer. In this layer the 3-dimensional coordinates of atoms are introduced, usually expressed using an orthogonal Cartesian basis. Atom labels are identical to those defined in the topological layer. Interatomic distances and vectors can be derived and, by mapping from the topological level, distances can be identified as either bonds or not-bonds. 3: Third is the CRYSTAL layer. The lattice parameters form the basis for the atomic coordinates. The atoms of the formula unit are labelled by augmenting the labels of the asymmetric unit with the appropriate crystallographic symmetry operation. Interatomic distances may be calculated, and can be classified as bonds or not-bonds by mapping from the topological layer. Although there are effective empirical schemes for assigning bonds on the basis of interatomic distance, there is no guarantee that this assignment agrees with the assignment in the topological layer. The topological assignment takes precedence. (An aside: electron densities ----------------------------- Models of chemical bonding based on electron densities, while containing much important information about the stabilities of different structures, are still struggling to find their place in the simple descriptions used by chemists. In any case these models usually require an a priori knowledge of the 3-dimensional structure. The electron density can be measured or calculated in the crystal layer, and can be calculated in the geometry layer. We need to make provision for electron density studies, but we need first to get the broad structures established.) (Another aside: treatment infinite graphs ----------------------------------------- The graph of an infinitely connected structure might seem to present further difficulties, but these can be handled. Discussion of this can be deferred. For the moment the discussion is restricted to finite molecular units.) SOME NOTES ON SYMMETRY ---------------------- There are some obvious symmetry relationships we should not loose sight of. The SYMMETRY of a graph is called an AUTOMORPHISM, an operation that interchanges the labels of atoms without changing the graph, e.g., the 6-fold rotation of the atoms in the benzene molecule is an automorphism. The symmetry in the geometry layer is given by the POINT GROUP of the molecular unit, and symmetry in the crystal layer by the SPACE GROUP of the crystal or by the SITE SYMMETRY of the molecular unit. This is the site symmetry of a special position lying within the molecular unit which may or may not be occupied by an atom. The following theorems apply to all molecular units. THEOREM 1: Every symmetry operation of the point group of the molecular unit (in the geometry layer) must correspond to an automorphism of the bond graph. The converse is not true. THEOREM 2: Every symmetry operation of the crystallographic site symmetry of the molecular unit must be an element of the point group of the molecular unit in the geometry layer. Again, the converse is not true. COROLLARY: Automorphism order >= point group order >= site symmetry order. OUR GOAL -------- We are aiming to introduce into CIF a description of the chemistry of a compound in the topological layer. This description should allow for the nesting of complex structures and the assignment of a variety of properties to the atoms, the bonds and the molecular units. The properties of an atom include its name, label, and physical properties such as valence, coordination number, etc., the properties of a bond are its length, valence, character, etc. and the properties of a molecular unit are its formula, name, formula mass, etc. RELATIONSHIPS BETWEEN LAYERS ---------------------------- The chemistry (i.e. the pattern of bonds that link the atoms) is defined in the topological layer and in this layer the atoms are labelled according to a chemical scheme. In this layer, and only in this layer, there are rules for determining which atoms are bonded, though knowledge of the nearest neighbours may be needed in some cases (e.g., for assigning bonds around alkali metal atoms). Experimental structures are most often determined from crystals and the atom labels in the crystal layer are based on a combination of crystallographic symmetry operations and the labels of the atoms in the asymmetric unit. A mapping between the topological and crystal layers is necessary to link the chemical and crystallographic labels. Problems in mapping between the layers -------------------------------------- In principle the mapping between layers is straightforward, but there are practical difficulties. The labelling of atoms in the topological and crystal levels is, in general, different because the crystal labels contain crystallographic symmetry operations that have no meaning in the topological layer, and the labelling in the topological layer will reflect chemical rather than crystallographic properties. There are different ways to approach the labelling of atoms in the different layers. A labelling that tries to capture the mappings implicitly rapidly becomes impossibly complex. A better approach is to define topological labels for each atom in the formula unit on a chemical basis, and then explicitly map these onto the atom labels assigned to the formula unit in the crystallographic layer. The lower-level molecular units are defined separately and are mapped back to the formula unit or other molecular unit as appropriate. Role of the geometric description --------------------------------- The atomic labels used in the geometry layer can be the same as the labels used in the topology layer, so in principle these two layers could be combined. However there may be practical as well as theoretical reasons for keeping them separate, namely manipulation of the CIF, including printing out the information it contains, would probably arrange to treat the topology and geometry separately. In the sample CIF below they are shown as separate loops. ==================== PART III =============================== SAMPLE CIF ----------data_sample_chemical_CIF # # This file illustrates how CIF might accommodate chemical information. It is # based on the idea that the bond topology, bonding geometry and # crystallography are distinct descriptions of the structure, that each is # reported in its own set of categories and that these can be mapped on to # each other. This draft is intended only to show the organization of the # information. Data names may change, they may be dropped or new names may be # added. # # There are two questions we should consider. The first is whether the # topology and geometry categories should be combined. The second is whether # the description of the formula unit should appear in its own loop, since it # is treated differently from the lower-level molecular units. # # The first loop defines the properties of the molecular units the user wishes # to define. These can be nested so that a higher level molecular unit is # composed of lower level units. In this description the highest level # molecular unit must be the formula unit of the crystal, the other molecular # units listed are components of the formula unit and may themselves be # composed of lower-level molecular units. This example describes the # molecular structure of trinitrotoluene and its relationship to the crystal # structure. The lower-level molecular units are identified here as a # deprotonated benzene ring, a methyl group, a nitrate group, a hydrogen atom # and a carbon atom. This example illustrates the nesting of molecular units # (TNT contains a methyl molecular unit that is itself composed of one C and # three H molecular units.) This example also shows how low-level molecular # units (NO2 and H) can be used more than once in building the higher-level # structure. In practice nested molecular units may not often be used. Most # chemical descriptions would be much less complex than this one. # # This first loop defines the formula and point group. It could also include # molecular mass, formal charge, chirality etc. # loop_ _molecule_id _molecule_name _molecule_formula _molecule_point_group _molecule_details 1 TNT 'C7 H5 N3 O6' m 'formula unit composed of a benzene ring, methyl and three nitro groups' 2 benzene_ring 'C6' mm ? 3 methyl 'C H3' 3m 'composed of C and three H' 4 nitro 'N O2' mm ? 5 hydrogen H ? 'atom as a molecular unit' 6 carbon C ? 'atom as a molecular unit' # # The following loop identifies the atoms that comprise the different # molecular units. _topology_atom_molecule_id is the child of _molecule_id # and identifies the molecular unit to which the atom belongs. # _topology_atom_label is parent to _*_atom_label items in other categories. # The first two items taken together may appear only once in the list. Note # that each lower-level unit, e.g., NO2, is given only once even if it occurs # more than once in the formula unit, and that the same atom may be defined # several times if it appears in different molecular units, though it does not # necessarily have the same name. # # _topology_atom_element might link to a category that gives elemental # properties such as the atomic mass, but this connection is not shown in this # draft. # loop_ _topology_atom_molecule_id _topology_atom_label _topology_atom_element _topology_atom_valence _topology_atom_coord_number _topology_atom_details # # First come the atoms in the formula unit # 1 C1 C 4 3 ? 1 C2 C 4 3 ? 1 C3 C 4 3 ? 1 C4 C 4 3 ? 1 C5 C 4 3 ? 1 C6 C 4 3 ? 1 C7 C 4 4 ? 1 H3 H 1 1 ? 1 H5 H 1 1 ? 1 H71 H 1 1 ? 1 H72 H 1 1 ? 1 H73 H 1 1 ? 1 N2 N 3 3 ? 1 O21 O 2 1 ? 1 O22 O 2 1 ? 1 N4 N 3 3 ? 1 O41 O 2 1 ? 1 O42 O 2 1 ? 1 N6 N 3 3 ? 1 O61 O 2 1 ? 1 O62 O 2 1 ? # # The benzene ring is shown next # 2 C1 C 4 3 ? 2 C2 C 4 3 ? 2 C3 C 4 3 ? 2 C4 C 4 3 ? 2 C5 C 4 3 ? 2 C6 C 4 3 ? # # Then the methyl group # 3 C1 C 4 4 ? 3 H1 H 1 1 ? 3 H2 H 1 1 ? 3 H3 H 1 1 ? # # The nitrate group follows # 4 N1 N 3 3 ? 4 O1 O 2 1 ? 4 O2 O 2 1 ? # # The H and C atoms are defined as single atom molecular units # 5 H H 1 1 ? 6 C C 4 4 ? # # The following loop maps the different molecular units on to each other and # therefore contains information about the nesting of molecular units. Each # atom is identified by its molecule_id and atom_label, the first pair of # items, which together must be unique in the list, define the atom in the # molecular unit at the higher level, the second pair define the atoms in the # lower-level molecular units. The same lower-level descriptor may appear # more than once. In this example, the formula unit (1) is composed of # molecular units 2 (C6), 3 (CH3), 4 (NO2, three times) and 5 (H, twice). # Molecular unit 3 is composed of 6 (C) and 5 (H, three times). # loop_ _topology_mapping_molecule_1 _topology_mapping_atom_label_1 _topology_mapping_molecule_2 _topology_mapping_atom_label_2 # # The atoms of the formula unit map onto the atoms of the lower level # molecular units # 1 C1 2 C1 1 C2 2 C2 1 C3 2 C3 1 C4 2 C4 1 C5 2 C5 1 C6 2 C6 1 H3 5 H 1 H5 5 H 1 C7 3 C1 1 H71 3 H1 1 H72 3 H2 1 H73 3 H3 1 N2 4 N1 1 O21 4 O1 1 O22 4 O2 1 N4 4 N1 1 O41 4 O1 1 O42 4 O2 1 N6 4 N1 1 O61 4 O1 1 O62 4 O2 # # The methyl group maps on to the molecular units that consist of single C and # three H atoms # 3 C1 6 C 3 H1 5 H 3 H2 5 H 3 H3 5 H # # The following loop maps the atoms of the formula unit onto the # crystallographic atom descriptor. _*_crystal_atom_label is the child of the # existing _atom_site_label and _*_crystal_symop_id is the child of the # existing _space_group_symop_id. Further items to give the lattice # translations of the symmetry operation will sometimes be needed, # particularly in infinitely connected structures. These are not shown here. # To illustrate the use of crystallographic symmetry I have assumed # (presumably incorrectly) that the TNT molecule lies on a crystallographic # mirror plane perpendicular to plane of the benzene ring. This mirror # operation is denoted by the _*_symop_id value of 2. # # Note that the lower-level molecular units are not mapped on to the # crystallographic formula unit for various reasons, the main one being that # the topological bond graph and the crystallographic bond graph are only # isomorphous for the formula unit, so there is no unique mapping between the # crystallographic formula unit and the lower-level molecular units. The # strategy for mapping the lower-level molecular units onto the crystal is # first to map them back to the formula unit (in this case NO2 maps back to # three different nitro groups in the formula unit) and these map onto two # symmetry independent NO2 groups of the crystal . # loop_ _crystal_mapping_molecule_id _crystal_mapping_topology_atom_label _crystal_mapping_crystal_atom_label _crystal_mapping_crystal_symop_id 1 C1 C1 1 1 C2 C2 1 1 C3 C3 1 1 C4 C4 1 1 C5 C3 2 1 C6 C2 2 1 H3 H3 1 1 H5 H3 2 1 C7 C7 1 1 H71 H71 1 1 H72 H72 1 1 H73 H72 2 1 N2 N2 1 1 O21 O21 1 1 O22 O22 1 1 N4 N4 1 1 O41 O41 1 1 O42 O41 2 1 N6 N2 2 1 O61 O21 2 1 O62 O22 2 # # The following loop gives the topological connectivities (i.e., the bonds) # found in each of the molecular units. The first three items must together # be unique. _topology_bond_molecule_id is a child of _molecule_id and the # two _topology_bond_atom_label_* items are children of _topology_atom_label. # # Note that although in this example the formula unit is fully connected, this # is not a requirement. A crystal containing two independent examples of the # same molecule (Z' = 2) would have both listed (but disconnected) at the # level of the formula unit and these would both be mapped to a lower-level # molecular unit containing a generic example of the molecule. # # In this example the properties of the same bonds are defined at different # levels, giving rise to a possible conflict (e.g., the N-O bonds might be # defined as delocalized in the formula unit, but as double bonds in the NO2 # unit. What is the proper way to resolve these conflicts? # loop_ _topology_bond_molecule_id _topology_bond_atom_label_1 _topology_bond_atom_label_2 _topology_bond_type 1 C1 C2 delocalized 1 C2 C3 delocalized 1 C3 C4 delocalized 1 C4 C5 delocalized 1 C5 C6 delocalized 1 C6 C1 delocalized 1 C1 C7 single 1 C3 H3 single 1 C5 H5 single 1 C7 H71 single 1 C7 H71 single 1 C7 H71 single 1 C2 N2 single 1 N2 O21 double 1 N2 O22 double 1 C4 N4 single 1 N4 O41 double 1 N4 O42 double 1 C6 N6 single 1 N6 O61 double 1 N6 O62 double 2 C1 C2 delocalized 2 C2 C3 delocalized 2 C3 C4 delocalized 2 C4 C5 delocalized 2 C5 C6 delocalized 2 C6 C1 delocalized 3 C1 C7 single 2 C3 H3 single 2 C5 H5 single 3 C7 H71 single 3 C7 H71 single 3 C7 H71 single 4 C2 N2 single 4 N2 O21 double 4 N2 O22 double # # The next two loops give the atomic coordinates and bond geometries of the # molecular units described above. Each row is uniquely labelled using the # same items as are used to label the topological atoms and bonds, raising the # question as to whether the geometry categories can be combined with the # topology categories, so that the coordinates, bond lengths etc. could be # given in the topology loops above. Is this desirable or is there a virtue # in making a clear distinction between topology and geometry? It may not # for example, be possible to combine all the properties conveniently on a # single line so there may be an advantage in splitting them into different # loops, c.f., atomic coordinates and ADPs in the current CIFs. # # Note that the geometry is not necessarily derived from the crystal structure # and would in general not be possible for the lower level molecular units, # e.g., the NO2 molecular unit maps onto two crystallographically independent # NO2 groups. We may wish later to distinguish between different sources of # the geometry (databases, crystal structure, theory etc.) # # The next loop gives the atomic coordinates at the geometry level, i.e., in # orthogonal Cartesian coordinates. As before the first two items are # children of the above ids and their combined values must be unique. For # simplicity in this example the coordinates are represented by the place # holder '?'. # # Each molecular unit will be referred to its own coordinate system and # presumably we need to make provision for including the crystallographic to # molecular transformation matrices, at least for the formula unit. # loop_ _geometry_atom_molecule_id _geometry_atom_label _geometry_atom_x _geometry_atom_y _geometry_atom_z 1 C1 ? ? ? 1 C2 ? ? ? 1 C3 ? ? ? 1 C4 ? ? ? 1 C5 ? ? ? 1 C6 ? ? ? 1 C7 ? ? ? 1 H3 ? ? ? 1 H5 ? ? ? 1 H71 ? ? ? 1 H72 ? ? ? 1 H73 ? ? ? 1 N2 ? ? ? 1 O21 ? ? ? 1 O22 ? ? ? 1 N4 ? ? ? 1 O41 ? ? ? 1 O42 ? ? ? 1 N6 ? ? ? 1 O61 ? ? ? 1 O62 ? ? ? 2 C1 ? ? ? 2 C2 ? ? ? 2 C3 ? ? ? 2 C4 ? ? ? 2 C5 ? ? ? 2 C6 ? ? ? 2 H3 ? ? ? 2 H4 ? ? ? 3 C7 ? ? ? 3 H1 ? ? ? 3 H2 ? ? ? 3 H3 ? ? ? 4 N1 ? ? ? 4 O1 ? ? ? 4 O2 ? ? ? # # The atoms H and C do not need molecular coordinates since they appear at the # origin in their own coordinate system. # # The following list gives the geometry of the bonds. The first three items # together must be unique. # loop_ _geometry_bond_molecule_id _geometry_bond_atom_label_1 _geometry_bond_atom_label_2 _geometry_bond_distance _geometry_bond_vector_x _geometry_bond_vector_y _geometry_bond_vector_z 1 C1 C2 1.346(3) 1.246(3) -0.615(3) 0.347(3) # # Items omitted for the sake of brevity, the total number being equal to the # number of items in the _topology_bond list. # 4 N2 O22 ? ? ? ? # ############ End of file ################ _______________________________________________ coreCIFchem mailing list coreCIFchem@iucr.org http://scripts.iucr.org/mailman/listinfo/corecifchem

**[Send comment to list secretary]****[Reply to list (subscribers only)]****Follow-Ups**:**Re: coreCIFchem Discussion #3**(Howard Flack)

- Prev by Date:
**Re: Discussion #2** - Next by Date:
**Re: coreCIFchem Discussion #3** - Prev by thread:
**Re: Discussion Paper #4 CORRECTION** - Next by thread:
**Re: coreCIFchem Discussion #3** - Index(es):