(36) Comings and goings; IUPAC formula; H bonds; _type_construct

To: [email protected]
Subject: (36) Comings and goings; IUPAC formula; H bonds; _type_construct
From: bm
Date: Thu, 6 Jul 1995 14:48:05 +0100
Dear Colleagues

This time I have to report some changes to our roster of Consultants.
Howard Flack has tendered his resignation, and Hans Wondratschek has
joined, pending consideration by the Executive Committee of his adoption as
a full member.

Howard, who has done a very great deal in terms of building a centralised
repository of crystallographic information on the net, finds that 
"The realities of networking, transfer of information in digital
  form, provision of acceptable (and accepted) software, and user training
  and information are quite different from what I could (and did) imagine.
  My own needs and interests were satisfied 99.5% by what evolved into the core
  dictionary (i.e. very similar to the contents of the SCFS description). I
  have consequently and certainly contributed nothing to the work of COMCIFS
  by being a consultant. It's best that I leave."
I would dispute Howard's contention that he has contributed nothing. Apart
from some well-aimed pedantry (which - within reason - is a merit within
COMCIFS!), it has always seemed important to me that our discussions have
been under the eye of a number of experts in a variety of crystallographic
disciplines, so that even their silence can be taken as confirmation that
we're not committing terrible mistakes in areas outside our own expertise.
I shall miss Howard's correspondence on COMCIFS matters, but at least look
forward to continued interaction with him on other topics. So, Howard,
adieu - and this is the last circular from this group that you will
receive from me!


Hans Wondratschek has been nominated by the Chairman of the International
Tables Commission to join COMCIFS as a member to oversee the handling of
symmetry information within the CIF universe. Until this is confirmed by
the Executive Committee, we welcome him as a Consultant. Hans has already
been involved with Gotzon Madariaga in a project to codify the
group-subgroup tables that will constitute an additional volume of
International Tables in CIF formalism. His interest in extending this to
cover all the information in International Tables volume A ties in with our
preliminary remarks on such a project in circular 26. We welcome him to
our committee.


This will (probably) be the last mailing before the ACA and ECM meetings,
at which I look forward to seeing many of you. 


New item
========

D36.1  New chemical formula
---------------------------
The existing _chemical_formula_ definitions in the Core describe
analytical, structural, moiety and sum formulae according to a set of rules
(described in _chemical_formula_appendix) that are parseable, and that
conform broadly to the Cambridge database rule set. However, in Acta we
have found these rules to be too constricting for typesetting formulae in
more expressive ways, and so we have used a local data name
(_chemical_formula_iucr) for alternative representations. Really that
should be _iucr_chemical..., but we invented it even before COMCIFS was
thought of. But it's been suggested that we propose instead an alternative
global data name to allow expression of any formula in a manner that
satisfies the IUPAC rules (which formulae in Acta do). This data name would
then be available for that purpose by any other journal. I have to confess
that we have jumped the gun on this, for the name has been used in
the CIF instruction booklet we prepared for the summer workshops. I would
of course usually insist that anyone anticipating COMCIFS decisions in such
a way be shot, with no heed paid to any excuse (such as "printer's deadlines"),
and I shall await my punishment stoically.

data_chemical_formula_iupac
    _name                       '_chemical_formula_iupac'
    _type                        char
    _example                    '[Co Re (C 12 H 22 P)2 (C O)6].0.5C H3 O H'
    _definition
;              Formula expressed in conformance with IUPAC rules for
               inorganic and metal-organic compounds (Nomenclature of
               Inorganic Chemistry, 1990. Oxford: Blackwell Scientific
               Publications), where these conflict with the rules for
               any other _chemical_formula_ entries. Typically used for
               formatting a formula in accordance with journal rules.
               This should appear in the CIF in addition to the most
               appropriate of the other _chemical_formula_ data names.
;

Note how this differs, for example, from _chemical_formula_structural
by the inclusion of square brackets and premultipliers. It is intended to
allow greater freedom to deviate from the strict CIF rules where necessary.


Continuing discussions
======================

(31)D30.1 Hydrogen bonds
------------------------
We almost jumped the gun on this, too, by adding to the CIF Guide
the _geom_hbond_ set of data names I proposed in circular 31, but
wiser counsels prevailed, and we listed the old local names, but with
a caution that they might be replaced in the new Core release. (As with
the proposed _chemical_formula_iupac above, we can and will if necessary
continue with local names for our own purposes, but these seem to be of
potentially wider use.)

I continue to prefer my option (2), which was a new category geom_hbond
with data names

     _geom_hbond_atom_site_label_D
     _geom_hbond_atom_site_label_H
     _geom_hbond_atom_site_label_A
     _geom_hbond_distance_DH
     _geom_hbond_distance_HA
     _geom_hbond_distance_DA
     _geom_hbond_angle_DHA
     _geom_hbond_site_symmetry_A                                    
     _geom_hbond_site_symmetry_D
     _geom_hbond_site_symmetry_H
     _geom_hbond_publ_flag

despite David's objection that this confers a special status on hydrogen
bonds. The reasons for my preference are pragmatic: it's much easier to
teach authors to use this largely intuitive set of codes than to set up
pointers between elements of different lists. Again, I emphasise that this
is for Acta purposes, and is not designed for mmCIF applications where
hydrogen bonding and other interactions are handled differently.

Syd has endorsed this approach in his review of the draft CIF Guide. Are
there any further objections?

Here's an example to show the idea in action:

loop_
     _geom_hbond_atom_site_label_D
     _geom_hbond_atom_site_label_H
     _geom_hbond_atom_site_label_A
     _geom_hbond_distance_DH
     _geom_hbond_distance_HA
     _geom_hbond_distance_DA
     _geom_hbond_angle_DHA
     _geom_hbond_site_symmetry_A                                    

# D     H     A    D-H   H...A   D...A    D-H...A  symm
# -     -     -    ---   -----   -----    -------  ----
  N10   H10   O6   0.95   1.99   2.816(5)  145    3_745
  C3    H3    O31  0.95   2.23   2.640(7)  105      .
  C4A   H4A   O51  0.95   2.26   2.735(5)  110      .
  C7    H7A   O8   0.95   2.19   2.647(7)  109      .


D33.2  _type_construct in the MS dictionary
-------------------------------------------
D>      Here are some further thoughts on 33.2 dealing with type
D> constructs.
D> 
D>      As I understand it, regex tells you only about the
D> *characters* one can expect to find at a particular point is the
D> string of characters that make up a data item.  This clearly has
D> some value in validating files to ensure that the characters that
D> appear conform to the dictionary specifications, but it does not
D> say anything about the *information* that the string contains.  A
D> program that looks no further than REGEX can see that the file
D> conforms in a syntactical way without having any idea of the
D> semantics.  It is like a program that checks that a crossword
D> puzzle consists of black and white squares, but does not check the
D> clues. 

Yes, this is certainly the case.

D>      I feel that we should be working towards a dictionary that not
D> only describes the allowed character strings, but also gives the
D> computer information on how these strings are related.  We should
D> be moving from merely providing the computer with typographical
D> information to semantic information.  Clearly this is something
D> that needs to be done carefully but Gotzon's draft of the modulated
D> structure dictionary has opened the door.
D> 
D>      Our handling of symmetry in the core is one place where we
D> could use such an approach to advantage so I will use this as an
D> example of what we could do.  There is at present an important
D> ambiguity in the core definitions.  In the geom sections we give
D> the symmetry transformations applied to an atom in the form
D> 
D> _geom_*_site_symmetry_+            3_546
D> 
D> where '*' stand for 'bond', 'contact', 'angle' or 'torsion' and '+'
D> stands for '1', '2', '3' or '4'.  These data items ought to be
D> described by a _type_construct since they are, and always have
D> been, understood to be a parsable string, the first character
D> referring to the symmetry operator and the last three to the
D> translation operator applied in order to generate the atom.  The
D> programs in Chester already parse this information in producing
D> copy for Acta Cryst. C.  
D> 
D>      Of course, when the first version of the core was prepared
D> there were no _type_constructs so this is one feature that we need
D> to add.  This is not the problem.  The problem we have is more
D> serious.  The symmetry operators used in _geom are never explicitly
D> defined in the _symmetry_equiv category.  There is an implicit
D> assumption that the list of equivalent positions is ordered, but
D> there is nothing in the cif definition that prevents these from
D> being sorted into a different order, in which case all the
D> information about the symmetry operators applied in _geom is lost. 
D> We clearly need a _symmetry_equiv_pos_label item in the
D> _symmetry_equiv category to identify each symmetry operator. 
D> 
D>      This too can be added without much difficulty, but this _label
D> will be parent to a family of 11 closely related children having
D> the form _geom_*_site_symmetry_label_+!  However, this data item is
D> not yet defined, though its value does appear combined with
D> information on translations in _geom_*_site_symmetry_+.  This means
D> that we have, at present, no way to indicate the parent-child
D> relationship.  
D> 
D>      We could, using the conventions proposed by Gotzon, make this
D> relation explicit by using a _type_construct for
D> _geom_*_site_symmetry_+ that has the form:
D> 
D> (_geom_*_site_symmetry_label_+)_(_geom_*_site_symmetry_trans_a_+)\
D> (_geom_*_site_symmetry_trans_b_+)(_geom_*_site_symmetry_trans_c_+)
D> 
D> This in turn requires the definition of each of these four items
D> for each of the 11 substitutions of * and +, a total of 44 new data
D> items as follows:
D> 
D>      2 _geom_bond_site_symmetry_+, 
D>      3 _geom_angles_site_symmetry_+, 
D>      4 _geom_torsion_site_symmetry_+ 
D> and  2 _geom_contact_site_symmetry_+ items, 
D> 
D> This sort of fecundity we can do without.  However there are
D> methods of birth control at hand!
D> 
D>      Each of these 44 items refer to the same set of four primitive
D> definitions:
D> 
D>           _site_symmetry_label
D>           _site_symmetry_trans_a
D>           _site_symmetry_trans_b
D>           _site_symmetry_trans_c
D> 
D> Since these primitive data items do not appear explicitly in the
D> _geom_* loops, they do not have to be in the same category as the
D> _geom_*_site_symmetry_+ items and so do not have to be separately
D> defined for each of the 4 _geom_ categories.  This leads to a
D> considerable simplification.  Thus the _type_construct for each of
D> the _geom_*_site_symmetry_+ items will be identical, viz:
D> 
D> (_site_symmetry_label)_(site_symmetry_trans_a)\
D> (site_symmetry_trans_b)(site_symmetry_trans_c)
D> 
D> meaning that instead of defining 44 different names we only need to
D> define 4.  
D> 
D>      There is yet a further simplification we can make.  Since
D> _site_symmetry_label is a child of _symmetry_equiv_pos_label and
D> since the category of site_symmetry_label is not yet defined we can
D> replace the name of the child by the name of the parent.  The above
D> _type_construct can then be written as:
D> 
D> (_symmetry_equiv_pos_label)_(site_symmetry_trans_a)\
D> (site_symmetry_trans_b)(site_symmetry_trans_c)
D> 
D> leaving only three new data items to define, the three translation
D> items.

Yes, this is the sort of approach I described in D25.6(b). There I spoke of
assigning the category 'null' to ensure that such new data names (defined
just to circumscribe the components within a _type_construct field) did not
appear alone in a DATA file. However, I have an open mind as to whether
this is necessary, or whether explicit categories should be introduced.

The DDL2 approach uses _item_type_list.construct for the same purpose. Are
there any considerations specific to the DDL2 implementation that we should
be aware of (where categories are treated somewhat differently)?

D>      This simplification can be extended to other definitions.  For
D> example, there are 11 items of the type _geom_*_atom_site_label_+
D> each of which is separately defined in the current core dictionary. 
D> However, they could each be given the same _type_construct:
D> 
D>           (_atom_site_label)
D> 
D> This automatically identifies each of the children with its parent
D> and ensures that it automatically inherits such properties of its
D> parent as _type and would eliminate the need for _list_link_parent. 
D> As far as I can see, if we followed this idea systematically
D> through the dictionary we could completely eliminate the need for
D> the _list_link items!  Is there a snag that I have not spotted?

I suspect that the _list_link items convey some deeper information about
the relationships between categories ("lists" or "tables") than would be
satisfied by the substitution of the elements in the _type_construct only,
but I'm not sure. Perhaps the relational experts can pay particular
attention to this suggestion.


D> (Brian, comcifs may not be ready yet for this next section but it
D> could be passed on to Gotzon for his comments)

David, I'm sure you're right that this could go out on a smaller loop, but
I think it's useful to have this cross-fertilization of ideas visible - at
least in part - to all active dictionary authors. Anyway, everyone (except
Gotzon) can take your disclaimer as an excuse not to read the next section,
at least not with their usual attention to every detail!

D>      Turning now to the modulated structure dictionary, I have some
D> problems with the complexity of items such as
D> _atom_site_fourier_label_disp.  I feel that there should be a
D> simpler form.  
D> 
D>      Most of the modulating functions are expressed as a Fourier
D> series using either cosine or sine functions or both.  Each
D> modulation is defined by a wave vector given in
D> _cell_wave_vector_x, *_y, *_z and *_seq_id (maybe *_label would
D> represent our usual practice for this last item).  The modulation
D> that occurs in any parameter is then defined by an amplitude that
D> multiplies the sine or cosine function of the dot product of the
D> wave vector and the position vector, to which product may be added
D> a phase.  Therefore for each wave vector, each atom and each
D> coordinate we need any two of *_amp_sin, *_amp_cos and *_phase
D> (where * is a prefix that will refer to translation, rotation,
D> temperature factor or occupation).  This looks as if we need triply
D> nested loops, but it can be neatly handled by a construct such as:
D> 
D> # This example is based in part on the example given in the draft
D> # msCIFDIC extended to show how more than one modulation vector
D> # would be handled.
D> 
D> # Define the wave vectors
D> 
D> _loop
D>  _cell_wave_vector_label
D>  _cell_wave_vector_x
D>  _cell_wave_vector_y
D> 4 0.318(5) 0
D> 5 0 0.0223(4)
D> # Note that the z component of both wave vectors will have the
D> # default of 0
D> 
D> # Define the translation amplitudes.  These vary with modulation
D> # vector 4
D> 
D> _loop
D>  _mod_trans_atom_label
D>  _mod_trans_axis_label
D>  _wave_vector_label
D>  _mod_trans_amp_cos
D>  _mod_trans_amp_sin
D> K1 z 4 0.0084(4) -0.0106(5)
D> K2 z 4 0.0159(4) 0.0071(6)
D> SeO4 z 4 -0.0089(2) -0.0058(2)
D> 
D> # Define the rotation amplitudes.  These vary with modulation
D> # vector 5
D> 
D> _loop
D>  _mod_rot_atom_label
D>  _mod_rot_axis_label
D>  _wave_vector_label
D>  _mod_rot_amp_cos
D>  _mod_rot_amp_sin
D> SeO4 x 5 -4.2(1) 0.91(3)
D> SeO4 y 5 4.3(1) -5.5(2)
D> 
D> This way rotations and translations are kept in separate categories
D> so that their units are not mixed.  Translations are given in
D> fractions of the unit cell, rotations in degrees.  In this scheme,
D> higher order components in the Fourier series must be treated as
D> different modulation vectors, a problem that could be avoided if we
D> were allowed to use nested loops.



(34)D28.2, D28.3  R Factors
---------------------------

B> I am completely in favo[u]r of a systematic naming system for R-factors.
B> David's proposals fine. Should the powder defs have _pd_ in front? (I
B> have no strong feelings on this either).



Regards
Brian
Prev by Date: (35) Mostly units and R factors
Next by Date: (37) Length of data names in mmCIF; 'include' preprocessor directive
Index(es):
- Date
Discussion List Archives

(36) Comings and goings; IUPAC formula; H bonds; _type_construct