This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.
[IUCr Home Page] [CIF Home Page]


[Date Prev][Date Next][Date Index]

(30) DDL version 2; hydrogen bonds; su; R factors; _type_construct

Dear Colleagues

There is new ftp area reserved for the use of COMCIFS members. To access files
in this area, make an ftp connection to agate.iucr.ac.uk (or 192.70.242.60),
login as "comcifs" with password "wheatear". The directory now contains
copies of the DDL2 and mmCIF dictionaries (see discussion below).

D16.1 esd versus su
-------------------
Some time ago, we left open ("pending further developments") the adoption of
the term "statistical uncertainty" in place of "estimated standard deviation".
Howard now draws our attention to the fact that there have been further
developments.

H> The Executive Committee has approved the Statistical Descriptors II
H> report which is now with the Technical Editor's office in Chester being
H> prepared for publication in Acta. 

A copy of this is available from the ftp directory as the file 
statistical_descriptors_II.

H>                                   As I mentioned earlier two of the
H> recommendations concern COMCIFS and there was some short discussion about
H> them previously.
H> (a) the term 'standard uncertainty' (symbol su) is recommended in place of
H>     'estimated standard deviation' (symbol esd).

The expression 'esd' or 'e.s.d' occurs in the core dictionary in the
following lines (these are simply 'grepped' from the dictionary, but the
context is usually fairly clear).

    _type_conditions             esd   (in several places)
               chemical formulae. Parentheses are used only for e.s.d.'s.
;              Net intensity and e.s.d. calculated from the diffraction counts 
;              The e.s.d. of the individual mean standard scales applied to the 
    _refine_ls_shift/esd_max              .535
    _refine_ls_shift/esd_mean             .044
               876-881. The value must be between 0. and 1. with an e.s.d.
               weight [1/(e.s.d. squared)]. See also _refine_ls_restrained_S_ 
               weight [1/(e.s.d. squared)] and wr is the restraint weight.
data_refine_ls_shift/esd_
    loop_ _name                '_refine_ls_shift/esd_max'
                               '_refine_ls_shift/esd_mean'
          _enumeration_detail    sigma    "based on measured e.s.d.'s"
               criterion is usually expressed in terms of an e.s.d. threshold.

There seems no problem with changing these references to "s.u." except in the
data names _refine_ls_shift/esd_max and _min. Here, the data names should be
retained, but the definition might read

    _definition
;              The largest and the average ratios of the final least-squares 
               parameter shift divided by the final standard uncertainty (s.u.,
               formerly described as estimated standard deviation, e.s.d.).
;

- verbose, but unambiguous. Yes?

There is also the option to change the enumeration of _type_conditions to 'su',
but this may be difficult, given that the DDL1.4 paper has now gone to press.
Syd, are you willing to make this change? If not, an explanation of the
historical reasons for the code 'esd' can always be given in any later
description of this term.

H> (b) the 2 to 19 rule is recommended for the number of figures used to report
H>     su's. [My programming assistant, Howard D. Flack, has modified the Geneva
H>     version of the cif output programme of XTAL to take account of this. The
H>     Acta Cryst C editor has a copy of this and claims it works 
H>     properly down under as well.]

My ear has frequently been bent on the subject of the "Rule of 19" (i.e.
uncertainty values of less than 2 in the last decimal place should be
expanded to another significant figure). It is likely that this ruling will
be strictly enforced within IUCr journals. Should this be made a part of the
CIF specification - in other words, should the occurrence of a quantity such
as .1243(1) render the CIF invalid? If so, several CIFs already in the IUCr
would be invalid by this ruling. 

D28.2 and D28.3  R factors
--------------------------
D> In items 28.2 and 28.3 I agree with your suggestion for *_wR_factor,
D> making it the definitive value based on the reflections used in the
D> refinement.  It would be useful if, in the next circular, you could
D> summarise all the other types of R factor that exist or have been proposed,
D> together with your proposal for the ones that should be included in the
D> core.  I think we need these in front of us so we can register our
D> approval or otherwise.

Oh, dear. I find that my understanding of all the finer points of detail on
these topics is fragile, but I shall do my best. I give below my summary of
the existing definitions, and follow this with the full definitions, and the
proposals Brian Toby has made for similar definitions in the powder
dictionary (the relevant point here is that powder workers do require an
unweighted R factor based on intensities). There are various R factors
defined in the macromolecular dictionary, generally for shells of resolution;
I would suppose the _all, _obs and (nothing) suffixes can be applied to those
in line with our decisions on the core definitions. So, as I understand it,
the existing definitions are:

  '_refine_ls_R_factor_all'
  '_refine_ls_R_factor_obs'
R factors calculated on F (for comparison with older calculations quoting
this as the conventional R factor). 'All' means 'calculated using all
collected data'; 'observed" means "using all data satisfying the 'observed'
criterion', i.e. all data satisfying Fo > n.sigma(Fo), where n is some
arbitrary cutoff factor stipulated in _relfns_observed_criterion.

  '_refine_ls_wR_factor_all'
  '_refine_ls_wR_factor_obs'
Weighted R factors calculated on |F|, F^2^ or I~net~, according to which
quantity was chosen in the least-squares minimization function. The '_all' and
'_obs' suffixes carry the same meaning as before.

There is a problem with calculations within SHELXL93 which omit some
reflections from the refinement which are believed to be sytematically wrong.
The number of reflections used in the refinement is therefore not 'all'
(because some relections were collected, but are now ignored), nor the number
'observed' as per the cutoff criterion (although often the two will
coincide). Hence the proposal was to calculate a weighted R factor using just
those reflections that are actually used in the least-squares minimisation
function. This quantity would be denoted '_refine_ls_wR_factor' (with no
trailing suffix).

There are three points on which my mind remains a little unclear.

(1) This suggestion (for *_wR_factor) is well suited to the way SHELX works.
    Is it appropriate as a general definition? 

(2) Is there any merit in following the same principle for other calculated
    quantities that have '_all' and '_obs' flavours (in particular,
    _refine_ls_R_factor_all and _refine_ls_restrained_S_all)?

(3) What is the meaning of _refine_ls_number_reflns? In my previous mailing I
    enquired whether this was to be understood as the number of reflections
    used in the refinement (in other words, just the number of data points
    taken into account in calculating the putative _refine_ls_wR_factor). But
    I note that it is referred to in the definition of _refine_ls_restrained_S_
    data names. Is this usage consistent?


............... The existing core definitions for R, wR and S are ................
data_refine_ls_R_factor_
    loop_ _name                 '_refine_ls_R_factor_all'
                                '_refine_ls_R_factor_obs'
    _type                        numb
    _enumeration_range           0.0:
    _definition
;              Residual factors for all reflection data, and for reflection data
               classified as 'observed' (see _reflns_observed_criterion).
               R = (sum||Fm|-|Fc|| / sum|Fm|); Fm and Fc are measured and 
               calculated structure factors. This is the conventional R factor. 
               See also _refine_ls_wR_factor_ definitions.
;

data_refine_ls_wR_factor_  
    loop_ _name                 '_refine_ls_wR_factor_all'
                                '_refine_ls_wR_factor_obs'
    _type                        numb
    _enumeration_range           0.0:
    _definition
;              Residual factors for all reflection data, and for reflection data
               classified as 'observed' (see _reflns_observed_criterion). 
               wR = [sum(w|Ym-Yc|^2^) / sum(wYm^2^)]^1/2^ where Ym and Yc are 
               the measured and calculated coefficients specified by the
               _refine_ls_structure_factor_coef; w is the least-squares weight.
               See also the _refine_ls_R_factor_ definitions.
;

data_refine_ls_restrained_S_
    loop_ _name                 '_refine_ls_restrained_S_all'
                                '_refine_ls_restrained_S_obs'
    _type                        numb
    _enumeration_range           0.0:
    _definition
;              The least-squares goodness-of-fit parameter S' for all data, and
               for observed data, after the final cycle of least squares. This
               parameter explicitly includes the restraints applied in the 
               least-squares process.
               S' = {[sum(w|Ym-Yc|^2^) + sumr(wr|Pc-Pt|^2^)]
                                                / (Nref+Nrestr-Nparam)}^1/2^
               where the sum is over the specified reflection data; sumr is over
               the restraint data; Nref is the number of reflections used in the
               refinement (see _refine_ls_number_reflns); Nparam is the number
               of refined parameters (see _refine_ls_number_parameters); Nrestr
               is the number of restraints (see _refine_ls_number_restraints);
               Ym and Yc are the measured and calculated coefficients specified
               in _refine_ls_structure_factor_coef; Pc and Pt are the calculated
               and target restraint values; w is the least-squares reflection
               weight [1/(e.s.d. squared)] and wr is the restraint weight.
               See also _refine_ls_goodness_of_fit_ definitions.
;
...............................................................................

............... The proposed additions in the powder dictionary are ...........

data_proc_ls_I_R_factor
    _name                       '_refine_proc_ls_I_R_factor'
    _category                    refine
    _type                        numb
    _enumeration_range           0.0:
    _definition
;              Residual factors for estimated reflection intensities,
                 R~I~ = (sum~hkl~ |I~obs~(hkl) - I~calc~(hkl)| / sum I~obs~(hkl)               where I~obs~(hkl) and I~calc~(hkl) are the squares of the
               observed and and calculated structure factors. This is often
               referred to as R~B~ or R~Bragg~ in Rietveld refinements.
               See also _pd_proc_ls_prof_ for profile R-factor definitions.
;


data_pd_proc_ls_prof_
    loop_ _name                 '_pd_proc_ls_prof_R_factor'
                                '_pd_proc_ls_prof_wR_factor'
                                '_pd_proc_ls_prof_wR_expected'
    _category                    pd_proc_ls
    _type                        numb
    _definition
;              Rietveld/Profile fit R-factors

               Note that the R-factor computed for Rietveld refinements
               using the extracted reflection intensity values (often
               called the Rietveld or Bragg R-factor) is not properly a 
               profile R-factor. This R-factor may be specified using 
               _proc_ls_I_R_factor.

              _pd_proc_ls_prof_R_factor, often called R~p~, is an 
                unweighted fitness metric for the agreement between the 
                observed and computed diffraction patterns
                   R~p~ = sum~i~ ( I~obs~(i) - I~calc~(i) ) 
                          / sum~i~ ( I~obs~(i) )

              _pd_proc_ls_prof_wR_factor, often called R~wp~, is a
                weighted fitness metric for the agreement between the 
                observed and computed diffraction patterns
                  R~wp~ = SQRT {
                           sum~i~ ( w(i) * [ I~obs~(i) - I~calc~(i) ] ^2^ )
                           / sum~i~ ( w(i) * [I~obs~(i)]^2^ ) }

              _pd_proc_ls_prof_wR_expected, sometimes called the 
                theoretical R~wp~ or R~e~, is a weighted fitness metric for the                 statistical precision of the dataset. For an idealized fit, 
                where all deviations between the observed intensities and 
                those computed from the model are due to statistical 
                fluctuations, the observed R~wp~ should match the expected 
                R-factor. In reality R~wp~ will always be higher than 
                R~e~.
                  R~e~ = SQRT { 
                                 (n - p)  / sum~i~ ( w(i) * [I~obs~(i)]^2^ ) }

                Note that in the above equations, 
                   w(i) is the weight for the ith data point (see
                        _pd_proc_ls_weight)
                   I~obs~(i) is the observed intensity for the ith data
                        point, sometimes referred to as y~i~(obs) or
                        or y~oi~. (See _pd_meas_count_total, 
                        _pd_meas_intensity_total or _pd_proc_total).
                   I~calc~(i) is the computed intensity for the ith data
                        point with background and other corrections
                        applied to match the scale of the observed dataset, 
                        sometimes referred to as y~i~(calc) or
                        or y~ci~. (See _pd_calc_intensity_total).
                   n is the total number of data points (see
                        _pd_proc_number_of_points) less the number of
                        data points excluded from the refinement.
                   p is the total number of refined parameters.
;
...............................................................................

Howard has made the following remarks, which undoubtedly have some bearing on
this discussion, but I am not expert enough to see how exactly they affect
our deliberations.

H>D> I assume that this item is the one on which refinement is based.  
H> 
H>  There are all sorts of problems with loose statements like that.
H>  (a) The mimimized function in least squares does not usually have (constant)
H>      terms in the denominator.
H>  (b) The LS minimisation function is defined with the scale factor(s) applied
H>      to the calculated quantities whereas the R factors in general have the
H>      scale factor applied to the observed quantities (which makes the 
H>      denominator not to be a constant)
H>  (c) If you apply restraints, these act on the minimisation function but
H>      not on the R Factors that you seem to be talking about.

D25.6 _type_construct
---------------------
D> 	Perhaps the next circular, which promises to say more about the 
D> new DDL, could also let us know how we can get a copy of REGEX since this 
D> will be necessary for out further discussions.

I have put a copy of the POSIX document discussing regular expression syntax
in the new ftp area.

I feel that our discussions of _type_construct have demonstrated the
feasibility of this approach (and the same approach is supported in DDL2),
but I am unsure how to proceed at this point. The examples I have seen so far
are incomplete, and a thorough approach needs to be taken to ensure
self-consistency through all the dependent components. I think I would favour
dropping this from the current (that is, forthcoming!) release of the
dictionaries, but working on it energetically for future releases. Is there
general agreement on this, or does anyone feel that it is esential to have
this feature implemented at this point?


New topics
==========

D30.1 Hydrogen bonds
--------------------
Here's a set of notes I made some time ago (in fact, pre-COMCIFS) that has
just come to light again, reminding me of another matter that is overdue for
discussion. Please bear with me if I include these old notes verbatim, rather
than seek to rephrase them in modern terminology! I would suppose that
our preferred route now is option (2) below, but I have discovered that
option (1) has been routinely implemented by the Acta staff. Comments welcome.
I notice, by the way, that the struct_conn category of the mmCIF details
hydrogen bonds and other interactions, but I suggest the _geom_hbond_
approach would be more suitable for small-molecule CIFs.

Authors frequently wish to describe hydrogen-bond geometry, and a typical
table in Acta might look like this:

    D---H...A               D...A      D---H       H...A     D---H...A
C(6)---H(C6)...O(2)^i^    3.276 (5)   1.00 (4)    2.34 (4)    157 (1)
C(9)---H(C9)...O(2)^ii^   3.243 (5)   0.90 (4)    2.55 (4)    134 (1)

(1) Certain authors have suggested the following additional data names to be
used in such a case; the data naming scheme preserves chemical information
(i.e. DH is a bond distance, DA a contact), but the resultant loop contains
an inelegant mixture of _bond_, _contact_ and _angle_ identifiers.

_geom_bond_atom_site_label_D
_geom_bond_atom_site_label_H
_geom_bond_distance_DH
_geom_contact_atom_site_label_A
_geom_contact_distance_HA
_geom_contact_distance_DA
_geom_angle_DHA
_geom_contact_site_symmetry_A

(2) An alternative is to group all the entities under a new second-level
identifier [i.e. create a new category, geom_hbond], to obtain

_geom_hbond_atom_site_label_D
_geom_hbond_atom_site_label_H
_geom_hbond_atom_site_label_A
_geom_hbond_distance_DH
_geom_hbond_distance_HA
_geom_hbond_distance_DA
_geom_hbond_angle_DHA
_geom_hbond_site_symmetry_A

and perhaps also (for completeness)

_geom_hbond_site_symmetry_D
_geom_hbond_site_symmetry_H
_geom_hbond_publ_flag

(3) A third possibility is to embed all the data within the existing geometry
loops [e.g. the first example would have components within
          loop_
               _geom_bond_atom_site_label_1
               _geom_bond_atom_site_label_2
               _geom_bond_distance
                     C(6)     H(C6)   1.00(4)
          and
          loop_
               _geom_contact_atom_site_label_1
               _geom_contact_atom_site_label_2
               _geom_contact_distance
                     C(6)     O(2)    3.276(5)
                     H(C6)    O(2)    2.34(4)    ]
but to have a set of identifier 'pointers' in a separate loop
loop_
     _geom_hbond_donor
     _geom_hbond_hydrogen
     _geom_hbond_acceptor
     _geom_hbond_symmetry_acceptor
        C(6) H(C6) O(2) 2
        C(9) H(C9) O(2) 2_655


D30.2 The New DDL
-----------------
As I mentioned in passing last October, the macromolecular community decided
at the mmCIF workshop in Brussels to develop an enhanced version of the
dictionary definition language for use with the mmCIF dictionary. Syd, who
with Tony Cook is the author of the original DDL, agreed to this development,
and has been involved in the formulation of the new version, which is to be
called DDL version 2. John Westbrook (of the Nucleic Acids Data Bank at
Rutgers University) has been the main architect of this, and he has been
assisted by Syd and by Nick Spadaccini (who is also at University of Western
Australia). I have added John and Nick to the mailing list for any
discussions we may have on the new version.

However, while I am sure that any constructive comments on the formalism will
be welcomed, I see our role more as assessing the applicability of DDL 2 to
the core and other dictionaries. Because the mmCIF will include the core
definitions, it is of course necessary to have the core definitions expressed
in the same formalism as the mmCIF dictionary itself, and Paula has done a
magnificent job in merging the core dictionary with the mmCIF definitions to
produce a compound dictionary using DDL2. The question we need to address is
whether we should distribute the revised core dictionary itself in DDL2
formalism, or whether it should go out in DDL1.4 formalism and be maintained
in parallel by the mmCIF developers. 

To help in deciding this, I have mailed to everyone a (paper) copy of the ciftex
representation of Paula's latest revision. This is to demonstrate that the
dictionary need not look very different from the published Core, even though
the underlying representation is somewhat different; but it will also allow
us to concentrate attention on the content of the definitions - the new DDL
is rather more verbose than the old one, and definitions can be hard to
locate. Also, it will in the long run save paper - the ciftex version is
only 102 pages long (!), as opposed to the 300 or so needed for a full ASCII
printout. The potential drawback is that the details of the DDL are masked,
perhaps to the extent that the full power of the new formalism is not
apparent. I shall therefore append to this message a listing of the new DDL
dictionary, and I shall be pleased to e-mail the draft mmCIF file to anyone
who wishes to see that in its full (850 kb) glory.

I emphasise again that this exercise is to allow us to consider the effects
of the change of formalism, not to give approval to the definitions
themselves, work on which is still in progress. Indeed, I have a slightly
more recent version which includes major revisions to the _entity_... items 
(this is available in the ftp directory). And it is important that we not
become hung up on details of the formalism, except where there appear to be
real problems - the gestation period for this dictionary is already
undesirably long.

Let me make a few general remarks about the philosophy behind DDL2. We have
already had extensive discussions on the desirability of providing a
self-consistent machine-readable set of data attributes, and over the last
year or so the version 1 DDL has grown to include relations between data
items. This approach is now taken a stage further. In the new DDL dictionary,
a hierarchy of objects is defined: category_groups (arbitrarily definable
groups of categories, so that the geom_bond and geom_angle categories would
naturally be collected into the geom category_group); categories (corresponding
to the current definition of a category as a collection of data names which
may occur in the same looped list, or outside of loops in a related aggregate);
subcategories (collections of data items that form a coherent set within a
category, e.g. *_h, *_k and *_l items might form a miller_index subcategory);
and individual data items. Each of these hierarchical objects may be
described by a separate set of DDL definitions (so there is, for example, a
_category.description and a _sub_category.description).

The organisation of DDL2 dictionaries is different from DDL1. Each definition
is given within a save_ frame, where previously each appeared within its own
data block. The save_ frames are permitted STAR syntactic devices for
encapsulating blocks of information which may be referenced from other places
within the current data block. At this point, however, such references are
not used - the save_ frames merely split the dictionary up into logical
chunks, as did the previous fragmentation into data blocks. But because each
definition within a dictionary is related to the rest of the information in
the dictionary, it is best to have a single data block encompassing the whole
dictionary. John explains the reason for this reorganisation thus:

JW> The save_ syntax has been used in order to have a more consistent use of
JW> scope between data files and dictionaries.   Since we are representing
JW> links between data items we are are using save frames so that the referenced
JW> data items are all within the scope of the current dictionary.  This is
JW> not the case now where data_ sections are used.  Links between data
JW> blocks really violate the STAR scope rule that requires each data block
JW> to have a separate name space.

Another point of difference is that in earlier dictionaries a single data
block might contain the description of more than one related (more or less)
data names. Hence, in the core we have

data_cell_length_                                        
    loop_ _name                  '_cell_length_a'
                                 '_cell_length_b'
                                 '_cell_length_c'
    _type                        numb
    _enumeration_range           0.0:
    _esd                         yes
    _esd_default                 0.0
    loop_ _units_extension _units_description _units_conversion
        ' '  'Angstroms' *1.0 '_pm' 'picometres' /100. '_nm' 'nanometres' *10.
    _definition
;              Unit-cell lengths corresponding to the structure reported. ...
;

In the new formulation, each such definition would have its own save_ frame
(i.e. one each for _cell_length_a, _b and _c).

However, it IS possible to have more than one definition within a save_
frame, and this occurs when 'parent' and 'children' are defined together
(recall that the child relationship provides for pointers between identifiers
in different lists - a typical example is a _geom_bond_atom_atom_site_label_1
which must match an _atom_site_label). In the new dictionaries, this would be
written as

save_atom_site.label
    _item_description.description
;              The _atom_site.label is a unique identifier ...
;
     loop_
    _item.name
    _item.category_id 
    _item.mandatory_code
               '_atom_site.label'                 atom_site            yes
               '_geom_bond.atom_site_label_1'     geom_bond            yes
               '_geom_bond.atom_site_label_2'     geom_bond            yes
     loop_
    _item_linked.child_name
    _item_linked.parent_name
               '_geom_bond.atom_site_label_1'     '_atom_site.label'
               '_geom_bond.atom_site_label_2'     '_atom_site.label'
    _item_type.code               char
     loop_
    _item_examples.case           C12 Ca3g28 Fe3+17     H*251  boron2a
     save_

I find that this creates some problems in producing the dictionary - the
entry for '_geom_bond.atom_site_label_1' must be looked up under
'_atom_site.label', for instance. However, Paula has solved this problem by
including save_ frames for the _geom_bond... stuff that act as cross-references
to the primary definition, and I am satisfied with this. Again I asked John
for some clarification of this, and he describes the way in which this
arrangement better mirrors the organisation of data tables in a relational
description:

JW> Here is the model for what we are doing.  Each category definition
JW> defines a table and each item (attribute) defines a column in the table.
JW> The DDL defines that table structure or logical schema on which the
JW> macromolecular dictionary is built.  Each instance of a DDL category
JW> in the macromolecular dictionary adds a row to its category's table.
JW> Since this is the logical model, it is no longer possible to simply
JW> inspect the contents of a dictionary definition and expect to find
JW> all of the information about an item.
JW> 
JW> As [BM] points out, this departs from the current usage of searching
JW> within each definition for all of the information about an item.  I look
JW> at this in the following way.  The dictionary lays out the logical
JW> representation for the data.  This does not mean that the structure of the
JW> dictionary is the most efficient way of accessing the data.  We are
JW> reading the dictionary and building a table structure that we can
JW> search rather than roaming around looking for stuff in the dictionary.

BM> In many ways I appreciate the way this is done in your formulation, but I am
BM> still worried about how one answers the question "what is the meaning of the
BM> data name _geom_angle.atom_site_label_1 that appears in this data file?" One
BM> turns to Paula's dictionary, and locates the data name within a certain save
BM> frame (save_atom_site.label, of course ;->). There is lots of useful
BM> information in that save frame about the data name's attributes, but its
BM> "meaning", in human terms, is located in the description given in
BM> save_geom_...label_1 (which saveframe contains no instance of the data name
BM> itself, except implicitly in the framecode). It's all self-consistent, but
BM> to unwrap all these details needs a set of conventions or rules which are
BM> not yet explicitly set out anywhere.
BM>
BM> I guess I'm arguing that the save frame for this example should contain as a
BM> minimum
BM>     save_geom_angle.atom_site_label_1
BM>         _item.name                  '_geom_angle.atom_site_label_1'
BM>         _item_linked.parent_name    'atom_site.label'
BM>     save_
BM> which is suspiciously like the DDL1 structure, and I can hear you screaming
BM> already. If you won't let me have this, I can still make my dictionary
BM> typesetting program work (the application I'm actually playing with just
BM> now), but it involves a certain amount of special coding - the new DDL is no
BM> more 'self-defining' to me than the old DDL was to you.

JW> This really goes to the issue of how you search for things in the dictionary
JW> that I discussed in the previous section.  I agree that it is more difficult
JW> to find things in the new structure.  The alternative is an assembly of
JW> complete definitions.  This would be almost impossible to maintain given
JW> the size of the macromolecular dictionary, and it would be even more
JW> difficult to maintain consistency between related items.

One other major change that you may have noticed is the introduction of a dot
character into data names to differentiate the category name from the
instance within the category. John believes this to be very important for the
efficient validation of tabular relationships (in other words it makes it
easy to enforce the rule that a loop_ contains only items from the same
category). Note that it is not essential to do this - each dictionary
definition may explicitly list the category to which the data name belongs,
and indeed in Paula's elaboration of the mmCIF dictionary, she has done this.
But John prefers that the category should be easily extractable from the data
name alone, using the dot (or some other separator character). This will
directly contradict one of our decisions (see, for example (19)A10.6) that
no character beyond the leading underscore should have a special meaning.

It also raises the minor difficulty that all data names adhering to this
convention and including a dot will be different from the datanames published
in the core dictionary. To permit compatibility with existing data files,
John has introduced an alias mechanism, so that _atom_site_label will be
recognised and internally translated to _atom_site.label (and, indeed, to
_atom_site.id also in this particular instance), which is an entry in the new
dictionary. We need to give full consideration as to the wisdom or otherwise
of this approach. For the most part, the trade-off seems to be between
computational efficiency in John's applications, and the multitude of
headaches that might result from changing all the existing data names.
However, it's not quite so simple, since some of the aliases in Paula's draft
do not map to exact equivalents in the new formulation (see, for instance,
_atom_site.fract_x and _atom_site.fract_x_esd versus _atom_site_fract_x).

There is also a proposal to introduce a new type of data structure within the
CIF formulation that allows vectors or matrices to be presented as coherent
entities (described by suitable dictionary descriptions); but I am having
grave difficulty with seeing how this formulation can be compatible with the
current restriction on single-level loop structures in CIF, and suggest that
for the present we should represent matrices in the traditional way (by
listing individual components).

Hence I wish to encourage the discussion to follow the following strands
(which I number separately for ease of reference):

D30.3 Dictionary organisation within DDL2
-----------------------------------------
I move that we accept the save_ frame organisation in DDL2-compliant
dictionaries, and that we require each data name to have a matching save_
frame to allow location of the data name.

   [I remind you of what this is about with an example. The save_ frame
    save_atom_site.label contains a loop_ of _item.name's, including
    '_geom_bond.atom_site_label_1'. Paula has constructed a save_ frame,
    save_geom_bond.atom_site_label_1 which does NOT contain the _item.name
    '_geom_bond.atom_site_label_1', but it does have an
    _item_description.description that points the reader to the appropriate
    parent definition. I wish this to be adopted as a systematic convention
    throughout DDL2 CIF dictionaries.]

D30.4 Dot separator in data names
---------------------------------
I put forward the proposal that the dot be permitted in data names to allow
explicit reference to the category to which the data name belongs. I put
this forward with reservations of my own, and I invite John to elaborate on
the merits of this over an explicit _item.category_id listing in every case.
If it is just a matter of computational efficiency, how does it affect real
computations that he has been involved in?

D30.5 Matrix/vector structure types
-----------------------------------
I propose that we not implement _item_structure* components of DDL2 (which
describe higher-level matrix and vector structures) in the current mmCIF
dictionary. If John and Phil wish to challenge this, they are invited to
provide examples of how these would work in practice, within the current STAR
syntax rules and CIF restricted-STAR syntax conventions.

Best wishes
Brian