(44) More substantial changes to submitted Core dictionary

To: COMCIFS@iucr.ac.uk
Subject: (44) More substantial changes to submitted Core dictionary
From: bm
Date: Tue, 9 Jul 1996 17:24:31 +0100
Dear Colleagues

I now circulate David Brown's more substantive comments on the submitted
core dictionary, and my reactions. I haven't yet implemented the changes
suggested and agreed to below, and shall in some cases await further
dialogue before doing so. These are all constructive criticisms, and
should be considered by you in some detail. Further suggestions along
these lines are very welcome.

D44.1  _atom_site_aniso_B_* 
----------------------------
D>      The commission on nomenclature specifically recommends against
D> the use of B.  I would suggest that we drop this from the core.  If
D> the mm people want it, they should include it only in their
D> extension.  The commission also recommend the use of superscripts,
D> not subscripts e.g. U^ij^ rather than U~ij~ as being more
D> mathematically correct.  We should convert throughout (except where
D> the U's are referred to orthogonal cartesian coordinates where
D> subscripts are correct.  However this does not occur in cif).  See
D> also *_thermal_displacement_type.

I am told that we occasionally get Bij's in CIFs submitted to Acta, and
that we not infrequently get lists of Beq. The Managing Editor is unhappy
at the prospect of losing these from the core. I recommend their
retention, but perhaps with some addition to the definition suggesting
their restricted appropriateness. I don't think I have any
difficulty with the U^ij^ notation.

D44.2 _atom_site_disorder_assembly and *_group
----------------------------------------------
D>      I cannot make sense of the definitions here.  Can these be
D> worded in a more user friendly way?  If not, perhaps these
D> definitions should be dropped.

As I recall the history of this topic, the idea is that it is necessary to
flag atomic sites which represent alternative locations for a disordered
atom. Consider a disordered methyl group which is represented by two
tetrahedral groupings of H atoms displaced by 60 degrees with respect to
each other. Each of six H atoms might appear in the _atom_site_ list
(presumably with occupancy 50%). The original _atom_site_disorder_group
("a code to link disordered atom sites of a group that exist
simultaneously in the crystal structure") would allow you to flag each of
these sites with a code - 'M' say - to distinguish it from another group
of disordered atoms - a 'P' disordered 6-carbon ring, perhaps -
elsewhere in the structure. But there was no way to relate the atoms
within the 'M' group to each other. The new _atom_site_disorder_assembly
allows for this: denote by 'A' (say) the three H atoms in one tetrahedron,
and by 'B' the other three. Hence:

loop_ _atom_site_label                  # The *_group 'M' is a disordered methyl
      _atom_site_occupancy              # with configurations 'A' and 'B':
      _atom_site_disorder_group         #
      _atom_site_disorder_assembly      #    H11B    H11A      H13B
                                        #      .      |      .
   C1     1      .       .              #        .    |    .
   H11A   .5     M       A              #          .  |  .
   H12A   .5     M       A              #             C1 --------C2---
   H13A   .5     M       A              #           / .  \
   H11B   .5     M       B              #         /   .    \
   H12B   .5     M       B              #       /     .      \
   H13B   .5     M       B              #    H12A    H12B    H13A

(in the dictionary examples, numeric codes '1' and '2' are used instead of
my 'A' and 'B'; note the special meaning of '-1').

SHELXL-93 outputs the *_disorder_group codes but not (I believe) the
*_assembly ones; but John Davies has developed his BUILDER code to make
use of different *_assembly values, and this appears to be a well-tried
approach in his working code. Obviously there are many cases where this
level of differentiation is useful: we have just had a structure with 4
molecules in the asymmetric unit passed by MISSYM, while the author
subsequently found there was additional symmetry. The cause was the
presence of additional disordered sites which refined differently in the
separate molecules. An explicit indication of which groups of atoms to
regard as disordered would have been a great help to us.

Given this background, I think the definitions are terse, though correct.
If anyone wishes to proffer a more unambiguous set of definitions, I would
welcome them. But I strongly favour keeping the data names in the dictionary.

D44.3  _atom_type_scat_versus_stol_list
---------------------------------------
D>      It should be made clear in the definition that, since we
D> cannot have nested loops, this field is a text field that is not
D> computer interpretable and therefore any appropriate arrangement of
D> the numbers in allowed.  However, if we wish, we could allow the
D> field to have a cif-like structure so that, even though the field
D> is treated as text, this text file could be fed back into a cif
D> reader for machine interpretation, e.g.,
D> 
D> _atom_type_scat_versu_stol_list
D> ;  loop_
D>      _stol
D>      _scattering_factor
D>      0.0       32.0
D>      0.05      31.2
D>      0.10      30.3
D>      0.15      28.5  #The list has been truncated for brevity
D> ;  

This may be a dangerous precedent to establish, as it runs the risk of
confusing the non-expert in CIF. I'd like to hear other opinions on this.

D44.4 _audit_conform_[]
-----------------------
D>      Remove the example showing mixed mif and cif dictionaries.  It
D> it is not clear that this will work and it is therefore a dangerous
D> example.

OK, I'll remove this example for now, but I'd like some feedback on
whether anyone thinks this approach can be made workable.

D44.5 _audit_contact_author_fax   *_phone (and other *_phone items)
-------------------------------------------------------------------
D>      These fields (particlarly *_fax) should be numbers that could
D> be directly dialed by computer.  One could imagine a program that
D> allows a fax to be generated on line.  The fax could then be sent
D> directly by reference to this field.  In this case, all extraneous
D> parentheses should be omitted, the country code must be included. 
D> Presumably spaces might be allowed to assist in visual reading and
D> could be ignored by a program.  I would in any case favour only one
D> convention.  Otherwise we might as well abandon all attempts at
D> convention and allow people to use the format of their choice.  But
D> let us stick to our machine interpretable rule here and go for one
D> machine interpretable form.  The same is, of course, true for
D> email, but this should not present a problem.

The motive behind introducing the 12(345)678900 convention (which is that
of the World Directory) is to supply a more machine-readable format. The
example parses into three components: a country code, an area code, and a
local number. Recognition of these is useful (in the UK, an intRAnational
call would be made by dropping the international prefix, and prepending 0 to
the area code). Extension phone numbers are also recognised by the
non-numeric 'x' delimiter.

The trouble with enforcing only this format is that all existing CIFs now
become invalid, if one is strict about this. (Note that we don't have any
way to validate this convention, unlike the mmCIF case where an
_item_type_list.code could be defined for a phone/fax number. Indeed, with
extended regular expressions involving the 'OR' operator '|', both forms
could be permitted.)

So: should we enforce ONLY the original conventions? Only the new one?
BOTH? Some new one? or accept any attempt at an intelligible representation?

D44.6 _cell_* and _cell_measurement_*
-------------------------------------
D>      It is a pity that these two different categories got
D> themselves defined, as I can see cases where one will want to give
D> a list of different cell parameters as a function of temperature or
D> pressure.  Normally one would want to include these in the same
D> loop.  To get around this difficulty we need to define
D>      _cell_id  and
D>      _cell_measurement_id
D> with the appropriate parent child flags (I guess _cell_id defines
D> the parent.)

Hmmm, yes. I think these are only 'pseudo-categories', in the sense that
they weren't assigned on the basis of relational tables, but rather under
the intuitive notion that _cell_[] things were a "property" of the cell,
_cell_measurement_[] of the cell measurement process (is that right Paula?
Do you remember?). The _cell_measurement_refln_[] grouping, however, do
form a separate category. I'll add the _cell_id and _cell_measurement_id
unless there is a strong feeling that the two should yet be recombined.

D44.7 _citation_journal_coden_ASTM  *_CAS
-----------------------------------------
D>      These two are the same.  They were originally defined by ASTM
D> but CAS took them over.  Strictly, the word CODEN refers to these
D> items, so the _ASTM or _CAS is redundant.  However, we are using
D> the word in a more general sense (though I suspect that CODEN may
D> be copyright), so we need to specify something extra.  Let's go for
D> _CAS since they are now the ones in charge.  Remove *_ASTM.

OK. But what about _database_journal_ASTM?

D44.8 _diffrn_ambient_environment
---------------------------------
D>      Air and vacuum are not the same.  The default should be air. 
D> Measurement in a vacuum should be specified.

OK.

D44.9 _diffrn_attenuator_scale
------------------------------
D>      Is this adequately defined?  I am not sure what number is
D> expected here; the amount by which an intensity must be multiplied
D> to get all measurements on the same scale or its inverse?  I guess
D> the enumeration of 1.0: gives the answer, but it would be better
D> if it were more clearly explained.

D44.10 _diffrn_measurement_details
----------------------------------
D>      The example is unfortunate.  All the items given in the
D> example should have their own separately defined data items.  I
D> suppose this field is intended to include comments such as:
D> ; The results may not be entirely reliable as the measurement was
D>   made during a heat wave when the air-conditioning had broken down.
D> ;

Yes, I agree that there should be an appropriate set of separately defined
data items, and I am still hoping that an expert on the various types of
current experimental equipment will supply these. In the meantime it may
be better to put this information in the *_details field rather than lose
it altogether. However, I'll defer to your suggestion and swap in your
suggested example.

D44.11 _diffrn_measurement_device_type
--------------------------------------
D>      Again the example:  What is a 3- or 4-circle camera?  Should
D> 'camera' not be 'diffractometer'?  Likewise kappa-geometry.
D> Do the new members of this group make _diffrn_measurement_device
D> obsolete?  Is there an advantage is breaking down this field into
D> three separate fields?  In no case is the data machine
D> interpretable, so I cannot see why we need *_details *_specific and
D> *_type.  I would vote to remove these three items from the core.

We are gradually developing the art of determining the appropriate level
of granularity or atomicity of data representation, and I think there are
many cases where the decisions are indeed arguable. Paula put these
datanames into mmCIF to allow a categorisation of machine parameters that
she felt appropriate to the mm community - it can make sense to be able to
extract the generic type of equipment from a database, even if the
quantities are not machine-interpretable and therefore there is no a
priori method of analysing this information. I have no strong feelings
either way, and shall act with the consensus.

D44.12 _diffrn_orientation_matrix_*
-----------------------------------
D>      This matrix is defined through the use of *_type since there
D> no definition given in the cif dictionary.  This is a somewhat
D> unsatisfactory arrangement as it means a universal program cannot
D> be written to make use of this matrix, but the definition was left
D> open for a reason.  However, without a definition, the matrix is
D> meaningless.  Therefore the example given under
D> _diffrn_orient_matrix_[] MUST include also a
D> _diffrn_orientation_matrix_type item.

OK, I'll see what I can do. (The example CIF - a real one - doesn't
include a _diffrn_orientation_matrix_type item!)

D44.13 _diffrn_radiation
------------------------
D> I have some difficulties here.
D> 1. Again, what have we gained by introducing _specific, _details
D> and _type fields, except to ensure that all these aspects are
D> reported?
D> 2. More seriously, we are getting our nomenclaltures confused. 
D> Originally we had _diffrn_radiation_*, but now we are introducing
D> terms like _diffrn_radiation_source_* and
D> diffrn_radiation_detector_*.  If we are going to divide the field
D> into source and detector etc. we need to recognise that
D> *_monochromator may belong to either group (pre andor post
D> monochromation might occur).  Likewise collimation is important in
D> both the entrance and the exit beams (see also
D> _diffrn_refln_detect_slit_* where this information appears to be
D> given).  As I mentioned before, *_power has to be 'num' with units
D> of W or KW.  If we need to define voltage, current etc. this must
D> be done in other data items.  Presumably different items are needed
D> for synchrotron and for x-ray tubes, gamma rays, neutrons,
D> electrons etc.  I recommend that we not rush these items.  Delete
D> all the proposed additions from the present version (both those
D> from powder and mm) and let us take our time over getting the
D> definitions well thought out.  Perhaps Paula and Brian T should be
D> alerted to this concern.

I shall delete the proposed additions from the core for now. (Note that I
see no problems with making relatively small incremental changes to the
dictionary henceforth. The major reason for the delay in getting to this
stage has been the structural concerns that we have now resolved.)

D44.13 _exptl_absorp_correct_type
---------------------------------
D>      Surely we should add 'psi-scan' to the enumeration list and,
D> if it were possible, delete 'empirical'.  Acta Cryst. should make
D> it clear that 'empirical' is not a satisfactory enumeration for
D> submitted cifs, but we need something that explicitly defines the
D> much used psi-scan.  Alternatively, we should restrict 'empirical'
D> to describe psi-scan corrections.

This one slipped into the corrections I made and posted last time. The new
suggested definition and enumeration reads:

    _name                      '_exptl_absorpt_correction_type'
    _category                    exptl
    _type                        char
    loop_ _enumeration
          _enumeration_detail    analytical  'analytical from crystal shape'
                                 cylinder    'cylindrical'
                                 empirical   'empirical from diffraction data'
                                 integration 'integration from crystal shape'
                                 none        'no absorption correction applied'
                                 psiscan     'psi-scan corrections'
                                 refdelf     'refined from delta-F'
                                 sphere      'spherical'
    _definition
;              The absorption correction type and method. The value 'empirical'
               should NOT be used unless no more detailed information is
               available.
;


D44.14 _exptl and _exptl_crystals
---------------------------------
D>      The definitions of these two categories are virtually
D> identical.  Shape, size and density are NOT included in the _exptl
D> category.

Oops. Will fix this.

D44.15 _geom_[]
---------------
D>      In the definition it is best to say 'Geometry data are
D> therefore USUALLY redundant'.  One can visualise cases where cell
D> constants or atomic coordinates are not included in a cif, since
D> the emphasis is entirely on geometry and _chemical_conn_*
D> information.  Acta Cryst is not the only recipient of cifs!

Done.

D44.16 _geom_angle_site_symmetry_*, *_bond_site_symmetry_*, etc.
----------------------------------------------------------------
D>      As you know, I am still unhappy about this hangover from the
D> old ways of highly compressed thinking.  At least we now provide
D> for a definition of _symmetry_equiv_pos_id and it should be pointed
D> out that the first number in the character string must correspond
D> to one of these values.  The definition of the remaining digits in
D> this string is still far from clear.  I would suggest something
D> like: 'The character string n_klm is composed as follows:
D>      ** n refers to the symmetry operation that is applied to the
D>      coordinates stored in _atom_site_fract_x, _atom_site_fract_y
D>      and _atom_site_fract_z.  It must match a number given in
D>      _symmetry_equiv_pos_id.
D>      ** k, l and m refer to the translations that are subsequently
D>      applied to the symmetry transformed coordinates to generate
D>      the atom used in calculating the angle (bond, contact).  These
D>      translations (x,y,z) are related to (k,l,m) by the relations
D>           k = 5 + x
D>           l = 5 + y
D>           m = 5 + z
D>      By adding 5 to the translations, the use of negative numbers
D>      is avoided.'

OK with everyone? I appreciate the concern to identify each transformation
as a specific data item, but I think that there are practical reasons for
bailing out at a certain level (as with dates, we don't supply separate
_year, _month and _day fields, though we could in principle). Our current
in-house bible, "The SGML Implementation Guide", says in a slightly
different context "A business should spend no time and money defining and
managing information in smaller chunks than they need to". (A by-product
of SGML and many other techniques of modern electronic publishing seems to
be that the rules of grammar are considered obsolete!) Given that the 
n_klm notation is already established by precedence, I vote we stick with
the status quo (but with the improved definitions).

D44.17 _journal_[]
------------------
D>      Is there a reason for including the definitions for the new
D> items in this category, but not for the old items?  If these are
D> for private journal use, so that definitions are not needed, then
D> why are they needed for _journal_index_* items?  I suppose these
D> come under the category of reserved names (a concept that we have
D> been cautious about adopting any where else).  However, I suppose
D> that since the IUCr owns the standard it is OK to reserve names for
D> its own use, though I can see no reason why other journals that
D> accept cifs might not use these items as well.  I do not want to
D> raise fundemental questions of philosphy, but we should treat all
D> items in this category in the same way, unless there is a special
D> reason why people outside the office need to know the definitions. 
D> I propose that we drop the definitions and just add the datanames
D> to the existing list.

I added the definitions for the benefit of reviewers in Chester who had
not previously seen these. I agree that we should be consistent, but I'm
happy to be persuaded to add definitions for the other items if it's
thought useful to make them available for other journals (an approach I
would favour).

D44.18 _publ_section_*
----------------------
D>      Why do we need two datanames (for ascii text or word processor
D> text) when the text is included as a single block, but not when it
D> is included in sections?  

We have never needed to use the _publ_manuscript_ items, nor would I know
how to. I wish they would just go away. We do allow different formats
within the new _publ_body_ category (see _publ_body_format), where the
components of the paper are looped together; it would be cumbersome to do
this in the _publ_ category, where each dataname refers to a particular
document element. But I'll think a bit more about this.

D44.19 _publ_body_*
-------------------
D>      Should this category be mentioned under _publ_section_ and
D> _publ_manuscript_?  There seem to be so many ways of presenting the
D> text of a paper that some guidance would be in order.  Which format
D> should be used for which paper?  What is the preferred format (or
D> is this left to the journal to state)?  Are any of these items
D> obsolescent?

Yes, I'll cross-reference the appropriate categories. The journals should
stipulate in their respective Notes for Authors exactly which datanames to
use.

D44.20 _publ_manuscript_incl_[]
-------------------------------
D>      This section worries me.  We are asking it to do two unrelated
D> functions, which is usually a recipe for disaster.
D>      1. It flags items normally not included in the published
D> manuscript but which the author would like to see printed
D>      2. It allows the author to define ad-hoc data names to include
D> items that are not defined in cif.
D>      I would prefer to see these functions separated, but if this
D> is not deemed feasible, then the description should make this
D> double functions quite explicit.  The examples are unfortunate. 
D> The first example shows exactly what one should NOT do, namely
D> define ad hoc names for items that have real names, and the second
D> seems redundant, since the function is performed by
D> _geom_hbond_publ_flag.  I propose that this category be restricted
D> to listing existing cif items that should be included in the
D> printed paper and that a different category, say
D> _publ_manuscript_undef_name be used for inclusion of other items
D> that are not defined in the cif but regarded as important enough to
D> be included in the printed paper.  The magnetic permiability is a
D> good example and authors of inorganic papers may wish to print the
D> bond valence sums around the atoms, an item for which no datanames
D> currently exist.

Strictly, the purpose of the _publ_manuscript_incl_ items is to add the
datanames listed to the request list that the journal employs for the
paper in question. The "*_info" and "*_defn" datanames are intended to
supply information to the curious, rather than introduce a bona fide DDL
definition. Pragmatically, we find that we have to define new formatting
macros for any new dataname not in the standard request list, irrespective
of whether it's in the dictionary or not. It's all something of a dirty
trick, I admit, but it's one that works superbly well for Acta, and I
wouldn't be happy to see any changes made to these data items.

D44.21 _refln_symmetry_epsilon
------------------------------
D>      The enumeration range should surely be 1:48, not 1:32

OK.

D44.22 _refln_symmetry_multiplicity
-----------------------------------
D>      Again, I think the enumeration range should be 1:48 though I
D> am not certain of this.  It is also the *magnitudes* of the
D> structure factors that are the same.  The structure factors
D> themselves may have different phases and are not necessarily the
D> same.

OK.


Regards
Brian
Prev by Date: (43) Minor changes to submitted Core dictionary
Next by Date: (45) Further discussions on the submitted Core
Index(es):
- Date
Discussion List Archives

(44) More substantial changes to submitted Core dictionary