Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

review of core CIF dictionary

  • To: Multiple recipients of list <coredmg@iucr.org>
  • Subject: review of core CIF dictionary
  • From: "I. David Brown" <idbrown@mcmail.cis.mcmaster.ca>
  • Date: Wed, 5 Jun 2002 16:42:48 +0100 (BST)
To members of the Core Dictionary Maintenance Group (DMG),

     First let me apologize for this long email.  It introduces a
proposal for a major review of the coreCIF dictionary which will
need the help of all members of the core Dictionary Maintenance
Group.  You are either a member of this group or a valued
consultant and we are anxious to hear your views and
recommendations.  Enclosed with this email is an agenda of items
that we need to consider.

     In making this review we should remember that the original
version of this dictionary was prepared when CIF was perceived as
being nothing more than a format for transferring and archiving
information about crystal structures.  The first version of this
dictionary was conceived as a printed guide for programmers and
it was only later that it evolved into a computer-readable
document in which the concept of categories with their keys and
links gave the dictionary a relational form.  Although changes
have been made over the years to keep it in line with the
changing perception of CIF, these have occurred in a piecemeal
fashion.  With major developments now underway that will convert
CIF dictionaries into self-contained computer-accessible
compendia of crystallographic knowledge, we need to carry out our
review with an eye to the future and prepare the core dictionary
so that it can take advantage of the changes that lie ahead.
These will introduce computer-readable algorithms and include
compact representations of vectors and tensors, but to be
effective these changes will require a strict adherence to the
dictionary definitions that give CIF its relational structure.
These features have not yet been systematically developed or
exploited by the Core dictionary.  This review gives us the
chance to prepare the dictionary to ease the transition to future
versions.

     Over the past few years a number of changes have been
suggested for the CIF core dictionary but only a few of them have
so far been adopted.  Some suggestions have come from
individuals, some from Acta Cryst. which has discovered items
that cannot be correctly given using the present version of the
dictionary, and some from the Cambridge Crystallographic Data
Centre which is working on ways to format the existing CSD
entries in CIF.  It is therefore a good time to undertake a
comprehensive review of the CIF_core dictionary.  If you have
other suggestions for changes, now is the time to bring them
forward.

     Brian has collected together all the suggestions that he has
received so far, along with the discussion of these items that
appeared on the Core DMG list server.  This document can be
viewed at:

     http://agate.iucr.org/cif/cif_core/revisions23.html

The items in this document (marked with (w) in the agenda) are
arranged in order of completeness - that is to say the first item
is a simple correction of a definition that does not need DMG
approval, then follow items that have already been approved,
followed by those that are fully documented and should be simple
to approve, all the way down to some general comments about
problems with no particular suggestions as to how they might be
resolved.  As full details of the discussion so far are available
on the website, I only summarize them below.

     The items on the agenda given below are arranged by category
and, apart from item 1, are listed in the order they appear in
the core dictionary.  I have given each a number so that it can
be easily referred to in the discussion.  I have marked with an
asterisk items that are relatively uncontroversial and could
possibly be fast-tracked.

     Please post your comments to the core list server
(coredmg@iucr.org).  If you make a substantive comment on a
particular topic, it is best to give it in its own email with the
item number in the subject line.  This will allow the list server
to establish a thread that will make the discussion easier to
follow.  Before you send any comments, I recommend that you check
the coredmg list server at

      http://www.iucr.org/iucr-top/lists/coredmg/.

to see how this works.

     I would recommend that you download the file Brian has
prepared on the agate web page quoted above, together with the
current versions of the core and symmetry dictionaries.  These
are the documents that will be needed as we work on these
revisions.

     Please give some thought to the following queries and
suggestions and send your comments to the coredmg list server at
the address given above.  Suggestions for other changes are also
welcome.  Brian and I will organize your comments into dictionary
format in order to focus the discussion and help us develop a
consensus.

     I am sorry that we have such a long list of items to
consider, but there is an advantage in reviewing all of these at
the same time as it helps us to get a better overview and will
probably reveal some other changes that are needed.

          I look forward to seeing your comments.

                         Best wishes

                              David
                              Chair of the coreDMG

Agenda given below
*****************************************************
Dr.I.David Brown,  Professor Emeritus
Brockhouse Institute for Materials Research,
McMaster University, Hamilton, Ontario, Canada
Tel: 1-(905)-525-9140 ext 24710
Fax: 1-(905)-521-2773
idbrown@mcmaster.ca
*****************************************************

               AGENDA FOR CORE DICTIONARY REVIEW

SUMMARY LIST OF ITEMS POSTED FOR CONSIDERATION.
(Detailed descriptions follow)
-----------------------------------------------------------
* Item that can probably be fast-tracked
(w) Further details appear on CIF-core revisions at
        agate.iucr.org/cif/cif_core/revisions23.html.

1.(w)     A general problem.

2. ATOM_SITE and ATOM_TYPE categories
     2.1(w)    Rigid groups
     2.2(w)    Representation of disorder
               2.3  Scattering factors for multiple structure
          determinations
     2.4  _atom_type_scat_versus_stol_list
     2.5  _atom_site_fract_*
     2.6  _atom_sites_special_details
     2.7  Anharmonic atomic displacement parameters
     2.8  _atom_site_refinement flags

4. CELL category
     4.1*(w)   Reciprocal cell
     4.2*(w)   Z'

5. CHEMICAL categories
     5.1(w)    Origins and properties of the sample.
                    5.2(w)    Inclusion of peptide sequences (CCDC is preparing
               a report)
     5.3       Crystal properties.

6. CITATION categories
     6.1*(w)   REFCODES for citations

8. DATABASE categories
     8.1*(w)   Deposit numbers for CCDC (already approved)
     8.2*(w)   Database history

9. DIFFRN categories
     9.1(w)    Twins (a report is being prepared)
     9.2*(w)   Replacing theta_max by resolution
     9.3*(w)   _diffrn_source_takeoff_angle
     9.4(w)    _diffrn_orient_matrix
     9.5  Flag for systematic absences in diffrn_refln category
     9.6  _diffrn_source_target
     9.7  Rethinking _diffrn_standards_*

10. EXPTL categories
     10.1(w)   Sample history
                    10.2      Provision for describing the shape of more than
               one crystal

11. GEOM categories
     11.1(w)   _geom_bond_multiplicity

13. PUBL categories
     13.1*(w)  Links to the World Directory

14. REFINE categories
     14.1(w)   _refine_ls_F_calc
     14.2(w)   _refine_ls_restrained_wR-factor_all
     14.3(w)   Twins (a report is being prepared)
     14.4      The default value of _refine_ls_extinction_method
     14.5(w)   _refine_ls_extinction_coef

17. SYMMETRY categories.
     17.1*     Replacement with the SPACE GROUP categories

------------------------------------------------------------
DETAILS OF THE ABOVE ITEMS
--------------------------

1(w). A general problem.
     Surveys of the Acta Cryst. and  the CCDC archives show that
it is often not possible to supply the exact numerical value
required by CIF for, e.g. temperatures and other parameters.
Sometimes there is only an upper or lower limit, a range of
values, or even just an approximate value.  The way to handle
this problem is probably to define new data items, e.g., _*_lt
and _*_gt (for less than and greater than) for the items in
question.  This would cover 'less than', 'greater than' and
'ranges', but does not address the problem of how one gives
approximate numbers when a standard uncertainty is not supplied.
It also does not address the problem of how one includes
qualitative information such as 'high' and 'low' which are found
in the CSD.  Perhaps the  text field _*_details could be used for
these cases since it is unlikely that any precise quantitative
use could be made of such approximate information.

     The items for which _lt and _gt (or one of them) could be
defined occur in a number of different categories and are
summarized here.  Some of the suggested items do not yet exist in
the dictionary.

Item                     Current name
--------                 ---------------------
melting temperature           _chemical_melting_point
decomposition temperature
sublimation temperature
temperature of experiment     _exptl_crystal_density_measure_temp
                              _diffrn_ambient_temperature
pressure of experiment        _diffrn_ambient_pressure
phase transition temperature
phase transition pressure
measured density              _expt_crystal_density_meas
shift/su                      _refine_ls_shift/esd_max
                              _refine_ls_shift/esd_mean
decay of diffraction standards     _diffrn_standards_decay

2. ATOM_SITE and ATOM_TYPE categories
     Several additions have been proposed here, and since these
two categories are closely related they are listed together.  The
solutions to the problems raised by several of these items may be
related.

2.1(w). A proposal for a method of defining rigid groups was
first presented in the modulated structure dictionary (msCIF),
but was thought to be of general interest and so was transferred
to the core DMG for our consideration.  There is a fairly
detailed discussion on the web but no consensus has yet been
reached on the best way to handle this.

2.2(w). Disorder
     At present disorder is handled somewhat inelegantly in the
core which makes any automatic manipulation awkward.  The problem
arises because the present version of CIF does not make the
distinction between a site in the crystal and the atoms that
occupy that site.  Ideally these should be given in two different
lists, though this would be cumbersome for the majority of
structures in which there is no disorder.

2.2a Occupational disorder.
     At present this is handled by giving all the elements that
occupy a given site identical coordinates and occupation numbers
that sum to 1.0 or less.  The more elegant solution is to define
the properties of the site in the atom_site loop and the
properties of the atoms that occupy the site in the atom_type
loop.  An example of such a loop would be

loop_
     _atom_type_id
     _atom_type_symbol
     _atom_type_element
     _atom_type_occupancy
     1    T1   Al   0.34(4)
     2    T1   Si   0.66(4)
     3    T2   Al   0.21(3)
     4    T2   Si   0.70(3)

_atom_type_id is the category key (unique identifier for each
line)  It is not currently defined for this category but category
keys should as a matter of principle be added in preparation for
the advanced applications that are being developed.
_atom_type_symbol  is the child of _atom_site_label and links to
the atom_site category. This item is already in the dictionary.
A more logical name that follows current naming conventions would
be _atom_type_site_label
_atom_type_element is not currently defined but probably should
be a recognized element symbol.

2.2b Displacive disorder
     Displacive disorder is currently handled by two items:
     _atom_site_disorder_group
     _atom_site_disorder_assembly
The intent of these two items (whose definition in the current
dictionary could probably be improved) is to link together groups
of atoms whose disorder is correlated, e.g., a phenyl group that
may occur in two different disordered orientations.  The atoms
belonging to one orientation are assigned, say, to _*_group 1 and
those belonging to the other orientation are assigned to _*_group
2.  All sites in the same group are simultaneously occupied and
have the same occupation number.  Adjacent sites in different
groups cannot be simultaneously occupied.  _*_assembly is used
if, for example, two different phenyl groups are disordered.
Each would be assigned a different value of _*_assembly, so that
the restrictions on simultaneous occupation only apply to groups
in the same assembly.  We could use a good example.  What happens
if these groups are rigid groups as presumably frequently
happens?  Can we combine a rigid group description and disordered
group description into a single category?  See also item 2.1.

2.3 Brian Toby has raised a problem with the ATOM_TYPE category
suggesting that ATOM_TYPE_SCAT should be defined as a separate
category.  These items in this new category would give details of
the scattering  factors which depend on the radiation and
wavelength used.  In multi-radiation or multi-wavelength
experiments the scattering factors may need to be looped and
keyed to structure factors measured with different radiations.
It then makes no sense to include these in the same loop that
describe structure related properties such as _*_number_in_cell
and _*_oxidation_state. (See 2.4).

2.4 _atom_site_scat_versus_stol_list was introduced to allow a
list of scattering factors as a function of sine(theta) to be
included as a loop.  However, since nested loops are not allowed,
this loop was given in the form of a text field that needed to be
parsed in order to be accessible to the computer.  This is
inelegant, but could be rectified by introducing a new category
keyed back to the ATOM_TYPE category, e.g.

loop_
     _atom_scat_factor_type_symbol #matches _atom_type_symbol
     _atom_scat_factor_stol
     _atom_scat_factor_scat_factor
     S    0         16.0
     S    0.01      15.3
     S    0.03      14.8
# data values omitted for brevity
     V    0         23.0
     V    0.015     22.1
     V    0.03      21.5
# list truncated for brevity

By including _atom_scat_factor_radiation_code, it might also be
possible to meet the requirements of Brian Toby described in 2.3.

2.5  Is there a need for
     _atom_sites_special_details
for including a general discussion of  say, disorder, or the
effects of twinning on the coordinates given?

2.6 The default value of _atom_site_fract_*   is 0.0 which does
not make much sense.  If the coordinates are not given they are,
presumably, unknown.  As the definitions now stand, coordinates
must be given the explicit value of '?' or '.' if they are not
known, otherwise all atoms are assumed to lie at the origin!
Should we remove the default?

2.7  Do we need to make provision for including anharmonic atomic
displacement parameters?

2.8 _atom_site_refinement_flags has seven enumerated values, all
single letters, but unlike all other enumerated flags, these
values can be concatenated.  A normal dictionary-driven check of
this field would fail if it encounters more than one letter in
the string.   We need either to include all reasonable
combinations of the 7 characters in the enumeration list or treat
this item as free text.   Alternatively, we could  replace this
item by three new items, one referring to the refinement of
positions (four allowed letters), another to occupation (one
allowed letter) and the third to atomic displacement parameters
(two allowed letters).  The enumeration lists could then include
all reasonable combinations of flags for each of the three items.

3. AUDIT categories
     No changes proposed.

4. CELL category
4.1*(w)  CCDC request that we define the reciprocal cell.  We
could add 'reciprocal' after 'cell' in all the relevant names
e.g., _cell_reciprocal_angle_alpha or use a name such as
_cell_angle_alpha* in line with current usage (except that * by
convention is used as a wild character).  In any case the
definitions should be straightforward.  I assume we would also
need to include the reciprocal cell volume.  What would be the
best form of the dataname?

4.2*(w)  CCDC has also requested an item for Z', the number of
formula units in the asymmetric unit. This item is used to
identify the number of molecules in the structure that are not
related by symmetry.   _cell_formula_units_Z' could be defined as
(_cell_formula_units_Z)/(multiplicity of the general position).
Would this definition fail under some circumstances such as when
two independent centrosymmetric molecules are found in P-1?
Would the prime after Z be too inconspicuous a character when the
CIF is printed?  Would it be best to name this
_cell_formula_units_Z_prime?

5. CHEMICAL categories
     This category is used to describe the chemical properties of
the sample, but some of these properties are included in the
EXPTL category as they are regarded as part of the structure
determination or specific to the sample being studied.

5.1(w) The CCDC has identified a need for including more
information about the origins and properties of the sample.  They
propose additional names:
     _chemical_compound_source_recrystallization (or
_*_recrystallisation - do we have conventions on spelling?).
This might be better included under _exptl since it refers to the
preparation of the sample studied, see 10.1 below.
     _chemical_properties_physical
     _chemical_properties_biological
These would be text fields that would allow for descriptive
comments.

5.2(w) CCDC have also identified a need to include peptide
sequences for some of their polypeptides.  They are currently
working on a scheme compatible with mmCIF for including this
information.  Discussion on this topic should be deferred until
we receive their report.

5.3 What we do not seem to have is a category that describes
specifically the properties of the crystal such as the phase
(though there is a _chemical_name_structure_type which could
include terms like perovskite, NaCl, etc.)  Some properties that
are related to the crystal rather than the compound, such as
refractive index and optical activity, perhaps need their own
category.  The density is strictly a crystal property (different
phases can have different densities) but is given in the exptl
category, i.e. it is not treated as a chemical or crystal
property but as part of the structure determination, even though
few people routinely measure it (see 10).

6. CITATION categories
     This category is used for including references in the CIF to
other work, principally references to journal articles.

6.1*(w) A proposal has been made for an item
_citation_database_id_CSD which would contain the REFCODE of a
CSD entry that was being cited (not the REFCODE of the crystal
described in the CIF which is given in _database_id_CIF).  The
precedent is _citation_database_id_medline.  Should we add
additional database_ids at the same time, e.g. to PDF, PDB, ICSD
etc.?

7. COMPUTING categories
     No changes proposed.

8. DATABASE categories
     This category is used for giving codes for database entries
and the codens used by the databases for journal names, but it
could easily be extended to include other database-related items.

8.1*(w) Recently the CCDC has received approval for a couple of
additional codes to give deposit numbers to entries in some of
their archival files that are not part of the main database, e.g.
_database_code_depnum_ccdc_journal for entries passed to CCDC by
journals before they are processed into the main database.  Since
this change has been approved, no further discussion is needed.

8.2*(w) This category would also seem to be the best place for
including information on changes that have been made in an entry
by a database prior to the production of an output CIF.  One
possibility is to introduce items such as _database_CSD_audit or
database_CSD_history to record these changes.

9. DIFFRN categories
     Several changes are proposed for this category.

9.1(w) There is a proposal to introduce items for describing
twins, partly in this category and partly in REFINE (see 14.3).
This suggestion is currently being pursued by a special
subcommittee.  We should await their report.

9.2*(w) The suggestion is made that the items:
     _diffrn_measured_fraction_theta_full
     _diffrn_measured_fraction_theta_max
     _diffrn_reflns_theta_full
     _diffrn_reflns_theta_max
should be replaced by:
     _diffrn_reflns_measured_fraction_resolution_full
     _diffrn_reflns_measured_fraction_resolution_max
     _diffrn_reflns_resolution_full
     _diffrn_reflns_resoltuion_max
This would place all four items in the same category (which is
logical) and would replace theta (which depends on the
wavelength) with the resolution in Angstroms (which does not).
This is in line with other definitions in CIF.

9.3*(w) A suggestion from the powder diffraction community that
the following item might be useful:
     _diffrn_source_takeoff_angle

9.4(w) The item:
     _diffrn_orient_matrix
depends on the particular diffraction geometry used and is
defined by the user in _diffrn_orient_matrix_type.  This runs
counter to our philosophy that no item should depend on a second
item to determine its meaning (i.e. all items are insensitive to
their context).  In some of the recent dictionaries considerable
care has been taken to define an axis system that does not depend
on a particular diffractometer or its orientation.  For example,
coordinate axes can be chosen along the incident beam and the
normal to the plane of diffraction, but even this basic
definition runs into problems in the case of area detectors.  Can
we improve the definition used in the core dictionary?

9.5 Is a flag needed for systematic absences in the DIFFRN_REFLN
category?  Since _diffrn_reflns_number excludes the systematic
absences, a computer can only check that the sum is correct if it
can identify the systematically absent reflections.  There is
such a flag in the REFLN category but the original definitions
were based on the notion that systematic absences are a property
of the structural model being refined (hence of the calculated
structure factors) and not a property of the diffraction
measurements (observed structure factors)..

9.6 Details of the radiation used are given in the category
DIFFRN_RADIATION, except for  _diffrn_source_target.  Is this
arrangement appropriate?  The problem arises if a crystal is
studied using different types of radiation.

9.7 With the growing use of area detectors do we need to rethink
the _diffrn_standards_* items by adding items suitable for area
detector measurements?

10. EXPTL categories
     This is where various properties of the studied sample are
presented.  There is some overlap with the CHEMICAL categories
(see 5 above).

10.1(w)  Do we need to give details of the sample history,
specially for inorganic materials where heat treatment, annealing
in different atmospheres, etc is important?  This might also be
the best place to describe the growth of the single crystal
specimen used in the study (see 5.1).

10.2  Although the CIF dictionaries envision the possibility of
reporting measurements on more than one crystal, there is only
provision for describing the faces of one of them.  Should we
define an item _exptl_crystal_face_crystal_id which would be a
child of _exptl_crystal_id, allowing the faces of more than one
crystal to be reported?

11. GEOM categories

11.1 The current geom definitions allow all the bonds around a
given atom to be listed with their symmetry operations, but Acta
Cryst. only prints those in the asymmetric set, i.e., bonds
related to others by symmetry are not printed.  Thus for NaCl
only one of the six Na-Cl bonds is printed, but it is useful to
know in these cases how many bonds of this type there are.  The
proposal is for an item:
     _geom_bond_multiplicity
This could be given as an alternative to listing all the bonds,
or it could be defined only for the bond that is flagged for
printing with the choice left up to the user.  This problem
occurs frequently in inorganic compounds.  Would a similar item
be useful for angles?  I don't know what kind of fix Acta Cryst.
currently uses when setting up for print.

12. JOURNAL categories
No changes proposed

13. PUBL categories
13.1*(w) Acta Cryst. is proposing two items which allow them to
connect the authors of papers with their entry in the World
Directory via an author id number:
     _publ_contact_author_id_iucr
     _publ_author_id_iucr

14. REFINE categories
14.1(w) The following items have also been transferred from the
msCIF dictionary as being of general more interest:
     _refine_ls_F_calc_accuracy
     _refine_ls_F_calc_details
     _refine_ls_F_calc_formula
The last item gives the analytical expression used to calculate
the structure factors (presumably when this is not standard) with
further details given in _*_details leading to an estimate of the
accuracy of the calculated Fs.  Presumably these items would only
be used in special circumstances.  Full dictionary descriptions
are given on the web.

14.2(w) Also from the msCIF dictionary with a full dictionary
descriptions given on the web are:
     _refine_ls_restrained_wR-factor_all
     _refine_ls_restraints_weighting scheme
The intent is to define an R factor for restrained refinements.
There is some discussion of the proposed expression on the web.

14.3(w) There is a proposal that CIF should be able to describe
twins with new items in DIFFRN (see 9.1) and REFINE.  This item
is currently being reviewed by a special committee.  We should
wait for their report.

14.4 The default value of _refine_ls_extinction_method is
'Zachariasen'.  This default not only makes presumptions about
the standard type of extinction correction (which is
inappropriate), but it also assumes that an extinction correction
was made if no extinction is reported!  Strictly, the absence of
an extinction corrections can only be signalled by setting this
value explicitly to '.'  Omitting this item from a CIF implies
that a Zachariasen correction was made.  Should we remove this
default?

14.5(w)  At present _refine_ls_extinction_coef is a catch-all for
the parameter refined for any type of extinction correction.  Its
value only has meaning in the context of the value of
_refine_ls_extinction_method.  Since CIFs are supposed to be
context independent (i.e. the meaning of an item does not depend
on the value given to any other item), we should define different
names for each of the coefficients determined by different
methods.

15. REFLN categories
No changes proposed

16. REFLNS categories
No changes proposed

17. SYMMETRY categories
17.1*  The symmetry CIF dictionary approved by COMCIFS last
December provides a much more carefully thought out set of
symmetry definitions than is currently available in the current
symmetry categories.  This dictionary can be found on the IUCr
web site.  It contains three categories:
     SPACE_GROUP
     SPACE_GROUP_SYMOP
     SPACE_GROUP_WYCKOFF
The recommendation is that the current symmetry categories which
have some conceptual weaknesses should be replaced by the new
SPACE_GROUP categories.  The question is: do we wish to include
the whole of the symmetry CIF dictionary in the core or only a
subset of items, and if so, which?


[Send comment to list secretary]
[Reply to list (subscribers only)]