(49) Remaining issues: block id, disorder, diffrn

To: [email protected]
Subject: (49) Remaining issues: block id, disorder, diffrn
From: bm
Date: Wed, 25 Sep 1996 14:00:08 +0100
Dear Colleagues

I shall send out in an accompanying email the latest working draft of the 
extended core dictionary, in which I have tried to pull together all the
currently open threads. There are one or two areas where I am still not
happy (which I discuss below); but I would like you to consider the draft as
complete except for these areas, and study it with a view to giving your 
formal proposal. What I intend to do is send another circular next week,
summarising any discussions on the problems that are still unresolved, and
addressing any structural problems that have come to light in the interim.
I shall then ask you to submit your formal vote of approval at the end of
next week; so it is important that we resolve all current problems
to our satisfaction in that time.

Here are my remaining reservations:

D44.2 Disorder
--------------
I promised that I would have some discussions with John Davies on the
_atom_site_disorder_assembly and *_group items, and I have now done so.
Recall the intent behind these items (and here I correct a slip of the
keyboard in the example I posted in circular 44: the *_group and *_assembly
lines in the loop header should be reversed, so that the example is now as
given below). Where a disordered cluster of atoms (identified by a unique
value of _atom_site_disorder_assembly) can be represented by two (or more)
alternative conformations, the members of each separate conformation are
assigned the same value of _atom_site_disorder_group.

loop_ _atom_site_label                  # *_assembly 'M' is a disordered methyl
      _atom_site_occupancy              # with configurations 'A' and 'B':
      _atom_site_disorder_assembly      #
      _atom_site_disorder_group         #    H11B    H11A      H13B
                                        #      .      |      .
   C1     1      .       .              #        .    |    .
   H11A   .5     M       A              #          .  |  .
   H12A   .5     M       A              #             C1 --------C2---
   H13A   .5     M       A              #           / .  \
   H11B   .5     M       B              #         /   .    \
   H12B   .5     M       B              #       /     .      \
   H13B   .5     M       B              #    H12A    H12B    H13A

John points out that there are two approaches to describing disorder in the
crystal - that of the crystallographer and that of the chemist. The
crystallographer wishes to interpret a density map, and ends up with a list
of sites in the unit cell that are occupied by clumps of electron density.
He identifies them with atoms of one or more chemical species (a particular
site can be occupied by silver or silicon atoms in a non-periodic fashion);
and with complete or partial occupancy. So long as he can identify all
populated sites in this way, he is content. The chemist wishes to identify
sensibly-bonded chemical entities in the cell, and so will interpret the
contents as a superposition of distinct moieties, each such moiety having
some occupancy.

Consider the following hypothetical molecule.  In the diagram below,
suppose the group of atoms labelled as 1-2-3-4 is sometimes disordered
                                to the conformation 1-2A-3A-4,
                    3B          and sometimes (but less frequently)
                 /     \        to 1-2-3B-4. Atoms 1 and 4 have 100% occupancy,
                /       \       atoms 2, 2A and 3A have 50% occupancy, atom 3
               /         \      40% occupancy and atom 3B 10% occupancy (so
    ---1------2-----3-----4---  that the 1-2-3B-4 conformation occurs 10%
        \                /      of the time, and 1-2-3-4 40% of the time).
         \              /       This can be described in a CIF with the data
           2A ------ 3A         names that refer to the positions and
                                occupancies, and one can even add detailed
text descriptions to help a human reader understand what is going on:

loop_
_atom_site_label
_atom_site_occupancy
_atom_site_description
1    1.0     'ordered site, forms with 4 the static nodes of a disordered chain'
2    0.5     'site in disordered chain 1-2-3-4, bonding to 3 or 3B'
3    0.4     'site in disordered chain 1-2-3-4, split with site 3B'
4    1.0     'ordered site, forms with 1 the static nodes of a disordered chain'
2A   0.5     ...etc
3A   0.5
3B   0.1

but _atom_site_description is not of much use to a software application
seeking to extract the contents of one moiety.

The chemist's problem is resolved by labelling the three distinct moieties.
In CIF we could set up another table which encodes the following information:

 Name of moiety    Contents
    A              1, 2, 3, 4 (+other atoms in the ordered part of the molecule)
    B              1, 2A, 3A,4    "
    C              1, 2, 3B, 4    "

One could then go on to build descriptions of the moieties and their
relative abundance:

loop_
_moiety_name
_moiety_occupancy
_moiety_description
  A  .4     'disordered conformation of main compound'
  B  .5     'disordered conformation of main compound'
  C  .1     'disordered conformation of main compound'
  D  1.0    'solvate'

Because of the structure of CIF (which isn't very tolerant of an indeterminate
number of values for an item), the "name of moiety" table would be rather
verbose, but presumably easy enough to write from an application that allowed
the user to input chemical knowledge in disentangling the disordered components.

In principle one could carry back into the _atom_site_ list the details of
which moiety or moieties each occupied site was associated with (that is,
instead of building a "name of moiety"/"contents" table), but again there
are some technical problems in doing this in CIF, related to the fact that
each atom may belong to several chemical moieties. In other words, a table
containing the entries

loop_
_atom_site_label
_atom_site_occupancy
_atom_site_associated_moiety
1    1.0     ABC
2    0.5     AC
3    0.4     A
4    1.0     ABC
2A   0.5     B
3A   0.5     B
3B   0.1     C

COULD be interpreted in such a way as to give you a description of molecule
A (or B, or C...), though this runs counter to our distaste for having
separate parsing rules for separate data names.

I note that the mmCIF dictionary allows a much richer description of the
structural elements associated with the results of the refinement, and can
describe an ensemble of disordered fragments. However, I'm not sure that
it's appropriate to bring the full richness of that description to bear on
small molecules.

Another point that needed clarification was the use of negative values of
the *_group code. This arises from the practice in SHELXL of controlling the
connectivity by grouping disordered atoms into different "PART"s. In
general, each PART number will map to a value of _atom_site_disorder_group.
However, the mapping can be one-to-many, for the PARTs are not intended as
descriptions of the disordered cluster per se, but are rather clusters of
atoms in the refinement that obey different constraints designed to model
the effects of disorder. Hence, automatic bond generation is inhibited
between atoms with different PART numbers, unless one of them is 0 (this is
the value associated with the main ordered fragment of the molecule).
Now, if you imagine a group disordered on a symmetry element, the group
can be described by a set of atoms all belonging to the same PART (because
the disordered positions will be generated when the symmetry element is
applied); but a connectivity algorithm runs the risk of building spurious
bonds between the atoms listed in the PART and their symmetry-generated
mates. SHELXL adopts the conventions that such PARTs should be given a
negative PART number, as a flag to switch off special position constraints
and bonds to symmetry-generated atoms.

PROPOSED ACTION
---------------
Action on all this? I propose to include David's new definitions for
_atom_site_disorder_assembly and *_group in (45)D44.2, with a different
description of the negative part numbers, thus:

data_atom_site_disorder_assembly
    _name                      '_atom_site_disorder_assembly'
    _category                    atom_site
    _type                        char
    _list                        yes
    _list_reference            '_atom_site_label'
    loop_ _example
          _example_detail   A  'disordered methyl assembly with groups 1 and 2'
                            B  'disordered sites related by a mirror'
                            S  'disordered sites independent of symmetry'
    _definition
;              A code which identifies a cluster of atoms that show long range
               positional disorder but are locally ordered.  Within each such
               cluster of atoms, _atom_site_disorder_group is used to identify
               the sites that are simultaneously occupied.  This field is only
               needed if there is more than one cluster of disordered atoms
               showing independent local order.
;

data_atom_site_disorder_group
    _name                      '_atom_site_disorder_group'
    _category                    atom_site
    _type                        char
    _list                        yes
    _list_reference            '_atom_site_label'
    loop_ _example
          _example_detail    1  'unique disordered site in group 1'
                             2  'unique disordered site in group 2'
                            -1  'symmetry-independent disordered site'
    _definition
;              A code that identifies a group of positionally disordered atom
               sites that are locally simultaneously occupied.  Atoms that are
               positionally disordered over two or more sites (e.g. the H
               atoms of a methyl group that exists in two orientations) can
               be assigned to two or more groups. Sites belonging to the same
               group are simultaneously occupied, but those belonging to
               different groups are not. A minus prefix (e.g. "-1") is used to
               indicate sites disordered about a special position.
;

This will not give a complete description of complex disordered structures,
but will satisfy the many cases where a functional group is simply
disordered into two or more distinct conformations. I suggest we leave to a
later revision of the dictionary any attempt to describe the ensemble of
moieties in a manner different from the mmCIF STRUCT categories.


D45.7 _audit_link_block_code
----------------------------
G> D45.7. I agree with your solution of transferring audit_link_block_code 
G> to a block_id. What I do not know (or remember) is the scheme of 
G> prefixes. Anyway is a good solution. 

There seems to be a consensus on identifying data blocks and links between
blocks through the _audit_link_ category and the identifier "_block_id",
which I have in fact named as "_audit_block_code", as the "AUDIT" category
seems the right place for it. The current definitions are listed below, for
convenience.

The problem is in assigning a unique value to _audit_block_code. I can see
three ways to proceed:

(1) Require only uniqueness within the current file. Then the linking
mechanism is just a formal way of stating the relationship between the data
blocks in a file - by no means a useless function, since the standard does
not impose any particular function on any individual data block.

(2) Formulate an algorithm that guarantees the generation of a unique
string. I am open to suggestions on how this may be achieved in a
machine/OS/program-independent way. The simplest solution seems to be Brian
Toby's, of setting up a registry of unique prefixes and assigning them to
individuals, then relying on the individuals to ensure the uniqueness of
every data block they generate. 

(3) Adopt a naming philosophy along the lines of the URL/URI/URN rules that
we see on the web. The notion here is that an identifier can take the form
          (protocol:)(/location)(#fragment)
and that partial forms of the identifier can be employed for cross-references
of restricted scope. How might this work? Say we have the following values
of _audit_link_block_code (I use '|' for the fragment separator, in view of
the special meaning to STAR of the '#' character):
                            |TOZ     (a)
                       file2|TOZ     (b)
               ../peer/file3|TOZ     (c)
              /top/mid/file4|TOZ     (d)
http://x.y.z/somewhere/file5|TOZ     (e)

(a) means "data block with _audit_block_code value of TOZ in the same file",
and is portable (subject to the file being retained as intact). (b) means
block TOZ in a file named file2 at the same level of the storage hierarchy
(usually this means "in the same directory", but it's intended to be a bit
more general than that). (c) means block TOZ in file3 in a neighbouring
container (directory) at the same level in the hierarchy; (d) means block
TOZ within a file, file4, at a specific location in the hierarchy that is
currently accessible (normally this means on the same file system). (b), (c)
and (d) are progressively less portable, in the sense that the links will
work only if the files linked to are carried around in the same relative
positions in the namespace. For instance, if two files, file1 and file2,
are submitted to ACTA, both published in the same issue of the journal,
and both archived in the same directory, then link (b) would work.
Link (e) is again entirely portable, in that it identifies the host
machine on the Internet and the data access protocol required to
retrieve the file.

This scheme has the merit of allowing links on a local system, if only this
is required; but is extensible as the links need to be made public. The
values of the "block codes" need to be "unique" only within the scope
permitted by the form of the identifier - the http: value is unique across
the Internet, but is only assigned by the public archive hosting that file.

PROPOSED ACTION
---------------
Unless I hear compelling arguments to the contrary, I intend to implement
solution (1) at this point; it is possible that it might be extended to a
scheme such as (3) in the future.


data_audit_block_code
    _name                      '_audit_block_code'
    _category                    audit
    _type                        char
    _example                     TOZ_1991-03-20
    _definition
;              A code intended to identify uniquely the current data block.
;

data_audit_link_[]
    _name                      '_audit_link_[]'
    _category                    category_overview
    _type                        null
    loop_ _example
          _example_detail
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
;
    loop_
    _audit_link_block_code
    _audit_link_block_description
       .             'discursive text of paper with two structures'
       morA_(1)      'structure 1 of 2'
       morA_(2)      'structure 2 of 2'
;
;
    Example 1 - multiple structure paper, as illustrated
                in A Guide to CIF for Authors (1995). IUCr: Chester.
;
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
;
    loop_
    _audit_link_block_code
    _audit_link_block_description
       .        'publication details'
       KSE_COM  'experimental data common to ref./mod. structures'
       KSE_REF  'reference structure'
       KSE_MOD  'modulated structure'
;
;
    Example 2 - example file for the one-dimensional incommensurately
                modulated structure of K~2~SeO~4~.
;
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    _definition
;              Data items in the AUDIT_LINK category record details about the
               relationships between data blocks in the current CIF.
;

data_audit_link_block_code
    _name                      '_audit_link_block_code'
    _category                    audit_link
    _type                        char
    _list                        yes
    _list_mandatory              yes
    _definition
;              The value of _audit_block_code associated with a data block
               in the current file related to the current data block. The
               special value '.' may be used to refer to the current data
               block for completeness.
;
 
data_audit_link_block_description
    _name                      '_audit_link_block_description'
    _category                    audit_link
    _type                        char
    _list                        yes
    _list_reference            '_audit_link_block_code'
    _definition
;              A textual description of the relationship of the referenced
               data block to the current one.
;


D45.8 Inheritance across data blocks
------------------------------------
Gotzon sees another benefit to the _audit_link_ mechanism:

G> On the other hand audit_link_block_code would define the scope where 
G> the parent/child link should work. I mean that the DDL requirement 
G> "Identifies a data item....and which must be present in the same data 
G> block..." would be "substituted" for certain CIF applications by:
G> "Identifies a data item....and which must be present in the same 
G> (logical) data block...", where a logical data block is composed by those 
G> physical data blocks that are included in the audit_link_block_code list 
G> (uniqueness of block_id seems to be crucial).

This is a formal problem which I leave as an open issue at this point - it
doesn't immediately affect the current revision to the dictionary itself.

D48.1 _diffrn_ categories
-------------------------
I have adopted most of the changes suggested by David in his proposed
reworking of the various DIFFRN_ categories. Following some detailed
correspondence with him, the final version is slightly different from the
proposals in circular 48, and I would appreciate it if you would review
these sections carefully. There are cases (e.g. _diffrn_radiation_detector
and _diffrn_detector_device; likewise _diffrn_radiation_detector_dtime and
_diffrn_detector_dtime) where essentially the same information is repeated
in two categories, according as one or multiple instruments or experiments
are described. This is somewhat unsatisfactory; but the overall logic of the
DIFFRN categories is at least better argued than used to be the case.

----------
Regards
Brian
Prev by Date: (48) Extended Core: interblock links; DIFFRN categories
Next by Date: (50) Call for approval of CIF Core dictionary version 2.0
Index(es):
- Date
Discussion List Archives

(49) Remaining issues: block id, disorder, diffrn