This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.

Appended below is a modified version of an article written to brief COMCIFS members on the evolution of the DDL formalism utilised in CIF application dictionaries.

A BRIEF HISTORY OF THE DDL

1. The STAR file

The initial intention of a STAR ("Self-Defining Text Archive and Retrieval") file was to provide a set of simple syntactic rules for storing and retrieving text strings in a flexible and extensible manner. The rules are indeed simple. Here they are, in essence:

The file is divided into tokens separated by white space.
A token may include white-space characters provided (a) it is surrounded by matching single or double quotes and contains no <newline> characters; or (b) it is surrounded by <newline><semicolon> character pairs (digraphs).
A token beginning with an underscore character is a dataname that MUST have an accompanying value. For non-looped data, this value is the following token.
Looped data are permitted. A loop structure contains a header and a body. The header is introduced by the reserved word "loop_" and contains a list of datanames. The values in the loop body are associated with matching datanames in the header, in strict rotation. In STAR (but not in CIF), nesting of loops is permitted. In the header, each level of nesting below the first is introduced by a "loop_" keyword and terminated by "stop_".
Other reserved words are "data_xxx" where xxx is an arbitrary string that must be unique within the file; "global_"; "save_xxx" and "save_". These are used to partition the file into distinct data cells with specific scoping rules. Of these, only "data_xxx" is used in CIF. All datanames and associated values must occur within one of these data cells.
Comments (introduced by the "#" character and terminated by the end of a line) are permitted anywhere that a token is valid.

There is no restriction on the nature of the information conveyed by these tokens - the STAR rules are purely syntactic, and allow a specified token to be retrieved by an application - i.e. given a dataname, the matching data value or values should be retrieved. The application starbase (sb) works entirely at this syntactic level, and guarantees to return requested data in fully compliant STAR format.

This syntactic format was chosen as the base layer for the CIF, but with some simplifications to aid programming. The only recognised data cell delimiter is the data_xxx block code; nesting of loops is prohibited; block codes and datanames are restricted to 32 characters in length; lines are restricted to 80 characters.

The initial semantic layering (i.e. the imposition of "meaning" onto the abstract tokens) was done by devising datanames that were self-expressive, and by imposing some basic data types on the associated values. Effectively only two data types were permitted - "numb" for numerical values (with a permitted standard uncertainty value in trailing parentheses); and "char" for textual information (though some applications might choose to differentiate between multi-line text extending over several lines and bracketed by newlines, and the single-line-or-less character strings with quote marks or no delimiters).

Hence, this could be considered a totally valid CIF:

   data_proto
   _my.crystal's.habit   'turns green in daylight'

But there is a problem here: the word 'habit', used here in the sense of 'customary behaviour', is used in crystallography with a specialised meaning. This ambiguity is fatal to the purpose of devising a universal exchange mechanism.

2. A Dictionary of Universal Terms

So the next step was to devise a dictionary of datanames which represented very specific terms and definitions in crystallography. The dictionary would list all datanames with a universal meaning, together with a definition of that meaning, an indication of whether the associated value was numeric or textual, and any constraints that could be applied to that value. The datanames were constructed in a way designed to illustrate their relationship with each other, through a hierarchy of subcomponents separated by underscores - hence _atom_site_symmetry_multiplicity etc.

And this was indeed how the original CIF dictionary was submitted to Acta - as a MS-Word file with the definitions laid out just as in a lexicographic dictionary. An appendix listed permitted codes for certain values (what we habitually call the 'enumeration lists'), and some general elements of the definitions were also included in the main text of the paper. All this information was available only to the human reader.

However, Tony Cook, who was working with Syd Hall on the chemical (MIF) applications of STAR files, pointed out that much of the stored information on each dataname could be extracted by computer if it were presented in an appropriate way - and what way would be more appropriate than as a STAR file, so that the same software being written to extract information from a CIF could be used to extract information from the dictionary? And so it came to pass - by the time the original CIF paper went to press, the dictionary had been recast as a STAR file (with the same syntax restrictions as a CIF). The dictionary information was associated with a new set of datanames, which form the vocabulary of the Dictionary Definition Language, or DDL. Note that this formalism is only mentioned in passing in the CIF paper: the typeset version of the dictionary translates the DDL names into sentences, so that the dictionary again resembles a lexicographic one in its layout of entries.

Here is an example of an entry in the core dictionary:

data_atom_site_attached_hydrogens
    _name                       '_atom_site_attached_hydrogens'
    _type                        numb
    _list                        yes
    _list_identifier            '_atom_site_label'
    _enumeration_range           0:4   
    _enumeration_default         0      
    loop_ _example              
          _example_detail        2    'water oxygen' 
                                 1    'hydroxyl oxygen'
                                 4    'ammonium nitrogen' 
    _definition
;              The number of hydrogen atoms attached to the atom at this site 
               excluding any H atoms for which coordinates (measured or 
               calculated) are given. 
;

The DDL used for this version is documented only in comments at the end of the dictionary file cifdic.C91. I reproduce it here for historical interest, and shall refer to this version as DDL0.

##############################################################################
#
#                        DDL Data Name Descriptions
#                        --------------------------
#
# _compliance           The dictionary version in which the item is defined.
#
# _definition           The description of the item.
#
# _enumeration          A permissible value for an item. The value 'unknown'
#                       signals that the item can have any value.
#
# _enumeration_default  The default value for an item if it is not specified
#                       explicitly. 'unknown' means default is not known.
#
# _enumeration_detail   The description of a permissible value for an item.
#                       Note that that the code '.' normally signals a null
#                       or 'not applicable' condition.
#
# _enumeration_range    The range of values for a numerical item. The
#                       construction is 'min:max'. If 'max' is omitted then the 
#                       item can have any value greater than or equal to 'min'.
#
# _esd                  Signals if an estimated standard deviation is 
#                       expected to be appended (enclosed within brackets)
#                       to a numerical item. May be 'yes' or 'no'.
#
# _esd_default          The default value for the esd of a numerical item
#                       if a value is not appended.
#
# _example              An example of the item.
#
# _example_detail       A description of the example.
#
# _list                 Signals if an item is expected to occur in a looped
#                       list. Possible values 'yes','no' or 'both'.
#
# _list_identifier      Identifies a data item that MUST appear in the list
#                       containing the currently defined data item.
#
# _name                 The data name of the item defined.
#
# _type                 The data type 'numb' or 'char' (latter includes 'text').
#
# _units_extension      The data name extension code used to specify the units 
#                       of a numerical item.
#
# _units_description    A description of the units.
#
# _units_conversion     The method of converting the item into a value based
#                       on the default units. Each conversion number is 
#                       preceded by an operator code *, /, +, or - which 
#                       indicates how the conversion number is applied.
#
# _update_history       A record of the changes to this file.
#
#-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof

I know of two applications which are able to read the CIF core and validate CIFs against that dictionary, expressed in DDL0: CIFCHK from the CCDC, which we use to check CIFs as we archive them (but this program has never been released to the public domain); and sb (which has gaps - it can't handle _units_extension).

The benefit of this approach is, of course, that you do not need to write a new subroutine to handle each new entry in the dictionary - no small consideration, now that we have more than 1200 entries. By the time of the Veszprem computing school in mid-1992, Peter Murray-Rust, for instance, was already working on a C++ object library for CIF. The ambition was to provide a library of routines (written in C++ but callable from C (and maybe even Fortran) for extracting CIF data and validating it against the dictionary.

3. Relationships Between Data Names - the DDL Extended

At the York meeting in April '93, the viewpoint was advanced that CIFs could usefully be treated as implementations of various formal data models. As I have described its evolution so far, the CIF data model has no explicit mechanism for recording relationships between items of data contained in the file. But there were already certain implicit conventions. The dataname scheme embodied a hierarchy of definitions: consider the three data names _atom_site_label, _atom_site_fract_x and _atom_type_oxidation_number. "Obviously" (to a human), these all have something to do with atoms; the first two have something to do with the positions occupied by atoms in a cell; and one would expect _atom_site_fract_x to have corresponding _y and _z equivalents (which it has). In this hierarchical picture, the hierarchical relationships can be unravelled to a large extent by parsing the name components in the dataname.

But only "to a large extent". The hierarchies as set up in the core dictionary are not rigid. The CIF paper lists _chemical_ and _chemical_conn_ as separate "category" parents. (I put the word category in quotes here because it was not a well defined idea. Most people have a feel for the type of categorisation being attempted, but the rules for assigning categories were for a long time difficult to pin down.) In like manner, the _atom_site_ and _atom_site_aniso_ groups of datanames were clearly related, but yet distinct. The relationship was not so clean as the similar form of datanames suggested.

In practice, related data are grouped together - conventionally in the form of tables; in CIFs within loops. A 'relational' data model is one which can represent data as tables obeying certain definite rules: each line of the table must be indexed by a unique key. It was argued that the CIF tables could be mapped onto a relational data model, provided that the necessary rules were enforced through the relationships between data names stored in the dictionary. In other words, the DDL should be revised to allow valid relational tables to be constructed and validated. This would make it much easier to load a relational database with the relevant information from a CIF; or, conversely, one supposes, to borrow the tools of relational database management for managing CIFs.

So, between the York meeting and the workshop in Tarrytown of October '93, there was a great deal of e-mail correspondence and a consequent evolution in the DDL to meet these objectives. Each dataname was now assigned formally to a category (the name of which was usually the same as the dataname stem, e.g. atom_site - but not necessarily so). All the items in a table (i.e. looped list) belonged to the same category. Items not in a table were assigned a different category. Certain uniqueness constraints could be applied to one or more of the datanames. This had the effect of defining the unique 'key' for the table. For instance, in the atom_site category, _atom_site_label had to be unique (in crystallography-speak, this just means that every atomic site had to have a unique label). Additional DDL names were introduced to record parent-child relationships between data names in different categories. Atom labels occur in the geometry tables, for instance _geom_bond_atom_site_label_1. These labels must match corresponding labels of atoms in the atom_site list, so the 'parent' of _geom_bond_atom_site_label_1 is _atom_site_label, and vice versa.

To meet all these requirements, the original DDL was extended and modified, and is now in press as version 1.4. I append below a list of the datanames defined in that version; those prefixed by * are extensions to the DDL0 set:

     *   _category   
         _definition
     *   _dictionary_history   
     *   _dictionary_name   
     *   _dictionary_update   
     *   _dictionary_version   
         _enumeration
         _enumeration_default
         _enumeration_detail
         _enumeration_range
         _example
         _example_detail
         _list
     *   _list_level   
     *   _list_link_child   
     *   _list_link_parent   
     *   _list_mandatory   
     *   _list_reference   (replaces _list_identifier)
     *   _list_uniqueness   
         _name
     *   _related_item   
     *   _related_function   
         _type
     *   _type_conditions   
     *   _type_construct   
         _units
         _units_detail

The following DDL0 terms were dropped:

  _compliance      (replaced by the _dictionary_ attributes)
  _esd             (incorporated in _type_conditions)
  _esd_default
  _list_identifier (replaced by _list_reference)
  _update_history  (replaced by _dictionary_history)
  _units_extension        |
  _units_description      |- (replaced by _units and _units_detail)
  _units_conversion       |

The _units_ items pose something of a problem in reading a dictionary. The original idea was that _cell_length_a would be defined in the dictionary as a quantity in angstroms; but it would have a _units_extension code _pm, which means that the dataname _cell_length_a_pm would be understood as a cell length in picometres (the numeric conversion factor is given by _units_conversion). However, a straightforward dictionary lookup would not return _cell_length_a_pm as a valid dataname. This mechanism, often criticised for the ill logic of modifying the attributes of a defined term by changing its name, was finally dropped, so that each data name now has a unique physical unit associated with it.

Two sets of items have involved especially protracted labours. The _list_ items define the attributes and relationships between items in a looped list. The _list_level allows for data names to be assigned to deeper levels in nested loops. As such, it has no use in CIF applications, where nested loops are forbidden; but it allows nested loop structures to be defined in other STAR applications which do permit it. In my view this is important, for it permits the definition of hierarchical data models, where the existence of certain data items is dependent on the existence of others at a higher level in the hierarchy (you can't have a loop at level 2 unless it is nested within a loop at level 1). I don't know enough to say whether this is a necessary or sufficient condition for mapping STAR data structures onto hierarchical data models, but I have a feeling that it may be important for doing this, and I'd be glad to hear any informed commentary on this.

In CIF loops, which are always at level 1, the _list_ attributes describe the relations between the data values in a table, and so this is an attempt to map onto a relational data model. If one wants to identify the key to entries in such a table (the data name or names which must have unique values within the table), one must collect together entries with _list_mandatory set to 'yes' and the complete set of _list_uniqueness pointers.

Here's a simple example, somewhat adapted from the MIF paper, to show what's going on. The following loop defines a table of bonds in a chemical structure:

     loop_  _bond_id_1   _bond_id_2  _bond_type
              C1   C2   double
              C2   C3   single
              C3   C4   double
              C1   C7   single

The MIF dictionary entry for _bond_type includes the line

          loop_  _list_reference    '_bond_id_1'   '_bond_id_2'

meaning that both bond id values must be present for the table entry to make sense. In the entries for _bond_id_1 (and _2) are found the lines

                 _list_mandatory    yes
          loop_  _list_uniqueness   '_bond_id_1'   '_bond_id_2'

meaning that these datanames must be present in the loop, and must together be unique (C1 appears twice as _bond_id_1 in the example, but that's OK: on one occasion it's teamed with C2, on another with C7).

These relationships do all hang together; but it's necessary to do a certain amount of hunting through the dictionary to ensure that you've got all the relevant information (and it requires a lot of work to make sure that the dictionary yields this information in a consistent manner).

The other set of items I marked for interest are the _type_ group. Because the STAR philosophy is to store and deliver text strings, it was felt that the assignment of data types was largely unnecessary. Integers, floats, double-precision complex numbers and booleans did not need to be stored in any different manner. They can all be coded as text strings: "3", "2.76", "-1.30000988765876 + 0.00022456342987i", "yes". But there have been many voices raised to counter this view, and the solution adopted in DDL1.4 is to permit three fundamental types, with _type values of "numb", "char" and "null". ("null" is a device adopted to allow additional information to be stored in dictionaries for humans to read, but machine parsers to ignore.) _type_conditions extends this, so that 1.23(4) is understood as a number with associated standard uncertainty (e.s.d.) in CIF applications, and 1:4 is understood as a range of allowed integer values in certain MIF applications. _type_construct is a more general device which allows a data value to be compared with a pattern (in regular expression notation). Hence a date quantity could be described in the dictionary with a _type_construct of [0-9][0-9]:[0-9][0-9]:[0-9][0-9], meaning any triplet of two-digit integers separated by colons (not the format used in CIF!). The regex notation is very powerful, and in principle this allows very tight control over acceptable patterns in the string representing a data value.

A final comment on category. The category assigned to each data item defines which loop (table) it may appear in. If it doesn't normally appear in a loop, its category is more generally defined to encompass related items with the same status. Note that this linking of categories to loops results in rather a large number of distinct category assignments. It is not, perhaps, an effective taxonomic classification.

4. Version 2 Unleashed

The DDL2 version developed by John Westbrook followed on yet another CIF workshop, that at Brussels in October '94. Its philosophy was to use the same mechanism of supplying machine-readable attributes of data names in a STAR dictionary, the attributes themselves labelled by STAR data names. But the intention was more specifically to provide a representation better suited for mapping onto a relational data model than the original DDL. Macromolecular data are increaasingly manipulated in relational databases, and it is noteworthy that the PDB has recently published a schema for its proposed relational database implementation of its stored data.

In COMCIFS mailing 30 this year (part of which is reproduced below), I already described how (I think) DDL2 is intended to work. Here I shall just pick up a few threads to contrast with specific remarks I made above on DDL1. First, the taxonomy employed allows a classification hierarchy: there are categories, defined just as the categories in DDL1, but there are also category_groups (clusters of categories, or supercategories), and subcategories. Hence _cell.length_a, _cell.length_b and _cell.length_c are the (only) three members of the subcategory "cell_length"; they are all members of the "cell" category; they are all members of the "cell_group" supercategory (as is, say, _cell_measurement_refln.index_h); and they are also members of the "inclusive_group" supercategory (as is everything in the mmCIF dictionary).

Second, the properties of a category are listed separately from the properties of its constituent members. If I were to extend my MIF example into DDL2, you would have an entry something like

   save_BOND
          _category.id               bond
    loop_ _category_key.name  '_bond.id_1'
                              '_bond.id_2'
   save_

which defines the key of the 'bond' category. None of the entries for _bond.id_1, _bond.id_2 or _bond.type (as they would become in DDL2) would contain anything about key values, except insofar as they indicate membership of the 'bond' category.

The mmCIF2 dictionary also contains a larger spread of what are effectively user-defined types: a list of type codes is established at the beginning of the dictionary using the equivalent of _type_construct. In each data name definition, an _item_type.code value is listed, which indexes into this table of user-defined types. The original DDL1 types are preserved (as _item_type_list.primitive_code and _item_type_conditions.code, I think), but there is more freedom to define application-specific types for the dependent application to handle.

Note that datanames described by DDL2 dictionaries will have an embedded dot character to separate their "category" part from their "instance" part: this will be the most obvious difference between CIFs containing datanames described by DDL2 dictionaries and existing CIFs. The DDL2 proposal endeavours to honour existing CIFs through an aliasing mechanism.

For the sake of completeness, I attach here a simple list of all the data names used in the DDL2 set:

_category.description
_category.id
_category.implicit_key
_category.mandatory_code
_category_examples.case
_category_examples.detail
_category_examples.id
_category_group.category_id
_category_group.id
_category_group_list.description
_category_group_list.id
_category_group_list.parent_id
_category_key.id
_category_key.name
_category_methods.category_id
_category_methods.method_id
_datablock.description
_datablock.id
_datablock_methods.datablock_id
_datablock_methods.method_id
_dictionary.datablock_id
_dictionary.title
_dictionary.version
_dictionary_history.revision
_dictionary_history.update
_dictionary_history.version
_item.category_id
_item.mandatory_code
_item.name
_item_aliases.alias_name
_item_aliases.dictionary
_item_aliases.dictionary_version
_item_aliases.name
_item_default.name
_item_default.value
_item_dependent.dependent_name
_item_dependent.name
_item_description.description
_item_description.name
_item_enumeration.detail
_item_enumeration.name
_item_enumeration.value
_item_examples.case
_item_examples.detail
_item_examples.name
_item_linked.child_name
_item_linked.parent_name
_item_methods.method_id
_item_methods.name
_item_range.maximum
_item_range.minimum
_item_range.name
_item_related.function_code
_item_related.name
_item_related.related_name
_item_structure.code
_item_structure.name
_item_structure_list.code
_item_structure_list.dimension
_item_structure_list.index
_item_sub_category.id
_item_sub_category.name
_item_type.code
_item_type.name
_item_type_conditions.code
_item_type_conditions.name
_item_type_list.code
_item_type_list.construct
_item_type_list.detail
_item_type_list.primitive_code
_item_units.code
_item_units.name
_item_units_conversion.factor
_item_units_conversion.from_code
_item_units_conversion.operator
_item_units_conversion.to_code
_item_units_list.code
_item_units_list.detail
_method_list.code
_method_list.detail
_method_list.id
_method_list.inline
_method_list.language
_sub_category.description
_sub_category.id
_sub_category_examples.case
_sub_category_examples.detail
_sub_category_examples.id
_sub_category_methods.method_id
_sub_category_methods.sub_category_id

5. The Present

So now we have two formalisms that can be used to describe the information content of a STAR file: DDL1.4, which is used in the MIF core dictionary, the CIF core, powder and modulated structures extensions, and in minor applications like the WDC9 and ACA abstracts dictionaries; and DDL2.1, which is used in the mmCIF dictionary. The current mmCIF also includes all the core definitions reworked in the new formalism.

DDL1.4 has evolved in the way I described above, and is a general mechanism for defining data names. It is not tied to any specific data model, but can be mapped onto a relational model, and possibly onto a hierarchical one.

DDL2 was developed as a relational model, and is better structured and more consistent in that respect. But it enforces this model on data files that it describes: if a CIF had a table representing some raw experimental data, it is possible that some lines of that table might be repeated (if the same reflection were re-measured, for instance). The relational viewpoint forces those lines to be distinguished, even if only by the addition of a new data name whose only purpose is to number the rows of the table! The stricter categorisation rules may also make more problematic the handling of external data names (i.e. those introduced by a user which have no definition in an official dictionary).

At this stage, the two formalisms are very closely compatible - DDL2 was designed to achieve this. But a whole-hearted implementation of all the DDL2 ideas may well lead to a divergence between data files described by DDL1 and DDL2 dictionaries. At present, an application reading a DDL2 dictionary is intended to be able to read and verify existing CIFs (currently described by DDL1 dictionaries) through an in-built aliasing mechanism. We have yet to see it demonstrated how well this will work; but it's unlikely that such a mechanism will track any changes that are made to the DDL1 language and its dependent data files in the future.

So we need to consider carefully whether all existing dictionaries should be recast in the DDL2 formalism, and whether that will affect the future evolution of CIF vis-a-vis parallel developments such as MIF.

(Here is an extract from the COMCIFS mailing mentioned above.)

D30.2 The New DDL
-----------------
Let me make a few general remarks about the philosophy behind DDL2. We have
already had extensive discussions on the desirability of providing a
self-consistent machine-readable set of data attributes, and over the last
year or so the version 1 DDL has grown to include relations between data
items. This approach is now taken a stage further. In the new DDL dictionary,
a hierarchy of objects is defined: category_groups (arbitrarily definable
groups of categories, so that the geom_bond and geom_angle categories would
naturally be collected into the geom category_group); categories (corresponding
to the current definition of a category as a collection of data names which
may occur in the same looped list, or outside of loops in a related aggregate);
subcategories (collections of data items that form a coherent set within a
category, e.g. *_h, *_k and *_l items might form a miller_index subcategory);
and individual data items. Each of these hierarchical objects may be
described by a separate set of DDL definitions (so there is, for example, a
_category.description and a _sub_category.description).

The organisation of DDL2 dictionaries is different from DDL1. Each definition
is given within a save_ frame, where previously each appeared within its own
data block. The save_ frames are permitted STAR syntactic devices for
encapsulating blocks of information which may be referenced from other places
within the current data block. At this point, however, such references are
not used - the save_ frames merely split the dictionary up into logical
chunks, as did the previous fragmentation into data blocks. But because each
definition within a dictionary is related to the rest of the information in
the dictionary, it is best to have a single data block encompassing the whole
dictionary. John explains the reason for this reorganisation thus:

'The save_ syntax has been used in order to have a more consistent use of
scope between data files and dictionaries.   Since we are representing
links between data items we are are using save frames so that the referenced
data items are all within the scope of the current dictionary.  This is
not the case now where data_ sections are used.  Links between data
blocks really violate the STAR scope rule that requires each data block
to have a separate name space.'

Another point of difference is that in earlier dictionaries a single data
block might contain the description of more than one related (more or less)
data names. Hence, in the core we have

data_cell_length_                                        
    loop_ _name                  '_cell_length_a'
                                 '_cell_length_b'
                                 '_cell_length_c'
    _type                        numb
    _enumeration_range           0.0:
    _esd                         yes
    _esd_default                 0.0
    loop_ _units_extension _units_description _units_conversion
        ' '  'Angstroms' *1.0 '_pm' 'picometres' /100. '_nm' 'nanometres' *10.
    _definition
;              Unit-cell lengths corresponding to the structure reported. ...
;

In the new formulation, each such definition would have its own save_ frame
(i.e. one each for _cell_length_a, _b and _c).

However, it IS possible to have more than one definition within a save_
frame, and this occurs when 'parent' and 'children' are defined together
(recall that the child relationship provides for pointers between identifiers
in different lists - a typical example is a _geom_bond_atom_atom_site_label_1
which must match an _atom_site_label). In the new dictionaries, this would be
written as

save_atom_site.label
    _item_description.description
;              The _atom_site.label is a unique identifier ...
;
     loop_
    _item.name
    _item.category_id 
    _item.mandatory_code
               '_atom_site.label'                 atom_site            yes
               '_geom_bond.atom_site_label_1'     geom_bond            yes
               '_geom_bond.atom_site_label_2'     geom_bond            yes
     loop_
    _item_linked.child_name
    _item_linked.parent_name
               '_geom_bond.atom_site_label_1'     '_atom_site.label'
               '_geom_bond.atom_site_label_2'     '_atom_site.label'
    _item_type.code               char
     loop_
    _item_examples.case           C12 Ca3g28 Fe3+17     H*251  boron2a
     save_

One other major change that you may have noticed is the introduction of a dot
character into data names to differentiate the category name from the
instance within the category. John believes this to be very important for the
efficient validation of tabular relationships (in other words it makes it
easy to enforce the rule that a loop_ contains only items from the same
category). Note that it is not essential to do this - each dictionary
definition may explicitly list the category to which the data name belongs.
But John prefers that the category should be easily extractable from the data
name alone, using the dot (or some other separator character). This will
directly contradict an earlier COMCIFS decision that no character beyond
the leading underscore should have a special meaning.

It also raises the minor difficulty that all data names adhering to this
convention and including a dot will be different from the datanames published
in the core dictionary. To permit compatibility with existing data files,
John has introduced an alias mechanism, so that _atom_site_label will be
recognised and internally translated to _atom_site.label (and, indeed, to
_atom_site.id also in this particular instance), which is an entry in the new
dictionary. We need to give full consideration as to the wisdom or otherwise
of this approach. For the most part, the trade-off seems to be between
computational efficiency in John's applications, and the multitude of
headaches that might result from changing all the existing data names.
However, it's not quite so simple, since some of the aliases in Paula's draft
do not map to exact equivalents in the new formulation (see, for instance,
_atom_site.fract_x and _atom_site.fract_x_esd versus _atom_site_fract_x).

Some useful references

Original STAR file specification: Hall, S. R. (1990). J. Chem. Inf. Comput. Sci. 31, 326-333.
Updated STAR specification: Hall, S. R. & Spadaccini, N. (1994). J. Chem. Inf. Comput. Sci. 34, 505-508.
Original work on DDL: Cook, A. F. P. (1991). Dictionary Definition Language in STAR File Format. ORAC Report, Leeds.
DDL specification: Hall, S. R. & Cook, A. P. F. (1995). J. Chem. Inf. Comput. Sci. In press.
CIF specification: Hall, S. R., Allen, F. H. & Brown, I. D. (1991). Acta Cryst. A47, 655-685.
MIF specification: Allen, F. H., Barnard, J. M., Cook, A. P. F. & Hall, S. R. (1995). J. Chem. Inf. Comput. Sci. In press.

Details of Workshops and Proceedings

Additional background information and pointers to current DDL and applications dictionaries may be found at the IUCr CIF home page

Brian McMahon, Coordinating Secretary, COMCIFS (bm@iucr.ac.uk) 18 August 1995

IUCr Webmaster