Appended below is a modified version of an article written to brief COMCIFS members on the evolution of the DDL formalism utilised in CIF application dictionaries.
This syntactic format was chosen as the base layer for the CIF, but with some simplifications to aid programming. The only recognised data cell delimiter is the data_xxx block code; nesting of loops is prohibited; block codes and datanames are restricted to 32 characters in length; lines are restricted to 80 characters.
The initial semantic layering (i.e. the imposition of "meaning" onto the abstract tokens) was done by devising datanames that were self-expressive, and by imposing some basic data types on the associated values. Effectively only two data types were permitted - "numb" for numerical values (with a permitted standard uncertainty value in trailing parentheses); and "char" for textual information (though some applications might choose to differentiate between multi-line text extending over several lines and bracketed by newlines, and the single-line-or-less character strings with quote marks or no delimiters).
Hence, this could be considered a totally valid CIF:
data_proto _my.crystal's.habit 'turns green in daylight'
But there is a problem here: the word 'habit', used here in the sense of 'customary behaviour', is used in crystallography with a specialised meaning. This ambiguity is fatal to the purpose of devising a universal exchange mechanism.
And this was indeed how the original CIF dictionary was submitted to Acta - as a MS-Word file with the definitions laid out just as in a lexicographic dictionary. An appendix listed permitted codes for certain values (what we habitually call the 'enumeration lists'), and some general elements of the definitions were also included in the main text of the paper. All this information was available only to the human reader.
However, Tony Cook, who was working with Syd Hall on the chemical (MIF) applications of STAR files, pointed out that much of the stored information on each dataname could be extracted by computer if it were presented in an appropriate way - and what way would be more appropriate than as a STAR file, so that the same software being written to extract information from a CIF could be used to extract information from the dictionary? And so it came to pass - by the time the original CIF paper went to press, the dictionary had been recast as a STAR file (with the same syntax restrictions as a CIF). The dictionary information was associated with a new set of datanames, which form the vocabulary of the Dictionary Definition Language, or DDL. Note that this formalism is only mentioned in passing in the CIF paper: the typeset version of the dictionary translates the DDL names into sentences, so that the dictionary again resembles a lexicographic one in its layout of entries.
Here is an example of an entry in the core dictionary:
data_atom_site_attached_hydrogens _name '_atom_site_attached_hydrogens' _type numb _list yes _list_identifier '_atom_site_label' _enumeration_range 0:4 _enumeration_default 0 loop_ _example _example_detail 2 'water oxygen' 1 'hydroxyl oxygen' 4 'ammonium nitrogen' _definition ; The number of hydrogen atoms attached to the atom at this site excluding any H atoms for which coordinates (measured or calculated) are given. ;
The DDL used for this version is documented only in comments at the end of the dictionary file cifdic.C91. I reproduce it here for historical interest, and shall refer to this version as DDL0.
############################################################################## # # DDL Data Name Descriptions # -------------------------- # # _compliance The dictionary version in which the item is defined. # # _definition The description of the item. # # _enumeration A permissible value for an item. The value 'unknown' # signals that the item can have any value. # # _enumeration_default The default value for an item if it is not specified # explicitly. 'unknown' means default is not known. # # _enumeration_detail The description of a permissible value for an item. # Note that that the code '.' normally signals a null # or 'not applicable' condition. # # _enumeration_range The range of values for a numerical item. The # construction is 'min:max'. If 'max' is omitted then the # item can have any value greater than or equal to 'min'. # # _esd Signals if an estimated standard deviation is # expected to be appended (enclosed within brackets) # to a numerical item. May be 'yes' or 'no'. # # _esd_default The default value for the esd of a numerical item # if a value is not appended. # # _example An example of the item. # # _example_detail A description of the example. # # _list Signals if an item is expected to occur in a looped # list. Possible values 'yes','no' or 'both'. # # _list_identifier Identifies a data item that MUST appear in the list # containing the currently defined data item. # # _name The data name of the item defined. # # _type The data type 'numb' or 'char' (latter includes 'text'). # # _units_extension The data name extension code used to specify the units # of a numerical item. # # _units_description A description of the units. # # _units_conversion The method of converting the item into a value based # on the default units. Each conversion number is # preceded by an operator code *, /, +, or - which # indicates how the conversion number is applied. # # _update_history A record of the changes to this file. # #-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof
I know of two applications which are able to read the CIF core and validate CIFs against that dictionary, expressed in DDL0: CIFCHK from the CCDC, which we use to check CIFs as we archive them (but this program has never been released to the public domain); and sb (which has gaps - it can't handle _units_extension).
The benefit of this approach is, of course, that you do not need to write a new subroutine to handle each new entry in the dictionary - no small consideration, now that we have more than 1200 entries. By the time of the Veszprem computing school in mid-1992, Peter Murray-Rust, for instance, was already working on a C++ object library for CIF. The ambition was to provide a library of routines (written in C++ but callable from C (and maybe even Fortran) for extracting CIF data and validating it against the dictionary.
But only "to a large extent". The hierarchies as set up in the core dictionary are not rigid. The CIF paper lists _chemical_ and _chemical_conn_ as separate "category" parents. (I put the word category in quotes here because it was not a well defined idea. Most people have a feel for the type of categorisation being attempted, but the rules for assigning categories were for a long time difficult to pin down.) In like manner, the _atom_site_ and _atom_site_aniso_ groups of datanames were clearly related, but yet distinct. The relationship was not so clean as the similar form of datanames suggested.
In practice, related data are grouped together - conventionally in the form of tables; in CIFs within loops. A 'relational' data model is one which can represent data as tables obeying certain definite rules: each line of the table must be indexed by a unique key. It was argued that the CIF tables could be mapped onto a relational data model, provided that the necessary rules were enforced through the relationships between data names stored in the dictionary. In other words, the DDL should be revised to allow valid relational tables to be constructed and validated. This would make it much easier to load a relational database with the relevant information from a CIF; or, conversely, one supposes, to borrow the tools of relational database management for managing CIFs.
So, between the York meeting and the workshop in Tarrytown of October '93, there was a great deal of e-mail correspondence and a consequent evolution in the DDL to meet these objectives. Each dataname was now assigned formally to a category (the name of which was usually the same as the dataname stem, e.g. atom_site - but not necessarily so). All the items in a table (i.e. looped list) belonged to the same category. Items not in a table were assigned a different category. Certain uniqueness constraints could be applied to one or more of the datanames. This had the effect of defining the unique 'key' for the table. For instance, in the atom_site category, _atom_site_label had to be unique (in crystallography-speak, this just means that every atomic site had to have a unique label). Additional DDL names were introduced to record parent-child relationships between data names in different categories. Atom labels occur in the geometry tables, for instance _geom_bond_atom_site_label_1. These labels must match corresponding labels of atoms in the atom_site list, so the 'parent' of _geom_bond_atom_site_label_1 is _atom_site_label, and vice versa.
To meet all these requirements, the original DDL was extended and modified, and is now in press as version 1.4. I append below a list of the datanames defined in that version; those prefixed by * are extensions to the DDL0 set:
* _category _definition * _dictionary_history * _dictionary_name * _dictionary_update * _dictionary_version _enumeration _enumeration_default _enumeration_detail _enumeration_range _example _example_detail _list * _list_level * _list_link_child * _list_link_parent * _list_mandatory * _list_reference (replaces _list_identifier) * _list_uniqueness _name * _related_item * _related_function _type * _type_conditions * _type_construct _units _units_detail
The following DDL0 terms were dropped:
_compliance (replaced by the _dictionary_ attributes) _esd (incorporated in _type_conditions) _esd_default _list_identifier (replaced by _list_reference) _update_history (replaced by _dictionary_history) _units_extension | _units_description |- (replaced by _units and _units_detail) _units_conversion |
The _units_ items pose something of a problem in reading a dictionary. The original idea was that _cell_length_a would be defined in the dictionary as a quantity in angstroms; but it would have a _units_extension code _pm, which means that the dataname _cell_length_a_pm would be understood as a cell length in picometres (the numeric conversion factor is given by _units_conversion). However, a straightforward dictionary lookup would not return _cell_length_a_pm as a valid dataname. This mechanism, often criticised for the ill logic of modifying the attributes of a defined term by changing its name, was finally dropped, so that each data name now has a unique physical unit associated with it.
Two sets of items have involved especially protracted labours. The _list_ items define the attributes and relationships between items in a looped list. The _list_level allows for data names to be assigned to deeper levels in nested loops. As such, it has no use in CIF applications, where nested loops are forbidden; but it allows nested loop structures to be defined in other STAR applications which do permit it. In my view this is important, for it permits the definition of hierarchical data models, where the existence of certain data items is dependent on the existence of others at a higher level in the hierarchy (you can't have a loop at level 2 unless it is nested within a loop at level 1). I don't know enough to say whether this is a necessary or sufficient condition for mapping STAR data structures onto hierarchical data models, but I have a feeling that it may be important for doing this, and I'd be glad to hear any informed commentary on this.
In CIF loops, which are always at level 1, the _list_ attributes describe the relations between the data values in a table, and so this is an attempt to map onto a relational data model. If one wants to identify the key to entries in such a table (the data name or names which must have unique values within the table), one must collect together entries with _list_mandatory set to 'yes' and the complete set of _list_uniqueness pointers.
Here's a simple example, somewhat adapted from the MIF paper, to show what's going on. The following loop defines a table of bonds in a chemical structure:
loop_ _bond_id_1 _bond_id_2 _bond_type C1 C2 double C2 C3 single C3 C4 double C1 C7 single
The MIF dictionary entry for _bond_type includes the line
loop_ _list_reference '_bond_id_1' '_bond_id_2'meaning that both bond id values must be present for the table entry to make sense. In the entries for _bond_id_1 (and _2) are found the lines
_list_mandatory yes loop_ _list_uniqueness '_bond_id_1' '_bond_id_2'
meaning that these datanames must be present in the loop, and must together be unique (C1 appears twice as _bond_id_1 in the example, but that's OK: on one occasion it's teamed with C2, on another with C7).
These relationships do all hang together; but it's necessary to do a certain amount of hunting through the dictionary to ensure that you've got all the relevant information (and it requires a lot of work to make sure that the dictionary yields this information in a consistent manner).
The other set of items I marked for interest are the _type_ group. Because the STAR philosophy is to store and deliver text strings, it was felt that the assignment of data types was largely unnecessary. Integers, floats, double-precision complex numbers and booleans did not need to be stored in any different manner. They can all be coded as text strings: "3", "2.76", "-1.30000988765876 + 0.00022456342987i", "yes". But there have been many voices raised to counter this view, and the solution adopted in DDL1.4 is to permit three fundamental types, with _type values of "numb", "char" and "null". ("null" is a device adopted to allow additional information to be stored in dictionaries for humans to read, but machine parsers to ignore.) _type_conditions extends this, so that 1.23(4) is understood as a number with associated standard uncertainty (e.s.d.) in CIF applications, and 1:4 is understood as a range of allowed integer values in certain MIF applications. _type_construct is a more general device which allows a data value to be compared with a pattern (in regular expression notation). Hence a date quantity could be described in the dictionary with a _type_construct of [0-9][0-9]:[0-9][0-9]:[0-9][0-9], meaning any triplet of two-digit integers separated by colons (not the format used in CIF!). The regex notation is very powerful, and in principle this allows very tight control over acceptable patterns in the string representing a data value.
A final comment on category. The category assigned to each data item defines which loop (table) it may appear in. If it doesn't normally appear in a loop, its category is more generally defined to encompass related items with the same status. Note that this linking of categories to loops results in rather a large number of distinct category assignments. It is not, perhaps, an effective taxonomic classification.
In COMCIFS mailing 30 this year (part of which is reproduced below), I already described how (I think) DDL2 is intended to work. Here I shall just pick up a few threads to contrast with specific remarks I made above on DDL1. First, the taxonomy employed allows a classification hierarchy: there are categories, defined just as the categories in DDL1, but there are also category_groups (clusters of categories, or supercategories), and subcategories. Hence _cell.length_a, _cell.length_b and _cell.length_c are the (only) three members of the subcategory "cell_length"; they are all members of the "cell" category; they are all members of the "cell_group" supercategory (as is, say, _cell_measurement_refln.index_h); and they are also members of the "inclusive_group" supercategory (as is everything in the mmCIF dictionary).
Second, the properties of a category are listed separately from the properties of its constituent members. If I were to extend my MIF example into DDL2, you would have an entry something like
save_BOND _category.id bond loop_ _category_key.name '_bond.id_1' '_bond.id_2' save_which defines the key of the 'bond' category. None of the entries for _bond.id_1, _bond.id_2 or _bond.type (as they would become in DDL2) would contain anything about key values, except insofar as they indicate membership of the 'bond' category.
The mmCIF2 dictionary also contains a larger spread of what are effectively user-defined types: a list of type codes is established at the beginning of the dictionary using the equivalent of _type_construct. In each data name definition, an _item_type.code value is listed, which indexes into this table of user-defined types. The original DDL1 types are preserved (as _item_type_list.primitive_code and _item_type_conditions.code, I think), but there is more freedom to define application-specific types for the dependent application to handle.
Note that datanames described by DDL2 dictionaries will have an embedded dot character to separate their "category" part from their "instance" part: this will be the most obvious difference between CIFs containing datanames described by DDL2 dictionaries and existing CIFs. The DDL2 proposal endeavours to honour existing CIFs through an aliasing mechanism.
For the sake of completeness, I attach here a simple list of all the data names used in the DDL2 set:
_category.description _category.id _category.implicit_key _category.mandatory_code _category_examples.case _category_examples.detail _category_examples.id _category_group.category_id _category_group.id _category_group_list.description _category_group_list.id _category_group_list.parent_id _category_key.id _category_key.name _category_methods.category_id _category_methods.method_id _datablock.description _datablock.id _datablock_methods.datablock_id _datablock_methods.method_id _dictionary.datablock_id _dictionary.title _dictionary.version _dictionary_history.revision _dictionary_history.update _dictionary_history.version _item.category_id _item.mandatory_code _item.name _item_aliases.alias_name _item_aliases.dictionary _item_aliases.dictionary_version _item_aliases.name _item_default.name _item_default.value _item_dependent.dependent_name _item_dependent.name _item_description.description _item_description.name _item_enumeration.detail _item_enumeration.name _item_enumeration.value _item_examples.case _item_examples.detail _item_examples.name _item_linked.child_name _item_linked.parent_name _item_methods.method_id _item_methods.name _item_range.maximum _item_range.minimum _item_range.name _item_related.function_code _item_related.name _item_related.related_name _item_structure.code _item_structure.name _item_structure_list.code _item_structure_list.dimension _item_structure_list.index _item_sub_category.id _item_sub_category.name _item_type.code _item_type.name _item_type_conditions.code _item_type_conditions.name _item_type_list.code _item_type_list.construct _item_type_list.detail _item_type_list.primitive_code _item_units.code _item_units.name _item_units_conversion.factor _item_units_conversion.from_code _item_units_conversion.operator _item_units_conversion.to_code _item_units_list.code _item_units_list.detail _method_list.code _method_list.detail _method_list.id _method_list.inline _method_list.language _sub_category.description _sub_category.id _sub_category_examples.case _sub_category_examples.detail _sub_category_examples.id _sub_category_methods.method_id _sub_category_methods.sub_category_id
DDL1.4 has evolved in the way I described above, and is a general mechanism for defining data names. It is not tied to any specific data model, but can be mapped onto a relational model, and possibly onto a hierarchical one.
DDL2 was developed as a relational model, and is better structured and more consistent in that respect. But it enforces this model on data files that it describes: if a CIF had a table representing some raw experimental data, it is possible that some lines of that table might be repeated (if the same reflection were re-measured, for instance). The relational viewpoint forces those lines to be distinguished, even if only by the addition of a new data name whose only purpose is to number the rows of the table! The stricter categorisation rules may also make more problematic the handling of external data names (i.e. those introduced by a user which have no definition in an official dictionary).
At this stage, the two formalisms are very closely compatible - DDL2 was designed to achieve this. But a whole-hearted implementation of all the DDL2 ideas may well lead to a divergence between data files described by DDL1 and DDL2 dictionaries. At present, an application reading a DDL2 dictionary is intended to be able to read and verify existing CIFs (currently described by DDL1 dictionaries) through an in-built aliasing mechanism. We have yet to see it demonstrated how well this will work; but it's unlikely that such a mechanism will track any changes that are made to the DDL1 language and its dependent data files in the future.
So we need to consider carefully whether all existing dictionaries should be recast in the DDL2 formalism, and whether that will affect the future evolution of CIF vis-a-vis parallel developments such as MIF.
(Here is an extract from the COMCIFS mailing mentioned above.)
D30.2 The New DDL ----------------- Let me make a few general remarks about the philosophy behind DDL2. We have already had extensive discussions on the desirability of providing a self-consistent machine-readable set of data attributes, and over the last year or so the version 1 DDL has grown to include relations between data items. This approach is now taken a stage further. In the new DDL dictionary, a hierarchy of objects is defined: category_groups (arbitrarily definable groups of categories, so that the geom_bond and geom_angle categories would naturally be collected into the geom category_group); categories (corresponding to the current definition of a category as a collection of data names which may occur in the same looped list, or outside of loops in a related aggregate); subcategories (collections of data items that form a coherent set within a category, e.g. *_h, *_k and *_l items might form a miller_index subcategory); and individual data items. Each of these hierarchical objects may be described by a separate set of DDL definitions (so there is, for example, a _category.description and a _sub_category.description). The organisation of DDL2 dictionaries is different from DDL1. Each definition is given within a save_ frame, where previously each appeared within its own data block. The save_ frames are permitted STAR syntactic devices for encapsulating blocks of information which may be referenced from other places within the current data block. At this point, however, such references are not used - the save_ frames merely split the dictionary up into logical chunks, as did the previous fragmentation into data blocks. But because each definition within a dictionary is related to the rest of the information in the dictionary, it is best to have a single data block encompassing the whole dictionary. John explains the reason for this reorganisation thus: 'The save_ syntax has been used in order to have a more consistent use of scope between data files and dictionaries. Since we are representing links between data items we are are using save frames so that the referenced data items are all within the scope of the current dictionary. This is not the case now where data_ sections are used. Links between data blocks really violate the STAR scope rule that requires each data block to have a separate name space.' Another point of difference is that in earlier dictionaries a single data block might contain the description of more than one related (more or less) data names. Hence, in the core we have data_cell_length_ loop_ _name '_cell_length_a' '_cell_length_b' '_cell_length_c' _type numb _enumeration_range 0.0: _esd yes _esd_default 0.0 loop_ _units_extension _units_description _units_conversion ' ' 'Angstroms' *1.0 '_pm' 'picometres' /100. '_nm' 'nanometres' *10. _definition ; Unit-cell lengths corresponding to the structure reported. ... ; In the new formulation, each such definition would have its own save_ frame (i.e. one each for _cell_length_a, _b and _c). However, it IS possible to have more than one definition within a save_ frame, and this occurs when 'parent' and 'children' are defined together (recall that the child relationship provides for pointers between identifiers in different lists - a typical example is a _geom_bond_atom_atom_site_label_1 which must match an _atom_site_label). In the new dictionaries, this would be written as save_atom_site.label _item_description.description ; The _atom_site.label is a unique identifier ... ; loop_ _item.name _item.category_id _item.mandatory_code '_atom_site.label' atom_site yes '_geom_bond.atom_site_label_1' geom_bond yes '_geom_bond.atom_site_label_2' geom_bond yes loop_ _item_linked.child_name _item_linked.parent_name '_geom_bond.atom_site_label_1' '_atom_site.label' '_geom_bond.atom_site_label_2' '_atom_site.label' _item_type.code char loop_ _item_examples.case C12 Ca3g28 Fe3+17 H*251 boron2a save_ One other major change that you may have noticed is the introduction of a dot character into data names to differentiate the category name from the instance within the category. John believes this to be very important for the efficient validation of tabular relationships (in other words it makes it easy to enforce the rule that a loop_ contains only items from the same category). Note that it is not essential to do this - each dictionary definition may explicitly list the category to which the data name belongs. But John prefers that the category should be easily extractable from the data name alone, using the dot (or some other separator character). This will directly contradict an earlier COMCIFS decision that no character beyond the leading underscore should have a special meaning. It also raises the minor difficulty that all data names adhering to this convention and including a dot will be different from the datanames published in the core dictionary. To permit compatibility with existing data files, John has introduced an alias mechanism, so that _atom_site_label will be recognised and internally translated to _atom_site.label (and, indeed, to _atom_site.id also in this particular instance), which is an entry in the new dictionary. We need to give full consideration as to the wisdom or otherwise of this approach. For the most part, the trade-off seems to be between computational efficiency in John's applications, and the multitude of headaches that might result from changing all the existing data names. However, it's not quite so simple, since some of the aliases in Paula's draft do not map to exact equivalents in the new formulation (see, for instance, _atom_site.fract_x and _atom_site.fract_x_esd versus _atom_site_fract_x).
Copyright © 1997 International Union of Crystallography
IUCr Webmaster