(34) _units simplification; history of DDL

To: [email protected]
Subject: (34) _units simplification; history of DDL
From: bm
Date: Thu, 11 May 1995 14:56:18 +0100
Dear Colleagues

It occurs to me that I haven't posted news of another move: Phil Bourne has
been at the San Diego Supercomputing Center for a few months now - his
current e-mail address is [email protected]. Sorry not to have mentioned it
before, Phil!


Call for Agreement
==================

D34.1  Simplification of units descriptions in DDL1.4
-----------------------------------------------------
I call for agreement from all full members of COMCIFS on the following
proposal. Consultants are encouraged to give an 'aye' or 'nay' if they see
this decision affecting their own work and interests.

I mentioned briefly in D33.5 that Syd was proposing to drop the
_units_extension, _units_description and _units_conversion attributes from
the published DDL1.4 specification, and to replace them with _units and
_units_detail. Some of us have already been involved in correspondence
over this, and I include a few extracts from that correspondence below.
The DDL1.4 specification is currently being revised post-review, so this is
the last opportunity to modify it before publication.

You will see from my later article on the development of the DDL that there
is precedent for dropping unpublished DDL attributes. We would need to
promote the derivative data names (such as _cell_length_a_pm) to "full"
dataname status to ensure compatibility with existing uses of the
_units_extension idea, but I see no fundamental problems.

S> I am in the final throes of returning an expanded DDL specification paper to
S> JCICS.  In doing this expansion I pondered on the "_units_" attributes
S> again. They are an increasing source of irritation. This is the last
S> opportunity to expunge them painlessly from the record.
S> 
S> I raise the possibility with you of replacing the attributes
S> _units_extension, _units_description and _units_conversion with two new
S> ones _units and _units_detail. The first will be a parsible code as
S> in ..._extension and second the description of the units. 
S> 
S> This new usage implies that each data item will have a fixed units, which I
S> think will make most people happy.

BM> Syd: In a sense it would be only a "small" job to fix the archived CIFs -
BM> the whole point of CIF is that you have well defined tags which are trivial
BM> to find (and thus amend). There might be a lot of them, and it might take
BM> many CPU cycles, but our machines are patient and forbearing enough...
BM> 
BM> But I don't see that they need to be fixed. Just add _cell_length_a_pm etc
BM> to the existing core (as extra, full, datanames with new definitions
BM> specifying the units - or, rather, with new _units). This preserves the
BM> integrity of any existing files (outside of the archive) which conform to
BM> core version 1991. Rather a lot of clutter, it's true. But at least in DDL2,
BM> these old items could all be put into a supercategory (what does John call
BM> it - _category_group ?) called something like 'compatibility' that will tend
BM> to keep them out of the way.

PMR> I have no problems in principle with simplifying the _units_.  There is a
PMR> general problem which I think the CIF community has to face which is how
PMR> and when there can be stable software for the language.  I originally 
PMR> wrote quite extensive software for the _units_* hierarchy and this now 
PMR> would need to be rewritten if anyone is going to use it.  
PMR> 	I can't see a simple way round this in the CIF language, but I do 
PMR> think it's important that we get enough stability to start writing code.  
PMR> Until that happens there it will be possible for software to check the 
PMR> basic CIF syntax but not, say, to check entries against the dictionary.  
PMR> It's not trivial at present to see whether cell_length_a_pm is a 
PMR> dictionary entry since there is no simple clue to the parser that _pm is 
PMR> a unit and not part of the name.
PMR> 	I believe that the appropriate way forward is through attributes 
PMR> such are possible in SGML (no - I'm not deserting CIF, but I think we may 
PMR> get help from this direction).  Here I would write something like:
PMR> <CELL><LENGTH><A UNIT="pm" ESD=0.3>1412.3</A></LENGTH></CELL> which shows 
PMR> the structure more clearly for the parser.  An alternative, which could 
PMR> map easily onto CIF, could be:
PMR> <FLOAT ID="_cell_length_a" UNITS="pm" ESD=0.3>1412.3</FLOAT>
PMR> 
PMR> 	At the risk of being branded a heretic, this has the merit of 
PMR> parsability and extensibility.  A CIF could be translated into that form 
PMR> by using a translation table.

JW> I would welcome the elimination of the current manner in which units
JW> attributes are expressed.  In the new DDL we have assigned a code for
JW> each unit type using the attribute _item_units.code.   The unit codes
JW> are then defined/described once in a separate category item_units_list.
JW> We have also included the category item_units_conversion which provides
JW> the correspondence and conversion information between unit types. This has
JW> been implemented in the mmCIF dictionary and I think that it has
JW> worked out rather well.

S> Peter's call for stability is of course appropriate ... he is right and
S> it is most definitely what I am trying to achieve with the publication
S> of DDL1.4. The problem Peter cites about identifying what is an extension
S> to a dataname or a new dataname is in my view the crucial one -- it can be
S> done, but....!


Continuing discussions
======================

(33)D28.2, D28.3 R Factors
--------------------------
G> I must congratulate Paula on achieving what I failed to achieve in a long
G> exchange of emails with Syd when the original CIF was being developed,
G> namely to change the definition of the R-factor ! (if I had succeeded, life
G> might now be easier for authors and co-editors of Acta C, but it was never
G> the intention of CIF to make life easy).
G> 
G> I feel strongly that there are two separate pieces of information that should
G> be kept logically separate:
G> 
G> (a) A 'conventional' R-factor for the purposes of comparing structures 
G> that may have been refined by different procedures by different people,
G> including the many structures published before the CIF revolution.  This
G> should be what most people have always understood as an R-factor, e.g.
G> the formula given by Paula for reflections defined by the resolution 
G> ranges and 'observed' criteria exactly as she has specified, but without
G> the clause 'and that were included in the refinement'.
G> 
G> (b) The procedure used for refining the structure, including the quantity
G> minimized (well, almost) and the specification of which reflections were
G> used in this procedure.  For example we quite happily refine proteins 
G> against F-squared for all data whereas it appears that Paula is refining
G> against F for a specified subset of the data.  It is not the responsibility
G> of COMCIFS to define one particular refinement strategy as correct and to
G> force everyone to use it, and it would stifle scientific progress to do so.
G> 
G> Given the R-factor defined as in (a), the maximum resolution and the 
G> completeness of the data, any experienced crystallographer has a feel for
G> the quality of a particular structure determination.  This only works if
G> we keep it simple and in keeping with generally accepted practice.  For
G> (b), we must be as flexible as possible to allow for progress.  Both sets
G> of information are essential in the CIF file, but must be kept logically
G> separate.

D30.3 The New DDL
-----------------
In the last mailing (in section D33.1) David Brown posed a few questions
about the relationship between DDL1 and DDL2, and asked for a history of
the DDL. To meet that request, I have scribbled down a few notes which
explain the position as I see it. I am still far from sure in my own mind
how future (and indeed, current draft) dictionaries should be written. This
worries me slightly, for I cannot clearly articulate why I am uneasy at the
suggestion to discard DDL1 for CIF dictionary purposes. Perhaps I am just
growing irrationally conservative in my old age!

                       A BRIEF HISTORY OF THE DDL

1. The STAR file
================
The initial intention of a STAR ("Self-Defining Text Archive and Retrieval")
file was to provide a set of simple syntactic rules for storing and
retrieving text strings in a flexible and extensible manner The rules are
indeed simple. Here they are, in essence:
   The file is divided into tokens separated by white space.
   A token may include white-space characters provided
    (a) it is surrounded by matching single or double quotes and contains
    no <newline> characters; or
    (b) it is surrounded by <newline><semicolon> character pairs (digraphs).
   A token beginning with an underscore character is a dataname that MUST
    have an accompanying value. For non-looped data, this value is the
    following token.
   Looped data are permitted. A loop structure contains a header and a
    body. The header is introduced by the reserved word "loop_" and contains
    a list of datanames. The values in the loop body are associated with
    matching datanames in the header, in strict rotation. In STAR (but not
    in CIF), nesting of loops is permitted. In the header, each level of
    nesting below the first is introduced by a "loop_" keyword and
    terminated by "stop_".
   Other reserved words are "data_xxx" where xxx is an arbitrary string
    that must be unique within the file; "global_"; "save_xxx" and "save_".
    These are used to partition the file into distinct data cells with
    specific scoping rules. Of these, only "data_xxx" is used in CIF. All
    datanames and associated values must occur within one of these data
    cells.
   Comments (introduced by the "#" character and terminated by <newline>)
    are permitted anywhere that a token is valid.
 
There is no restriction on the nature of the information conveyed by these
tokens - the STAR rules are purely syntactic, and allow a specified token
to be retrieved by an application - i.e. given a dataname, the matching
data value or values should be retrieved. The application starbase (sb)
works entirely at this syntactic level, and guarantees to return requested
data in fully compliant STAR format.

This syntactic format was chosen as the base layer for the CIF, but with
some simplifications to aid programming. The only recognised data cell
delimiter is the data_xxx block code; nesting of loops is prohibited; block
codes and datanames are restricted to 32 characters in length; lines are
restricted to 80 characters.

The initial semantic layering (i.e. the imposition of "meaning" onto the
abstract tokens) was done by devising datanames that were self-expressive,
and by imposing some basic data types on the associated values. Effectively
only two data types were permitted - "numb" for numerical values (with a
permitted standard uncertainty value in trailing parentheses); and "char"
for textual information (though some applications might choose to
differentiate between multi-line text extending over several lines and
bracketed by newlines, and the single-line-or-less character strings with
quote marks or no delimiters).

[No, Paula, I haven't forgotten "null" - but this is still In The Beginning...]

Hence, this could be considered a totally valid CIF:
   data_proto
   _my.crystal's.habit   'turns green in daylight'

But there is a problem here: the word 'habit', used here in the sense of
'customary behaviour', is used in crystallography with a specialised
meaning. This ambiguity is fatal to the purpose of devising a universal
exchange mechanism.

2. A Dictionary of Universal Terms
==================================
So the next step was to devise a dictionary of datanames which represented
very specific terms and definitions in crystallography. The dictionary
would list all datanames with a universal meaning, together with a
definition of that meaning, an indication of whether the associated value
was numeric or textual, and any constraints that could be applied to that
value. The datanames were constructed in a way designed to illustrate their
relationship with each other, through a hierarchy of subcomponents
separated by underscores - hence _atom_site_symmetry_multiplicity etc.

And this was indeed how the original CIF dictionary was submitted to Acta -
as a MS-Word file with the definitions laid out just as in a lexicographic
dictionary. An appendix listed permitted codes for certain values (what we
habitually call the 'enumeration lists'), and some general elements of the
definitions were also included in the main text of the paper. All this
information was available only to the human reader.

However, Tony Cook, who was working with Syd on the chemical (MIF)
applications of STAR files, pointed out that much of the stored information
on each dataname could be extracted by computer if it were presented in an
appropriate way - and what way would be more appropriate than as a STAR
file, so that the same software being written to extract information from a
CIF could be used to extract information from the dictionary? And so it
came to pass - by the time the original CIF paper went to press, the
dictionary had been recast as a STAR file (with the same syntax
restrictions as a CIF). The dictionary information was associated with a
new set of datanames, which form the vocabulary of the Dictionary
Definition Language, or DDL. Note that this formalism is only mentioned in
passing in the CIF paper: the typeset version of the dictionary translates
the DDL names into sentences, so that the dictionary again resembles a
lexicographic one in its layout of entries.

Here is an example of an entry in the core dictionary:

data_atom_site_attached_hydrogens
    _name                       '_atom_site_attached_hydrogens'
    _type                        numb
    _list                        yes
    _list_identifier            '_atom_site_label'
    _enumeration_range           0:4   
    _enumeration_default         0      
    loop_ _example              
          _example_detail        2    'water oxygen' 
                                 1    'hydroxyl oxygen'
                                 4    'ammonium nitrogen' 
    _definition
;              The number of hydrogen atoms attached to the atom at this site 
               excluding any H atoms for which coordinates (measured or 
               calculated) are given. 
;

The DDL used for this version is documented only in comments at the end of
the dictionary file cifdic.C91. I reproduce it here for historical
interest, and shall refer to this version as DDL0.

##############################################################################
#
#                        DDL Data Name Descriptions
#                        --------------------------
#
# _compliance           The dictionary version in which the item is defined.
#
# _definition           The description of the item.
#
# _enumeration          A permissible value for an item. The value 'unknown'
#                       signals that the item can have any value.
#
# _enumeration_default  The default value for an item if it is not specified
#                       explicitly. 'unknown' means default is not known.
#
# _enumeration_detail   The description of a permissible value for an item.
#                       Note that that the code '.' normally signals a null
#                       or 'not applicable' condition.
#
# _enumeration_range    The range of values for a numerical item. The
#                       construction is 'min:max'. If 'max' is omitted then the 
#                       item can have any value greater than or equal to 'min'.
#
# _esd                  Signals if an estimated standard deviation is 
#                       expected to be appended (enclosed within brackets)
#                       to a numerical item. May be 'yes' or 'no'.
#
# _esd_default          The default value for the esd of a numerical item
#                       if a value is not appended.
#
# _example              An example of the item.
#
# _example_detail       A description of the example.
#
# _list                 Signals if an item is expected to occur in a looped
#                       list. Possible values 'yes','no' or 'both'.
#
# _list_identifier      Identifies a data item that MUST appear in the list
#                       containing the currently defined data item.
#
# _name                 The data name of the item defined.
#
# _type                 The data type 'numb' or 'char' (latter includes 'text').
#
# _units_extension      The data name extension code used to specify the units 
#                       of a numerical item.
#
# _units_description    A description of the units.
#
# _units_conversion     The method of converting the item into a value based
#                       on the default units. Each conversion number is 
#                       preceded by an operator code *, /, +, or - which 
#                       indicates how the conversion number is applied.
#
# _update_history       A record of the changes to this file.
#
#-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof


I know of two applications which are able to read the CIF core and validate
CIFs against that dictionary, expressed in DDL0: CIFCHK from the CCDC,
which we use to check CIFs as we archive them (but this program has never
been released to the public domain); and sb (which has gaps - it can't
handle _units_extension).

The benefit of this approach is, of course, that you do not need to write a
new subroutine to handle each new entry in the dictionary - no small
consideration, now that we have more than 1200 entries. By the time of the
Veszprem computing school in mid-1992, Peter Murray-Rust, for instance, was
already working on a C++ object library for CIF. The ambition was to
provide a library of routines (written in C++ but callable from C (and
maybe even Fortran?)) for extracting CIF data and validating it against the
dictionary.

3. Relationships Between Data Names - the DDL Extended
======================================================
At the York meeting in April '93, the viewpoint was advanced that CIFs
could usefully be treated as implementations of various formal data models.
As I have described its evolution so far, the CIF data model has no
explicit mechanism for recording relationships between items of data 
contained in the file. But there were already certain implicit conventions.
The dataname scheme embodied a hierarchy of definitions: consider the three
data names _atom_site_label, _atom_site_fract_x and
_atom_type_oxidation_number. "Obviously" (to a human), these all have
something to do with atoms; the first two have something to do with the
positions occupied by atoms in a cell; and one would expect
_atom_site_fract_x to have corresponding _y and _z equivalents (which it
has). In this hierarchical picture, the hierarchical relationships can be
unravelled to a large extent by parsing the name components in the
dataname.

But only "to a large extent". The hierarchies as set up in the core
dictionary are not rigid. The CIF paper lists _chemical_ and
_chemical_conn_ as separate "category" parents. (I put the word category in
quotes here because it was not a well defined idea. Most people have a feel
for the type of categorisation being attempted, but the rules for assigning
categories were for a long time difficult to pin down.) In like manner, the
_atom_site_ and _atom_site_aniso_ groups of datanames were clearly related,
but yet distinct. The relationship was not so clean as the similar form of
datanames suggested.

In practice, related data are grouped together - conventionally in the form
of tables; in CIFs within loops. A 'relational' data model is one which
can represent data as tables obeying certain definite rules: each line of
the table must be indexed by a unique key. It was argued that the CIF
tables could be mapped onto a relational data model, provided that the
necessary rules were enforced through the relationships between data names
stored in the dictionary. In other words, the DDL should be revised to
allow valid relational tables to be constructed and validated. This would
make it much easier to load a relational database with the relevant
information from a CIF; or, conversely, one supposes, to borrow the tools
of relational database management for managing CIFs.

So, between the York meeting and the workshop in Tarrytown of October '93,
there was a great deal of e-mail correspondence and a consequent evolution
in the DDL to meet these objectives. Each dataname was now assigned
formally to a category (the name of which was usually the same as the
dataname stem, e.g. atom_site - but not necessarily so). All the items in a
table (i.e. looped list) belonged to the same category. Items not in a
table were assigned a different category. Certain uniqueness constraints
could be applied to one or more of the datanames. This had the effect of
defining the unique 'key' for the table. For instance, in the atom_site
category, _atom_site_label had to be unique (in crystallography-speak, this
just means that every atomic site had to have a unique label). Additional
DDL names were introduced to record parent-child relationships between data
names in different categories. Atom labels occur in the geometry tables,
for instance _geom_bond_atom_site_label_1. These labels must match
corresponding labels of atoms in the atom_site list, so the 'parent'
of _geom_bond_atom_site_label_1 is _atom_site_label, and vice versa.

To meet all these requirements, the original DDL was extended and modified,
and is now in press as version 1.4. I append below a list of the datanames
defined in that version; those prefixed by * are extensions to the DDL0 set:

     *   _category   
         _definition
     *   _dictionary_history   
     *   _dictionary_name   
     *   _dictionary_update   
     *   _dictionary_version   
         _enumeration
         _enumeration_default
         _enumeration_detail
         _enumeration_range
         _example
         _example_detail
         _list
     *   _list_level   
     *   _list_link_child   
     *   _list_link_parent   
     *   _list_mandatory   
     *   _list_reference   (replaces _list_identifier)
     *   _list_uniqueness   
         _name
     *   _related_item   
     *   _related_function   
         _type
     *   _type_conditions   
     *   _type_construct   
         _units_extension        |
         _units_description      |- under review - see below
         _units_conversion       |

The following DDL0 terms were dropped:

  _compliance      (replaced by the _dictionary_ attributes)
  _esd             (incorporated in _type_conditions)
  _esd_default
  _list_identifier (replaced by _list_reference)
  _update_history  (replaced by _dictionary_history)

The _units_ items are currently under review by Syd, because they pose a
problem in reading a dictionary. The idea is that _cell_length_a is defined
in the dictionary as a quantity in angstroms; but it has a _units_extension
code _pm, which means that the dataname _cell_length_a_pm is to be
understood as a cell length in picometres (the numeric conversion factor is
embodied in _units_conversion). But a straightforward dictionary lookup
doesn't return _cell_length_a_pm as a valid dataname.

Two sets of items have involved especially protracted labours. The _list_
items define the attributes and relationships between items in a looped
list. The _list_level allows for data names to be assigned to deeper levels
in nested loops. As such, it has no use in CIF applications, where nested
loops are forbidden; but it allows nested loop structures to be defined in
other STAR applications which do permit it. In my view this is important,
for it permits the definition of hierarchical data models, where the
existence of certain data items is dependent on the existence of others at
a higher level in the hierarchy (you can't have a loop at level 2 unless it
is nested within a loop at level 1). I don't know enough to say whether
this is a necessary or sufficient condition for mapping STAR data
structures onto hierarchical data models, but I have a feeling that it may
be important for doing this, and I'd be glad to hear any informed
commentary on this.

In CIF loops, which are always at level 1, the _list_ attributes describe
the relations between the data values in a table, and so this is an attempt
to map onto a relational data model. If one wants to identify the key to
entries in such a table (the data name or names which must have unique
values within the table), one must collect together entries with
_list_mandatory set to 'yes' and the complete set of _list_uniqueness
pointers.

Here's a simple example, somewhat adapted from the MIF paper, to show what's
going on. The following loop defines a table of bonds in a chemical structure:

     loop_  _bond_id_1   _bond_id_2  _bond_type
              C1   C2   double
              C2   C3   single
              C3   C4   double
              C1   C7   single

The MIF dictionary entry for _bond_type includes the line
          loop_  _list_reference    '_bond_id_1'   '_bond_id_2'
meaning that both bond id values must be present for the table entry to
make sense. In the entries for _bond_id_1 (and _2) are found the lines
                 _list_mandatory    yes
          loop_  _list_uniqueness   '_bond_id_1'   '_bond_id_2'
meaning that these datanames must be present in the loop, and must together
be unique (C1 appears twice as _bond_id_1 in the example, but that's OK: on
one occasion it's teamed with C2, on another with C7).

These relationships do all hang together; but it's necessary to do a
certain amount of hunting through the dictionary to ensure that you've got
all the relevant information (and it requires a lot of work to make sure
that the dictionary yields this information in a consistent manner).

The other set of items I marked for interest are the _type_ group. Because
the STAR philosophy is to store and deliver text strings, it was felt that
the assignment of data types was largely unnecessary. Integers, floats,
double-precision complex numbers and booleans did not need to be stored in
any different manner. They can all be coded as text strings: "3", "2.76",
"-1.30000988765876 + 0.00022456342987i", "true". But there have been many
voices raised to counter this view, and the solution adopted in DDL1.4 is
to permit three fundamental types, with _type values of "numb", "char" and
"null". ("null" is a device adopted to allow additional information to be
stored in dictionaries for humans to read, but machine parsers to ignore.)
_type_conditions extends this, so that 1.23(4) is understood as a number
with associated standard uncertainty (e.s.d.) in CIF applications, and
1:4 is understood as a range of allowed integer values in certain MIF
applications. _type_construct is a more general device which allows
a data value to be compared with a pattern (in regular expression
notation). Hence a date quantity could be described in the dictionary with
a _type_construct of [0-9][0-9]:[0-9][0-9]:[0-9][0-9], meaning any triplet
of two-digit integers separated by colons (not the format used in CIF!).
The regex notation is very powerful, and in principle this allows very
tight control over acceptable patterns in the string representing a data
value.

A final comment on category. The category assigned to each data item
defines which loop (table) it may appear in. If it doesn't normally appear
in a loop, its category is more generally defined to encompass related
items with the same status. Note that this linking of categories to loops
results in rather a large number of distinct category assignments. It is
not, perhaps, an effective taxonomic classification.

4. Version 2 Unleashed 
======================
The DDL2 version developed by John Westbrook followed on yet another CIF
workshop, that at Brussels in October '94. Its philosophy was to use the
same mechanism of supplying machine-readable attributes of data names in a
STAR dictionary, the attributes themselves labelled by STAR data names. But
the intention was more specifically to provide a representation better
suited for mapping onto a relational data model than the original DDL.
Macromolecular data are increasingly manipulated in relational databases,
and it is noteworthy that the PDB has recently published a schema for its
proposed relational database implementation of its stored data.

In mailing 30 of 15 Feb this year, I already described how (I think) DDL2
is intended to work. Here I shall just pick up a few threads to contrast
with specific remarks I made above on DDL1. First, the taxonomy employed
allows a classification hierarchy: there are categories, defined in the same
way as the categories in DDL1, but there are also category_groups (clusters
of categories, or supercategories), and subcategories. Hence _cell.length_a,
_cell.length_b and _cell.length_c are the (only) three members of the
subcategory "cell_length"; they are all members of the "cell" category;
they are all members of the "cell_group" supercategory (as is, say,
_cell_measurement_refln.index_h); and they are also members of the
"inclusive_group" supercategory (as is everything in the mmCIF
dictionary).

Second, the properties of a category are listed separately from the
properties of its constituent members. If I were to extend my MIF example
into DDL2, you would have an entry something like
   save_BOND
          _category.id               bond
    loop_ _category_key.name  '_bond.id_1'
                              '_bond.id_2'
   save_
which defines the key of the 'bond' category. None of the entries for
_bond.id_1, _bond.id_2 or _bond.type (as they would become in DDL2)
would contain anything about key values, except insofar as they indicate
membership of the 'bond' category.

The DDL2 approach also allows other general properties to be described in
the dictionary in (arguably) a more structured way. The _units_conversion
entries in DDL1 appear in individual definition blocks, so that the
conversion factor from angstroms to picometres is given for every data name
which may have associated data names and different units. In the mmCIF2
dictionary, all units required are gathered together in a single table at
the beginning of the dictionary. Individual data names have associated with
them the unit to attach to the quantity described, and any other units may
be generated by consulting the conversion table. (Note, however, that the
mmCIF2 does not permit a quantity to be expressed in other units in the
CIF: the units of _cell.length_a are angstroms, and there is no
_cell.length_a_pm or equivalent.)

The mmCIF2 dictionary also contains a larger spread of what are effectively
user-defined types: a list of type codes is established at the beginning of
the dictionary using the equivalent of _type_construct. In each data name
definition, an _item_type.code value is listed, which indexes into this
table of user-defined types. The original DDL1 types are preserved (as
_item_type_list.primitive_code and _item_type_conditions.code, I think),
but there is more freedom to define application-specific types for the
dependent application to handle.

Both the units and type lists were dropped from my ciftex'd version of the
mmCIF dictionary because of technical difficulties in printing them, but
they are present in the dictionary file itself.

Note that datanames described by DDL2 dictionaries will have an embedded
dot character to separate their "category" part from their "instance" part:
this will be the most obvious difference between CIFs containing datanames
described by DDL2 dictionaries and existing CIFs. The DDL2 proposal
endeavours to honour existing CIFs through an aliasing mechanism.

For the sake of completeness, I attach here a simple list of all the data
names used in the DDL2 set:

_block.description
_block.id
_category.description
_category.id
_category.mandatory_code
_category.method_id
_category_examples.case
_category_examples.detail
_category_examples.id
_category_group.category_id
_category_group.id
_category_group_list.description
_category_group_list.id
_category_group_list.parent_id
_category_key.id
_category_key.name
_dictionary.block_id
_dictionary.title
_dictionary.version
_dictionary_history.revision
_dictionary_history.update
_dictionary_history.update_day
_dictionary_history.update_month
_dictionary_history.update_year
_dictionary_history.version
_item.category_id
_item.mandatory_code
_item.sub_category_id
_item_aliases.alias_name
_item_aliases.name
_item_default.name
_item_default.value
_item_dependent.dependent_name
_item_dependent.name
_item_description.description
_item_description.name
_item_enumeration.detail
_item_enumeration.name
_item_enumeration.value
_item_examples.case
_item_examples.detail
_item_examples.name
_item_linked.child_name
_item_linked.parent_name
_item_range.maximum
_item_range.minimum
_item_range.name
_item_related.function_code
_item_related.name
_item_related.related_name
_item_structure.code
_item_structure.name
_item_structure_list.code
_item_structure_list.dimension
_item_structure_list.index
_item_type.code
_item_type.name
_item_type_conditions.code
_item_type_conditions.name
_item_type_list.code
_item_type_list.construct
_item_type_list.detail
_item_type_list.primitive_code
_item_units.code
_item_units.name
_item_units_conversion.factor
_item_units_conversion.from_code
_item_units_conversion.operator
_item_units_conversion.to_code
_item_units_list.code
_item_units_list.detail
_method.id
_method.name
_method_list.code
_method_list.detail
_method_list.id
_method_list.inline
_method_list.language
_sub_category.description
_sub_category.id
_sub_category.method_id
_sub_category_examples.case
_sub_category_examples.detail
_sub_category_examples.id

5. The Present
==============
So now we have two formalisms that can be used to describe the information
content of a STAR file: DDL1.4, which is used in the MIF core dictionary,
the CIF core, powder and modulated structures extensions, and in minor
applications like the WDC9 and ACA abstracts dictionaries; and DDL2.0.x,
which is used in the mmCIF dictionary. The current mmCIF also includes all
the core definitions reworked in the new formalism.

DDL1.4 has evolved in the way I described above, and is a general mechanism
for defining data names. It is not tied to any specific data model, but can
be mapped onto a relational model, and possibly onto a hierarchical one.

DDL2 was developed as a relational model, and is better structured and more
consistent in that respect. But it enforces this model on data files that
it describes: if a CIF had a table representing some raw experimental data,
it is possible that some lines of that table might be repeated (if the same
reflection were re-measured, for instance). The relational viewpoint forces
those lines to be distinguished, even if only by the addition of a new data
name whose only purpose is to number the rows of the table! The stricter
categorisation rules may also make more problematic the handling of
external data names (i.e. those introduced by a user which have no
definition in an official dictionary).

At this stage, the two formalisms are very closely compatible - DDL2 was
designed to achieve this. But a whole-hearted implementation of all the DDL2
ideas may well lead to a divergence between data files described by DDL1
and DDL2 dictionaries. At present, an application reading a DDL2 dictionary
is intended to be able to read and verify existing CIFs (currently described
by DDL1 dictionaries) through an in-built aliasing mechanism. We have yet
to see it demonstrated how well this will work; but it's unlikely that such
a mechanism will track any changes that are made to the DDL1 language and
its dependent data files in the future.

So we need to consider carefully whether all existing dictionaries should
be recast in the DDL2 formalism, and whether that will affect the future
evolution of CIF vis-a-vis parallel developments such as MIF.

=====

Two last small points: DDL stands for "Dictionary Definition Language"
(there have been numerous other descriptions over the last few years!).

And DDL is a STAR application (i.e. a set of data names and "values" in
STAR format). It's not "the description of the STAR standard". It's of
interest to us because it's the language in which CIF definitions are
written; but I suppose that, if all else were to fail, we could go back to
MS-Word and the language of Shakespeare :-)


Regards
Brian
Prev by Date: (33) Modulated structures dictionary, R factors, DDL2, ACA abstracts
Next by Date: (35) Mostly units and R factors
Index(es):
- Date
Discussion List Archives

(34) _units simplification; history of DDL