Status of the STAR Dictionary Definition Language

SYDNEY R HALL

Crystallography Centre, University of Western Australia, Nedlands 6009, Australia

& ANTHONY P F COOK

Synopsis Scientific Systems, 175 Woodhouse Lane, Leeds LS2 3AR, United Kingdom

This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.

INTRODUCTION

The primary purpose of this workshop is to determine which, where and when software tools will be developed for the exchange and archiving of macromolecular data in CIF format. The mmCIF dictionary cif_mm_core.dic plays a central role in these developments, and it follows that the functionality of the language of CIF dictionaries, the STAR dictionary definition language[1](DDL), is of critical importance. The purpose of this talk will be to outline the attributes that have been defined in the Core DDL dictionary file ddl_core.dic [this may be obtained by anonymous FTP from 130.95.232.12 in directory /cif].

The structure of the STAR DDL was first proposed[2 ] by one of us (APFC) in 1991. Since then a great deal has been learnt about what types of definitions are needed in a CIF dictionary. The scope of the language has expanded mainly in response to the definition requirements of the two major STAR developments, CIF[3] and MIF[4]. These projects have, in particular, led to an understanding of which data attributes are essential, and how they should be implemented. However, this process has not been without its problems. We soon realised that it is impossible to reach unanimity on all issues. Early attempts to do this resulted in significant delays. We now adopt the procedure of circulating proposed additions to the DDL, weighing up the arguments pro and con, and then making a decision on their implementation. This is a pragmatic approach which works in a fast moving field where considerable dogma and inertia exists. The IUCr Committee on CIF Standards (COMCIFS) has been assisting us in promoting constructive discussion [well, most of the time, anyway :-) ].

The underpinning objectives of the DDL development are that the attributes be relatively simple (to maximise their understanding by inexpert users), extensible (to provide for future extension and new data applications) and adaptable to different data modelling paradigms. The latter objective has not been enthusiastically received by those who favour particular relational, hierarchical and object-oriented data models. We argue that this level of data interpretation is an application specific matter, and the DDL should service, where the constraints of simplicity and flexibility allow, all of these data paradigms.

The Core DDL attributes we describe here provide the basis for the stable and logical development of STAR applications and associated tools. They formalise the DDL concepts and vocabulary that are now being applied consistently to existing applications, dictionaries and software. These attributes represent the core vocabulary of the language, and their definition must not be changed in the future. They also represent the platform from which future expansion of the language vocabulary may take place.

DEFINITION STRUCTURE

The STAR dictionary definition language is composed of discrete data attributes which are organised within a STAR File structure. These attributes, individually and collectively, provide the vocabulary and the functionality of the DDL. The attribute definitions are contained within the DDL dictionary file ddl_core.dic. The text version of this dictionary translated using CIFtex[5 ]will be distributed at the workshop.

The general structure of a data definition in a dictionary file is best illustrated with an example from an existing STAR application. The chemical melting point is defined in the CIF dictionary cif_core.dic as the data item _chemical_melting_point. Here is its definition.

	
data_chemical_melting_point	
    _name                      '_chemical_melting_point'	
    _category                    chemical	
    _type                        numb	
    _enumeration_range           0.0:	
    loop_ _units_extension	
          _units_description	
          _units_conversion      ' '     'Kelvin'    +0	
                                 '_C'    'Celsius'   +273.0	
    _definition	
;              Temperature at which a crystalline solid changes to a
                     liquid.	
;

The definition attributes _name, _type, etc. are self-descriptive. Each definition is contained within a single STAR data block which has a blockname which identifies the defined item. This example uses only a few of the possible DDL attributes. Details of all core attributes are given later. The structure of a DDL definition, and of the dictionary file as a whole, is STAR conformant and may be accessed using the same procedures and software normally applied to STAR data files. As a consequence of this conformance, the actual order of the attributes in the definition is irrelevant to their access and application.

The same example definition may be converted with CIFtex[5] software into a text entry of the published CIF dictionary[3].

_chemical_melting_point(numb)

Temperature at which a crystalline solid changes to a liquid.

The permitted range is 0.0-->[[infinity]]. The units extensions are:

' ' (Kelvin +0) '_C' (Celsius +273.0) [chemical]

ATTRIBUTES

Language attributes constitute the working vocabulary of the DDL. They may be subdivided into four attribute classes.

To establish the identity of a data item
To specify purpose or function of a data item
To define the links between data items
To perform dictionary control functions

1. Attribute Identification

Data identification provides the simplest form of data validation. In a STAR File each data item is represented by a data name and a data value. The DDL attribute used to define a data name is

_name '<data name>'

Multiple data names may be entered (as a looped list) for items that have a similar function, or that form an irreducible set. The data name specification represents the first level of validation for a STAR File. That is, it provides a check that an item is recognisable to a particular application. The _name attribute also provides a spelling check for validation software such as Cyclops[6]. The presence of an unrecognised item (i.e. a data name which is not present in the application dictionary) does not constitute a data violation. The presence of specific items may, however, be important in a particular application and, in some cases, missing items can invalidate other data references (the method for specifying mandatory data is discussed below). Extra (i.e. unrecognised) items will usually be ignored.

2. Function Attributes

The majority of attributes in the DDL vocabulary describe the function or purpose of a defined data item. These are:

_definition _enumeration _enumeration_default _enumeration_detail _enumeration_range _example _example_detail _list _list_level _type _type_conditions _type_construct _units_extension _units_description

_units_conversion

The _definition attribute is the text description of the defined data item. This provides the primary semantic information about the function and purpose of the defined data item. This text may also contain additional information about machine-parsible attributes present in the definition. This attribute is not intended to be machine-parsible.

The _enumeration attributes specify the boundary values for the defined data item. If a data item is restricted to a set of specific values, _enumeration and _enumeration_detail serve to itemise these. For example, in the definition of the atomic element symbol these two attributes are used to list the IUPAC element symbols in the periodic table (e.g. in the mif_core.dic dictionary file for _atom_type). The _enumeration_range attribute specifies the extremer (minimum and maximum) values of data items with a preordained sequence of values. The _enumeration_default attribute specifies the assumed value if a data item is not explicitly specified. Enumeration attributes are machine-parsible.

The _example attributes provide typical invocations of the defined data item and are not machine-parsible attributes.

The _list attributes specify how data items are used in looped lists. The attribute _list is a switch which has the value yes if the defined item must be contained within a loop_ structure, and no if it must not (the default mode). The _list value may also be both for data items that can be used in either mode. The attribute _list_level specifies the nest level of the loop structure in which the defined item is used. The level for all CIF applications is 1.

The _type attributes fix the form or construction of the defined data item. The attribute _type is used to restrict the type of the defined item to either a string representing a number [the protocol of 'an acceptable number' is primarily an application matter], a non-numerical string (a single line or multi-line text bounded by semi-colons), or a data item to be used for DDL dictionary descriptions only. These data types are signalled with the codes numb, char or null. The attribute _type_conditions is used to specify extra conditions on the TYPE specification of numb and char data items, individually and globally. For example, _type_conditions is set to esd for all CIF's so that a standard deviation estimate may be appended, in parentheses, to a numerical item.

The attribute _type_construct is used to specify how certain character data items must be encoded in terms of the regular expression language REGEX (version POSIX[7]) and other data items. For example, a chronological 'date' may be expressed in a variety of formats, all of which involve the components of day, month and year. _type_construct may be used to specify such encoded data as a precise representation. The definition of _enumeration_range in Appendix I contains an example of this type of construction.

The _units attributes specify the measurement units of defined numerical data items. _units_extension values are permitted adjuncts of a defined data name which signal different measurement units. The measurement units are identified with the attribute _units_description, and the expression to convert to each measurement unit is specified with _units_conversion. The format of the conversion expression is <arithmetic operator><conversion factor>. The permitted operators are multiplication '*', division '/', addition '+' and subtraction '-'. The value in default units (as defined for a data name with no extension) is calculated by applying the specified values to the conversion expression.

3. Link attributes

The third attribute class contains the DDL items responsible for specifying the relational links between data items. These are

_category _list_link_child _list_link_parent _list_mandatory _list_reference _list_uniqueness _related_item

_related_function

The _category attribute specifies the natural grouping of the defined data item. This attribute is particularly important for list data items (ie. those with _list set at either yes or both) must have the same _category designation in the same loop structure. [Note that items of the same category can appear in more than one loop structure, provided that appropriate 'reference data items' are available, but each loop should contain only items of this category.] The _category attribute provides a machine-parsible group identifier which is often contained in the data name construction itself. Data names are usually constructed with the format <category>_<sub-category>_<descriptor>.

The _list_link_ attributes serve to inter-relate data between different looped lists. They identify items which are used to connect different data structures. For example, atom sites in a molecule are identified by unique labels (see Example 1). These labels appear in the list of atomic sites as _atom_id and also in a separate list of atomic connections as _bond_id_1 and _bond_id_2. The labels are used in both lists are identical, in one loop referring to the site information and in the other to the molecular connectivity.

The _list_link_child attribute is used to identify data items which depend implicitly on the presence of the defined item. This is a child dependency relationship. In the molecule example (Example 1) the items _bond_id_1 and _bond_id_2 are dependent on _atom_id because bonds are meaningless without the molecular site identification. The _list_link_parent attribute is the converse of _list_link_child in that it identifies a data item that must be present for the defined data item to be valid. In the molecular example _atom_id is the parent item of _bond_id_1 and _bond_id_2.

The _list_mandatory and _list_reference attributes are also closely related. _list_mandatory is a switch (yes or no) which signals if a defined data item is essential to the validity of a given category of loop structure (as specified by _category). _list_reference identifies items which represent the reference points for set (or packet) of looped data. In other words, this attribute defines the data items by which a specific packet is referenced or accessed within the list. In the molecule example (Example 1) _atom_id is the _list_reference value for both _atom_type (the element symbol) and _atom_attach_h (the number of attached hydrogen atoms) because both these items refer to the atom site. Similarly, the items _bond_id_1 and _bond_id_2 are the reference values for _bond_type_ccdc.

The attribute _list_uniqueness identifies data items which, collectively, must have unique values in order that the looped list of a given category is valid. This is strictly a validation quantity and is defined only in association with items with a _list_mandatory value of yes (which are almost always reference data items for the loop category).

The _related_ attributes provides relational and replacement information. These two attributes are always used together. _related_item identifies the item or items that are related to the defined data item. The nature of this relationship is specified with _related_function in terms of preset codes alternate, convention, conversion and replace. The definition of the function codes is detailed in Appendix I. These two attributes provide simple relational capabilities, and may be used change validation and access pathways. This enables archived files containing superseded data to remain active i.e. requests for particular items can be redirected to earlier or to more recently defined data. They facilitate the fundamental STAR requirement that definitions not be altered or removed, but they may be superseded. The related attributes ensure that old archive files may always be validated and accessed.

4. Dictionary Controls

The final class of attributes is used exclusively within the DDL dictionary. They are

_dictionary_history _dictionary_name _dictionary_update _dictionary_version

_include_file

The _dictionary_ attributes are used to audit and identify dictionary information. _dictionary_history is a text record containing the entry and update information of the dictionary. _dictionary_name is the generic name of the electronic file containing the dictionary (the actual name of the file can vary from site to site). The attributes _dictionary_version and _dictionary_update are the version number and the date of the last change in the dictionary. Both items represent important external reference information.

The _include_file command is used in a dictionary to import definition information. The value of this attribute is the name of a file containing data in STAR format. Dictionary processing software converts a _include_file command into the contents of the file (the _include_file item is effectively "overwritten"). The usual extension code for the names of included files is ".dic" for complete dictionaries and ".val" for partial validation data.

DICTIONARY CONSTRUCTION

The STAR File syntax permits DDL dictionaries to be constructed in a wide variety of ways, all of which will be accessible to validation software. The most commonly used layout for DDL dictionaries is illustrated with an example dictionary in Appendix II.

REFERENCES

1 Hall, S R.& Cook, A F P (1994) STAR Dictionary Definition Language: Initial Specifications. J. Chem. Inf. Comput. Sci. (in preparation).

2 Cook, A F P (1991) Dictionary Definition Language in STAR File format ORAC report.

3 Hall, S R, Allen, F H, Brown, I D (1991) The Crystallographic Information File (CIF): a New Standard Archive File for Crystallography. Acta Cryst. A47, 655-685.

4 Allen, F H, Barnard, J. M, Cook, A F P, Hall, S R (1994) The Molecular Information File (MIF): Initial Specifications. J. Chem. Inf. Comput. Sci. (in preparation).

5 McMahon, B (1992) CIF and the IUCr. Proceedings of the first Macromolecular Crystallographic Information File (CIF) Tools Workshop, Columbia University, NY.

6 Hall, S R (1993) Cyclops: J. Appl. Cryst.,26, 480-481.

7 (1991) POSIX Regular Expression Standard P1003.2, IEEE draft 11.2.

Example 1.

data_thiabutyrolactone

     loop_    
         _atom_id
         _atom_type
         _atom_attach_h
                         1  C  0     2  S  0     3  C  2
                         4  C  2     5  C  2     6  O  0

     loop_                        
         _bond_id_1
         _bond_id_2
         _bond_type_mif
                         1  2  s     2  3  s     3  4  s
                         4  5  s     5  1  s     1  6  d

APPENDIX I

DDL Dictionary as text version

(To be posted separately)
APPENDIX II

Recommended construction of a Dictionary using the DDL

############################################################################## # # # XYZ DATA DICTIONARY # # # ############################################################################## data_on_this_dictionary _dictionary_name xyz_core.dic _dictionary_version 0.1 _dictionary_update 1994-02-22 _dictionary_history ; 1994-02-22 Created as a typical dictionary construction. ; data_include_related_dictionaries _include_file star_core.dic global_ _list no _list_mandatory no _list_level 1 _type_conditions seq data_atom_attach_ loop_ _name '_atom_attach_all' '_atom_attach_ring' '_atom_attach_nh' '_atom_attach_h' _category atom _type numb _list yes _list_reference '_atom_id' _enumeration_range 0: _definition ; The number of atom sites considered to be attached to this site. all all sites ring all sites forming rings nh all sites excl. hydrogens and unshared electron pairs h hydrogen sites ; data_atom_charge _name '_atom_charge' _category atom _type numb _list yes _list_reference '_atom_id' _enumeration_range -99:99 _definition ; Specifies the formal electronic charge on the atom for the different atomic representation conventions. The convention for charge is specified by _define_bonding_convention. ; data_atom_cip _name '_atom_cip' _category atom _type char _list yes _list_reference '_atom_id' _definition ; Specifies the Cahn-Ingold-Prelog designation for the atom. The designators are by Prelog and Helmchen (Angew. Chem. Int. Ed. Engl. 1982, 21, 567-583). ; data_atom_coord_ loop_ _name '_atom_coord_x' '_atom_coord_y' '_atom_coord_z' _category atom _type numb _list yes _list_reference '_atom_id' loop_ _units_extension _units_description _units_conversion ' ' 'Angstroms' *1.0 '_pm' 'picometres' /100. '_nm' 'nanometres' *10. _definition ; Specifies the Cartesian coordinates for the atom at an arbitary origin and arbitary orthogonal axes. ; data_atom_id _name '_atom_id' _category atom _type char _list yes _list_mandatory yes _list_uniqueness '_atom_id' loop_ _list_link_child '_bond_id_1' '_bond_id_2' '_stereo_vertex_id' _definition ; This specifies a unique numeric identifier for an 'atom site' in a molecule or fragment. A designated atom site may be occupied by a 'dummy' atom (see _atom_type). ;

Copyright © 1997 International Union of Crystallography
IUCr Webmaster