This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.

The Crystallographic Information File

Common Semantic Features

Version 1.1 Specification

Draft of 10 July 2002

Introduction

1. The Crystallographic Information File (CIF) standard is an extensible mechanism for the archival and interchange of information in crystallography and related structural sciences. As computer techniques evolve, it becomes more appropriate to discuss the machine-accessible semantic content, or "meaning", of the data in such a file. Ultimately CIF seeks to establish an ontology for machine-readable crystallographic information - that is, a collection of statements providing the relations between concepts and the logical rules for reasoning about them.

Essential components in the development of such an ontology are:

In the CIF framework, the objects of discourse are described in so-called data dictionary files, that provide the vocabulary and taxonomic elements. The dictionaries also contain information about the relationships and attributes of data items, and thus encapsulate most of the semantic content that is accessible to software. In practice, different dictionaries exist to service different domains of crystallography, and a CIF that conforms to a specific dictionary must be interpreted in terms of the semantic information conveyed in the dictionary.

However, some common semantic features apply across all CIF applications, and the current document outlines the foundations upon which other dictionaries may build more elaborate taxonomies or informational models.

Definition of terms

2. The following terms are used in this document with the specific meanings indicated here.

Semantics of data items

3. While the STAR File syntax allows the identification and extraction of tags and associated values, the interpretation of the data thus extracted is application-dependent. In CIF applications, formal catalogues of standard data names and their associated attributes are maintained as external reference files called data dictionaries. These dictionary files share the same structure and syntax rules as data CIFs.

4. At the current revision, two conventions (known as Dictionary Definition Languages or DDL) are supported for detailing the meaning and associated attributes of data names. These are known as DDL1 (Hall &Cook, 1995) and DDL2 (Westbrook, 19xx), and differ in the amount of detail they carry about data types, the relationships between specific data items and the large-scale classification of data items.

5. While it may be formally possible to define the semantics of the data items in a given data file in both DDL1 and DDL2 data dictionaries, in practice different dictionaries are constructed to define the data names appropriate for particular crystallographic applications, and each such dictionary is written in DDL1 or DDL2 formalism according to which appears better able to describe the data model employed. There is thus in practice a bifurcation of CIF into two dialects according to the DDL used in composing the relevant dictionary file. However, the use of aliases may permit applications tuned to one dialect to import data constructed according to the other.

Data name semantics

6. Strictly, data names should be considered as void of semantic content - they are tags for locating associated values, and all information concerning the meaning of that value should be sought in an associated dictionary.

7. However, it is customary to construct data names as a sequence of components elaborating the classification of the item within the logical structure of its associated dictionary. Hence a data name such as _atom_site_fract_x displays a hierarchical arrangement of components corresponding to membership of nested groupings of data elements. The choice of components readily indicates to a human reader that this data item refers to the fractional x coordinate of an atomic site within a crystal unit cell, but it should be emphasised from a computer programming viewpoint that this is coincidental; the attributes that constrain the value of this data item (and its relationship to others such as _atom_site_fract_y and _atom_site_fract_z) must be obtained from the dictionary and not otherwise inferred.

8. In practice data names described in a DDL2 dictionary are constructed with a period character separating their specific function from the name of the category to which they have been assigned. In the absence of a dictionary file, this convention permits the inference that the data item with name _atom_site.fract_x will appear in the same looped list as other items with names beginning _atom_site., and that all such items belong to the same category.

Namespace

9. The intention of the maintainers of public CIF dictionaries is to formulate a single authoritative set of data names for each CIF dialect (i.e. DDL1 and DDL2), thus facilitating the reliable archive and interchange of crystallographic data. However, it is also permissible for users to introduce local data names into a CIF. Two mechanisms exist to reduce the danger of collision of data names that are not incorporated into public dictionaries.

10. The character string [local] is reserved for local use. That is, no public dictionary will define a data name that includes this string. This allows experimentation with data items in a strictly local context, i.e. in cases where the CIF is not intended for interchange with any other user.

11. Where CIFs including local data items are expected to enjoy a public circulation, authors may register a reserved prefix for their sole use. The registry is available on the web at

http://www.iucr.org/iucr-top/cif/spec/reserved.html

A reserved prefix, e.g. foo, must be used in the following ways

12. There is no syntactic property identifying such a reserved prefix, so that software validating or otherwise handling such local data names must scan the entire registry and match registered prefixes against the indicated components of data names. Note that reserved prefixes may themselves contain underscore characters, so a maximal matching search must be made.

Note on handling of units

13. The published specification for CIF Version 1.0 permitted data values expressed in different units to be tagged by variant data names (Hall, Allen & Brown, 1991, p. 657:)

... Many numeric fields contain data for which the units must be known. Each CIF data item has a default units code which is stated in the CIF Dictionary. If a data item is not stored in the default units, the units code is appended to the data name. For example, the default units for a crystal cell dimension are ångströms. If it is necessary to include this data item in a CIF with the units of picometres, the data name of _cell_length_a is replaced by _cell_length_a_pm. Only those units defined in the CIF Dictionary are acceptable. The default units, except for the ångström, conform to the SI Standard adopted by the IUCr.

This approach is deprecated and has not been supported by any official CIF dictionary published subsequent to version 1.0 of the Core. All data values must be expressed in the single unit assigned in the associated dictionary.

A small number of archived CIFs exist with variant data names as permitted by the above clause. If it is necessary to validate them against versions of the Core dictionary subsequent to version 1.0, the formal compatibility dictionary cif_compat.dic may be used for the purpose. No other use should be made of this dictionary.

Data value semantics

14. The STAR syntax permits retrieval of data by simply requesting a specific data name within a specific data block. Prior knowledge about data type (e.g. text or numbers), whether the item is looped, or whether the item exists in the file at all, is unnecessary. However, applications in general need to know data type, valid ranges of values, and relationships between data items; and a program designer needs to know the purpose of the data item (i.e. what physical quantity or internal book-keeping function it represents). While such semantic information may be defined informally for local data items (ones not intended for exchange between different users or software applications), formal descriptions of the semantics associated with data values are catalogued in data dictionary files. Currently two formalisms (dictionary definition languages) for describing data value attributes are supported; full specifications of these formalisms (known as DDL1 and DDL2) are provided elsewhere.

Data typing

15. Four base data types are supported in CIF. These are

16. Many applications distinguish between multi-line text fields and character string values that fit within a single line of text. While this is a convenient practical distinction for coding purposes, formally both manifestations should be regarded as having the same base type, which might be "char" or "uchar". Applications are at liberty to choose whether to define specific multi-line text subtypes, and whether to permit casting between subtypes of a base type. The examples of character string delimiters in paragraph 20 of the document "Syntax" are predicated on an approach that handles all subtypes of character or text data equivalently.

17. Where the attributes of a data value are not available in a dictionary listing, it may be assumed that a character string interpretable as a number should be taken to represent an item of type numb. However, an explicit dictionary declaration of type will override such an assumption.

Subtyping

18. The base data types detailed in the previous section are very general, and need to be refined for practical application. Refinement of types is to some extent application-dependent, and different subtypes are supported for data items defined by DDL1 and DDL2 dictionary files. The following notes indicate some considerations, but the relevant dictionary files and documentation should be consulted in each case.

19. DDL1 dictionaries
Standard uncertainties: Values of type numb may include a standard uncertainty in the final digit(s) of the number where the associated item definition includes the attribute

     _type_conditions     esd
For example, a value of 34.5(12) means 34.5 with a standard uncertainty of 1.2; it may also be expressed in scientific notation as 3.45E1(12).

20. DDL2 dictionaries
DDL2 provides a number of tags that may be used in a dictionary file to specify subtypes for data items defined by that dictionary alone. Examples of the subtypes specified for the macromolecular CIF dictionary are:

code identifying code strings or single words
ucode identifying code strings or single words (case-insensitive)
uchar1 single-character codes (case-insensitive)
uchar3 three-character codes (case-insensitive)
line character strings forming a single line of text
uline character strings forming a single line of text (case-insensitive)
text multi-line text
int integers
float floating-point real numbers
yyyy-mm-dd dates
symop symmetry operations
any any type permitted

Special generic values

21. The unquoted character literals ? (query mark) and . (full point) are special and are valid expressions for any data type.

22. The value ? means that the actual value of a requested data item is unknown.

23. The value . means that the actual value of a requested data item is inapplicable. This is most commonly used in a looped list where a data value is required for syntactic integrity.

Embedded data semantics

24. The attributes of data items defined in CIF dictionaries serve to direct crystallographic applications in the retrieval, storage and validation of relevant data. In principle a CIF might include as data items suitably encoded fields representing data suitable for manipulation by text processing, image, spreadsheet, database or other applications. It would be useful to have a formal mechanism allowing a CIF to invoke appropriate content handlers for such data fields, and this is under investigation for the next CIF version specification.

CIF conventions for special characters in text

25. The one existing example of embedded semantics is the text character markup introduced in the CIF Version 1.0 specification and summarised in paragraphs 29-36 below. The specification is silent on which fields should be interpreted according to these markup conventions, but the published examples suggest that they may be used in any character field in a CIF data file except as prohibited by a dictionary directive. It is intended that the next CIF version specification shall formally declare where such markup may be used.

Dictionary compliance

26. Dictionary files containing the definitions and attribute sets for the data items contained in a CIF should be identified within the CIF by some or all of the data items
    _audit_conform_dict_name
    _audit_conform_dict_version
    _audit_conform_dict_location
corresponding to DDL1 dictionaries, or
    _audit_conform.dict_name
    _audit_conform.dict_version
    _audit_conform.dict_location
for DDL2 dictionaries. Where no such information is provided, it may be assumed that the file should conform against the core CIF dictionary.

27. The _audit_conform data items may be looped in the case where more than one dictionary is used to define the items in a CIF, and they may include dictionaries of local data items provided such dictionary files have been prepared in accordance with the rules of the appropriate DDL.

28. A detailed protocol exists for locating, merging and overlaying multiple dictionary files.

CIF markup conventions

29. If permitted by the relevant dictionary and if no other indication is present, the contents of a text or character field are assumed to be interpretable as text in English or some other human language. Certain special codes are used to indicate special characters or accented letters not available in the ASCII character set, as listed below.

Greek letters

30. In general, the corresponding letter of the Latin alphabet, prefixed by a backslash character. The complete set is:
tex2html_wrap_inline30 A \a \A alpha
tex2html_wrap_inline32 B \b \B beta
tex2html_wrap_inline34 X \c \C chi
tex2html_wrap_inline36 tex2html_wrap_inline38 \d \D delta
tex2html_wrap_inline40 E \e \E epsilon
tex2html_wrap_inline42 tex2html_wrap_inline44 \f \F phi
tex2html_wrap_inline46 tex2html_wrap_inline48 \g \G gamma
tex2html_wrap_inline50 H \h \H eta
tex2html_wrap_inline52 I \i \I iota
tex2html_wrap_inline54 K \k \K kappa
tex2html_wrap_inline56 tex2html_wrap_inline58 \l \L lambda
tex2html_wrap_inline60 M \m \M mu
  
tex2html_wrap_inline62 N \n \N nu
o O \o \O omicron
tex2html_wrap_inline64 tex2html_wrap_inline66 \p \P pi
tex2html_wrap_inline68 tex2html_wrap_inline70 \q \Q theta
tex2html_wrap_inline72 R \r \R rho
tex2html_wrap_inline74 tex2html_wrap_inline76 \s \S sigma
tex2html_wrap_inline78 T \t \T tau
tex2html_wrap_inline80 tex2html_wrap_inline82 \u \U upsilon
tex2html_wrap_inline84 tex2html_wrap_inline86 \w \W omega
tex2html_wrap_inline88 tex2html_wrap_inline90 \x \X xi
tex2html_wrap_inline92 tex2html_wrap_inline94 \y \Y psi
tex2html_wrap_inline96 Z \z \Z zeta

Accented letters

31. Accents should be indicated by using the following codes before the letter to be modified (i.e. use \'e for an acute e):
\' acute (é) \" umlaut (ü) \= overbar
\` grave (à) \~ tilde (ñ) \. overdot
\^ circumflex (â) \; ogonek \< hacek
\, cedilla (ç) \> Hungarian umlaut \( breve

Other characters

32. Other special alphabetic characters should be indicated as follows:
\%a a-ring (å) \?i dotless i \&s German "ss" (ß)
\/o o-slash (ø) \/l Polish l (tex2html_wrap120) \/d barred d

Capital letters may also be used in these codes, so an ångström symbol (Å) may be given as \%A.

33. Superscripts and subscripts should be indicated by bracketing relevant characters with circumflex or tilde characters, thus:

superscripts Csp^3^ for Csp3
subscripts U~eq~ for Ueq

The closing symbol is essential to return to normal text.

34. Some other codes are accepted by convention. These are:

\% degree (°) \\times ×
-- dash +- ±
--- single bond -+ tex2html_wrap_inline104
\\db double bond \\square square
\\tb triple bond \\neq tex2html_wrap_inline108
\\ddb delocalized double bond \\rangle >
\\sim ~ \\langle <
(N.B. ~  is the code for subscript) \\rightarrow tex2html_wrap_inline114
\\simeq @ \\leftarrow tex2html_wrap_inline116
\\infty tex2html_wrap_inline102    

Note that \\db, \\tb and \\ddb should always be followed by a space, e.g. C=C is denoted by C\\db C.

Typographic style codes

35. The codes indicated above are designed to refer to special characters not expressible within the CIF character set, and the initial specification did not permit markup for typographic style such as italic or bold-face type. However, in some cases the ability to indicate type style is useful, and in addition to the codes above HTML-like conventions are allowed of surrounding text by <i> </i> to indicate the beginning and end of italic, and by <b> </b> to indicate the beginning and end of boldface type.

36. If it is necessary to convey more complex typographic information than is permitted by these special character codes and conventions, the entire text field should be of a richer content type allowing detailed typographic markup.

References

37.