# Common semantic features

## Introduction

1. The Crystallographic Information File (CIF) standard is an extensible mechanism for the archival and interchange of information in crystallography and related structural sciences. Ultimately CIF seeks to establish an ontology for machine-readable crystallographic information - that is, a collection of statements providing the relations between concepts and the logical rules for reasoning about them.

Essential components in the development of such an ontology are:

• the basic rules of grammar and syntax, described in the document File syntax
• a vocabulary of the tags or data names specifying particular objects
• a taxonomy, or classification scheme relating the specified objects
• descriptions of the attributes and relationships of individual and related objects

In the CIF framework, the objects of discourse are described in so-called data dictionary files, that provide the vocabulary and taxonomic elements. The dictionaries also contain information about the relationships and attributes of data items, and thus encapsulate most of the semantic content that is accessible to software. In practice, different dictionaries exist to service different domains of crystallography, and a CIF that conforms to a specific dictionary must be interpreted in terms of the semantic information conveyed in the dictionary.

However, some common semantic features apply across all CIF applications, and the current document outlines the foundations upon which other dictionaries may build more elaborate taxonomies or informational models.

## Definition of terms

2. The following terms are used in the CIF specification documents with the specific meanings indicated here.
• 2.1. A CIF is a file conforming to the specification herein stated, containing either information on a crystallographic experiment or its results (or similar scientific content); or descriptions of the data identifiers in such a file.
• 2.2. A data file is understood to convey information relating to a crystallographic experiment.
• 2.3. A dictionary file is understood to contain information about the data items in one or more data files as identified by their data names.
• 2.4. A data name is a case-insensitive identifier (a string of characters beginning with an underscore character) of the content of an associated data value.
• 2.5. A data value is a string of characters representing a particular item of information. It may represent a single numerical value; a letter, word or phrase; extended discursive text; or in principle any coherent unit of data such as an image, audio clip or virtual-reality object.
• 2.6. A data item is a specific piece of information defined by a data name and an associated data value.
• 2.7. A tag is understood in this document to be a synonym for data name.
• 2.8. A data block is the highest-level component of a CIF, containing data items or save frames. A data block is identified by a data block header, which is an isolated character string (that is, bounded by white space and not forming part of a data value) beginning with the case-insensitive reserved characters data_.
• 2.9. A block code is the variable part of a data block header, e.g. the string foo in the header data_foo.
• 2.10. A save frame is a partitioned collection of data items within a data block, started by a save frame header, which is an isolated character string beginning with the case-insensitive reserved characters save_, and terminated with an isolated character string containing only the case-insensitive reserved characters save_.
• 2.11. A frame code is the variable part of a save frame header, e.g. the string foo in the header save_foo.

## Semantics of data items

3. While the STAR File syntax allows the identification and extraction of tags and associated values, the interpretation of the data thus extracted is application-dependent. In CIF applications, formal catalogues of standard data names and their associated attributes are maintained as external reference files called data dictionaries. These dictionary files share the same structure and syntax rules as data CIFs.

4. At the current revision, two conventions (known as Dictionary Definition Languages or DDL) are supported for detailing the meaning and associated attributes of data names. These are known as DDL1 (Hall & Cook, 1995) and DDL2 (Westbrook & Hall, 1995), and differ in the amount of detail they carry about data types, the relationships between specific data items and the large-scale classification of data items.

5. While it may be formally possible to define the semantics of the data items in a given data file in both DDL1 and DDL2 data dictionaries, in practice different dictionaries are constructed to define the data names appropriate for particular crystallographic applications, and each such dictionary is written in DDL1 or DDL2 formalism according to which appears better able to describe the data model employed. There is thus in practice a bifurcation of CIF into two dialects according to the DDL used in composing the relevant dictionary file. However, the use of aliases may permit applications tuned to one dialect to import data constructed according to the other.

### Data name semantics

6. Strictly, data names should be considered as void of semantic content - they are tags for locating associated values, and all information concerning the meaning of that value should be sought in an associated dictionary.

7. However, it is customary to construct data names as a sequence of components elaborating the classification of the item within the logical structure of its associated dictionary. Hence a data name such as _atom_site_fract_x displays a hierarchical arrangement of components corresponding to membership of nested groupings of data elements. The choice of components readily indicates to a human reader that this data item refers to the fractional x coordinate of an atomic site within a crystal unit cell, but it should be emphasised from a computer programming viewpoint that this is coincidental; the attributes that constrain the value of this data item (and its relationship to others such as _atom_site_fract_y and _atom_site_fract_z) must be obtained from the dictionary and not otherwise inferred.

8. In practice data names described in a DDL2 dictionary are constructed with a period character separating their specific function from the name of the category to which they have been assigned. In the absence of a dictionary file, this convention permits the inference that the data item with name _atom_site.fract_x will appear in the same looped list as other items with names beginning _atom_site., and that all such items belong to the same category.

#### Namespace

9. The intention of the maintainers of public CIF dictionaries is to formulate a single authoritative set of data names for each CIF dialect (i.e. DDL1 and DDL2), thus facilitating the reliable archive and interchange of crystallographic data. However, it is also permissible for users to introduce local data names into a CIF. Two mechanisms exist to reduce the danger of collision of data names that are not incorporated into public dictionaries.

10. The character string [local] (including the literal bracket characters) is reserved for local use. That is, no public dictionary will define a data name that includes this string. This allows experimentation with data items in a strictly local context, i.e. in cases where the CIF is not intended for interchange with any other user.

11. Where CIFs including local data items are expected to enjoy a public circulation, authors may register a reserved prefix for their sole use. The registry is available on the web at

http://www.iucr.org/iucr-top/cif/spec/reserved.html

A reserved prefix, e.g. foo, must be used in the following ways

• If the data file contains items defined in a DDL1 dictionary, the local data names assigned under the reserved prefix must contain it as their first component, e.g. _foo_atom_site_my_item.
• If the data file contains items defined in a DDL2 dictionary, then the reserved prefix must be
• the first component of data names in a category defined for local use, e.g. _foo_my_category.my_item
• the first component following the period character in a data name describing a new item in a category already defined in a public dictionary, e.g. _atom_site.foo_my_item

12. There is no syntactic property identifying such a reserved prefix, so that software validating or otherwise handling such local data names must scan the entire registry and match registered prefixes against the indicated components of data names. Note that reserved prefixes may not themselves contain underscore characters.

### Note on handling of units

13. The published specification for CIF Version 1.0 permitted data values expressed in different units to be tagged by variant data names (Hall, Allen & Brown, 1991, p. 657:)

... Many numeric fields contain data for which the units must be known. Each CIF data item has a default units code which is stated in the CIF Dictionary. If a data item is not stored in the default units, the units code is appended to the data name. For example, the default units for a crystal cell dimension are ångströms. If it is necessary to include this data item in a CIF with the units of picometres, the data name of _cell_length_a is replaced by _cell_length_a_pm. Only those units defined in the CIF Dictionary are acceptable. The default units, except for the ångström, conform to the SI Standard adopted by the IUCr.

This approach is deprecated and has not been supported by any official CIF dictionary published subsequent to version 1.0 of the Core. All data values must be expressed in the single unit assigned in the associated dictionary.

A small number of archived CIFs exist with variant data names as permitted by the above clause. If it is necessary to validate them against versions of the Core dictionary subsequent to version 1.0, the formal compatibility dictionary cif_compat.dic may be used for the purpose. No other use should be made of this dictionary.

### Data value semantics

14. The STAR syntax permits retrieval of data by simply requesting a specific data name within a specific data block. Prior knowledge about data type (e.g. text or numbers), whether the item is looped, or whether the item exists in the file at all, is unnecessary. However, applications in general need to know data type, valid ranges of values, and relationships between data items; and a program designer needs to know the purpose of the data item (i.e. what physical quantity or internal book-keeping function it represents). While such semantic information may be defined informally for local data items (ones not intended for exchange between different users or software applications), formal descriptions of the semantics associated with data values are catalogued in data dictionary files. Currently two formalisms (dictionary definition languages) for describing data value attributes are supported; full specifications of these formalisms (known as DDL1 and DDL2) are provided elsewhere.

#### Data typing

15. Four base data types are supported in CIF. These are

• numb: a value interpretable as a decimal base number and supplied as an integer, a floating-point number or in scientific notation;
• char: a value to be interpreted as character or text data (where the value contains white-space characters, it must be quoted);
• uchar: a value to be interpreted as character or text data but in a case-insensitive manner (i.e. the values FOO and foo are to be taken as identical);
• null: a special data type associated with items for which no definite value may be stored in computer memory. It is the type associated with the special character literal values ? (query mark) and . (full point) which may appear as values for any data item within a data file (see section on "Special generic values" below). It is also the type assigned to items defined in dictionary files which may not occur in data files.

16. Many applications distinguish between multi-line text fields and character string values that fit within a single line of text. While this is a convenient practical distinction for coding purposes, formally both manifestations should be regarded as having the same base type, which might be "char" or "uchar". Applications are at liberty to choose whether to define specific multi-line text subtypes, and whether to permit casting between subtypes of a base type. The examples of character string delimiters in paragraph 20 of the document "Syntax" are predicated on an approach that handles all subtypes of character or text data equivalently.

17. Where the attributes of a data value are not available in a dictionary listing, it may be assumed that a character string interpretable as a number should be taken to represent an item of type numb. However, an explicit dictionary declaration of type will override such an assumption.

#### Subtyping

18. The base data types detailed in the previous section are very general, and need to be refined for practical application. Refinement of types is to some extent application-dependent, and different subtypes are supported for data items defined by DDL1 and DDL2 dictionary files. The following notes indicate some considerations, but the relevant dictionary files and documentation should be consulted in each case.

19. DDL1 dictionaries
Standard uncertainties: Values of type numb may include a standard uncertainty in the final digit(s) of the number where the associated item definition includes the attribute

     _type_conditions     esd

(or _type_conditions su, a synonym introduced to DDL1 in 2005). For example, a value of 34.5(12) means 34.5 with a standard uncertainty of 1.2; it may also be expressed in scientific notation as 3.45E1(12).

20. DDL2 dictionaries
DDL2 provides a number of tags that may be used in a dictionary file to specify subtypes for data items defined by that dictionary alone. Examples of the subtypes specified for the macromolecular CIF dictionary are:

 code identifying code strings or single words ucode identifying code strings or single words (case-insensitive) uchar1 single-character codes (case-insensitive) uchar3 three-character codes (case-insensitive) line character strings forming a single line of text uline character strings forming a single line of text (case-insensitive) text multi-line text int integers float floating-point real numbers yyyy-mm-dd dates symop symmetry operations any any type permitted

#### Special generic values

21. The unquoted character literals ? (query mark) and . (full point) are special and are valid expressions for any data type.

22. The value ? means that the actual value of a requested data item is unknown.

23. The value . means that the actual value of a requested data item is inapplicable. This is most commonly used in a looped list where a data value is required for syntactic integrity.

### Embedded data semantics

24. The attributes of data items defined in CIF dictionaries serve to direct crystallographic applications in the retrieval, storage and validation of relevant data. In principle a CIF might include as data items suitably encoded fields representing data suitable for manipulation by text processing, image, spreadsheet, database or other applications. It would be useful to have a formal mechanism allowing a CIF to invoke appropriate content handlers for such data fields, and this is under investigation for the next CIF version specification.

#### CIF conventions for special characters in text

25. The one existing example of embedded semantics is the text character markup introduced in the CIF Version 1.0 specification and summarised in paragraphs 30-37 below. The specification is silent on which fields should be interpreted according to these markup conventions, but the published examples suggest that they may be used in any character field in a CIF data file except as prohibited by a dictionary directive. It is intended that the next CIF version specification shall formally declare where such markup may be used.

### Handling of long lines

26. The restriction in line length within CIF requires techniques to handle without semantic loss the content of lines of text exceeding the limit (2048 characters in this revision, 80 characters in the initial CIF specification). The line folding protocol defined here provides a general mechanism for wrapping lines of text within CIFs to any extent within the overall line length limit. A specific application where this would be useful is the conversion of lines longer than 80 characters to the CIF 1.0 limit. This 80-character limit is used in the examples below for illustrative purposes.

These techniques are applied only to the contents of text fields and to comments.

In order to permit such folding a special semantics is defined for use of the backslash. It is important to understand that this does not change the syntax of CIF 1.0. All existing CIFs conforming to the CIF 1.0 specification can be viewed as having exactly the same semantics as they now have. Use of these transformational semantics is optional, but recommended.

In order to avoid confusion between CIFs that have undergone these transformations and those that have not, the special comment beginning with a hash mark immediately followed by a backslash (#\) as the last non-blank characters on a line is reserved to mark the beginning of comments created by folding long-line comments, and the special text field beginning with the sequence line-termination, semicolon, backslash (<eol>;\) as the only non-blank characters on a line is reserved to mark the beginning of text fields created by folding long-line text fields.

The backslash character is used to fold long lines in character strings and comments. Consider a comment which extends beyond column 80. In order to provide a comment with the same meaning which can be fitted into 80 character lines, prefix the comment with the special comment consisting of a hash mark followed by a backslash (#\) and the line terminator. Then on new lines take appropriate fragments of the original comment, beginning each fragment with a hash mark and ending all but the last fragment with a backslash. In doing this conversion, check for an original line that ends with a backslash followed only by blanks or tabs. To preserve that backslash in the conversion, add another backslash after it. If the next lexical token (not counting blanks or tabs) is another comment, to avoid fusing this comment with the next comment, be sure to insert a line with just a hash mark.

Similarly, for a character string that extends beyond column 80,

• first convert it to be a text field delimited by line-termination-semicolon (<eol>;) sequences
• then change the initial line-termination-semicolon (<eol>;) sequence to line-termination-semicolon-backslash-line-termination (<eol>;\<eol>)
• and break all subsequent lines that do not fit within 80 columns with a trailing backslash. In the course of doing the translation,
• check for any original text lines that end with a backslash followed only by blanks or tabs;
• to preserve that backslash in the conversion, add another backslash after it, and then an empty line.

(More formally, the line folding should be done separately and directly on single line non-semicolon delimited characters strings to allow for recognition of the fact that no terminal line-termination is intended - see below).

In order to understand this scheme, suppose the CIF fragment (1) below were considered to have long lines, then they could be transformed as follows (2):

(1) Initial CIF

#######################################################
### CIF submission form for Rietveld refinements    ###
###                        Version 14 December 1998 ###
#######################################################
data_znvodata
_chemical_name_systematic
;

_chemical_formula_moiety        'H2 O9 V2 Zn3, 2(H2 O)'
_chemical_formula_sum           'H6 O11 V2 Zn3'
_chemical_formula_weight        480.05


(2) Transformed CIF

#\
###########################\
###########################
### CIF submission form for Rietveld refinements    ###
###                        Version 14 December 1998 ###
#######################################################
data_znvodata
_chemical_name_systematic
;\
zinc dihydroxide divan\
;

_chemical_formula_moiety
;\
H2 O9 V2 Zn3, 2(H2 O)\
;
_chemical_formula_sum           'H6 O11 V2 Zn3'
_chemical_formula_weight        480.05


In making the transformation from the backslash folded form to long lines, it is very important to strip trailing blanks before attempting to recognize a backslash as the last character. When re-assembling text field lines, no reassembly should be done except in text fields that begin with the special sequence described above, line-termination-semicolon-backslash-line-termination, (<eol>;\<eol>), so that text fields which happen to contain backslashes, but which were not created by folding long lines, are not changed. It is also important to remove the trailing backslashes when reassembling long lines. The final line-termination-semicolon sequence of a text field takes priority over the reassembly process and ends it, but a trailing backslash on the last line of a text field very nicely conveys the information that no trailing line termination is intended to be included within the character string.

Similarly, when reassembling long-line comments, the reassembly begins with a comment of the form hash-backslash-line-termination. The initial hash mark is retained and then a forward scan is made through line-terminations and blanks for the next comment, from which the initial hash mark is stripped and then the contents of the comment are appended. If that comment ends with a backslash, the trailing backslash is stripped and the process repeats. Note that the process will be ended by intervening tags, values, data blocks or other no-whitespace information, and that the process will not start at all without the special hash-backslash-line-termination comment.

Since there are very few, if any, CIFs which contain text fields and comments beginning this way, in most cases, it is reasonable to adopt the policy of doing this processing unless it is disabled.

Here is another example of folding. The following three text fields would be equivalent:

;C:\foldername\filename
;

;\
C:\foldername\filename
;

and
;\
C:\foldername\file\
name
;

but the next example would be a two-line value where the first line had the value "C:\foldername\file\" and the second had the value "name":
;
C:\foldername\file\
name
;


Note that backslashes should not be used to fold lines outside of comments and text fields. That would introduce extraneous characters into the CIF and violate the basic syntax rules. In any case, such an action is not necessary.

## Dictionary compliance

27. Dictionary files containing the definitions and attribute sets for the data items contained in a CIF should be identified within the CIF by some or all of the data items

    _audit_conform_dict_name
_audit_conform_dict_version
_audit_conform_dict_location

corresponding to DDL1 dictionaries, or
    _audit_conform.dict_name
_audit_conform.dict_version
_audit_conform.dict_location

for DDL2 dictionaries. Where no such information is provided, it may be assumed that the file should conform against the core CIF dictionary.

28. The _audit_conform data items may be looped in the case where more than one dictionary is used to define the items in a CIF, and they may include dictionaries of local data items provided such dictionary files have been prepared in accordance with the rules of the appropriate DDL.

29. A detailed protocol exists for locating, merging and overlaying multiple dictionary files (McMahon, Westbrook & Bernstein, 2000).

## CIF markup conventions

30. If permitted by the relevant dictionary and if no other indication is present, the contents of a text or character field are assumed to be interpretable as text in English or some other human language. Certain special codes are used to indicate special characters or accented letters not available in the ASCII character set, as listed below.

### Greek letters

31. In general, the corresponding letter of the Latin alphabet, prefixed by a backslash character. The complete set is:

 A \a \A alpha B \b \B beta X \c \C chi \d \D delta E \e \E epsilon \f \F phi \g \G gamma H \h \H eta I \i \I iota K \k \K kappa \l \L lambda M \m \M mu

 N \n \N nu o O \o \O omicron \p \P pi \q \Q theta P \r \R rho \s \S sigma T \t \T tau \u \U upsilon \w \W omega \x \X xi \y \Y psi Z \z \Z zeta

### Accented letters

32. Accents should be indicated by using the following codes before the letter to be modified (i.e. use \'e for an acute e):

 \' acute (é) \" umlaut (ü) \= overbar \` grave (à) \~ tilde (ñ) \. overdot \^ circumflex (â) \; ogonek \< hacek \, cedilla (ç) \> Hungarian umlaut \( breve

These codes will always be followed by a non-whitespace character.

### Other characters

33. Other special alphabetic characters should be indicated as follows:

 \%a a-ring (å) \?i dotless i \&s German "ss" (ß) \/o o-slash (ø) \/l Polish l () \/d barred d

Capital letters may also be used in these codes, so an ångström symbol (Å) may be given as \%A.

34. Superscripts and subscripts should be indicated by bracketing relevant characters with circumflex or tilde characters, thus:

 superscripts Csp^3^ for Csp3 subscripts U~eq~ for Ueq

35. Some other codes are accepted by convention. These are:

 \% degree (°) \\times × -- dash +- ± --- single bond -+ \\db double bond \\square square \\tb triple bond \\neq \\ddb delocalized double bond \\rangle > \\sim ~ \\langle < (N.B. ~  is the code for subscript) \\rightarrow \\simeq @ \\leftarrow \\infty

Note that \\db, \\tb and \\ddb should always be followed by a space, e.g. C=C is denoted by C\\db C.

### Typographic style codes

36. The codes indicated above are designed to refer to special characters not expressible within the CIF character set, and the initial specification did not permit markup for typographic style such as italic or bold-face type. However, in some cases the ability to indicate type style is useful, and in addition to the codes above HTML-like conventions are allowed of surrounding text by <i> </i> to indicate the beginning and end of italic, and by <b> </b> to indicate the beginning and end of boldface type.

37. If it is necessary to convey more complex typographic information than is permitted by these special character codes and conventions, the entire text field should be of a richer content type allowing detailed typographic markup.

38.