Essential components in the development of such an ontology are:
In the CIF framework, the objects of discourse are described in so-called data dictionary files, that provide the vocabulary and taxonomic elements. The dictionaries also contain information about the relationships and attributes of data items, and thus encapsulate most of the semantic content that is accessible to software. In practice, different dictionaries exist to service different domains of crystallography, and a CIF that conforms to a specific dictionary must be interpreted in terms of the semantic information conveyed in the dictionary.
However, some common semantic features apply across all CIF applications, and the current document outlines the foundations upon which other dictionaries may build more elaborate taxonomies or informational models.
3. While the STAR File syntax allows the identification and extraction of tags and associated values, the interpretation of the data thus extracted is application-dependent. In CIF applications, formal catalogues of standard data names and their associated attributes are maintained as external reference files called data dictionaries. These dictionary files share the same structure and syntax rules as data CIFs.
4. At the current revision, two conventions (known as Dictionary Definition Languages or DDL) are supported for detailing the meaning and associated attributes of data names. These are known as DDL1 (Hall &Cook, 1995) and DDL2 (Westbrook, 19xx), and differ in the amount of detail they carry about data types, the relationships between specific data items and the large-scale classification of data items.
5. While it may be formally possible to define the semantics of the data items in a given data file in both DDL1 and DDL2 data dictionaries, in practice different dictionaries are constructed to define the data names appropriate for particular crystallographic applications, and each such dictionary is written in DDL1 or DDL2 formalism according to which appears better able to describe the data model employed. There is thus in practice a bifurcation of CIF into two dialects according to the DDL used in composing the relevant dictionary file. However, the use of aliases may permit applications tuned to one dialect to import data constructed according to the other.
6. Strictly, data names should be considered as void of semantic content - they are tags for locating associated values, and all information concerning the meaning of that value should be sought in an associated dictionary.
7. However, it is customary to construct data names as a sequence of components elaborating the classification of the item within the logical structure of its associated dictionary. Hence a data name such as _atom_site_fract_x displays a hierarchical arrangement of components corresponding to membership of nested groupings of data elements. The choice of components readily indicates to a human reader that this data item refers to the fractional x coordinate of an atomic site within a crystal unit cell, but it should be emphasised from a computer programming viewpoint that this is coincidental; the attributes that constrain the value of this data item (and its relationship to others such as _atom_site_fract_y and _atom_site_fract_z) must be obtained from the dictionary and not otherwise inferred.
8. In practice data names described in a DDL2 dictionary are constructed with a period character separating their specific function from the name of the category to which they have been assigned. In the absence of a dictionary file, this convention permits the inference that the data item with name _atom_site.fract_x will appear in the same looped list as other items with names beginning _atom_site., and that all such items belong to the same category.
10. The character string [local] is reserved for local use. That is, no public dictionary will define a data name that includes this string. This allows experimentation with data items in a strictly local context, i.e. in cases where the CIF is not intended for interchange with any other user.
11. Where CIFs including local data items are expected to enjoy a public circulation, authors may register a reserved prefix for their sole use. The registry is available on the web at
http://www.iucr.org/iucr-top/cif/spec/reserved.html
A reserved prefix, e.g. foo, must be used in the following ways
12. There is no syntactic property identifying such a reserved prefix, so that software validating or otherwise handling such local data names must scan the entire registry and match registered prefixes against the indicated components of data names. Note that reserved prefixes may themselves contain underscore characters, so a maximal matching search must be made.
13. The published specification for CIF Version 1.0 permitted data values expressed in different units to be tagged by variant data names (Hall, Allen & Brown, 1991, p. 657:)
... Many numeric fields contain data for which the units must be known. Each CIF data item has a default units code which is stated in the CIF Dictionary. If a data item is not stored in the default units, the units code is appended to the data name. For example, the default units for a crystal cell dimension are ångströms. If it is necessary to include this data item in a CIF with the units of picometres, the data name of _cell_length_a is replaced by _cell_length_a_pm. Only those units defined in the CIF Dictionary are acceptable. The default units, except for the ångström, conform to the SI Standard adopted by the IUCr.
This approach is deprecated and has not been supported by any official CIF dictionary published subsequent to version 1.0 of the Core. All data values must be expressed in the single unit assigned in the associated dictionary.
A small number of archived CIFs exist with variant data names as permitted by the above clause. If it is necessary to validate them against versions of the Core dictionary subsequent to version 1.0, the formal compatibility dictionary cif_compat.dic may be used for the purpose. No other use should be made of this dictionary.
16. Many applications distinguish between multi-line text fields and character string values that fit within a single line of text. While this is a convenient practical distinction for coding purposes, formally both manifestations should be regarded as having the same base type, which might be "char" or "uchar". Applications are at liberty to choose whether to define specific multi-line text subtypes, and whether to permit casting between subtypes of a base type. The examples of character string delimiters in paragraph 20 of the document "Syntax" are predicated on an approach that handles all subtypes of character or text data equivalently.
17. Where the attributes of a data value are not available in a dictionary listing, it may be assumed that a character string interpretable as a number should be taken to represent an item of type numb. However, an explicit dictionary declaration of type will override such an assumption.18. The base data types detailed in the previous section are very general, and need to be refined for practical application. Refinement of types is to some extent application-dependent, and different subtypes are supported for data items defined by DDL1 and DDL2 dictionary files. The following notes indicate some considerations, but the relevant dictionary files and documentation should be consulted in each case.
19. DDL1 dictionaries
Standard uncertainties: Values of type numb may
include a standard uncertainty in the final digit(s) of the
number where the associated item definition includes the
attribute
_type_conditions esdFor example, a value of 34.5(12) means 34.5 with a standard uncertainty of 1.2; it may also be expressed in scientific notation as 3.45E1(12).
20. DDL2 dictionaries
DDL2 provides a number of tags that may be used in a dictionary
file to specify subtypes for data items defined by that
dictionary alone. Examples of the subtypes specified for the
macromolecular CIF dictionary are:
code | identifying code strings or single words |
ucode | identifying code strings or single words (case-insensitive) |
uchar1 | single-character codes (case-insensitive) |
uchar3 | three-character codes (case-insensitive) |
line | character strings forming a single line of text |
uline | character strings forming a single line of text (case-insensitive) |
text | multi-line text |
int | integers |
float | floating-point real numbers |
yyyy-mm-dd | dates |
symop | symmetry operations |
any | any type permitted |
22. The value ? means that the actual value of a requested data item is unknown.
23. The value . means that the actual value of a requested data item is inapplicable. This is most commonly used in a looped list where a data value is required for syntactic integrity.
_audit_conform_dict_name _audit_conform_dict_version _audit_conform_dict_locationcorresponding to DDL1 dictionaries, or
_audit_conform.dict_name _audit_conform.dict_version _audit_conform.dict_locationfor DDL2 dictionaries. Where no such information is provided, it may be assumed that the file should conform against the core CIF dictionary.
27. The _audit_conform data items may be looped in the case where more than one dictionary is used to define the items in a CIF, and they may include dictionaries of local data items provided such dictionary files have been prepared in accordance with the rules of the appropriate DDL.
28. A detailed protocol exists for locating, merging and overlaying multiple dictionary files.
|
|
\' | acute (é) | \" | umlaut (ü) | \= | overbar |
\` | grave (à) | \~ | tilde (ñ) | \. | overdot |
\^ | circumflex (â) | \; | ogonek | \< | hacek |
\, | cedilla (ç) | \> | Hungarian umlaut | \( | breve |
\%a | a-ring (å) | \?i | dotless i | \&s | German "ss" (ß) |
\/o | o-slash (ø) | \/l | Polish l () | \/d | barred d |
Capital letters may also be used in these codes, so an ångström symbol (Å) may be given as \%A.
33. Superscripts and subscripts should be indicated by bracketing relevant characters with circumflex or tilde characters, thus:
superscripts | Csp^3^ | for | Csp3 |
subscripts | U~eq~ | for | Ueq |
The closing symbol is essential to return to normal text.
34. Some other codes are accepted by convention. These are:
\% | degree (°) | \\times | × |
-- | dash | +- | ± |
--- | single bond | -+ | |
\\db | double bond | \\square | square |
\\tb | triple bond | \\neq | |
\\ddb | delocalized double bond | \\rangle | > |
\\sim | ~ | \\langle | < |
(N.B. ~ is the code for subscript) | \\rightarrow | ||
\\simeq | @ | \\leftarrow | |
\\infty |
Note that \\db, \\tb and \\ddb should always be followed by a space, e.g. C=C is denoted by C\\db C.
36. If it is necessary to convey more complex typographic information than is permitted by these special character codes and conventions, the entire text field should be of a richer content type allowing detailed typographic markup.