Essential components in the development of such an ontology are:
In the CIF framework, the objects of discourse are described in so-called data dictionary files, that provide the vocabulary and taxonomic elements. The dictionaries also contain information about the relationships and attributes of data items, and thus encapsulate most of the semantic content that is accessible to software. In practice, different dictionaries exist to service different domains of crystallography, and a CIF that conforms to a specific dictionary must be interpreted in terms of the semantic information conveyed in the dictionary.
However, some common semantic features apply across all CIF applications, and the current document outlines the foundations upon which other dictionaries may build more elaborate taxonomies or informational models.
3. While the STAR File syntax allows the identification and extraction of tags and associated values, the interpretation of the data thus extracted is application-dependent. In CIF applications, formal catalogues of standard data names and their associated attributes are maintained as external reference files called data dictionaries. These dictionary files share the same structure and syntax rules as data CIFs.
4. At the current revision, two conventions (known as Dictionary Definition Languages or DDL) are supported for detailing the meaning and associated attributes of data names. These are known as DDL1 (Hall &Cook, 1995) and DDL2 (Westbrook, 19xx), and differ in the amount of detail they carry about data types, the relationships between specific data items and the large-scale classification of data items.
5. While it may be formally possible to define the semantics of the data items in a given data file in both DDL1 and DDL2 data dictionaries, in practice different dictionaries are constructed to define the data names appropriate for particular crystallographic applications, and each such dictionary is written in DDL1 or DDL2 formalism according to which appears better able to describe the data model employed. There is thus in practice a bifurcation of CIF into two dialects according to the DDL used in composing the relevant dictionary file. However, the use of aliases may permit applications tuned to one dialect to import data constructed according to the other.
6. Strictly, data names should be considered as void of semantic content - they are tags for locating associated values, and all information concerning the meaning of that value should be sought in an associated dictionary.
7. However, it is customary to construct data names as a sequence of components elaborating the classification of the item within the logical structure of its associated dictionary. Hence a data name such as _atom_site_fract_x displays a hierarchical arrangement of components corresponding to membership of nested groupings of data elements. The choice of components readily indicates to a human reader that this data item refers to the fractional x coordinate of an atomic site within a crystal unit cell, but it should be emphasised from a computer programming viewpoint that this is coincidental; the attributes that constrain the value of this data item (and its relationship to others such as _atom_site_fract_y and _atom_site_fract_z) must be obtained from the dictionary and not otherwise inferred.
8. In practice data names described in a DDL2 dictionary are constructed with a period character separating their specific function from the name of the category to which they have been assigned. In the absence of a dictionary file, this convention permits the inference that the data item with name _atom_site.fract_x will appear in the same looped list as other items with names beginning _atom_site., and that all such items belong to the same category.
10. The character string [local] is reserved for local use. That is, no public dictionary will define a data name that includes this string. This allows experimentation with data items in a strictly local context, i.e. in cases where the CIF is not intended for interchange with any other user.
11. Where CIFs including local data items are expected to enjoy a public circulation, authors may register a reserved prefix for their sole use. The registry is available on the web at
http://www.iucr.org/iucr-top/cif/spec/reserved.html
A reserved prefix, e.g. foo, must be used in the following ways
12. There is no syntactic property identifying such a reserved prefix, so that software validating or otherwise handling such local data names must scan the entire registry and match registered prefixes against the indicated components of data names. Note that reserved prefixes may themselves contain underscore characters, so a maximal matching search must be made.
13. The published specification for CIF Version 1.0 permitted data values expressed in different units to be tagged by variant data names (Hall, Allen & Brown, 1991, p. 657:)
... Many numeric fields contain data for which the units must be known. Each CIF data item has a default units code which is stated in the CIF Dictionary. If a data item is not stored in the default units, the units code is appended to the data name. For example, the default units for a crystal cell dimension are ångströms. If it is necessary to include this data item in a CIF with the units of picometres, the data name of _cell_length_a is replaced by _cell_length_a_pm. Only those units defined in the CIF Dictionary are acceptable. The default units, except for the ångström, conform to the SI Standard adopted by the IUCr.
This approach is deprecated and has not been supported by any official CIF dictionary published subsequent to version 1.0 of the Core. All data values must be expressed in the single unit assigned in the associated dictionary.
A small number of archived CIFs exist with variant data names as permitted by the above clause. If it is necessary to validate them against versions of the Core dictionary subsequent to version 1.0, the formal compatibility dictionary cif_compat.dic may be used for the purpose. No other use should be made of this dictionary.
16. Many applications distinguish between multi-line text fields and character string values that fit within a single line of text. While this is a convenient practical distinction for coding purposes, formally both manifestations should be regarded as having the same base type, which might be "char" or "uchar". Applications are at liberty to choose whether to define specific multi-line text subtypes, and whether to permit casting between subtypes of a base type. The examples of character string delimiters in paragraph 20 of the document "Syntax" are predicated on an approach that handles all subtypes of character or text data equivalently.
17. Where the attributes of a data value are not available in a dictionary listing, it may be assumed that a character string interpretable as a number should be taken to represent an item of type numb. However, an explicit dictionary declaration of type will override such an assumption.18. The base data types detailed in the previous section are very general, and need to be refined for practical application. Refinement of types is to some extent application-dependent, and different subtypes are supported for data items defined by DDL1 and DDL2 dictionary files. The following notes indicate some considerations, but the relevant dictionary files and documentation should be consulted in each case.
19. DDL1 dictionaries
Standard uncertainties: Values of type numb may
include a standard uncertainty in the final digit(s) of the
number where the associated item definition includes the
attribute
_type_conditions esdFor example, a value of 34.5(12) means 34.5 with a standard uncertainty of 1.2; it may also be expressed in scientific notation as 3.45E1(12).
20. DDL2 dictionaries
DDL2 provides a number of tags that may be used in a dictionary
file to specify subtypes for data items defined by that
dictionary alone. Examples of the subtypes specified for the
macromolecular CIF dictionary are:
code | identifying code strings or single words |
ucode | identifying code strings or single words (case-insensitive) |
uchar1 | single-character codes (case-insensitive) |
uchar3 | three-character codes (case-insensitive) |
line | character strings forming a single line of text |
uline | character strings forming a single line of text (case-insensitive) |
text | multi-line text |
int | integers |
float | floating-point real numbers |
yyyy-mm-dd | dates |
symop | symmetry operations |
any | any type permitted |
22. The value ? means that the actual value of a requested data item is unknown.
23. The value . means that the actual value of a requested data item is inapplicable. This is most commonly used in a looped list where a data value is required for syntactic integrity.
These techniques are applied only to the contents of text fields and to comments.
In order to permit such folding we define a special semantics for use of the backslash. It is important to understand that this does not change the syntax of CIF 1.0. All existing CIFs conforming to the CIF 1.0 specification can be viewed as having exactly the same semantics as they now have. Use of these transformational semantics is optional, but recommended.
In order to avoid confusion between CIFs that have undergone these transformations and those that have not, the special comment beginning with a hash mark immediately followed by a backslash (#\) as the last non-blank characters on a line is reserved to mark the beginning of comments created by folding long-line comments, and the special text field beginning with the sequence line-termination, semicolon, backslash (<eol>;\) as the only non-blank characters on a line is reserved to mark the beginning of text fields created by folding long-line text fields.
The backslash character is used to fold long lines in character strings and comments. Consider a comment which extends beyond column 80. In order to provide a comment with the same meaning which can be fitted into 80 character lines, prefix the comment with the special comment consisting of a hash mark followed by a backslash (#\) and the line terminator. Then on new lines take appropriate fragments of the original comment, beginning each fragment with a hash mark and ending all but the last fragment with a backslash. In doing this conversion, check for an original line that ends with a backslash followed only by blanks or tabs. To preserve that backslash in the conversion, add another backslash after it. If the next lexical token (not counting blanks or tabs) is another comment, to avoid fusing this comment with the next comment, be sure to insert a line with just a hash mark.
Similarly, for a character string that extends beyond column 80,
In order to understand this scheme, suppose the CIF fragment (1) below were considered to have long lines, then we could transform them as follows (2):
(1) Initial CIF
################################################### # # # Converted from PDB format to CIF format by # # pdb2cif version 2.3.1 24 Aug 96 # # by # # P.E. Bourne, H.J. Bernstein and F.C. Bernstein # # # ################################################### data_1DIN _entry.id 1DIN loop_ _struct.entry_id _struct.title 1DIN ; DIENELACTONE HYDROLASE AT 2.8 ANGSTROMS Compound:: MOL_ID: 1; MOLECULE: DIENELACTONE HYDROLASE; CHAIN: NULL; SYNONYM: DLH; EC: 3.1.1.45; ENGINEERED: YES Source:: MOL_ID: 1; ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.; STRAIN: B13; EXPRESSION_SYSTEM: EXPRESSED UNDER OWN PROMOTER; EXPRESSION_SYSTEM_PLASMID: PDC100; EXPRESSION_SYSTEM_GENE: CLC D ; _exptl.entry_id 1DIN _exptl.method ' X-RAY DIFFRACTION '
(2) Transformed CIF
#\ ##########################\ ########################## # # #\ # Converted from PDB format\ # to CIF format by # # pdb2cif version 2.3.1 24 Aug 96 # # by # # P.E. Bourne, H.J. Bernstein and F.C. Bernstein # # # ################################################### data_1DIN _entry.id 1DIN loop_ _struct.entry_id _struct.title 1DIN ;\ DIENELACTONE HYDROLASE\ AT 2.8 ANGSTROMS Compound:\ : MOL_ID: 1; MOLECULE: DIENELACTONE HYDROLASE; CHAIN: NULL; SYNONYM: DLH; EC: 3.1.1.45; ENGINEERED: YES Source:: MOL_ID: 1; ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.; STRAIN: B13; EXPRESSION_SYSTEM:\ EXPRESSED UNDER OWN PROMOTER; EXPRESSION_SYSTEM_PLASMID: PDC100; EXPRESSION_SYSTEM_GENE: CLC D ; _exptl.entry_id 1DIN _exptl.method ;\ X-RAY DIFFRACTION \ ;
In making the transformation from the backslash folded form to long lines, it is very important to strip trailing blanks before attempting to recognize a backslash as the last character. When re-assembling text field lines, no reassembly should be done except in text fields that begin with the special sequence described above, line-termination-semicolon-backslash-line-termination, (<eol>;\<eol>), so that text fields which happen to contain backslashes, but which were not created by folding long lines, are not changed. It is also important to remove the trailing backslashes when reassembling long lines. The final line-termination-semicolon sequence of a text field takes priority over the reassembly process and ends it, but a trailing backslash on the last line of a text field very nicely conveys the information that no trailing line termination is intended to be included within the character string.
Similarly, when reassembling long-line comments, the reassembly begins with a comment of the form hash-backslash-line-termination. The initial hash mark is retained and then a forward scan is made through line-terminations and blanks for the next comment, from which the initial hash mark is stripped and then the contents of the comment are appended. If that comment ends with a backslash, the trailing backslash is stripped and the process repeats. Note that the process will be ended by intervening tags, values, data blocks or other no-whitespace information, and that the process will not start at all without the special hash-backslash-line-termination comment.
Since there are very few, if any, CIFs which contain text fields and comments beginning this way, in most cases, it is reasonable to adopt the policy of doing this processing unless it is disabled.
Here is another example of folding. The following three text fields would be equivalent:
;C:\foldername\filename ; ;\ C:\foldername\filename ;and
;\ C:\foldername\file\ name ;but the next example would be a two-line value where the first line had the value "C:\foldername\file\" and the second had the value "name":
; C:\foldername\file\ name ;
Note that backslashes should not be used to fold lines outside of comments and text fields. That would introduce extraneous characters into the CIF and violate the basic syntax rules. In any case, such an action is not necessary.
_audit_conform_dict_name _audit_conform_dict_version _audit_conform_dict_locationcorresponding to DDL1 dictionaries, or
_audit_conform.dict_name _audit_conform.dict_version _audit_conform.dict_locationfor DDL2 dictionaries. Where no such information is provided, it may be assumed that the file should conform against the core CIF dictionary.
27. The _audit_conform data items may be looped in the case where more than one dictionary is used to define the items in a CIF, and they may include dictionaries of local data items provided such dictionary files have been prepared in accordance with the rules of the appropriate DDL.
28. A detailed protocol exists for locating, merging and overlaying multiple dictionary files.
|
|
\' | acute (é) | \" | umlaut (ü) | \= | overbar |
\` | grave (à) | \~ | tilde (ñ) | \. | overdot |
\^ | circumflex (â) | \; | ogonek | \< | hacek |
\, | cedilla (ç) | \> | Hungarian umlaut | \( | breve |
These codes will always be followed by a non-whitespace character.
\%a | a-ring (å) | \?i | dotless i | \&s | German "ss" (ß) |
\/o | o-slash (ø) | \/l | Polish l () | \/d | barred d |
Capital letters may also be used in these codes, so an ångström symbol (Å) may be given as \%A.
33. Superscripts and subscripts should be indicated by bracketing relevant characters with circumflex or tilde characters, thus:
superscripts | Csp^3^ | for | Csp3 |
subscripts | U~eq~ | for | Ueq |
The closing symbol is essential to return to normal text.
34. Some other codes are accepted by convention. These are:
\% | degree (°) | \\times | × |
-- | dash | +- | ± |
--- | single bond | -+ | |
\\db | double bond | \\square | square |
\\tb | triple bond | \\neq | |
\\ddb | delocalized double bond | \\rangle | > |
\\sim | ~ | \\langle | < |
(N.B. ~ is the code for subscript) | \\rightarrow | ||
\\simeq | @ | \\leftarrow | |
\\infty |
Note that \\db, \\tb and \\ddb should always be followed by a space, e.g. C=C is denoted by C\\db C.
36. If it is necessary to convey more complex typographic information than is permitted by these special character codes and conventions, the entire text field should be of a richer content type allowing detailed typographic markup.