This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.

The Crystallographic Information File

Syntax

Version 1.1 Specification

Draft of 13 September 2002

Modifications from the last posted version (10 July 2002) are indicated in RED type

Introduction
Definition of terms
File syntax

General features
Character set
White space
End-of-line conventions
Case sensitivity
Implementation restrictions

Maximum line length
Maximum data name, block code and frame code lengths
Single-level loop constructions
Non-expansion of save frame references
Exclusion of global_ blocks

Version identification
Appendix A: A formal grammar for CIF

Summary
Explanation of the formal syntax
Lexical tokens
CIF grammar

References

Introduction

1. This document describes the full syntax of the Crystallographic Information File (CIF).

Definition of terms

2. The following terms are used in this document with the specific meanings indicated here.

2.1. A CIF is a file conforming to the specification herein stated, containing either information on a crystallographic experiment or its results (or similar scientific content); or descriptions of the data identifiers in such a file.
2.2. A data file is understood to convey information relating to a crystallographic experiment.
2.3. A dictionary file is understood to contain information about the data items in one or more data files as identified by their data names.
2.4. A data name is an identifier (a string of characters beginning with an underscore character) of the content of an associated data value.
2.5. A data value is a string of characters representing a particular item of information. It may represent a single numerical value; a letter, word or phrase; extended discursive text; or in principle any coherent unit of data such as an image, audio clip or virtual-reality object.
2.6. A data item is a specific piece of information defined by a data name and an associated data value.
2.7. A tag is understood in this document to be a synonym for data name.
2.8. A data block is the highest-level component of a CIF, containing data items or save frames. A data block is identified by a data block header, which is an isolated character string (that is, bounded by white space and not forming part of a data value) beginning with the reserved characters data_.
2.9. A block code is the variable part of a data block header, e.g. the string foo in the header data_foo.
2.10. A save frame is a partitioned collection of data items within a data block, started by a save frame header, which is an isolated character string beginning with the reserved characters save_, and terminated with an isolated character string containing only the reserved characters save_.
2.11. A frame code is the variable part of a save frame header, e.g. the string foo in the header save_foo.

File syntax

3. The syntax of CIF is a proper subset of the syntax of STAR Files as described by Hall (1991) and Hall & Spadaccini (1994). The general structure is described below under the heading General features, and a number of subsections list specific restrictions to the STAR syntax that are in force within CIF. A formal language grammar using computer science notation is included as Appendix A.

General features

4. A CIF consists of data names (tags) and associated values organized into data blocks. A data block may contain data items (associated data names and data values) and/or it may contain save frames.

5. Save frames may only be used in dictionary files.

5a. Implementation note: At a purely syntactic level there is no way to distinguish between dictionary and data files. (It is also to be noted that not all dictionary files contain save frames.) A fully validating parser must therefore be able to detect the start and termination of save frames, the uniqueness of the framecode within a data block, and the uniqueness of data names within a frame code. It is however legitimate for an application-based parser designed to handle only the contents of data files to consider the presence of a save frame as an error.

6. A data block begins with the reserved string data_ followed immediately by the name of the data block, forming a data block header. A save frame has a similar structure to a data block, but may not itself contain further save frames. A save frame begins with the reserved string save_ followed immediately by the name of the save frame, forming a save frame header. Unlike a data block, a save frame also has a marker for the end of the frame in the form of a repetition of the reserved word save_, this time without the name of the frame. Save frames may not nest. Within a single CIF, no two data blocks may have the same name; within a single data block no two save frames may have the same name, although a save frame may have the same name as a data block in the same CIF.

7. A given data name (tag) (see 2.4 and 2.7) may appear no more than once in a given data block or save frame. A tag may be followed by a single value, or a list of one or more tags may be marked by the preceding reserved word loop_ as the headings of the columns of a table of values. White space is used to separate a data block or save frame header from the contents of the data block or save frame, and to separate tags, values and the reserved word loop_.

8. The reserved word stop_ may optionally be used to mark the end of a loop header and/or of a loop body.

9. The word global_, used in STAR Files to introduce a group of data values with a scope extending to the end of the file, is an additional reserved word in CIF (that is, it may not be used as the unquoted value of any data item).

10. If a data value (see 2.5) contains white space or begins with a character string reserved for a special purpose, it must be delimited by one of several sets of special character strings (the choice of which is constrained if the data value contains characters interpretable as marking a new line of text according to the discussion in the following paragraphs). Such a data value will be indicated by the term non-simple data value.

11. A simple data value (i.e. one which does not contain white space or begin with a special character string) may optionally be delimited by any of the same set of delimiting character strings, except for data values that are to be interpreted as numbers.

12. The special character strings in this context are listed in the following table. The term "non-simple data values" in this table refers to data values beginning with these special character strings.

Character or string Role

_ identifies data name

# identifies comment

$ identifies save frame pointer

' delimits non-simple data values

" delimits non-simple data values

[ reserved opening delimiter for non-simple data values (see paragraph 19)

] reserved closing delimiter for non-simple data values (see paragraph 19)

; at beginning of line of text delimits non-simple data values

data_ identifies data block header

save_ identifies save frame header or terminator

Character or string	Role
`_`	identifies data name
`#`	identifies comment
`$`	identifies save frame pointer
`'`	delimits non-simple data values
`"`	delimits non-simple data values
`[`	reserved opening delimiter for non-simple data values (see paragraph 19)
`]`	reserved closing delimiter for non-simple data values (see paragraph 19)
`;` at beginning of line of text	delimits non-simple data values
`data_`	identifies data block header
`save_`	identifies save frame header or terminator

In addition the following reserved words may not occur as unquoted data values.

Reserved word	Role
`loop_`	identifies looped list of data
`stop_`	terminates looped list of data
`global_`	reserved as STAR global block header

13. The complete syntactic description of a numeric data value is included in Appendix A (section 59) under the production <Numeric>

14. The base CIF specification distinguishes between character and numeric values (see section 15 of the document "Common semantic features"). Particular CIF applications may make more finely-grained distinctions within these types. The paragraphs immediately above have the corollary that a data value such as 12 that appears within a CIF may be quoted (e.g. '12') if, and only if it is to be interpreted and stored in computer memory as a character string and not a numeric value. For example '12' might legitimately appear as a label for an atomic site, where another alphabetic or alphanumeric string such as 'C12' is also acceptable; but it may not legitimately be used to represent an integer quantity twelve.

15. Matching single or double quote characters (' or ") may be used to bound a string representing a non-simple data value provided the string does not extend over more than one line.

16. Because data values are invariably separated from other tokens in the file by white space, such a quote-delimited character string may contain instances of the character used to delimit the string provided they are not followed by white space. For example, the data item

     _example  'a dog's life'

is legal; the data value is a dog's life.

17. The special sequence of end-of-line followed immediately by a semicolon in column one (in C-string notation, "\n;") may also be used as a delimiter at the beginning and end of a character string comprising a data value. The complete bounded string is called a text field, and may be used to convey multi-line values. The end-of-line associated with the closing semicolon does not form part of the data value. Within a multi-line text field, leading white space within text lines must be retained as part of the data value; trailing white space on a line may however be elided.

18. A text field delimited by the \n; digraph may not include a semicolon at the start of a line of text as part of its value. ~~Such a value must be delimited by the matching square brackets described in the next section.~~

19. Matching square bracket characters, '[' and ']', are reserved for possible future introduction as delimiters of multi-line data values. At this revision of the CIF specification a data value may not begin with an unquoted left square bracket character '['. (While not strictly necessary, the right square bracket character ']' is restricted in the same way in recognition of its reserved use as a closing delimiter.)

20. For example, the data value foo may be expressed equivalently as an unquoted string foo, as a quoted string 'foo' ~~, as a bracketed string [foo]~~ or as a text field

;foo
;

By contrast the value of the text field

; foo
  bar
;

is ` foo\n bar' (with the C-string convention of using the \n digraph to represent an end-of-line); the embedded space characters are significant. This value may also occur in a CIF delimited by square brackets as
[ foo bar]
where the whitespace before the leading bracket has no significance except to separate the value from any preceding tokens, but the whitespace immediately following the leading bracket is part of the value.

21. A comment in a CIF begins with an unquoted character "#" and extends to the end of the current line.

Character set

22. Characters within a CIF are restricted to certain printable or white-space characters. Specifically, these are the ones located in the ASCII character set at decimal positions 09 (HT or horizontal tab), 10 (LF or line feed), 13 (CR or carriage return) and the letters, numerals and punctuation marks at positions 32-126.

The ASCII characters at decimal positions 11 (VT or vertical tab) and 12 (FF or form feed), often included in library implementations as white space characters, are explicitly excluded from the CIF character set at this revision.

23. The reference to the ASCII character set is specifically to identify characters in an established and widely available standard. It is understood that CIFs may be constructed and maintained on computer platforms that implement other character-set encodings. However, for maximum portability only the characters identified in the section above may be used. Other printable characters, even if available in an accessible character set such as Unicode, must be indicated by some encoding mechanism using only the permitted characters. At this revision, only the encoding convention detailed in sections 29-36 of the document "Common semantic features" is recognised for this purpose.

White space

24. Any of the whitespace characters listed under the heading Character set (i.e. HT, LF, CR) and the visible space character SP (position number 32 in the ASCII encoding) may be used interchangeably to separate tokens, with the exception that the semicolon characters delimiting multiline text fields must be preceded by the whitespace character or characters understood as indicating an end of line (see next section).

End-of-line conventions

25. The way in which a line is terminated is operating-system dependent. The STAR File specification does not address different operating-system conventions for encoding the end of a line of text in a text file. For a file generated and read in the same machine environment, this is rarely a problem, but increasingly applications on a network host may access files on different hosts through protocols designed to present a unified view of a file system. In practice, for current common operating systems many applications may regard the ASCII characters LF or CR or the sequence CR LF as signalling an end-of-line, inasmuch as these represent the end-of-line conventions supported under the common operating systems Unix, MacOS or DOS/Windows. On platforms with record-oriented operating systems applications must understand and implement the appropriate end-of-line convention. Care must be taken when transferring such files to other operating systems to insert the appropriate end-of-line characters for the target operating system. A more complete discussion is given in section 42 below.

Case sensitivity

26. Data names, block and frame codes, and reserved words are case-insensitive. The case of any characters within data values must be respected.

Implementation restrictions

27. Certain allowed features of STAR File syntax have been expressly excluded or restricted from the CIF implementation.

Maximum line length

28. Lines of text may not exceed 2048 characters in length. This count excludes the character or characters used by the operating system to mark the line termination.

Maximum data name, block code and frame code lengths

29. Data names may not exceed 75 characters in length.

30. Data block codes and save frame codes may not exceed 75 characters in length (and therefore data block headers and save frame headers may not exceed 80 characters in length).

Single-level loop constructs

31. Only a single level of looping is permitted.

Non-expansion of save frame references

32. Save frames are permitted in CIFs, but expressly for the purpose of encapsulating data name definitions within data dictionaries. No reference to these save frames is envisaged, and the save frame reference code permitted in STAR is not used. This means that unquoted character strings commencing with the $ character may not be interpreted as save frame codes in CIF. Use of such unquoted character strings is reserved to guard against subsequent relaxation of this constraint.

Exclusion of global_ blocks

33. In the full STAR specification, blocks of data headed by the special word global_ are permitted before normal data blocks. They contain data names and associated values which are inherited in subsequent data blocks; the scope of a value extends from its point of declaration in a global block to the end of the file. Because rearrangements of the order of data blocks and concatenation of data blocks from different files are commonplace operations in many CIF applications, and because of the difficulty in properly tracking and implementing values implied by global blocks, use of the global_ feature of STAR is expressly forbidden at this revision. To guard against its future introduction, the special word global_ remains reserved in CIF.

Version identification

34. As an archival file format, the CIF specification is expected to change infrequently. Revised specifications will be issued to accompany each substantial modification. A CIF may be considered compliant against the most recent version for which in practice it satisfies all syntactic and content rules as detailed in the formal specification document. However, to signal the version against which compliance was claimed at the time of creation, or to signal the file type and version to applications (such as operating-system utilities), it is recommended that a CIF begin with a structured comment that identifies the version of CIF used. For CIFs compliant with the current specification, the first 11 bytes of the file should be the string

#\#CIF_1.1

immediately followed by one of the whitespace characters permitted in the section Character set.

Appendix A: A formal grammar for CIF

Summary

Syntactic Unit	Syntax	Case Sensitive?
Basic Structure of a CIF
`<CIF>`	`<Comments>? <WhiteSpace>? {�<DataBlock> {�<WhiteSpace> <DataBlock> }* {�<WhiteSpace> }? }?`	yes
`<DataBlock>`	`<DataBlockHeading> {<WhiteSpace> { <DataItems> \| <SaveFrame>} }*`	yes
`<DataBlockHeading>`	`<DATA_> {�<NonBlankChar> }+`	no
`<SaveFrame>`	`<SaveFrameHeading> {�<WhiteSpace> <DataItems> }+ <WhiteSpace> <SAVE_>`	yes
`<SaveFrameHeading>`	`<SAVE_> {�<NonBlankChar> }+`	no
`<DataItems>`	`<Tag> <WhiteSpace> <Value> \| <LoopHeader> <LoopBody>`	yes
`<LoopHeader>`	`<LOOP_> {<WhiteSpace> <Tag>}+ {�<WhiteSpace> <STOP_> }?`	no
`<LoopBody>`	`<Value> {�<WhiteSpace> <Value> }* {�<WhiteSpace> <STOP_> }?`	yes
Reserved Words
`<DATA_>`	`{'D' \| 'd'} {'A' \| 'a'} {'T' \| 't'} {'A' \| 'a'} '_'`	no
`<LOOP_>`	`{'L' \| 'l'} {'O' \| 'o'} {'O' \| 'o'} {'P' \| 'p'} '_'`	no
`<GLOBAL_>`	`{'G' \| 'g'} {'L' \| 'l'} {'O' \| 'o'} {'B' \| 'b'} {'A' \| 'a'} {'L' \| 'l'} '_'`	no
`<SAVE_>`	`{'S' \| 's'} {'A' \| 'a'} {'V' \| 'v'} {'E' \| 'e'} '_'`	no
`<STOP_>`	`{'S' \| 's'} {'T' \| 't'} {'O' \| 'o'} {'P' \| 'p'}'_'`	no
Tags and Values
`<Tag>`	`'_'{�<NonBlankChar>}+`	no
`<Value>`	`{�'.' \| '?' \| <Numeric> \| <CharString> \| <TextField> }`	yes
Numeric Values
`<Numeric>`	`{�<Number> \| <Number> '(' <UnsignedInteger> ')' }`	no
`<Number>`	`{<Integer> \| <Float> }`	no
`<Integer>`	`{ '+' \| '-' }? <UnsignedInteger>`	no
`<Float>`	`{ <Integer> \| { {'+'\|'-'} ? { {<Digit>} * '.' <UnsignedInteger> } \| { <Digit>} + '.' } } {<Exponent>} ? } }`	no
`<Exponent>`	`{ {'e' \| 'E' \| 'd' \| 'D'} \| { '+' \| '- ' } \| {'e' \| 'E' \| 'd' \| 'D'} { '+' \| '- ' } } <UnsignedInteger>`	no
`<UnsignedInteger>`	`{�<Digit> }+`	no
`<Digit>`	`{ '0' \| '1' \| '2' \| '3' \| '4' \| '5' \| '6' \| '7' \| '8' \| '9' }`	no
Character Strings and Text Fields
`<CharString>`	`<UnquotedString> \| <SingleQuotedString> \| <DoubleQuotedString>`	yes
`<eol><UnquotedString>`	`<eol><OrdinaryChar> {<NonBlankChar>}*`	yes
`<noteol><UnquotedString>`	`<noteol>{<OrdinaryChar>\|';'} {<NonBlankChar>}*`	yes
`<SingleQuotedString> <WhiteSpace>`	`<single_quote>{<AnyPrintChar>}* <single_quote> <WhiteSpace>`	yes
`<DoubleQuotedString> <WhiteSpace>`	`<double_quote> {<AnyPrintChar>}* <double_quote> <WhiteSpace>`	yes
`<TextField>`	`{�<SemiColonTextField> \| <BracketTextField> }`	yes
`<eol><SemiColonTextField>`	`<eol>';' { {<AnyPrintChar>}* <eol> {{<TextLeadChar> {<AnyPrintChar>}}? <eol>} } ';'`	yes
~~`<BracketTextField>`~~	~~`'[' {�<NonBracketChar> \| <BracketTextField>}* ']'`~~	~~yes~~
WhiteSpace and Comments
`<WhiteSpace>`	`{�<SP> \| <HT> \| <eol> \| <TokenizedComments>}+`	yes
`<Comments>`	`{ '#' {<AnyPrintChar>}* <eol>}+`	yes
`<TokenizedComments>`	`{�<SP> \| <HT> \| <eol> \|}+ <Comments>`	yes
Character Sets
`<OrdinaryChar>`	{ '!' \| '%' \| '&' \| '(' \| ')' \| '*' \| '+' \| ',' \| '-' \| '.' \| '/' \| '0' \| '1' \| '2' \| '3' \| '4' \| '5' \| '6' \| '7' \| '8' \| '9' \| ':' \| '<' \| '=' \| '>' \| '?' \| '@' \| 'A' \| 'B' \| 'C' \| 'D' \| 'E' \| 'F' \| 'G' \| 'H' \| 'I' \| 'J' \| 'K' \| 'L' \| 'M' \| 'N' \| 'O' \| 'P' \| 'Q' \| 'R' \| 'S' \| 'T' \| 'U' \| 'V' \| 'W' \| 'X' \| 'Y' \| 'Z' \| '\' \| '^' \| '`' \| 'a' \| 'b' \| 'c' \| 'd' \| 'e' \| 'f' \| 'g' \| 'h' \| 'i' \| 'j' \| 'k' \| 'l' \| 'm' \| 'n' \| 'o' \| 'p' \| 'q' \| 'r' \| 's' \| 't' \| 'u' \| 'v' \| 'w' \| 'x' \| 'y' \| 'z' \| '{' \| '\|' \| '}' \| '~' }	yes
~~`<NameChar>`~~	~~`<OrdinaryChar> \| '#' \| '$' \| '_' \| '[' \| ']'`~~	~~yes~~
`<NonBlankChar>`	`<OrdinaryChar> \| <double_quote> \| '#' \| '$' \| <single_quote> \| '_' \|';' \| '[' \| ']'`	yes
`<TextLeadChar>`	`<OrdinaryChar> \| <double_quote> \| '#' \| '$' \| <single_quote> \| '_' \| <SP> \| <HT> \|'[' \| ']'`	yes
`<AnyPrintChar>`	`<OrdinaryChar> \| <double_quote> \| '#' \| '$' \| <single_quote> \| '_' \| <SP> \| <HT> \| ';' \| '[' \| ']'`	yes
~~`<NonBracketChar>`~~	~~`<OrdinaryChar> \| <double_quote> \| '#' \| '$' \| <single_quote> \| '_' \| <SP> \| <HT> \| ';' \| '\[' \| '\]' \| <eol>`~~	~~yes~~

35. The rows of this table are called "productions". Productions are rules for constructing sentences in a language. They are written in terms of "terminal symbols" and "non-terminal symbols". "Terminal symbols" are what actually appear in a language. For example, 'poodle' might be given as a string of terminal symbols in some language discussing dogs. Non-terminal symbols are the higher-level constructs of the language, e.g. sentences, clauses, etc. For example <DOG> might be given as a non-terminal symbol in some language discussing dogs. Productions may be used to infer rules for parsing the language. For example, <DOG> ::= { 'poodle' | 'terrier' | 'bulldog' | 'greyhound' } might be given as a rule telling us what names of types of dogs we are allowed to write in this language. In this table, terminal symbols (i.e. terminal character strings) are enclosed in single quotes. To avoid confusion, the terminal symbol consisting of a single quote (i.e. an apostrophe) is indicated by <single_quote> and the terminal symbol consisting of a double quote is indicated by <double_quote>. The printable space character is indicated by <SP>, the horizontal tab character by <HT>, and the end of a line by <eol>. To allow for the occurrence of a semicolon as the initial character of an unquoted character string, provided it is not the first character in a line of text, the special symbol <noteol> is used below to indicate any character that is not interpretable as a line terminator. The cases of context sensitivity involving the beginning of text fields and the ends of quoted strings are discussed below, but they are most commonly resolved in a lexical scan.

36. Productions can be used to produce documents, or equivalently to check a document to see if it is valid in this grammar. The angle brackets delimit names for the syntactic units (the "non-terminal symbols") we are defining. The curly braces enclose alternatives separated by vertical bars and/or followed by a plus sign for "one or more", an asterisk for "zero or more", or a question mark for "zero or one".

37. In most cases, each production has a single non-terminal symbol in the syntactic unit we are defining. However, in some cases, both the syntactic unit and the syntax begin or end with some common symbol. This indicates that a specific context is required in order for the rule to be applied. We do this because the initial semicolon of a semicolon-delimited text field only has meaning at the beginning of a line, and quoted strings may contain their initial quoting character provided the embedded quoting character is not immediately followed by white space. ~~(Bracket-delimited text fields follow a different quoting convention; see below).~~ This "context sensitive" notation is unusual in defining computer languages (though very common in the full specifications of many computer and non-computer languages). This context-sensitive notation greatly simplifies our definitions and is simple to implement. We will elaborate the formal definitions below.

37a. In the present revision the production for <TextField> is a trivial equivalence to <SemiColonTextField>. The redundancy is retained to permit possible future extensions to text fields, in particular the possible introduction of a bracket-delimited text value.

Explanation of the Formal Syntax

38. Those not familiar with the conventions used in describing language grammars may wish to consult various lecture notes on the subject available on the web, e.g. http://www.bernstein-plus-sons.com/TMM/Parsing

39. In creating a parser for CIF, the normal process is to first perform a "lexical scan" to identify "tokens" in the CIF. A "token" is a grammatical unit, such as a special character, or a tag or a value, or some major grammatical subunit. In the course of a lexical scan, the input stream is reduced to manageable pieces, so that the rest of the parsing may be done more efficiently. The convention we will follow in this document is to mark the "non-terminal" tokens that we build up out of actual strings of characters or which do not have an immediate representation as printable characters by angle brackets, <>, and to indicate the tokens that are actual strings of characters as quoted strings of characters.

40. The precise division between a lexical scan and a full parse is a matter of convenience. We will present a suggested division. Before getting to that point, however, there are some highly machine-dependent matters that need to be resolved. There must be a clear understanding of the character set to be used, and of how files and lines begin and end. The character set will be specified in terms of printable characters and a few control characters from the venerable 7-bit ASCII (the US national variant of the ISO character set). In addition we will need some means of specifying the end of a line.

41. The character set in CIF is restricted to the ASCII control characters <HT> (horizontal tab, position 09 in the ASCII character set), <NL> (new line, position 10 in the ASCII character set, also named <LF>), and <CR> (carriage return, position 13 in the ASCII character set), and the printable characters in positions 32-126 of the ASCII characters set. These are the characters permitted by STAR with the exception of VT (vertical tab, position 11 in the ASCII character set) and FF (form feed, position 12 in the ASCII character set). In general it is a poor practice to use characters which are not common to all national variants of the ISO character set. On systems or in programming languages which do not "work in ASCII", the characters themselves may have different numeric values and in some cases there is no access to all the control characters.

42. The <eol> token stands for the system-dependent end-of-line.

Implementation note: CIF implementations may follow common HTML and XML practice in handling <eol>:

On many modern systems, "lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA). To simplify the tasks of applications, the characters passed to an application ... must be as if the ... [parser] normalized all line breaks in external parsed entities ... on input, before parsing, [e.g.] by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character." [from the XML specification http://www.w3.org/TR/2000/REC-xml-20001006].

Because Unix systems use '\n' (the ASCII LF control character, or #xA), and MS Windows systems use '\r''\n' (the ASCII CR control character, or #xD, followed by the ASCII LF control character, or #xA), and classic MacOS systems use '\r', a parser which covers a wide range of system in a reasonable manner could be constructed using a pseudo-production for <eol> such as

  <eol> ::= { <LF> | <CR><LF> | <CR> }

provided the supporting infrastructure (such as the lexer) deals with the necessary minor adjustment to ensure that each end-of-line is recognized and that all end-of-line control characters are filtered out from the portions of the text stream that are to be processed by other productions. One case to handle with care is the end-of-document case. It is not uncommon to encounter a last line in a document which is not terminated by any of the above-mentioned control characters. Instead, it may be terminated by the end of the character stream or by a special end-of-text-document control character (e.g. #x4 (control-D) or #x1A (control-Z)). A CIF parser should normalize such unterminated terminal lines to appear to an application as if they had been properly terminated. On the other hand, care should also be taken so that in multiple generations of CIF processing such processing does not result in an ever-growing "tail" of empty lines at the end of a CIF document.

This discussion is not meant to imply that a parser for a system which uses one of these line termination conventions must recognize a CIF written using another of these line termination conventions.

This discussion is not meant to imply that parsers on systems that use other line termination conventions and/or non-ASCII characters sets need to handle these ASCII control characters.

In processing a valid CIF document, it is always sufficient that a parser be able to recognize the line-termination conventions of text files local to its system environment, and that it recognize the local translations of and the printable characters used to construct a CIF.

However, when circumstances permit, if a parser is able to recognize "alien" line terminations, it is permissible for the parser to accept and process the CIF in that form without treating it as an error.

In writing CIF documents, the software that emits lines should follow the text file line termination conventions of the target system for which it is writing the CIF documents, and not mix conventions from multiple systems. In transmitting CIF documents from system to system, software should be used that causes the document to conform to the line termination conventions of the target system. In most cases this objective can best be achieved by using "text" or "ascii" transmission modes, rather than "binary" or "image" transmission modes.

43. In order to write the grammar, we need a way to refer to the single-quote characters which we use both to quote within the syntax and to quote within a CIF. To avoid system dependent confusion, we define the following special tokens:

token	meaning
<SP>	'�', the printable space character
<HT>	the horizontal tab characters on the system
<eol>	the machine-dependent end of line
<noteol>	the complement of the above; any character that does not indicate the machine-dependent end of line
<single_quote>	the apostrophe, `'`
<double_quote>	the double quote character, `"`

44. There are CIF specifications not definable directly in a context-free BNF. Restrictions in record and dataname lengths, and the parsing of text fields and quoted character strings are best handled in the initial lexical scan. A pure BNF can then be used to parse the tokenized input stream.

Lexical tokens

45. We define a "comment" to be initiated with the character # and followed by any sequence of characters (which include <SP> or <HT>). The only characters not allowed are those in the production <eol>, which <eol> terminates a comment. A comment is recognized only at the beginning of a line or after blanks, i.e. only after space, tab or <eol>. For this reason we define both comments and "tokenized comments". No portion of the essential machine-readable content within a CIF is conveyed by the comments. Comments are for the convenience of human readers of CIFs and may be freely introduced or removed. Note however the optional structured comment sanctioned in the section on Version identification, which has the purpose of indicating the file type and revision level to general-purpose file-handling software.

<Comments>               ::= { '#' {<AnyPrintChar>}* <eol>}+
<TokenizedComments>      ::= { <SP> | <HT> | <eol> |}+ <Comments>

46. We accept as whitespace all appropriate combinations of spaces, tabs, ends of lines and comments, as well as the beginning of the file. Whitespace are the characters able to delimit the lexical tokens.

<WhiteSpace>            ::= { <SP> | <HT> | <eol> | <TokenizedComments> }+

47. Non-blank characters are composed of all the characters in our set, excluding <SP> and <HT> and <eol> characters.

<NonBlankChar>          ::= <ordinary_char> | <double_quote> | '#' | '$'
      | <single_quote> | '_' | ';' | '[' | ']'

48. AnyPrintChar characters are composed of all the characters in our set, excluding <eol> characters.

<AnyPrintChar>          ::= <ordinary_char> | <double_quote> | '#' | '$' 
      | <single_quote> | '_' | <SP> | <HT> | ';' | '[' | ']'

49. We define a "line of text" to be a line contained within a semicolon bounded text field. Hence the first character cannot be a semicolon; it may be followed by any number of characters from the set <char> and terminated with a line termination character. We define the characters in TextLeadChar as those in AnyPrintChar except for the semicolon.

<TextLeadChar>          ::= <ordinary_char> | <double_quote> | '#' | '$' 
      | <single_quote> | '_' | <SP> | <HT> | '[' | ']'

50. In a bracket-delimited text field, the semicolon is legal, but brackets must either be properly nested or quoted with a backslash. We must recognize the characters which are not brackets. Since we do not have to recognize a semicolon in column 1, the end-of-line is essentially an ordinary character in this context.

~~<NonBracketChar> ::= <OrdinaryChar> | <double_quote> | '#' | '$' | <single_quote> | '_' | <SP> | <HT> | ';' | '\[' | '\]' | <eol>~~

51. The characters used in the names of tags and data blocks in CIF 1.1 are drawn from the non-blank printable characters minus the quoting symbols. other than the square brackets. For historical reasons, square brackets are permitted in names.

<NameChar>              ::=
<OrdinaryChar> | '#' | '$' | '_' | '[' | ']'

52. Ordinary characters are all those printable characters that can initiate a non-quoted character string. These exclude the special characters, ", #, $, ', [, ] and _ and in some cases ;.

<OrdinaryChar>          ::= '!' | '%' | '&' | '(' | ')' | '*' | '+' | ',' | '-' | '.' 
      | '/' | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' 
      | '9' | ':' | '<' | '=' | '>' | '?' | '@' | 'A' | 'B' | 'C' 
      | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' 
      | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' 
      | 'X' | 'Y' | 'Z' | '\' | '^' | '`' | 'a' | 'b' | 'c' | 'd'
      | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n'
      | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x'
      | 'y' | 'z' | '{' | '|' | '}' | '~' |

53. The reserved word data_ (in a case insensitive form).

<DATA_>                 ::= {'d'|'D'} {'a'|'A'} {'t'|'T'} {'a'|'A'} '_'

54. The reserved word loop_ (in a case insensitive form).

<LOOP_>                 ::= {'l'|'L'} {'o'|'O'} {'o'|'O'} {'p'|'P'} '_'

55. The reserved word save_ (in a case insensitive form).

<SAVE_>                 ::= {'s'|'S'} {'a'|'A'} {'v'|'V'} {'e'|'E'} '_'

56. The reserved word stop_ (in a case insensitive form).

<STOP_>                 ::= {'s'|'S'} {'t'|'T'} {'o'|'O'} {'p'|'P'} '_'

57. The reserved word global_ (in a case insensitive form). This is actually a reserved word of STAR, but we define it here so that it may be explicitly excluded as the start of a Unquoted string. We do this so that any possible future adoption of STAR features will not invalidate existing CIFs.

<GLOBAL_>               ::= 
      {'g'|'G'} {'l'|'L'} {'o'|'O'} {'b'|'B'} {'a'|'A'} {'l'|'L'} '_'

58. Quoted strings need to be recognized in the lexical scan, because their definition is context sensitive. A string quoted by single quotes may contain a single quote as long as it is not followed by whitespace. A string quoted by double quotes may contain a double quote as long as it is not followed by whitespace. Formally we express this with context sensitive productions. In practice, it requires a one-character look-ahead to decide to continue the scan if the opening quote is encountered, but the following character is not space, tab or end of line. When processing a semicolon-delimited text field the column position has to be remembered to decide if a semicolon should be recognized. When processing a bracket-delimited text field, the count of open and close brackets needs to be maintained. Note that the requirement that the first open bracket be preceded by whitespace and that the last close bracket be followed by whitespace does not apply to the square brackets within a bracket-delimited text field, and that a backslash suppresses interpretation of a square bracket. For both the semicolon-delimited text string and the bracket-delimited text string, For a semicolon-delimited text string failure to provide trailing whitespace is an error. The <WhiteSpace> on the lefthand side must evalue to the same string instance on the righthand side and the parse must terminate on the first valid match reading left to right.

<SingleQuotedString> <WhiteSpace> ::= <single_quote>{<AnyPrintChar>}* 
      <single_quote> <WhiteSpace>
<DoubleQuotedString> <WhiteSpace> ::= <double_quote> {<AnyPrintChar>}* 
      <double_quote> <WhiteSpace>
<TextField>             ::= { <SemiColonTextField>  | <BracketTextField>  }
<eol><SemiColonTextField>         ::= <eol>';' { {<AnyPrintChar>}* <eol>
            {{<TextLeadChar> {<AnyPrintChar>}*}? <eol>}*
            } ';'
<BracketTextField>      ::= '[' { <NonBracketChar> | <BracketTextField>}* ']'

59. Tags and Values are appropriate lexical tokens. The special values of '.' and '?' represent data that are inapplicable or unknown, respectively.

No string the initial 5 characters of which match the production for <LOOP_> is accepted as a non-quoted string.
No string the initial 5 characters of which match the production for <STOP_> is accepted as a non-quoted string.
No string the initial 5 characters of which match the production for <DATA_> is accepted as a non-quoted string.
No string the initial 5 characters of which match the production for <SAVE_> is accepted as a non-quoted string.
No string the initial 7 characters of which match the production for <GLOBAL_> is accepted as a non-quoted string.

Unquoted strings are described by a pair of productions to permit the initial letter of an unquoted string to be a semicolon so long as that does not occur at the beginning of a line. The parser is required to evaluate <noteol> to the same string instance on both sides of the production.

<Tag>                   ::= '_'{ <NonBlankChar>}+
<Value>                 ::= { '.' | '?' | <Numeric> | <CharString> | <TextField> }
<Numeric>               ::= { <Number> | <Number> '(' <UnsignedInteger> ')' }
<Number>                ::= { <Integer> | <Float> }
<Integer>               ::= { '+' | '-' }? <UnsignedInteger>
<Exponent>              ::= { {'e' | 'E' | 'd' | 'D'} | { '+' | '- ' } | {'e' | 'E' | 'd' | 'D'} { '+' | '- ' } } <UnsignedInteger>
<UnsignedInteger>       ::= {�<Digit> }+
<Digit>                 ::= { '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' }
<Float>                 ::= { <Integer> |
                                { {'+'|'-'} ? 
                                  { {<Digit>} * '.' <UnsignedInteger> } |
                                    { <Digit>} + '.' }
                                  }
                                  {<Exponent>} ?
                                }
                             }
<CharString>            ::= <UnquotedString> | <SingleQuotedString> | <DoubleQuotedString>
<eol><UnquotedString>    ::= <eol><OrdinaryChar> {<NonBlankChar>}*
<noteol><UnquotedString> ::= <noteol>{<OrdinaryChar>|';'} {<NonBlankChar>}*

CIF grammar

60. A CIF may be an empty file, or it may contain only comments or whitespace, or it may contain one or more data blocks. Comments before the first block are acceptable, and there must be white space between blocks. ~~In CIF 1.1 a file consisting only of comments and whitespace is not acceptable.~~

<CIF>                   ::=
      <Comments>? <WhiteSpace>? 
      { <DataBlock> { <WhiteSpace> <DataBlock> }* { <WhiteSpace> }? }?

61. For a data block, there must be a data heading and zero or more data items or save frames. ~~This means a file consisting of just a data block heading is invalid.~~

<DataBlock>             ::=
      <DataBlockHeading> {<WhiteSpace> { <DataItems> | <SaveFrame> } }*

62. A data block heading consists of the 5 characters data_ (case-insensitive) immediately followed by at least one non-blank character selected from the set or ordinary characters or the non-quote-mark, non-blank printable characters.

<DataBlockHeading>      ::= <DATA_> { <NonBlankChar> }+

63. For a save frame, there must be a save frame heading, some data items and then the reserved word save_.

<SaveFrame>             ::= <SaveFrameHeading> {<WhiteSpace> <DataItems>}+
      <WhiteSpace> <SAVE_>

64. A save frame heading consists of the 5 characters save_ (case-insensitive) immediately followed by at least one non-blank character selected from the set or ordinary characters or the non-quote-mark, non-blank printable characters.

<SaveFrameHeading>      ::= <SAVE_> { <NonBlankChar> }+

65. Data comes in two forms.

A data name tag separated from its associated value by a <WhiteSpace>.
Looped data. The number of values in the body must be a multiple of the number of tags in the header.

66. Optionally, the reserved word stop_ may be used to terminate a loop header and/or to terminate a loop body.

<DataItems>             ::= <Tag> <WhiteSpace> <Value> |
         <LoopHeader> <LoopBody>
<LoopHeader>           ::= <LOOP_> { <WhiteSpace> <Tag> }+ { <WhiteSpace> <STOP_> }?
<LoopBody>             ::= <Value> { <WhiteSpace> <Value> }* { <WhiteSpace> <STOP_> }?

References

67.

Hall, S. R. (1991). "The STAR File: A New Format for Electronic Data Transfer and Archiving", J. Chem. Inform. Comp. Sci., 31, 326-333.
Hall, S. R., Allen, F. H. and Brown, I. D. (1991). "The Crystallographic Information File (CIF): A New Standard Archive File for Crystallography", Acta Cryst., A47, 655-685.
Hall, S. R. & Cook, A. P. F. (1995). "{STAR} Dictionary Definition Language: Initial Specification", J. Chem. Inf. Comput. Sci., 35, 819-825.

Hall, S.R. & Spadaccini, N. (1994). "The STAR File: Detailed Specifications," J. Chem. Info. Comp. Sci., 34, 505-508. See http://www.crystal.uwa.edu.au/cc_star.html

Westbrook, J. D. What is the canonical reference (preferably to a formal published paper) for DDL2?