This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.

DRAFT

The Backus-Naur Form of the CIF syntax and grammar

Revised : 11th October 2000

Nick Spadaccini (nick@cs.uwa.edu.au)

Note the character set in STAR (and CIF by implication) is restricted to ASCII 09-13, 32-126. Other characters from the ASCII set are illegal. If present in a file the error state is well defined, but the functionality of the error handler is not specified. For instance one may choose to return an "illegal file exception" and terminate or one may choose to ignore and skip over the illegal characters.

The concept of whitespace <wspace> includes a comment, since these only serve to (in a parser sense) delimit tokens anyway.

A text file version of the BNF productions, without highlights and annotations is avaliable from here.

There are CIF specifications not definable in the BNF. These refer to restrictions in record and dataname lengths. The BNF can be used to define the tokenisation of the input stream, and the following error conditions should be detected in your CIF parsing programs
  1. the CIF data name restriction to 32 characters, in the original specification, has been supersceded (see http://www.iucr.org/iucr-top/cif/cif_core/diff2.0-1.0.html#syn).
  2. a CIF record (line of text) is restricted to 80 characters (from the set defined in <char>). This character count doesn not include the record termination character(s).
  3. The number of data elements in a <data_loop_values> of a <data_loop> MUST be an integer multiple of the number of data names in the associated <data_loop_field>.

LEXICAL tokens

We accept a space, a horizontal and a vertical tab as <blank>.

<blank>                ::= ' ' | '\t' | '\v'  (ASCII 32 09 11)
The non-printing single ASCII characters, 10, 12, 13 are always interpreted as being line terminators as is the two character sequence 13&10. In this way there is never any operating system dependent interpretation of line terminations. It necessarily requires that these characters can only be used for line termination

<terminate>             ::= '\n' | '\r' | '\r\n'  | '\f'  (ASCII 10 13 13&10 12)
We define a "comment" to be initiated with the character # and followed by any sequence of characters (which include <blank>). The only characters not allowed are those in the production <terminate>, and hence these characters terminate a comment

<comment>               ::= '#' <char>* 
We accept as whitespace ALL elements in the above three productions. Whitespace are the characters able to delimit the lexical tokens. Note that a comment is a legitimate whitespace because it must end with a line terminator, and hence delimits tokens.

<wspace>                ::= { <blank> | <terminate> | <comment> }+ 
Non-blank characters are composed of all the characters in our set, excluding <blank> and <terminate> characters.

<non_blank_char>        ::= <ordinary_char> | '"' | '#' | '$' | '\'' | ';' | '_'
Char characters are composed of all the characters in our set, excluding <terminate> characters.

<char>                  ::= <blank> |  <non_blank_char>
We define a "line of text" to be a line contained within a semicolon bounded text block. Hence the first character CANNOT be a semicolon, followed by any number of characters from the set <char> and terminated with a line termination character.

<line_of_text>          ::= <not_a_semi_colon> <char>* <terminate>
The double quote character.

<D_quote>               ::= '"' (ASCII 34)
The single quote character.

<S_quote>               ::= '\'' (ASCII 39)
The semi colon character.

<semi_colon>            ::= ';' (ASCII 59)
All printable characters EXCEPT the double quote.

<not_a_D_quote>         ::= <ordinary_char> | '#' | '$' | '\'' | ';' | '_'
                          | <blank>
All printable characters EXCEPT the single quote.

<not_a_S_quote>         ::= <ordinary_char> | '\"' | '#' | '$' | ';' | '_'
                          | <blank>
Ordinary characters are all those printable characters that can initiate a non-quoted text string. These exclude the special characters, ", #, $, ' and _ and in some cases ;.

<ordinary_char>         ::= '!' | '%' | '&' | '(' | ')' | '*' | '+' | ',' | '-' | '.' 
                          | '/' | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' 
                          | '9' | ':' | '<' | '=' | '>' | '?' | '@' | 'A' | 'B' | 'C' 
                          | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' 
                          | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' 
                          | 'X' | 'Y' | 'Z' | '[' | '\\' | ']' | '^' | '`' | 'a' | 'b' 
                          | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' 
                          | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' 
                          | 'w' | 'x' | 'y' | 'z' | '{' | '|' | '}' | '~' |
All printable characters EXCEPT the semi colon.

<not_a_semi_colon>      ::= <ordinary_char> | '\"' | '#' | '$' | '\'' | '_'
                          | <blank>
The reserved word data_ (in a case insensitive form).

<DATA_>                 ::= {'d'|'D'} {'a'|'A'} {'t'|'T'} {'a'|'A'} '_'
The reserved word loop_ (in a case insensitive form).

<LOOP_>                 ::= {'l'|'L'} {'o'|'O'} {'o'|'O'} {'p'|'P'} '_'
The operating system dependent end-of-file marker.

<EOF>                   ::= end-of-file marker
The reserved words save_, stop_ and global_ (in a case insensitive form). These are actually reserved words of STAR, but we define them here so that they may be explicitly excluded as <non_quoted_1_string> and <non_quoted_2_string> We do this so that any possible future adoption of these STAR features will not invalidate existing CIFs.

<SAVE_>                 ::= {'s'|'S'} {'a'|'A'} {'v'|'V'} {'e'|'E'} '_'
<STOP_>                 ::= {'s'|'S'} {'t'|'T'} {'o'|'O'} {'p'|'P'} '_'
<GLOBAL_>               ::= {'g'|'G'} {'l'|'L'} {'o'|'O'} {'b'|'B'} {'a'|'A'} {'l'|'L'} '_'

CIF grammar

An CIF file may be an empty file, or it may contain 1 or more data blocks.

<CIF_file>              ::= <data_block> *
There can be any amount of white spaces (remember <wspace> includes comments) before and at least one whitespace after a data block or an end-of-file (EOF). This forces whitespace between data blocks in a single file. There must be a data heading and AT LEAST one data item. This means a file consisting of just a data block heading is INVALID.

<data_block>            ::= <wspace>* <data_heading> <data>+ { <wspace> | <EOF> }+
A data block heading consists of the 5 characters data_ (case-insensitive) immediately followed by at least one non-blank character.

<data_heading>          ::= <DATA_> <non_blank_char>+
Data comes in 3 forms.
  1. A data name tag separated from its associated value by a trailing <blank>. Note it is explicitly a <blank> and not a <wspace>. This is a TYPE I data value.
  2. A data name tag separated from its associated value by a trailing line break. This is a TYPE II data value.
  3. Looped data.

<data>                  ::= { <wspace>+ <data_name> <wspace>* <blank> <data_value_1> }
                          | { <wspace>+ <data_name> <wspace>* <terminate> <data_value_2> }
		          | <data_loop>
Self explanatory. We must allow for white space preceding the loop_ (case insensitive) keyword, since this is not covered by any of the other productions.

<data_loop>             ::= <wspace>+ <LOOP_> <data_loop_field> <data_loop_values>
A list of at least one data name.

<data_loop_field>       ::= { <wspace>+ <data_name> }+
The definition of a data name. Initiated by a _ character and followed by one or more non-blank and non-terminating characters from the STAR character set.

<data_name>             ::= '_' <non_blank_char>+
As with the <data> production, except that the data loop is not supported since CIF does not accepted nested loops.

<data_loop_values>      ::= { { <wspace>* <blank> <data_value_1> }
                          | { <wspace>* <terminate> <data_value_2> } }+ 
TYPE I data values which are immediately preceded by a space. Note the <D_quote_string> and <S_quote_string> data appears as both TYPE I and TYPE II because these are delimited by a character namely " and '.

<data_value_1>          ::= <non_quoted_1_string>
                          | { <D_quote> <D_quote_string> <D_quote> }
                          | { <S_quote> <S_quote_string> <S_quote> }
TYPE II data values which are immediately preceded by a space. Note the <D_quote_string> and <S_quote_string> data appears as both TYPE I and TYPE II because these are delimited by a character namely " and '.

<data_value_2>          ::= <non_quoted_2_string>
                          | { <D_quote> <D_quote_string> <D_quote> }
                          | { <S_quote> <S_quote_string> <S_quote> }
                          | { <semi_colon> <SC_bounded_string> <semi_colon> }
This is an unquoted string which is immediately preceded by a space. It cannot begin with a number of characters (the complement of the <ordinary_char> set) ie. ", #, $, ' and _. $ is excluded because it has meaning in STAR, and even though CIF is a subset which does not support save frames, the special significance of a string beginning with $ should be protected. However is can begin with a semi colon. Then followed by any number of non-blank characters.
    Specific exceptions to lexemes which match the <non_quoted_1_string> production.
  1. No string beginning with an underscore (including a single underscore) is accepted as a type 1 non-quoted string. This is actually specified with in the production rule.
  2. No string that matches the production for <LOOP_> is accepted as a type 1 non-quoted string.
  3. No string that matches the production for <DATA_> is accepted as a type 1 non-quoted string.
  4. No string that matches the production for <SAVE_> is accepted as a type 1 non-quoted string.
  5. No string that matches the production for <GLOBAL> is accepted as a type 1 non-quoted string.
  6. NO string that matches the production for <data_heading> is accepted as a type 1 non-quoted string.

    In all cases if one wishes to define data values which match lexemes excluded in cases 1-6 above, they should be quoted data values. In this way the quotes protect matching against other CIF tokens.


<non_quoted_1_string>   ::= { <ordinary_char> | <semi_colon> } <non_blank_char>*
This is an unquoted string which is immediately preceded by a line break. As with TYPE I it too cannot begin with a ", #, $, ' or _. It also CANNOT begin with a semi colon, since this would make it semi-colon-delimited data.
    Specific exceptions to lexemes which match the <non_quoted_2_string> production.
  1. Strings beginning with an underscore (including a single underscore) are not accepted as type 2 non-quoted strings. This is actually specified with in the production rule.
  2. No string that matches the production for <LOOP_> is accepted as a type 2 non-quoted string.
  3. No string that matches the production for <DATA_> is accepted as a type 2 non-quoted string.
  4. No string that matches the production for <SAVE_> is accepted as a type 2 non-quoted string.
  5. No string that matches the production for <GLOBAL> is accepted as a type 2 non-quoted string.
  6. NO string that matches the production for <data_heading> is accepted as a type 2 non-quoted string.

    In all cases if one wishes to define data values which match lexemes excluded in cases 1-6 above, they should be quoted data values. In this way the quotes protect matching against other CIF tokens.


<non_quoted_2_string>   ::= <ordinary_char> <non_blank_char>*
The string between a set of double quotes can consist of any character that is not a double quote, or it can be a double quote as long as it is immediately followed by a non-blank character.

<D_quote_string>        ::= { <D_quote> <non_blank_char> | <not_a_D_quote> }*
The string between a set of single quotes can consist of any character that is not a single quote, or it can be a single quote as long as it is immediately followed by a non-blank character.

<S_quote_string>        ::= { <S_quote> <non_blank_char> | <not_a_S_quote>} *
The string bounded by semi-colons can begin with any number of characters (including those in the <blank> production) but necessarily terminated by a line break. This forces a line break on the line that contains the "opening" semi colon. After the first line you can have any number of <line_of_text>. Note we treat the first line as special, since it can contain a leading semi-colon, which is not true of <line_of_text>. A <line_of_text> is always terminated with a line break, thus ensuring the closing semi colon is in column 1.

<SC_bounded_string>     ::= <char>* <terminate> <line_of_text>*