This is an archive copy of the IUCr web site dating from 2008. For current content please visit
https://www.iucr.org.
DRAFT
The Backus-Naur Form of the CIF syntax and grammar
Revised : 11th October 2000
Nick Spadaccini (nick@cs.uwa.edu.au)
Note the character set in STAR (and CIF by implication) is restricted to ASCII 09-13, 32-126.
Other characters from the ASCII set are illegal. If present in a file the error state is well
defined, but the functionality of the error handler is not specified. For instance one may choose
to return an "illegal file exception" and terminate or one may choose to ignore and skip over the
illegal characters.
The concept of whitespace <wspace> includes a comment, since these
only serve to (in a parser sense) delimit tokens anyway.
A text file version of the BNF productions, without highlights and annotations is avaliable
from here.
There are CIF specifications not definable in the BNF. These refer to restrictions in record and
dataname lengths. The BNF can be used to define the tokenisation of the input stream, and the
following error conditions should be detected in your CIF parsing programs
- the CIF data name restriction to 32 characters, in the original specification, has been
supersceded (see http://www.iucr.org/iucr-top/cif/cif_core/diff2.0-1.0.html#syn).
- a CIF record (line of text) is restricted to 80 characters (from the set defined in
<char>). This character count doesn not include the record termination character(s).
- The number of data elements in a <data_loop_values> of a <data_loop> MUST be an
integer multiple of the number of data names in the associated <data_loop_field>.
|
LEXICAL tokens
We accept a space, a horizontal and a vertical tab as <blank>.
<blank> ::= ' ' | '\t' | '\v' (ASCII 32 09 11)
| |
The non-printing single ASCII characters, 10, 12, 13 are always interpreted as being line
terminators as is the two character sequence 13&10. In this way there is never any operating system
dependent interpretation of line terminations. It necessarily requires that these characters can
only be used for line termination
<terminate> ::= '\n' | '\r' | '\r\n' | '\f' (ASCII 10 13 13&10 12)
| |
We define a "comment" to be initiated with the character # and followed by any sequence
of characters (which include <blank>). The only characters not allowed are those in the
production <terminate>, and hence these characters terminate a comment
<comment> ::= '#' <char>*
| |
We accept as whitespace ALL elements in the above three productions. Whitespace are the characters
able to delimit the lexical tokens. Note that a comment is a legitimate whitespace because it must
end with a line terminator, and hence delimits tokens.
|
<wspace> ::= { <blank> | <terminate> | <comment> }+
|
Non-blank characters are composed of all the characters in our set, excluding <blank>
and <terminate> characters.
|
<non_blank_char> ::= <ordinary_char> | '"' | '#' | '$' | '\'' | ';' | '_'
|
Char characters are composed of all the characters in our set, excluding <terminate> characters.
|
<char> ::= <blank> | <non_blank_char>
|
We define a "line of text" to be a line contained within a semicolon bounded text block.
Hence the first character CANNOT be a semicolon, followed by any number of characters
from the set <char> and terminated with a line termination character.
|
<line_of_text> ::= <not_a_semi_colon> <char>* <terminate>
|
The double quote character.
|
<D_quote> ::= '"' (ASCII 34)
|
The single quote character.
|
<S_quote> ::= '\'' (ASCII 39)
|
The semi colon character.
|
<semi_colon> ::= ';' (ASCII 59)
|
All printable characters EXCEPT the double quote.
|
<not_a_D_quote> ::= <ordinary_char> | '#' | '$' | '\'' | ';' | '_'
| <blank>
|
All printable characters EXCEPT the single quote.
|
<not_a_S_quote> ::= <ordinary_char> | '\"' | '#' | '$' | ';' | '_'
| <blank>
|
Ordinary characters are all those printable characters that can initiate a
non-quoted text string. These exclude the special characters, ",
#, $, ' and _ and in some cases ;.
|
<ordinary_char> ::= '!' | '%' | '&' | '(' | ')' | '*' | '+' | ',' | '-' | '.'
| '/' | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8'
| '9' | ':' | '<' | '=' | '>' | '?' | '@' | 'A' | 'B' | 'C'
| 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M'
| 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W'
| 'X' | 'Y' | 'Z' | '[' | '\\' | ']' | '^' | '`' | 'a' | 'b'
| 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l'
| 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v'
| 'w' | 'x' | 'y' | 'z' | '{' | '|' | '}' | '~' |
|
All printable characters EXCEPT the semi colon.
|
<not_a_semi_colon> ::= <ordinary_char> | '\"' | '#' | '$' | '\'' | '_'
| <blank>
|
The reserved word data_ (in a case insensitive form).
|
<DATA_> ::= {'d'|'D'} {'a'|'A'} {'t'|'T'} {'a'|'A'} '_'
|
The reserved word loop_ (in a case insensitive form).
|
<LOOP_> ::= {'l'|'L'} {'o'|'O'} {'o'|'O'} {'p'|'P'} '_'
|
The operating system dependent end-of-file marker.
|
<EOF> ::= end-of-file marker
|
The reserved words save_, stop_ and global_ (in a case insensitive form). These are
actually reserved words of STAR, but we define them here so that they may be explicitly excluded
as <non_quoted_1_string> and <non_quoted_2_string> We do this so that any possible future adoption of
these STAR features will not invalidate existing CIFs.
|
<SAVE_> ::= {'s'|'S'} {'a'|'A'} {'v'|'V'} {'e'|'E'} '_'
<STOP_> ::= {'s'|'S'} {'t'|'T'} {'o'|'O'} {'p'|'P'} '_'
<GLOBAL_> ::= {'g'|'G'} {'l'|'L'} {'o'|'O'} {'b'|'B'} {'a'|'A'} {'l'|'L'} '_'
|
CIF grammar
An CIF file may be an empty file, or it may contain 1 or more data blocks.
|
<CIF_file> ::= <data_block> *
|
There can be any amount of white spaces (remember <wspace> includes comments) before and
at least one whitespace after a data block or an end-of-file (EOF). This
forces whitespace between data blocks in a single file.
There must be a data heading and AT LEAST one data item. This means a file
consisting of just a data block heading is INVALID.
|
<data_block> ::= <wspace>* <data_heading> <data>+ { <wspace> | <EOF> }+
|
A data block heading consists of the 5 characters data_ (case-insensitive) immediately followed by at least one
non-blank character.
|
<data_heading> ::= <DATA_> <non_blank_char>+
|
Data comes in 3 forms.
- A data name tag separated from its associated value by a trailing <blank>.
Note it is explicitly a <blank> and not a <wspace>. This is a TYPE I data value.
- A data name tag separated from its associated value by a trailing line break.
This is a TYPE II data value.
- Looped data.
|
<data> ::= { <wspace>+ <data_name> <wspace>* <blank> <data_value_1> }
| { <wspace>+ <data_name> <wspace>* <terminate> <data_value_2> }
| <data_loop>
|
Self explanatory. We must allow for white space preceding the loop_
(case insensitive) keyword, since this is
not covered by any of the other productions.
|
<data_loop> ::= <wspace>+ <LOOP_> <data_loop_field> <data_loop_values>
|
A list of at least one data name.
|
<data_loop_field> ::= { <wspace>+ <data_name> }+
|
The definition of a data name. Initiated by a _ character and followed
by one or more non-blank and non-terminating characters from the STAR character set.
|
<data_name> ::= '_' <non_blank_char>+
|
As with the <data> production, except that the data loop is not supported since CIF does not
accepted nested loops.
|
<data_loop_values> ::= { { <wspace>* <blank> <data_value_1> }
| { <wspace>* <terminate> <data_value_2> } }+
|
TYPE I data values which are immediately preceded by a space. Note the <D_quote_string>
and <S_quote_string> data appears as both TYPE I and TYPE II because these are delimited
by a character namely " and '.
|
<data_value_1> ::= <non_quoted_1_string>
| { <D_quote> <D_quote_string> <D_quote> }
| { <S_quote> <S_quote_string> <S_quote> }
|
TYPE II data values which are immediately preceded by a space. Note the <D_quote_string>
and <S_quote_string> data appears as both TYPE I and TYPE II because these are delimited
by a character namely " and '.
|
<data_value_2> ::= <non_quoted_2_string>
| { <D_quote> <D_quote_string> <D_quote> }
| { <S_quote> <S_quote_string> <S_quote> }
| { <semi_colon> <SC_bounded_string> <semi_colon> }
|
This is an unquoted string which is immediately preceded by a space. It cannot begin with a
number of characters (the complement of the <ordinary_char> set) ie. ", #, $, ' and _.
$ is excluded because it has meaning in STAR, and even though CIF is a subset which does not
support save frames, the special significance of a string beginning with $ should be protected.
However is can begin with a semi colon. Then followed by any number of non-blank characters.
|
Specific exceptions to lexemes which match the <non_quoted_1_string> production.
- No string beginning with an underscore (including a single underscore) is accepted
as a type 1 non-quoted string. This is actually specified with in the production rule.
- No string that matches the production for <LOOP_> is accepted as a type 1 non-quoted string.
- No string that matches the production for <DATA_> is accepted as a type 1 non-quoted string.
- No string that matches the production for <SAVE_> is accepted as a type 1 non-quoted string.
- No string that matches the production for <GLOBAL> is accepted as a type 1 non-quoted string.
- NO string that matches the production for <data_heading> is accepted as a type 1 non-quoted
string.
In all cases if one wishes to define data values which match lexemes excluded in cases 1-6
above, they should be quoted data values. In this way the quotes protect matching against
other CIF tokens.
|
<non_quoted_1_string> ::= { <ordinary_char> | <semi_colon> } <non_blank_char>*
|
This is an unquoted string which is immediately preceded by a line break. As with TYPE I
it too cannot begin with a ", #, $, ' or _. It also CANNOT begin with a semi colon, since this
would make it semi-colon-delimited data.
|
Specific exceptions to lexemes which match the <non_quoted_2_string> production.
- Strings beginning with an underscore (including a single underscore) are not accepted
as type 2 non-quoted strings. This is actually specified with in the production rule.
- No string that matches the production for <LOOP_> is accepted as a type 2 non-quoted string.
- No string that matches the production for <DATA_> is accepted as a type 2 non-quoted string.
- No string that matches the production for <SAVE_> is accepted as a type 2 non-quoted string.
- No string that matches the production for <GLOBAL> is accepted as a type 2 non-quoted string.
- NO string that matches the production for <data_heading> is accepted as a type 2 non-quoted
string.
In all cases if one wishes to define data values which match lexemes excluded in cases 1-6
above, they should be quoted data values. In this way the quotes protect matching against
other CIF tokens.
|
<non_quoted_2_string> ::= <ordinary_char> <non_blank_char>*
|
The string between a set of double quotes can consist of any character that is not a double quote,
or it can be a double quote as long as it is immediately followed by a non-blank character.
|
<D_quote_string> ::= { <D_quote> <non_blank_char> | <not_a_D_quote> }*
|
The string between a set of single quotes can consist of any character that is not a single quote,
or it can be a single quote as long as it is immediately followed by a non-blank character.
|
<S_quote_string> ::= { <S_quote> <non_blank_char> | <not_a_S_quote>} *
|
The string bounded by semi-colons can begin with any number of characters (including those in the
<blank> production) but necessarily terminated by a line break. This forces a line break
on the line that contains the "opening" semi colon. After the first line you can have any number of
<line_of_text>. Note we treat the first line as special, since
it can contain a leading semi-colon, which is not true of <line_of_text>. A
<line_of_text> is always terminated with a line break, thus ensuring the closing
semi colon is in column 1.
|
<SC_bounded_string> ::= <char>* <terminate> <line_of_text>*
|