Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Draft EBNF for CIF2

  • To: ddlm-group <ddlm-group@iucr.org>
  • Subject: [ddlm-group] Draft EBNF for CIF2
  • From: James Hester <jamesrhester@gmail.com>
  • Date: Wed, 20 Aug 2014 16:33:07 +1000
  • DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;h=mime-version:date:message-id:subject:from:to:content-type;bh=J495EBtG9MOOnaGuyyk8ocoEfEeIfTDijQidlbJQpSg=;b=Fhzvk+WJHsce+Yphhm6HTrwJTVUrMzl7/BBvd15B8b7L5gnm8R1JURt0Pa/8YSMm8UNcFUhE2N7PyTUJ3+pwN5ASeOXs0wcruO5dhQLz5j8pBlXLOTOVRh7mtzN7CX/YB57PxPDk4RgQ/TFViBSQhL01cCXz3Q0Vlz21uMoX2zZNrSN/xV2xLRpZipY6ctVt5c7L+Hc9u8BMBFaw0xN/HRvV4BEYVObtJgXm2/o4Z3DCPSJfFk8cYpKD378TSBmduC/DP4//CUfY4dRpE7Cze8yaYPqyIclbOg3ZwU0+BmRsJ0M5AbGN3bwpOenFWjhXIiDNvzzo+O+kXMNU0tMR3Q==
Dear DDLm group,

John Bollinger and myself have put our heads together and produced an ISO 14977 EBNF specification of CIF2.0 syntax.  We were working from the 10 August 2011 "CIF Changes" document agreed by this group and approved in Madrid, available in the archives of this list at http://www.iucr.org/__data/iucr/lists/ddlm-group/pdf00001.pdf, with the following change:

(1) triple-quoted strings were returned to the specification as agreed in Montreal, in the form contained in the draft posted at http://www.iucr.org/__data/assets/pdf_file/0020/59420/cif2_syntax_changes-jcb20110728.pdf

Please study the attached EBNF (plain text format) and advise of any errors, omissions or ambiguities.  Note that EBNF is clearly not the best format for machine-generation of parsers, but it should be sufficiently precise and understandable to serve as a foundation document for CIF2.

For those, such as myself, who have not seen the ENBF '-' operator before, the meaning of "A-B" is: all character sequences satisfying A, except those that satisfy B.

My intention is to start producing documentation for CIF2 on the basis of this EBNF, so your earliest comments would be most appreciated.

all the best,
James.

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
(*
 *   Extended Backhaus-Naur Form of the CIF2 syntax and grammar
 *
 * The CIF2 syntax is closely related to the STAR2 syntax published 
 * by Spadaccini and Hall (J. Chem. Inf. Model., 2012, 52 (8), 
 * pp 1901–1906 DOI: 10.1021/ci300074v. The CIF1.1 syntax was derived from 
 * the original STAR syntax. 
 * 
 * The allowed character set is those having Unicode
 * code points U+0009, U+000A, U+000D, U+0020 to U+D7FF, and
 * U+E000 to U+10FFFD, less code points of the form U+xFFFE and U+xFFFF,
 * where x is any hexadecimal digit (including 0). Note that the U+0007
 * character used in STAR2 for escaping string terminators is not used
 * in CIF2.
 * 
 * This document follows EBNF syntax as given in ISO/IEC 14977. In 
 * particular, the common "+" notation is replaced by option or 
 * repeat brackets, e.g. "(ab)+" is replaced by "(ab), {ab}".  Moreover,
 * the provided EBNF applies to sequences of Unicode characters -- not
 * sequences of bytes or encoded characters -- independent of any
 * character encoding scheme (but see also below).
 *
 * This particular EBNF uses the "special sequence" mechanism to
 * represent Unicode code points and code point ranges.  Individual
 * code points are represented in a special sequence by the form
 * U+[[h]h]hhhh -- that is, "U+" followed by the 4 to 6 hexadecimal
 * digits of the unsigned, 21-bit code point value.  Ranges are
 * represented by two such code point values separated by a hyphen
 * character.  Whitespace is permitted on either or both sides of
 * any code point value within this special sequence formulation.
 *
 *
 *   CIF2 Grammar and Syntax
 *
 * CIF 2.0 is a binary format consisting of Unicode text encoded
 * in UTF-8.  Notwithstanding its use of (nearly) the full Unicode
 * character repertoire, CIF applies only the semantics described below
 * to decoded character data, especially with respect to whitespace and
 * line termination.  Software consuming CIF data and names generally
 * ascribes additional semantics to them, however, which may include
 * additional Unicode semantics.
 *
 * A CIF2 file consists of a sequence of data blocks separated by
 * whitespace, with required magic code as the first characters.
 * Data may additionally be structured into save frames, which have
 * the same form as data blocks and nest within data block or other
 * save frames.  The data themselves may appear either as individual
 * (key, value) pairs, or they may be organized into tabular
 * structures called "loops".
 *
 * The main grammatic elements of CIFs are data block headers,
 * save frame headers, data names, data values, and a few keywords.
 * These all must always be separated from each other by whitespace.
 * However, text blocks (one kind of data value) have a newline as
 * part of their delimiter, and that newline can do double-duty
 * to satisfy the whitespace separation requirement.  Additionally,
 * the requirement does not apply to separating the delimiters of
 * list and table values from member values.  The productions
 * below explicitly include whitespace wherever it is required.
 *
 * Although it cannot be represented in EBNF, CIF 2.0 requires that
 * newlines in the file (see below) not be separated by more than
 * 2048 non-newline characters.
 *)

(*
 * This formulation of the CIF2-file production accepts the end of the
 * input as whitespace for the purpose of satisfying the CIF 2.0
 * requirement that the version comment be followed by whitespace.
 *)
CIF2-file = file-heading, [ newline, [
   [ wspace ], data-block, 
   { wspace, data-block },
], [ wspace ], [ comment ] ];

file-heading = [ ?U+FEFF? ], magic-code, { space } ;

(*
 * The "magic code" identifies the CIF version with which an instance
 * document claims to comply.
 *)
magic-code = '#\#CIF_2.0' ;

(*
 * A datablock consists of a data heading followed by zero or more 
 * data items and save frames.
 *)
data-block = data-heading, { container-content } ;

data-heading = data-token, container-name ;

(*
 * A save frame has content the same form as a data block's, but cannot 
 * occur at the top level. Save frames can be nested.
 *)
save-frame = save-heading, { container-content }, wspace, save-token ;

save-heading = save-token, container-name;

(* A datablock or save frame name is composed of one or more non-blank characters *)
container-name = non-blank-char, { non-blank-char } ;

(*
 * Each element inside a container is either data or a save frame, separated
 * from the header or previous element by whitespace.
 *)
container-content = wspace, ( data | save-frame ) ;

(* Data appear either as key-value pairs, or within data loops. *)
data = ( data-name, wspace-data-value ) | data-loop ;

(*
 * A data loop consists of a loop header (the case-insensitive word "loop_"
 * followed by a sequence of datanames) and then a sequence of one or more
 * whitespace-separated values.  Though it cannot be expressed in EBNF, CIF
 * requires that a loop whose header contains N data names must contain an
 * integral multiple of N data values.
 *)
data-loop = loop-token, wspace, data-name, { wspace, data-name },
  wspace-data-value, { wspace-data-value } ;

(*
 * A dataname begins with an underscore character, and contains one or more
 * additional, non-blank characters.
 *)
data-name = '_' , non-blank-char, { non-blank-char } ;

(*
 * A list contains zero or more whitespace-separated values.  The delimiting
 * brackets may optionally be separated by whitespace from the values,
 * or from each other if there are no values.
 *)
list = '[', [ list-values-start, { wspace-data-value } ], [ wspace ], ']' ;

list-values-start =
    ( [ wspace ], nospace-value )
  | ( [ wspace ], [ comment ], semicolon-text )
  | ( [ wspace-no-line ], unquoted-string )
  | (   wspace-line, unquoted-string-sol ) ;

(*
 * A table contains zero or more whitespace-separated key/value pairs.
 * The delimiting brackets may optionally be separated by whitespace from
 * the values, or from each other if there are no values.
 *)
table = '{', [ [ wspace ], table-pair, { wspace, table-pair } ], [ wspace ], '}' ;

(*
 * Key-value pairs appearing in a table structure take the form 'key':value,
 * where key must be a delimited string (but not a text block) and the value
 * may be any data value
 *)
table-pair = ( apostrophe-string | quoted-string | apostrophe3-string | quote3-string ),
  ':', ( nospace-value | unquoted-string | semicolon-text ) ;

(*
 * In most contexts, data values must be preceded by explicit and/or
 * implicit whitespace.  Only text blocks have implicit leading whitespace.
 * Additionally, the whitespace preceeding an unquoted string affects the
 * form that string may take.
 *)
wspace-data-value =
    ( wspace, nospace-value )
  | ( wspace-no-line, unquoted-string)
  | ( wspace-line, unquoted-string-sol)
  | ( [ wspace ], [ comment ], semicolon-text ) ;

(*
 * These data values have neither implicit leading whitespace nor any special
 * sensitivity to leading whitespace.
 *)
nospace-value =
    quoted-string
  | apostrophe-string
  | quote3-string
  | apostrophe3-string
  | list
  | table ;

(* 
 * Unquoted strings draw from a subset of the CIF character set, have an even
 * more limited first character, and may not have the same form as a data block
 * header, save frame header, or any of several reserved words.  When they appear
 * at the start of a line, unquoted strings may not start with a semicolon.
 *)
unquoted-string-sol = unquoted-string - ( ';', { non-blank-char } ) ;

unquoted-string = ( lead-char, {restrict-char} )
  - ( ( ( data-token | save-token ), { non-blank-char } ) | loop-token | global-token | stop-token ) ;

lead-char = restrict-char - ( '"' | '#' | '$' | "'" | '_' ) ;

restrict-char = non-blank-char - ( '[' | ']' | '{' | '}' ) ;

(* quote-delimited string *)
quoted-string = quoted-delim, quoted-content, quoted-delim ;

quoted-delim = '"' ;

quoted-content = { char - quoted-delim } ;

(* apostrophe-delimited string *)
apostrophe-string = apostrophe-delim, apostrophe-content, apostrophe-delim ;

apostrophe-delim = "'" ;

apostrophe-content = { char - apostrophe-delim } ;

(* triple-quote-delimited string *)
quote3-string = quote3-delim, quote3-content, quote3-delim ;

quote3-delim = '"""' ;

quote3-content = { allchars } - ( { allchars }, quote3-delim, { allchars } ) ;

(* triple-apostrophe-delimited string *)
apostrophe3-string = apostrophe3-delim, apostrophe3-content, apostrophe3-delim ;

apostrophe3-delim = "'''" ;

apostrophe3-content = { allchars } - ( { allchars }, apostrophe3-delim, { allchars } ) ;

(* text block *)
semicolon-text = text-delim, text-content, text-delim ;

text-delim = newline, ';' ;

text-content = { allchars } - ( { allchars }, text-delim, { allchars } ) ;

(*
 * CIF keywords are case-insensitive.
 *
 * The global and stop tokens are part of the original STAR specification;
 * they are reserved in case of future use in CIF.
 *)

data-token = ( 'D' | 'd' ), ( 'A' | 'a' ), ( 'T' | 't' ), ( 'A' | 'a' ), '_';

save-token = ( 'S' | 's' ), ( 'A' | 'a' ), ( 'V' | 'v' ), ( 'E' | 'e' ), '_';

loop-token = ( 'L' | 'l' ), ( 'O' | 'o' ),( 'O' | 'o' ), ( 'P' | 'p' ) , '_' ;

global-token = ( 'G' | 'g' ), ( 'L' | 'l' ), ( 'O' | 'o' ), ( 'B' | 'b' ), ( 'A' | 'a' ), ( 'L' | 'l' ), '_' ;

stop-token = ( 'S' | 's' ), ( 'T' | 't' ), ( 'O' | 'o' ), ( 'P' | 'p' ), '_' ;

(* Non-whitespace characters: *)
non-blank-char = char - space ;

(*
 * Runs of spaces, tabs, newlines, and newline-terminated comments are
 * required to separate many higher-level elements of this grammar
 *)
wspace = wspace-no-line | wspace-line ;

(*
 * For some purposes, it matters whether the last character of a run of
 * whitespace is a newline.
 *)
wspace-line = wspace-to-eol, { wspace-to-eol } ;

wspace-no-line = { wspace-to-eol }, space, { space } ;

wspace-to-eol = { space }, [ comment ], newline ;

(*
 * A comment is a hash symbol followed by every character up to, but not
 * including, the end of the line
 *)
comment = '#', { char } ;

(* All characters except the newline character: *)
char = allchars - newline ;

(* 
 * Only ASCII space and tab characters are significant as inline
 * whitespace. Unicode's classification of certain code points as
 * whitespace is not significant for the purposes of CIF.
 *)
space = ?U+0020? | ?U+0009? ;

(*
 * The two-character sequence U+000D U+000A is recognized as a line
 * terminator as are each of the characters U+000A and U+000D when they
 * appear outside such a sequence.  Similarly to XML, CIF2 interprets
 * each of these line terminator sequences as the single character U+000A,
 * wherever they appear in a CIF instance document, as if that
 * translation were performed prior to parsing.
 *)
newline = ?U+000A? ;

(* For ease of specification we define a token for the full character set. *)
allchars = ?U+0009? | ?U+000A? | ?U+000D? | ?U+0020 - U+007E?
  | ?U+00A0 - U+D7FF? | ?U+E000 - U+FDCF? | ?U+FDF0 - U+FFFD?
  | ?U+10000 - U+1FFFD?  | ?U+20000 - U+2FFFD?  | ?U+30000 - U+3FFFD?
  | ?U+40000 - U+4FFFD?  | ?U+50000 - U+5FFFD?  | ?U+60000 - U+6FFFD?
  | ?U+70000 - U+7FFFD?  | ?U+80000 - U+8FFFD?  | ?U+90000 - U+9FFFD?
  | ?U+A0000 - U+AFFFD?  | ?U+B0000 - U+BFFFD?  | ?U+C0000 - U+CFFFD?
  | ?U+D0000 - U+DFFFD?  | ?U+E0000 - U+EFFFD?  | ?U+F0000 - U+FFFFD?
  | ?U+100000 - U+10FFFD? ;


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.