Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CIF 2.0 syntax proposal for retaining backwards CIF 1.x compatibility

  • To: "Discussion list of the IUCr Committee for the Maintenance of the CIF Standard (COMCIFS)" <comcifs@iucr.org>
  • Subject: CIF 2.0 syntax proposal for retaining backwards CIF 1.x compatibility
  • From: =?windows-1252?Q?Saulius_Gra=9Eulis?= <grazulis@ibt.lt>
  • Date: Thu, 29 Aug 2013 13:54:07 +0300
Dear COMCIFS members,

as we have discussed during the recent IUCr Data Management workshop, I
have put together a proposal for the CIF 2.0 syntax that is fully
backwards compatible with CIF 1.1. I attach a draft of this proposal
(see the attached "CIF1-CIF2-compatibility.txt" file). The new proposal
uses '[[' and ']] to delimit tables (instead of '{' and '}'), and
retains the syntax of CIF 1.x quoted strings.

James has offered to try to "break" the proposed syntax, i.e. to see if
it can be parsed correctly in all kinds of unusual cases. I have done
preliminary tests of this kind using my sample implementation; the tests
I have used are attached as the "breakit.zip" file (it features 6
correct cases and 5 incorrect syntax examples). My impression is that
the proposed '[['/']]' syntax can always be parsed as expected, and
syntax errors an be unambiguously detected.

I would be grateful for comments from the COMCIFS members and, if the
proposal is found to be acceptable, for consideration of its inclusion
into the forthcoming CIF 2.0 standard.

Sicnerely yours,
Saulius

-- 
Dr. Saulius Gra×ulis
Vilnius University Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366
# encoding: UTF-8
Request for CIF Comments: 2013-08-27 v1
Proposal to COMCIFS

S. Gra┼żulis
Vilnius University Institute of Biotechnology
Grai─Źi┼źno 8
LT-02241 Vilnius
Lithuania
grazulis@ibt.lt
August 2013

      Proposal for a CIF2 syntax backwards compatible with CIF1
      =========================================================

Status of This Memo
-------------------

Abstract
--------

1. Current CIF 2.0 Syntax Proposal
----------------------------------

  The current CIF 2.0 syntax draft introduces several extensions
  compared to the existing CIF 1.x syntax:

    1. The use of the Unicode character set and the UTF-8 file
       encoding;

    2. Different syntax for quoted strings -- the delimiting quotes
       are no longer allowed within such strings (the CIF 1 grammar
       would permit delimiting quotes within quoted strings provided
       these non-closing quotes were not immediately followed by a
       white space);

    3. The curly brace characters, '{' and '}', get special meaning --
       they now denote the beginning and the end of key-value tables.

2. Drawbacks of the Current CIF2 Syntax Proposal
------------------------------------------------

  The current proposal, as outlined in par. 1, has IMHO a number of
  drawbacks. While the suggestion in the par. 1.1 above is fully
  backwards compatible with the CIF 1.x syntax (since every ASCII file
  is at the same time a valid UTF-8 file), the other two changes are
  *not* backwards compatible with CIF 1.x. This gives rise to the
  following inconveniences:

  1. The prospective CIF users will have to learn two different sets
     of syntax rules, and pay attention to when each set of lexical
     rules applies;

  2. The software developers and maintainers will have to keep track
     of two distinct CIF grammars and two distinct parsers. While this
     is of course a technically achievable feat, the question arises
     to whether it is necessary: "the cheapest, fastest, smallest and
     most reliable software components are those that are absent";

  3. Distinguishing between the CIF 1.x and CIF 2.0 files would
     require either the obligatory '#\#CIF_2.0' header or,
     alternatively, a backtracking parser that would start to parse a
     file from the beginning if the CIF 2.x construct is
     encountered. The header, as practice shows, may often be lost
     during file handling, leaving files potentially uninterpretable;
     the back-tracking will fail on non-seekable files, such as
     network connections, Unix-like pipes or other data streams
     generated on-the-fly.

3. A New Proposed Syntax
------------------------

  I will argue that all new features of CIF 2.0, in particular
  syntactically distinct tables, can be achieved while retaining full
  compatibility with the CIF 1.1 syntax, and therefore avoiding the
  drawbacks mentioned in pp. 2.1-3. It also, IMHO, does not add any
  new problems of its own.

  The proposed new CIF 2.0 syntax would look as follows:

  1. The UTF-8 proposal remains unchanged;

  2. The quoted string handling is retained exactly the same as it
     was mandated in CIF 1.1;

  3. The key-value tables, instead of using '{' and '}' as delimiters,
     would use double-character tokens '[[' and ']]' as delimiters
     instead. No special token would be used to separate keys and
     values, but the data would come in 'key-value' pairs. The '[['
     delimiter would unambiguously signal the start of a table, and
     an odd number of data values would be an error detected at parse
     stage.

4. Example of a CIF2 Syntax file
--------------------------------

  Example (based on the DDLm 'cif_core.dic'):

   _import.get
        [ [[ save: EXPERIMENTAL  file: core_exptl.dic  mode: full ]]
          [[ save: DIFFRACTION   file: core_diffr.dic  mode: full ]]
          [[ save: STRUCTURE     file: core_struc.dic  mode: full ]]
          [[ save: MODEL         file: core_model.dic  mode: full ]]
          [[ save: PUBLICATION   file: core_publn.dic  mode: full ]]
          [[ save: FUNCTION      file: core_funct.dic  mode: full ]] ]

  If a visual cue to highlight keys is desired, a semicolon can be
  added to the end of the key value, as in the example above ( "file:"
  instead of "file"). The use of this colon is not mandated by the
  syntax, but is instead a stylistic convention.

5. Benefits of the New Syntax
--------------------------

  There are several benefits of adopting the modified proposal from
  par. 3 (using '[[' and CIF 1.x strings instead of '{' and modified
  string syntax):

  1. Every CIF 1.1 file would be at the same time a syntactically
     correct CIF 2.0 file, thus easing compatibility problems and
     making transition from CIF 1.x to CIF 2.0 syntax easier;

  2. One parser could be used for parsing the existing CIF 1.x files
     and the new CIF 2.0, reducing the labour and costs of CIF parser
     support;

  3. The new CIF 2.0 syntax would be easier to explain to students and
     current users -- basically, everything they have learnt about
     CIF 1.1 remains valid, with a few extra additions;

  4. A CIF 1.x style value would be permitted anywhere in the CIF file
     (both 1.x and 2.0); this would make understanding of CIF
     structure for both programmers and chemists/crystallographers
     easier (no need to memorise exceptions that would inevitably be
     necessary in the tables).

  As for the drawbacks, the newly proposed method of using the '[['
  delimiters has, IMHO, none. The need to use spaces to disambiguate
  nested lists and arrays ('[ [[' vs. '[[ [' vs '[ [ [') might be seen
  as a slight inconvenience; however, the practice to require spaces
  for token disambiguation is usual in many computer languages (viz
  'a+++b' in C or C++, and "aaa""bbb" in CIF) and does not lead to any
  noticeable readability or usage difficulties. Accidental omissions of
  single spaces that might change the meaning of the CIF text will in
  all cases lead to detectable syntax errors, noticeable at the
  parsing stage; thus a chance of accidental data corruption because
  of the '[[/[' syntax peculiarities is minimal.

6. Comparison with the CIF2 file under the Current Proposal
-----------------------------------------------------------

  When displayed side-by-side, the texts in the current and the new
  CIF proposal look as follows (extracts are from the DDLm
  'cif_core.dic'):

   # The current CIF 2.0 proposal (using '{/}' and changed strings):
   _import.get       
        [{"save":'EXPERIMENTAL', "file":'core_exptl.dic', "mode":'full' },
         {"save":'DIFFRACTION',  "file":'core_diffr.dic', "mode":'full' },
         {"save":'STRUCTURE',    "file":'core_struc.dic', "mode":'full' },
         {"save":'MODEL',        "file":'core_model.dic', "mode":'full' },
         {"save":'PUBLICATION',  "file":'core_publn.dic', "mode":'full' },
         {"save":'FUNCTION',     "file":'core_funct.dic', "mode":'full' }]
 

   # The new CIF 2.0 proposal (using '[[/]]' and the CIF 1.x strings):
   _import.get
        [ [[ save: EXPERIMENTAL  file: core_exptl.dic   mode: full ]]
          [[ save: DIFFRACTION   file: core_diffr.dic   mode: full ]]
          [[ save: STRUCTURE     file: core_struc.dic   mode: full ]]
          [[ save: MODEL         file: core_model.dic   mode: full ]]
          [[ save: PUBLICATION   file: core_publn.dic   mode: full ]]
          [[ save: FUNCTION      file: core_funct.dic   mode: full ]] ]

  As we can see from the comparison, the new form is shorter (!),
  despite requiring spaces for '[ [[' disambiguation and despite using
  extra spaces for highlighting key-value pairs. It has also less
  "line noise", since with the new proposal we can use unquoted
  strings for keys and the text does not need so many quotes for
  delimiting.

  Al these features of the new proposal, IMHO, contribute to the
  readability of the CIF files.

7. Possibility of Future Extensions
-----------------------------------

  The '[[[' in my current test implementation is parsed as '[[ ['.
  This might be confusing, and precludes future extensions of CIF
  2.0 to use triple brackets '[[[' as a single token; thus it is
  probably wise to make the triple brackets, and all other tokens that
  start with double brackets, reserved for future use.

8. A Sample Implementation of the Newly Proposed CIF 2.0 Syntax
------------------------------------------------------------

  An extension of a CIF 1.1 parser to recognise the new proposed CIF
  2.0 syntax is very straightforward and requires addition of two
  lexical token rules and two grammar rules; all other rules remain
  unchanged, proving that the CIF 2.0 in this case is a compatible
  extension of CIF 1.1.

  An example implementation of the newly proposed CIF 2.0 syntax can
  be found (open for anonymous checkouts) on the COD server at:

  svn://www.crystallography.net/cod-tools/branches/saulius-CIF2-proposal/

  Below is the unified diff from the COD CIF parser in cod-tools/:

# BEGINNING OF THE PATCH:

Index: CIFParser.yp
===================================================================
--- CIFParser.yp	(revision 2217)
+++ CIFParser.yp	(revision 2218)
@@ -527,8 +527,34 @@
 				if( $CIFParser::debug >= 1 && $CIFParser::debug <= 2);
 			$_[1];
 		}
+       |        table
+                {       print "TABLE\t\t->\t" .
+				$_[1]->{value} . "\n"
+				if( $CIFParser::debug >= 1 && $CIFParser::debug <= 2);
+			$_[1];
+		}
+       |        list
+                {       print "LIST\t\t->\t" .
+				$_[1]->{value} . "\n"
+				if( $CIFParser::debug >= 1 && $CIFParser::debug <= 2);
+			$_[1];
+		}
 ;
 
+list :
+    START_LIST cif_value_list END_LIST
+    {
+        { value => $_[2] };
+    }
+;
+
+table :
+    START_TABLE cif_value_list END_TABLE
+    {
+        { value => { @{$_[2]} } };
+    }
+;
+
 string
 	:	SQSTRING
 		{
@@ -667,6 +693,8 @@
 sub _Lexer
 {
 	my($parser) = shift;
+
+        ## print $parser->YYData->{INPUT}, "\n";
 	
 	#trimming tokenized comments
 	if( defined $parser->YYData->{INPUT} &&
@@ -798,6 +826,28 @@
 			advance_token($parser, length($1 . $2));
 			return('SQSTRING', $1);
 		}
+		#matching begining or end of tables (associative arrays):
+		if(s/^ \[\[ //x)
+		{
+			advance_token($parser, 2);
+			return('START_TABLE', '[[');
+		}
+		if(s/^ \]\] //x)
+		{
+			advance_token($parser, 2);
+			return('END_TABLE', ']]');
+		}
+		#matching begining or end of lists (arrays):
+		if(s/^\[//)
+		{
+			advance_token($parser, 1);
+			return('START_LIST', '[');
+		}
+		if(s/^\]//)
+		{
+			advance_token($parser, 1);
+			return('END_LIST', ']');
+		}
 		#matching single quoted strings without a closing quote
 		if( ( $parser->{USER}{OPTIONS}{fix_errors} ||
                       $parser->{USER}{OPTIONS}

# END OF THE PATCH
 
  As can be seen, the only additions are the 'table' and 'list'
  grammar productions and their use in the 'cif_value' rule, and the
  code to recognise '[[/]]' and '[/]' delimiter tokens.

9. Possible Errors and their Detection by the Parser
--------------------------------------------------

  One source of possible errors is the omission of spaces between the
  '[' and '[[' tokens, resulting in a potentially ambiguous triple
  bracket sequence '[[['.

  The current implementation interprets '[[[' as '[[ [' and ']]]' as
  ']] ]' (the longest possible token is recognised first). This would
  result in a wrong closing sequences of the "run-together" bracket
  delimiters are used:

      [[[ a b c ]]]

  is interpreted as:

      [[ [ a b c ]] ]

  as a result the '[' list is closed by a table delimiter ']]', and
  this discrepancy is detected at compile time.

  If the proposal of the paragraph 7. is adopted, already the '[[['
  token would be detected as reserved at compile time, reducing
  ambiguities even further.

  Another source of possible errors is the loss of some values in
  tables. Despite the absence of a delimiter between a key and a
  value, such situations are immediately detected since they result in
  a table with an odd number of values:

  [[ k1: v1 k2: v2 ]] # A correct table with 2 keys and 2 values

  [[ k1: v1 k2: ]]    # Error, one value is lost and we have odd
                      # number of elements in the table -- detectable
                      # at compile time.

  Thus, the most common mistakes of a single character or a single
  value loss would be detectable at compile time (without even
  consulting dictionaries) and would not result in data loss or
  confusion.

10. Conclusions
---------------

  The proposal to keep CIF 2.0 backwards compatible to CIF 1.x is easy
  to implement, permits to achieve all goals of CIF 2.0 (syntactically
  distinct lists of values and key-value tables) even simpler as in
  current proposal and guards against potential data loss at the
  parser level.

  Retaining CIF 2.0 -- CIF 1.x backwards compatibility would make
  maintenance of parsers easier and facilitate learning of CIF 2.0
  syntax by the users and programmers, which should contribute to
  wider and faster adoption of CIF 2.0 framework, and easier ways to
  archive and exchange scientific data.

Vilnius
2013-08-29

breakit.zip

_______________________________________________
comcifs mailing list
comcifs@iucr.org
http://mailman.iucr.org/mailman/listinfo/comcifs

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.