Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A formal specification for CIF version 1.1 (Draft)

  • Subject: Re: A formal specification for CIF version 1.1 (Draft)
  • From: Brian McMahon <bm@xxxxxxxx>
  • Date: Fri, 13 Sep 2002 14:16:10 +0100 (BST)
Dear John

I shall shortly issue a slightly modified draft to the CIF 1.1
specification, arising from discussions at Geneva and elsewhere. Some
of the annotations arise from your comments, and especially your
detailed critique of July 11, which was left hanging in the air
a bit. I would like to address some of your concerns in a little more
detail here.



> Syntax, section 5: "Save frames may only be used in dictionary files."
> The language for CIF data files is therefore a restriction of the
> language for CIF dictionaries.  I find it a bit disingenuous to claim
> that they are the same.  ... A non-validating CIF parser does not
> have to recognize the full language specified by the draft specification.

The point of bringing dictionary files and data files together is to make it 
possible to write a single parser library that can read both types of
files. If you are writing your own parser for an application that is
designed only to handle data files, I guess it's OK to treat saveframes as
errors. I've accepted that by introducing an "implementation note":

  5a. <b>Implementation note:</b> At a purely syntactic level there is no way
  to distinguish between dictionary and data files. (It is also to be noted
  that not all dictionary files contain save frames.) A fully validating
  parser must therefore be able to detect the start and termination of save
  frames, the uniqueness of the framecode within a data block, and the
  uniqueness of data names within a frame code. It is however legitimate for
  an application-based parser designed to handle only the contents of data
  files to consider the presence of a save frame as an error.


  

>>> I do not see any point whatsoever to adding the stop_ keyword to
>>> the accepted CIF syntax.  It is not necessary as long as CIF does
>>> not permit nested loops, so it only makes parsers more difficult
>>> to write.  The question should be "why add it?" rather than "why not?"
>>
>> stop_ has always been a reserved word, so now, instead of recognizing
>> stop_ and declaring an error in all cases, a parser is allowed to
>> recognize stop_ and discard it in certain cases.
>  ...
> 
> from a parsing perspective, it is much simpler to recognize
> "stop_" and then unequivocally issue an error than it is to evaluate
> the parser state every time a "stop_" is encountered to check whether
> it is legal or not, and if so to modify the state appropriately.

The use of stop_ as an active loop terminator (as in STAR) was introduced
following requests to help error recovery. The idea was that it could be
identified as a clear intention on the part of the CIF creator to terminate
a loop (or loop header), and so make it easier to recover some data from
a mangled CIF. In practice I think it doesn't make error recovery much
easier except perhaps in some rather specific cases. I don't myself have
an opinion on this, and would be interested in hearing from anyone else
on the list who believes there to be a strong case for retaining the use
of stop_ as a list terminator. [I do believe though that it must be retained
explicitly as a reserved word.]




>>> And what about data values beginning with a substring matching a
>>> reserved word?  (Paragraph 10)  In CIF 1.0 it was reasonably clear
>>> that something like this applied to data_ because such a construct
>>> had its own semantics defined, but it was not clear that this was
>>> a general restriction applied to all the reserved words.
>> 
>>   CIF has always been presented as an application of STAR, so the
>> reserved words have, in fact always been reserved, and it has
>> always been the case the having a data value beginning data_ or
>> save_ was incorrect.  By applying exactly the same logic to the
>> full set of reserved words, I believe we should make the design of
>> most parsers cleaner and simpler.

I have split the table in Para. 12 so that now loop_, stop_ and global_
are discussed separately:

   In addition the following <b>reserved words</b> may not occur as unquoted
   data values.

   Reserved word    Role
   loop_            identifies looped list of data
   stop_            terminates looped list of data
   global_          reserved as STAR global block header




>>> What exactly is the point of introducing the square bracket 
>>> delimiters for text values?

The square brackets are being introduced into STAR as a framework for
extended processing of data values through dictionary-specified methods,
and their greatest use will be in handing tuples of numeric values
within vector or matrix calculations. We decided in Geneva that the
time was not yet right to introduce these procedures into CIF (they're
still being worked on by Syd and his group). However, square brackets will
be handled as reserved characters at the start of data values to permit
their use in this way at a later date. The draft has been modified
accordingly. I shall encourage Syd and Nick to keep this list posted of
developments (perhaps they may even be able to recruit collaborators).


 

>>> In paragraph 17: "The end-of-line associated with the closing semicolon
>>> does not form part of the data value."
>>> ...  I had thought that that last eol was part of the value.
>> 
>>   If you exclude the terminal <eol> from the text field, you then allow
>> the semi-colon to quote arbitrary text fields, including those that
>> do not have a terminal [<eol>].  If you do not exclude the terminal
>> <eol> from the text fields, then the only text that can be quoted with
>> semicolons is text that ends with a [<eol>].
> 
> This must be one of those cases that has always been seen conflicting
> interpretations, but I think this may be the wrong level at which to
> discuss it.  If CIF is to retain compatibility with STAR, then it is
> the interpretation required by STAR that we must use.  The 1994 STAR
> specification paper describes semicolon-quoted text as "a sequence of
> lines," with "lines" emphasized.  To me that indicates that the
> trailing <eol> is part of the quoted material, not part of the delimiter.

I think this is important and merits further discussion on the list. The
example in Para. 20 shows that the current intention is to treat 'foo' and 
;foo
;
as equivalent. I guess that properly 
;
foo
;
is different (its string value is '\nfoo'), and from Para. 17
; foo
;
different again (' \nfoo').




>>> In paragraphs 22 and 41: Exclusion of ASCII characters 11 and 12
>>> decimal is a departure from and incompatibility with CIF 1.0. 

I've added a sentence to draw attention to the specific exclusion of 
ASCII 11 and 12.




> But CIF has never before restricted data names to only those that could
> be defined in a DDL1 or DDL2 dictionary.  Moreover, with the increased
> line lengths in CIF 1.1, the dictionary storage problem should be
> alleviated anyway.

That's technically correct. However, a 75-character limitation ensures that
any data names that are used can be defined in, and according to the standard 
conventions of, a CIF-1.0 level dictionary.



 
>>> Paragraph 42 ... line termination
> 
> As I reread section 42 of the syntax document, I see what appear to be
> conflicting statements about line-termination handling.  In fact, the
> first two sentences seem to be inconsistent -- the first says that
> <eol> is the system-dependent end-of-line, and the second says that
> CIF follows the same convention as XML (complete with a quote from the
> XML recommendation, which is more or less along the lines of my stated
> preference above).  A few lines later, the spec proposes a parser
> that recognizes exactly the line termination semantics I prefer, but
> this is at variance with the earlier definition of <eol>.  Moreover,
> the quotation from the XML recommendation describes how the XML
> processor _translates_ end-of-line sequences to a standard (normalized)
> representation.  Is that in fact what CIF 1.1 parsers will be expected
> to do?  That would be fine by me, but other statements in this section
> of the draft seem to indicate that that is not the intent.

I have gone through the relevant paragraphs again  (25 and 42) and do not
want to change anything of substance. The line terminator *is* OS-dependent;
technically a CIF created on a Unix machine and transferred by binary ftp to 
a Windows machine is invalid. But it's an invalidity that is amenable to
diagnosis and handling without the need to trouble the user, and that's what 
we should do: write permissive applications that take account of this
without troubling the end-user. As written, the spec gives guidance for the
99% of software authors who wish to handle the Unix/Windows/MacOS cases
cross-platform, while still explaining the principle to be applied on a VMS, 
IBM mainframe or other unusual crystallographic computer platform.

So I preface the second sentence of para. 42 with "Implementation note:",
reword it as "CIF implementations may follow common HTML and XML
practice..." and drop the font size so that it is seen as a commentary on
an implementation strategy to accommodate the unavoidable OS dependency.




>>> According to paragraph 60, a file containing only whitespace and
>>> comments but no data block is not a valid 1.1 CIF.
>>> ...
>>> Paragraph 61: this is another departure from CIF 1.0, which did allow
>>> data blocks without data items.  Another of the ciftest trip files
>>> tests this case.  (vcif evidently produces a warning, which seems
>>> reasonable, but this is not an error.)
> 
> These two cases are similar.  A CIF with no data content would often be
> an error case for an application, but I prefer to let the application
> decide that, rather than enforcing it in the CIF spec.

I have been checking my notes (going back to email discussions in 1995 with
Nick on the esoteric topic of how to evaluate data values within a null data 
block that were declared in a prior global_ block). I think the explicit
invalidations of para. 60 and 61 were introduced late in the day and are
wrong. (Of course, John is right that such null cases might raise
application errors.) The amended productions are:

<CIF>                   ::=
      <Comments>? <WhiteSpace>? 
      { <DataBlock> { <WhiteSpace> <DataBlock> }* { <WhiteSpace> }? }?

and

<DataBlock>             ::=
      <DataBlockHeading> {<WhiteSpace> { <DataItems> | <SaveFrame> } }*

 


>> There is an open debate as to whether the production for <Tag>
>> should be:
>> 
>>    <Tag> ::= '_'{<NonBlankChar>}+
>> 
>> or
>>    <Tag> ::= '_'{<NameChar>}*
> 
> Well, the former is easier for an electronic parser because it has
> less context dependency.  Yes, it allows nasty, ugly data names, but
> as far as I am concerned anyone who uses such deserves what he gets.
> That's the one I would prefer.

Since the current draft had the first production, and to help the debate
along, I have deleted the production for <NameChar> and deleted para. 51 :-)


Regards
Brian

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.