Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: A formal specification for CIF version 1.1 (Draft)

  • Subject: RE: A formal specification for CIF version 1.1 (Draft)
  • From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
  • Date: Thu, 11 Jul 2002 01:29:10 +0100 (BST)
More comments below -- HJB

 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 020
        Idle Hour Blvd, Oakdale, NY, 11769


On Thu, 11 Jul 2002, Bollinger, John Clayton wrote:

> As a follow on to my remarks about my perceived differences between
> versions 1.0 and 1.1 of CIF, here are an additional specific
> response to Brian's message and some further commentary on the
> specification.
> Brian McMahon [mailto:bm@iucr.org] wrote:
> [...]
> > CIF is intended as an archival and portable format. For this
> > reason, the
> > description of certain syntactic features has been
> > constructed with care to
> > try to avoid machine or operating-system dependencies. This
> > is particularly
> > the case with the discussion regarding end-of-line delimiters. Here an
> > attempt has been made to reconcile the practical handling of
> > files which are
> > transported or shared across common operating systems such as
> > Unix, MacOS
> > and MSWindows with the more general formulation that is
> > required to support
> > files on mainframe or elderly record-oriented OS architectures.
> Regardless of whether the end of line handling is different in 1.1
> than it was in 1.0, I think that those comments are a
> mischaracterization of the details of the draft 1.1 spec.  As far as
> I can tell, what the spec now says is that CIF line termination
> is in fact machine dependent, and that an external utility must
> -- must! -- be used to convert a CIF from any foreign machine
> line termination convention to the local machine convention (if they
> differ) before a conforming CIF parser can successfully parse the
> file.  I think this is exactly the wrong direction.

The issue you raise is a fundamental one for many data formats.
CIF has always been specified as an editable text format.  This
is not unusual for archival scientific data formats.  You seem
to be saying that you would prefer a binary format.  In that
case I would suggest the binary variant of CIF:  CBF/imgCIF.

> I think it unfortunate that the specification lumps together CIF
> dictionaries and CIF data files as CIF, considering that they are
> in fact slightly different STAR dialects.  It furthermore seems
> like the spec has been tailored to allow this combination (by addition
> of save frames, at least), which I find a questionable strategy --
> especially given that it did not really accomplish the apparent
> goal anyway (that apparent goal being to produce a single STAR dialect
> with which both the dictionaries and the data files could be expressed).

What in particular is still missing to allow a common format
for CIFs and dictionaries?

> I do not see any point whatsoever to adding the stop_ keyword to
> the accepted CIF syntax.  It is not necessary as long as CIF does
> not permit nested loops, so it only makes parsers more difficult
> to write.  The question should be "why add it?" rather than "why not?"
stop_ has always been a reserved word, so now, instead of recognizing
stop_ and declaring an error in all cases, a parser is allowed to
recognize stop_ and discard it in certain cases.

> What exactly is the point of introducing the square bracket delimiters
> for text values?
It is more convenient to use than semicolon delimiters, and allows
a handy nesting.

> This is a bit picky, but I don't see the point of introducing distinct
> productions for <Comments> and <TokenizedComments> in the formal
> grammar.  Why not just forget about what is currently called <Comments>,
> use <TokenizedComments> in its place, and adjust the description to
> match?  That also relieves the spec of having to note the exception
> that a '#' embedded in non-whitespace does not initiate a comment.
> More fundamentally, though, why express a production for comments
> (plural) without expressing one for a single comment?  That's a bit
> quirky, I think, although it appears to work.

The problem is that most people think of a CIF comment as beginning
at the "#", so we use the <comment> token that way.  It also allows
the production for <CIF> to be more compact, since it is legal to
have a comments before the first whitespace, an item for which
we need the concept of a tokenized comment, i.e. a comment that
has been made easy to recognize as a token by virtue of the SP, HT
or <eol> before it.

> The <NameChar> non-terminal is not used anywhere.  And if it were used
> as the accompanying text describes (section 51), it would be another
> break from CIF 1.0 because quotation characters are excluded from data
> block names and data tags.  Those are not excluded in STAR (at least
> not in the 1994 paper) and as far as I know that exclusion was never
> before expressed as a CIF restriction of STAR.
There is an open debate as to whether the production for <Tag>
should be:

   <Tag> ::= '_'{<NonBlankChar>}+

   <Tag> ::= '_'{<NameChar>}*

We need the production to facilitate the discussion.

> Also in the formal grammar, the productions for
> <SingleQuotedString><WhiteSpace> and <DoubleQuotedString><WhiteSpace>
> are ambiguous.  What is intended, I think, is that the shortest string
> that matches the production be used.  Here is an example that
> could be misinterpreted:
> 	_d1 'a character value' _d2 'another one'
> Is that one data item or two?  Either interpretation seems to match
> the production.  I think Nick had some more precise productions in the
> BNFs he floated.

Please read paragraph 58, where it says:  "The <WhiteSpace> on the
lefthand side must evalue to the same string instance on the righthand
side and the parse must terminate on the first valid match reading left
to right."  If one uses a parser which accepts the first full-depth
match in a left to right scan, the productions are not ambiguous,
and are sufficient to define the quoted strings without having to
defined digraphs.

> In section 59, no production is given for <UnsignedInteger>.  A
> production can be found in the Appendix A summary, but it should be
> duplicated here as the other related productions are.  Similarly for
> the <Exponent> non-terminal.
> In section 59, the production for <Float> matches "+1.0" but not "1.0".
Good points.  The production for <Float> should be changed to:

<Float> ::= { <Integer> |
            { {'+'|'-'}? { {<Digit>} * '.' <UnsignedInteger> } |
            { <Digit>} + '.' } } {<Exponent>} ? } }

> Moreover, is there any value to including the <Numeric> non-terminal
> and its children in the grammar at all?  Anything that matches
> <Numeric> will also match <CharString>, so <Numeric> is not necessary
> to describe the language.

CIF differs from STAR in paying attention to numeric items.  The
dictionaries control the semantics and help to resolve the ambiguities.

> John Bollinger
> jobollin@indiana.edu

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.