Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: A formal specification for CIF version 1.1 (Draft)

  • Subject: RE: A formal specification for CIF version 1.1 (Draft)
  • From: "Bollinger, John Clayton" <jobollin@xxxxxxxxxxx>
  • Date: Thu, 11 Jul 2002 00:08:47 +0100 (BST)

As a follow on to my remarks about my perceived differences between
versions 1.0 and 1.1 of CIF, here are an additional specific
response to Brian's message and some further commentary on the
specification.

Brian McMahon [mailto:bm@iucr.org] wrote:

[...]

> CIF is intended as an archival and portable format. For this 
> reason, the
> description of certain syntactic features has been 
> constructed with care to
> try to avoid machine or operating-system dependencies. This 
> is particularly
> the case with the discussion regarding end-of-line delimiters. Here an
> attempt has been made to reconcile the practical handling of 
> files which are 
> transported or shared across common operating systems such as 
> Unix, MacOS
> and MSWindows with the more general formulation that is 
> required to support
> files on mainframe or elderly record-oriented OS architectures.

Regardless of whether the end of line handling is different in 1.1
than it was in 1.0, I think that those comments are a
mischaracterization of the details of the draft 1.1 spec.  As far as
I can tell, what the spec now says is that CIF line termination
is in fact machine dependent, and that an external utility must
-- must! -- be used to convert a CIF from any foreign machine
line termination convention to the local machine convention (if they
differ) before a conforming CIF parser can successfully parse the
file.  I think this is exactly the wrong direction.

I think it unfortunate that the specification lumps together CIF
dictionaries and CIF data files as CIF, considering that they are
in fact slightly different STAR dialects.  It furthermore seems
like the spec has been tailored to allow this combination (by addition
of save frames, at least), which I find a questionable strategy --
especially given that it did not really accomplish the apparent
goal anyway (that apparent goal being to produce a single STAR dialect
with which both the dictionaries and the data files could be expressed).

I do not see any point whatsoever to adding the stop_ keyword to
the accepted CIF syntax.  It is not necessary as long as CIF does
not permit nested loops, so it only makes parsers more difficult
to write.  The question should be "why add it?" rather than "why not?"

What exactly is the point of introducing the square bracket delimiters
for text values?

This is a bit picky, but I don't see the point of introducing distinct
productions for <Comments> and <TokenizedComments> in the formal
grammar.  Why not just forget about what is currently called <Comments>,
use <TokenizedComments> in its place, and adjust the description to
match?  That also relieves the spec of having to note the exception
that a '#' embedded in non-whitespace does not initiate a comment.
More fundamentally, though, why express a production for comments
(plural) without expressing one for a single comment?  That's a bit
quirky, I think, although it appears to work.

The <NameChar> non-terminal is not used anywhere.  And if it were used
as the accompanying text describes (section 51), it would be another
break from CIF 1.0 because quotation characters are excluded from data
block names and data tags.  Those are not excluded in STAR (at least
not in the 1994 paper) and as far as I know that exclusion was never
before expressed as a CIF restriction of STAR.

Also in the formal grammar, the productions for
<SingleQuotedString><WhiteSpace> and <DoubleQuotedString><WhiteSpace>
are ambiguous.  What is intended, I think, is that the shortest string
that matches the production be used.  Here is an example that
could be misinterpreted:

	_d1 'a character value' _d2 'another one'

Is that one data item or two?  Either interpretation seems to match
the production.  I think Nick had some more precise productions in the
BNFs he floated.

In section 59, no production is given for <UnsignedInteger>.  A
production can be found in the Appendix A summary, but it should be
duplicated here as the other related productions are.  Similarly for
the <Exponent> non-terminal.

In section 59, the production for <Float> matches "+1.0" but not "1.0".

Moreover, is there any value to including the <Numeric> non-terminal
and its children in the grammar at all?  Anything that matches
<Numeric> will also match <CharString>, so <Numeric> is not necessary
to describe the language.

John Bollinger
jobollin@indiana.edu

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.