Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: A formal specification for CIF version 1.1 (Draft)

  • Subject: RE: A formal specification for CIF version 1.1 (Draft)
  • From: "Bollinger, John Clayton" <jobollin@xxxxxxxxxxx>
  • Date: Thu, 11 Jul 2002 19:22:26 +0100 (BST)

This is a combined response to two messages, both from Herb.  I have
cut and pasted together parts of the responses, and I hope I have not
thereby taken anything out of context.

Herbert J. Bernstein [mailto:yaya@bernstein-plus-sons.com] wrote:
> On Wed, 10 Jul 2002, Bollinger, John Clayton wrote:
> > I think it unfortunate that the specification lumps together CIF
> > dictionaries and CIF data files as CIF, considering that they are
> > in fact slightly different STAR dialects.  It furthermore seems
> > like the spec has been tailored to allow this combination 
> (by addition
> > of save frames, at least), which I find a questionable strategy --
> > especially given that it did not really accomplish the apparent
> > goal anyway (that apparent goal being to produce a single 
> STAR dialect
> > with which both the dictionaries and the data files could 
> be expressed).
> 
> What in particular is still missing to allow a common format
> for CIFs and dictionaries?

Syntax, section 5: "Save frames may only be used in dictionary files."
The language for CIF data files is therefore a restriction of the
language for CIF dictionaries.  I find it a bit disingenuous to claim
that they are the same.  Yes, perhaps this is a picky point, but picky
points are what specifications are all about.  A non-validating CIF
parser does not have to recognize the full language specified by the
draft specification.
 
> > I do not see any point whatsoever to adding the stop_ keyword to
> > the accepted CIF syntax.  It is not necessary as long as CIF does
> > not permit nested loops, so it only makes parsers more difficult
> > to write.  The question should be "why add it?" rather than 
> "why not?"
> >
> stop_ has always been a reserved word, so now, instead of recognizing
> stop_ and declaring an error in all cases, a parser is allowed to
> recognize stop_ and discard it in certain cases.

[and]

>   I believe that this use of stop_ and save_  does not invalidate
> any previously valid CIFs, and is a realistic approach to dealing
> with these reserved words.  Any validating CIF parser needs to have
> a module to read dictionaries, where it will encounter save frames.
> Any properly written CIF parser has to recognize stop_ to distinguish
> it from a data value.  By making these changes in the specification,
> we are specifying a common practice (save frames), and saying
> that a use of a reserved word (stop_) in a context in which it
> clearly is not an error, should not be treated as an error.

As far as implementation stop_ and save_ not breaking existing valid
CIFs, I agree.  As for their usefulness and propriety, however, I am
not persuaded.  Yes, a validating CIF parser must be able to read
dictionaries, which use save_ and which therefore are written in a
superset of the language for data CIFs.  In a sense, adding save_ is
then not a problem for non-validating parsers, because they may
continue to reject the keyword as an error.  I would prefer, though,
to just acknowledge that the two languages remain different.

Stop_ is another story.

First, from a language design perspective, stop_ is absolutely useless
in CIF.  A valid usage in a version 1.1 compliant CIF would not
express one iota of information, because removing the stop_ keyword
would in no way whatsoever change the semantic interpretation.

Second, from a parsing perspective, it is much simpler to recognize
"stop_" and then unequivocally issue an error than it is to evaluate
the parser state every time a "stop_" is encountered to check whether
it is legal or not, and if so to modify the state appropriately.

As for whether or not a particular usage of stop_ is an error, I would
think that that was a matter dictated by the specifications we are
discussing, not by how logical or how compliant with STAR the usage may
be.  I don't see that it is particularly relevant that using stop_ in
the contexts the draft spec permits is consistent with STAR or that it
is human interpretable.

> > And what about data values beginning with a substring matching a
> > reserved word?  (Paragraph 10)  In CIF 1.0 it was reasonably clear
> > that something like this applied to data_ because such a construct
> > had its own semantics defined, but it was not clear that this was
> > a general restriction applied to all the reserved words.  Did I
> > just miss it somewhere, or is this one of those points of 1.0 that
> > is being clarified via the 1.1 spec?  If the latter, then let me
> > throw in that I don't like it.  I think that's because it is a
> > departure from the normal sense of the term "reserved word."  In any
> > case, it makes a parser that incremental bit trickier to write.
> 
>   CIF has always been presented as an application of STAR, so the
> reserved words have, in fact always been reserved, and it has
> always been the case the having a data value beginning data_ or
> save_ was incorrect.  By applying exactly the same logic to the
> full set of reserved words, I believe we should make the design of
> most parsers cleaner and simpler.

Well, I don't think I agree that parsers for 1.1 would be cleaner or
simpler by virtue of this language feature, but I'll withdraw my claim
that they would be trickier -- so long as they don't have to support
both the 1.0 spec and the 1.1 spec.

This change has more potential to break existing CIFs than do most, but
my biggest objection remains that this is not the behavior that I would
expect when presented only with the claim that loop_, stop_, save_,
data_, and global_ are reserved words.  If this feature is desired then
the specification text should be changed to say something to the effect
that strings starting with those substrings are reserved, and the
language that calls out those particular instances of such strings
as reserved words should be dropped or suitably marked as describing
special cases.

> > What exactly is the point of introducing the square bracket 
> delimiters
> > for text values?
> >
> It is more convenient to use than semicolon delimiters, and allows
> a handy nesting.

Okay, I'll buy that it's a convenience feature -- for CIF writers.
It's an inconvenience for CIF parsers, but it can be handled.  I'd like
it better if it served a useful role that was not otherwise performed;
see below.

> > In paragraph 17: "The end-of-line associated with the 
> closing semicolon
> > does not form part of the data value."  Is this another
> > change/clarification, or another published detail that had 
> previously
> > escaped me?  I had thought that that last eol was part of the value.
> 
>   If you exclude the terminal <eol> from the text field, you 
> then allow
> the semi-colon to quote arbitrary text fields, including those that
> do not have a terminal semicolon.  If you do not exclude the terminal
> <eol> from the text fields, then the only text that can be quoted with
> semicolons is text that ends with a semicolon.

I'm sorry, but I don't follow that.  I'm guessing you mean that
including the <eol> as part of the delimiter enables quotation of
strings that do not include a terminal <eol>.  Indeed, I always thought
that the exclusion of such strings was a quirk of the CIF language.  I
am certain that some of the earlier BNFs floated as candidates for a
CIF BNF included the <eol> in the production for the quoted content,
although I suppose that means little.

This must be one of those cases that has always been seen conflicting
interpretations, but I think this may be the wrong level at which to
discuss it.  If CIF is to retain compatibility with STAR, then it is
the interpretation required by STAR that we must use.  The 1994 STAR
specification paper describes semicolon-quoted text as "a sequence of
lines," with "lines" emphasized.  To me that indicates that the
trailing <eol> is part of the quoted material, not part of the
delimiter.

I observe, however, that the bracket-delimited quoting mechanism being
introduced in the draft specification does fill the hole left by the
interpretation of semicolon-delimited quoting that I am advocating.
 
> > In paragraphs 22 and 41: Exclusion of ASCII characters 11 and 12
> > decimal is a departure from and incompatibility with CIF 1.0.  Not
> > that I particularly object -- handling these appropriately 
> is a pain.
> >
> 
>   The second  sentence of the abstract of the Hall, Allen, Brown paper
> says:
> 
>   "The CIF is a general, flexible and easily extensible free-format
>   archive file; it is human and machine readable and can be 
> edited by a
>   simple text editor."
> 
> It is not always possible to edit texts containing ASCII control
> characters other than HT with a "simple text editor".  VT and FF
> serve to useful purpose in a CIF, and, as you note, they can
> be a pain to handle.
 
Did you mean VT and FF serve _no_ useful purpose in CIF?  If you looked
hard I think you might find people who would argue in favor of FF, at
least, but I personally agree with you.  My point was that this is
another difference from CIF 1.0, and another restriction of STAR.  Both
facts should be documented.

> > In paragraph 29: the data name length restriction to 75 
> characters is
> > another incompatibility with CIF 1.0 (as revised) where the 
> data name
> > length was restricted only indirectly by the line length 
> restriction.
> > Thus in CIF 1.0 data names could be 80 characters long.
> >
> 
> Actually, to allow a data name to be defined in a dictionary you have
> to allow it to appear with a prepended "data_" or "save_".  In DDL1
> dictionaries, the leading underscore of the data name is 
> dropped, which
> has created a limit of 76 characters.  In DDL2 the underscore is
> retained, which has create a limit of 75 characters.  Thus the 75
> character limit is simply a recognition of the implicit line
> length restrictions that had been in effect in the past, and helps
> to ensure that old systems will be able to work with these new names.

But CIF has never before restricted data names to only those that could
be defined in a DDL1 or DDL2 dictionary.  Moreover, with the increased
line lengths in CIF 1.1, the dictionary storage problem should be
alleviated anyway.

> > Paragraph 42 makes it optional to support line termination semantics
> > different from the host OS'.  That would be another departure
> > from CIF 1.0, I think, and, in my opinion, an all-around bad idea if
> > CIFs are supposed to be portable.  As far as I can tell, the pseudo-
> > production presented for <eol> is in fact the required 
> implementation
> > for a fully-conformant CIF 1.0 parser.
> >
> 
> If you are on a unix system, the pseudo-production is almost right
> for a "liberal-reader" CIF parser.  It misses the case of a final
> line in a file which has not been terminated by "\n".  If you are
> on a VMS system, or an IBM mainframe, the pseudo-production may be
> completely wrong for a CIF created locally as a text file.  If CIFs
> are truly to be portable, it must be possible for someone on
> a non-Unix system (and non-Windows, non-Mac system) to work with them.
> 
> 
> > Paragraph 43: In combination with the formal grammar 
> presented earlier,
> > the definitions of the <eol> and <noteol> non-terminals in 
> fact seems
> > to _preclude_ CIF parsers from handling non-native line termination
> > semantics.  Even if that's not a departure from CIF 1.0, it's still
> > a bad idea.
> >
> 
> We are not trying to preclude people from writing parsers which are
> liberal and able to read a wider range of CIF formats than those
> produced by the text editors of their own machines, but it would
> be unreasonable and impractical to insist that every parser be able
> to read every line format that ever has or will be invented.  It
> is not even reasonable to insist that every parser be able to
> read some short list of non-native line formats.  That would,
> for example, make Fortran-implemented parsers non-conformant on
> certain systems.

[and]

> > Regardless of whether the end of line handling is different in 1.1
> > than it was in 1.0, I think that those comments are a
> > mischaracterization of the details of the draft 1.1 spec.  As far as
> > I can tell, what the spec now says is that CIF line termination
> > is in fact machine dependent, and that an external utility must
> > -- must! -- be used to convert a CIF from any foreign machine
> > line termination convention to the local machine convention (if they
> > differ) before a conforming CIF parser can successfully parse the
> > file.  I think this is exactly the wrong direction.
> >
> 
> The issue you raise is a fundamental one for many data formats.
> CIF has always been specified as an editable text format.  This
> is not unusual for archival scientific data formats.  You seem
> to be saying that you would prefer a binary format.  In that
> case I would suggest the binary variant of CIF:  CBF/imgCIF.

I would prefer the approach taken by Postscript and some other
languages: <CR>, <LF>, and a <CR><LF> sequence are all accepted as
line terminators.  Support for systems that have record-oriented text
files or different character encodings necessarily requires conversion
in both directions, which I consider a separate issue altogether.

As I reread section 42 of the syntax document, I see what appear to be
conflicting statements about line-termination handling.  In fact, the
first two sentences seem to be inconsistent -- the first says that
<eol> is the system-dependent end-of-line, and the second says that
CIF follows the same convention as XML (complete with a quote from the
XML recommendation, which is more or less along the lines of my stated
preference above).  A few lines later, the spec proposes a parser
that recognizes exactly the line termination semantics I prefer, but
this is at variance with the earlier definition of <eol>.  Moreover,
the quotation from the XML recommendation describes how the XML
processor _translates_ end-of-line sequences to a standard (normalized)
representation.  Is that in fact what CIF 1.1 parsers will be expected
to do?  That would be fine by me, but other statements in this section
of the draft seem to indicate that that is not the intent.

I find it particularly troublesome that at the end of section 42 the
nature of software used to transfer CIFs is specified.  Not only
ought this to be beyond the scope of the specification, but it also
seems to be unnecessarily restrictive.  It says, for instance, that
I may not move a CIF from a Win32 system to a Linux system by diskette.
And what about my personal desktop, which dual boots Windows and Linux?
May I not reboot without worrying that I have violated the CIF spec?

As for not being able to write parsers in Fortran that support line
terminations different from the host OS', I say (1) Fortran is not an
ideal language for this sort of thing; (2) it will be easier when
Fortran acquires stream I/O in the next iteration of its standard; and
(3) it CAN be done with Fortran 77, and I have the working code to
prove it.  (Works with both DEC/Compaq/Intel Fortran and g77 on Win32
and Linux, at least, and requires no language extensions that I am
aware of.)

> > According to paragraph 60, a file containing only whitespace and
> > comments but no data block is not a valid 1.1 CIF.  That is another
> > departure from CIF 1.0 if it is really the intent.  One of 
> the ciftest
> > trip files actually tests this case, in fact.
> 
>   This sounds like a good topic for further discussion.  I for one
> would favor allowing such a file to be a CIF, but I am not certain
> what I would do with it.
> 
> >
> > Paragraph 61: this is another departure from CIF 1.0, which 
> did allow
> > data blocks without data items.  Another of the ciftest trip files
> > tests this case.  (vcif evidently produces a warning, which seems
> > reasonable, but this is not an error.)
> 
>   Yet another good topic for discussion.

These two cases are similar.  A CIF with no data content would often be
an error case for an application, but I prefer to let the application
decide that, rather than enforcing it in the CIF spec.

For the sake of discussion, I point out that the formal STAR grammar in
the 1994 STAR specification paper recognizes a file without any data
or global block as a valid STAR file, but requires a data block to
contain at least one data item, data loop, or save frame, and requires
a save frame to contain at least one data item or data loop.  Oddly,
however, a data loop with no data values can legally be the only content
of a STAR data block or save frame, according to the grammar presented
there.

> There is an open debate as to whether the production for <Tag>
> should be:
> 
>    <Tag> ::= '_'{<NonBlankChar>}+
> 
> or
>    <Tag> ::= '_'{<NameChar>}*

Well, the former is easier for an electronic parser because it has
less context dependency.  Yes, it allows nasty, ugly data names, but
as far as I am concerned anyone who uses such deserves what he gets.
That's the one I would prefer.

> > Also in the formal grammar, the productions for
> > <SingleQuotedString><WhiteSpace> and 
> <DoubleQuotedString><WhiteSpace>
> > are ambiguous.  What is intended, I think, is that the 
[...]
> Please read paragraph 58, where it says:  "The <WhiteSpace> on the
> lefthand side must evalue to the same string instance on the righthand
> side and the parse must terminate on the first valid match 
> reading left
> to right."  If one uses a parser which accepts the first full-depth
> match in a left to right scan, the productions are not ambiguous,
> and are sufficient to define the quoted strings without having to
> defined digraphs.

I would much rather see this expressed in the formal grammar than
(or in addition to) in the commentary.  It would be clearer that way.
(Evidently so, as I at first missed the part of the text that explains
this.)

> > Moreover, is there any value to including the <Numeric> non-terminal
> > and its children in the grammar at all?  Anything that matches
> > <Numeric> will also match <CharString>, so <Numeric> is not 
> necessary
> > to describe the language.
> 
> CIF differs from STAR in paying attention to numeric items.  The
> dictionaries control the semantics and help to resolve the 
> ambiguities.

I realize that CIF has more extensive data typing than does STAR, and
that CIF dictionaries can be used to resolve ambiguities.  My point
is to question whether it is necessary or useful to ambiguate and
expand the formal grammar by including the <Numeric> non-terminal.  My
current opinion is that the information conveyed by those productions
is more appropriate for the description of language semantics.


John Bollinger
jobollin@indiana.edu

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.