Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Backus-Naur Form for CIF

  • To: Multiple recipients of list <comcifs-l@iucr.org>
  • Subject: Re: Backus-Naur Form for CIF
  • From: Brian McMahon <bm@iucr.org>
  • Date: Wed, 4 Oct 2000 10:46:50 +0100 (BST)
Nick, Herbert

I've reflected this discussion onto the cif-developers list, because I think
it's useful to canvass a wider opinion than the comcifs list alone.

 On Wed, Oct 04, 2000 at 03:08:18AM +0100, Nick Spadaccini (>) wrote:
 > On Tue, 3 Oct 2000, Herbert J. Bernstein (>>) wrote:
 >> 1.  In the paper it says under "The CIF restrictions to the STAR File
 >> syntax are...":
 >>   " ... All data names and block codes are case insensitive, i.e. _ABS and
 >> _abs are treated identically."
 >> The usual approach used in Fortran of redefining a-z with productions of
 >> the form a ::= "A"|"a" won't work here, since we need to preserve case
 >> sensitivity for text.  In practice this would be fudged in the lexical
 >> scanner, but, for clarity, I would suggest adding an explicit comment
 >> explaining the case-insensitivity of data names and some productions of
 >> the form:
 >>   <DATA_>  ::=  {"D"|"d"} {"A"|"a"} {"T"|"t"} {"A"|"a"} "_"
 >>   <LOOP_>  ::=  {"L"|"l"} {"O"|"o"} {"O"|"o"} {"P"|"p"} "_"
 >> to use in place of the "data_" and "loop_" strings
 > Absolutely. This is very much what is done in the yacc implementation for
 > starbase. Namely we redefine the characters a,b,d etc to be of either case
 > and then define the tokens using these, as in ....
 > a       [aA]
 > b       [bB]
 > d       [dD]
 > e       [eE]
 > g       [gG]
 > l       [lL]
 > o       [oO]
 > p       [pP]
 > s       [sS]
 > t       [tT]
 > v       [vV]
 > Data_       {d}{a}{t}{a}_
 > Loop_       {l}{o}{o}{p}_
 > Global_     {g}{l}{o}{b}{a}{l}_
 > Stop_       {s}{t}{o}{p}_
 > Save_       {s}{a}{v}{e}_
 > In the javacc implementation there is this wonderful global setting,
 > namely
 > options {
 >     IGNORE_CASE=true;
 >     }
 > which simplifies things even more. I will adjust the BNF accordingly.
 > Thanks for picking that up Herb.

This is a nice example of "literalism", and good reason to have these details
thrashed out formally. I had always taken the view that the "reserved words"
data_ and loop_ were case-SENSITIVE; so data_foo and data_FOO were
identical, DATA_foo was invalid. I'm happy to accept the new convention,
although it does mean some code rewriting.

 >> 2.  The production for <data_block> does not require any leading or
 >> trailing whitespace, so that a <CIF_file> could consist of a
 >> <data_heading> and a <data> item immediately followed by another
 >> <data_heading>, etc.  I cannot seem to find where the productions
 >> explicitly require whitespace between the data item and the second
 >> data heading.  A similar problem seems to exist in the production for
 >> loop values.  This would certainly be solved by implicit precedence
 >> among the productions or by operation of the lexical scanner, but it would
 >> best to have the BNF be unambiguous in the handling of whitespace.
 > I have said it before and I will say it again, "Now you know why I have
 > been reluctant to include productions specific to whitespace into the
 > BNF". They are a purely lexical issue and language BNFs all exclude them
 > with the proviso that " whitespace can be used anywhere to delimit tokens
 > etc etc" without any explicit rules. I can see a fix, but it would need an
 > exception. Namely change 
 >    <data_block>   ::= <wspace>* <data_heading> <data>+ <wspace>*
 > to 
 >    <data_block>   ::= <wspace>+ <data_heading> <data>+ <wspace>*
 > The exception being the leading <wspace> need not be there IF IT IS THE
 > BEGINNING OF THE FILE. You could equally have
 > <data_block>   ::= <wspace>* <data_heading> <data>+ <wspace>+
 > with the exception about the end of the file.
 > This exception would have to be "written as a comment" and not formally
 > part of the BNF syntax (unless someone can see how to do it elegantly).
 > What's the consensus?

I prefer the exception at the end of the file (i.e. the second alternative).
Could it be formalised by including an end-of-file token?
   <data_block>   ::= <wspace>* <data_heading> <data>+ (<wspace>|<eof>)+
Though I guess the problem is that you then need to insert <eof>'s
everywhere that they might legitimately occur in a valid file description.

 >> 3.  The paper speaks of blanks, but not of tabs and vertical tabs and
 >> formfeeds.  Most systems will accept handle tabs reasonably.  Not all
 >> systems can handle vertical tab or form feed.  Are we requiring all
 >> CIF parsers to be able to handle more than blank and tab?
 > The vt and ff was an attempt to catch other non-printing characters that
 > could be reasonably interpreted as the equivalent of spaces or tabs (the
 > vt) or of a newline (ff). If it clarifies things, and restrictions always
 > do, I can delete references to vt and ff. Opinions?

If you permit <vt> and <ff> as allowed characters, you should treat them as
white space. I've no objection to forbidding them if that's what is
generally preferred. BUT I have a half-recollection that <ff> was introduced
at some stage as a mandatory character in the header to image data in the
crystallographic binary file (to stop "more" and other Unix pagers from
writing binary data to screen). Is that still the case? And how were such
embedded <ff>s to be handled in imgCIF?

 >> 4.  The paper speaks of recognising a number, and gives a syntax for a
 >> number (with and without an ESD).  Shouldn't this be in the BNF?
 > I guess I really view the BNF down to the level of what is a data value in
 > terms of the allowed character sequence. Whether it is a number or not is
 > a higher level of abstraction. I can include the production for a number
 > (with or without parentheses) but it would be a lexical definition. That
 > is it would not appear in any of the grammar productions because the
 > complexity would grow enormously. Imagine having to now define when a
 > <number> can appear within an <SC_bounded_string>! A *number* can be
 > included for the sake of lexical definition. Opinions?
I agree with Nick on this. Data value tokens are just strings. A "number"
is validated against typing rules. Whether the data typing in DDL1 CIF is
adequately described mechanistically is a good question; in DDL2 data types
are expressed *in the domain dictionary* as regexps (see mmCIF).
 >> 5.  The paper includes an example with use of "\" (e.g. 'Cu K\a' escapes
 >> in text and character fields.  Shouldn't this escape mechanism be
 >> mentioned in the BNF, at least in the comments.
 > As far as the BNF is concerned the use of \ is not excluded as a
 > legitimate character.
Again I agree with Nick that it's not the role of the BNF. But I agree with
Herbert that the rules for interpreting formatting codes within text fields
are not sufficiently well specified. I'll think about this and come back in
due course with proposals for tackiling that issues.
 >> 6.  The BNF does not seem to break out the "." and "?" metacharacter data
 >> values.  In real parsers, these are very important cases to distinguish.
 > Again as far as the BNF is concerned the use of . and ? are not excluded
 > as legitimate characters.
 > In 5. and 6. you seem to be speaking of *semantic* meaning. Such
 > definitions are not part of the BNF, The paper you speak of details these
 > characters and how to interpret them. One cannot appreciate what CIF is
 > with just a BNF, they will need to read other specifications not
 > reproducible in a BNF, and only explained in the textual form (as in the
 > paper).

The new policy statement includes URLs designed to point to formal
specifications for CIF, STAR and DDL. The pages at these URLs are currently
rudimentary, but will be expanded to include complete specifications,
including but not restricted to BNFs. I'm responsible for maintaining those
pages, so will be glad of suggestions for specific items that should be
included on those pages.

Best wishes