Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Backus-Naur Form for CIF

On Tue, 3 Oct 2000, Herbert J. Bernstein wrote:

> The usual approach used in Fortran of redefining a-z with productions of
> the form a ::= "A"|"a" won't work here, since we need to preserve case
> sensitivity for text.  In practice this would be fudged in the lexical
> scanner, but, for clarity, I would suggest adding an explicit comment
> explaining the case-insensitivity of data names and some productions of
> the form:
> 
>   <DATA_>  ::=  {"D"|"d"} {"A"|"a"} {"T"|"t"} {"A"|"a"} "_"
>   <LOOP_>  ::=  {"L"|"l"} {"O"|"o"} {"O"|"o"} {"P"|"p"} "_"
> 
> to use in place of the "data_" and "loop_" strings

Absolutely. This is very much what is done in the yacc implementation for
starbase. Namely we redefine the characters a,b,d etc to be of either case
and then define the tokens using these, as in ....

a       [aA]
b       [bB]
d       [dD]
e       [eE]
g       [gG]
l       [lL]
o       [oO]
p       [pP]
s       [sS]
t       [tT]
v       [vV]
Data_       {d}{a}{t}{a}_
Loop_       {l}{o}{o}{p}_
Global_     {g}{l}{o}{b}{a}{l}_
Stop_       {s}{t}{o}{p}_
Save_       {s}{a}{v}{e}_

In the javacc implementation there is this wonderful global setting,
namely

options {
    IGNORE_CASE=true;
    }

which simplifies things even more. I will adjust the BNF accordingly.
Thanks for picking that up Herb.

> 2.  The production for <data_block> does not require any leading or
> trailing whitespace, so that a <CIF_file> could consist of a
> <data_heading> and a <data> item immediately followed by another
> <data_heading>, etc.  I cannot seem to find where the productions
> explicitly require whitespace between the data item and the second
> data heading.  A similar problem seems to exist in the production for
> loop values.  This would certainly be solved by implicit precedence
> among the productions or by operation of the lexical scanner, but it would
> best to have the BNF be unambiguous in the handling of whitespace.

I have said it before and I will say it again, "Now you know why I have
been reluctant to include productions specific to whitespace into the
BNF". They are a purely lexical issue and language BNFs all exclude them
with the proviso that " whitespace can be used anywhere to delimit tokens
etc etc" without any explicit rules. I can see a fix, but it would need an
exception. Namely change 

<data_block>   ::= <wspace>* <data_heading> <data>+ <wspace>*

to 

<data_block>   ::= <wspace>+ <data_heading> <data>+ <wspace>*

The exception being the leading <wspace> need not be there IF IT IS THE
BEGINNING OF THE FILE. You could equally have

<data_block>   ::= <wspace>* <data_heading> <data>+ <wspace>+

with the exception about the end of the file.

This exception would have to be "written as a comment" and not formally
part of the BNF syntax (unless someone can see how to do it elegantly).

What's the consensus?

> 3.  The paper speaks of blanks, but not of tabs and vertical tabs and
> formfeeds.  Most systems will accept handle tabs reasonably.  Not all
> systems can handle vertical tab or form feed.  Are we requiring all
> CIF parsers to be able to handle more than blank and tab?

The vt and ff was an attempt to catch other non-printing characters that
could be reasonably interpreted as the equivalent of spaces or tabs (the
vt) or of a newline (ff). If it clarifies things, and restrictions always
do, I can delete references to vt and ff. Opinions?

> 4.  The paper speaks of recognising a number, and gives a syntax for a
> number (with and without an ESD).  Shouldn't this be in the BNF?

I guess I really view the BNF down to the level of what is a data value in
terms of the allowed character sequence. Whether it is a number or not is
a higher level of abstraction. I can include the production for a number
(with or without parentheses) but it would be a lexical definition. That
is it would not appear in any of the grammar productions because the
complexity would grow enormously. Imagine having to now define when a
<number> can appear within an <SC_bounded_string>! A *number* can be
included for the sake of lexical definition. Opinions?

> 5.  The paper includes an example with use of "\" (e.g. 'Cu K\a' escapes
> in text and character fields.  Shouldn't this escape mechanism be
> mentioned in the BNF, at least in the comments.

As far as the BNF is concerned the use of \ is not excluded as a
legitimate character.

> 6.  The BNF does not seem to break out the "." and "?" metacharacter data
> values.  In real parsers, these are very important cases to distinguish.

Again as far as the BNF is concerned the use of . and ? are not excluded
as legitimate characters.

In 5. and 6. you seem to be speaking of *semantic* meaning. Such
definitions are not part of the BNF, The paper you speak of details these
characters and how to interpret them. One cannot appreciate what CIF is
with just a BNF, they will need to read other specifications not
reproducible in a BNF, and only explained in the textual form (as in the
paper).

cheers

Nick

I will make the changes after some review of this correspondence by
others. 


--------------------------------
Dr Nick Spadaccini
Department of Computer Science              voice: +(61 8) 9380 3452
University of Western Australia               fax: +(61 8) 9380 1089
Nedlands, Perth,  WA  6907                 email: nick@cs.uwa.edu.au
AUSTRALIA                        web: http://www.cs.uwa.edu.au/~nick