Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Backus-Naur Form for CIF

At 03:08 04/10/00 +0100, Nick Spadaccini wrote:
>On Tue, 3 Oct 2000, Herbert J. Bernstein wrote:
>
>
> > 2.  The production for <data_block> does not require any leading or
> > trailing whitespace, so that a <CIF_file> could consist of a
> > <data_heading> and a <data> item immediately followed by another
> > <data_heading>, etc.  I cannot seem to find where the productions
> > explicitly require whitespace between the data item and the second
> > data heading.  A similar problem seems to exist in the production for
> > loop values.  This would certainly be solved by implicit precedence
> > among the productions or by operation of the lexical scanner, but it would
> > best to have the BNF be unambiguous in the handling of whitespace.
>
>I have said it before and I will say it again, "Now you know why I have
>been reluctant to include productions specific to whitespace into the
>BNF". They are a purely lexical issue and language BNFs all exclude them
>with the proviso that " whitespace can be used anywhere to delimit tokens
>etc etc" without any explicit rules. I can see a fix, but it would need an
>exception. Namely change
>
><data_block>   ::= <wspace>* <data_heading> <data>+ <wspace>*
>
>to
>
><data_block>   ::= <wspace>+ <data_heading> <data>+ <wspace>*
>
>The exception being the leading <wspace> need not be there IF IT IS THE
>BEGINNING OF THE FILE. You could equally have
><data_block>   ::= <wspace>* <data_heading> <data>+ <wspace>+
>
>with the exception about the end of the file.
>
>This exception would have to be "written as a comment" and not formally
>part of the BNF syntax (unless someone can see how to do it elegantly).
>
>What's the consensus?

> > 3.  The paper speaks of blanks, but not of tabs and vertical tabs and
> > formfeeds.  Most systems will accept handle tabs reasonably.  Not all
> > systems can handle vertical tab or form feed.  Are we requiring all
> > CIF parsers to be able to handle more than blank and tab?
>
>The vt and ff was an attempt to catch other non-printing characters that
>could be reasonably interpreted as the equivalent of spaces or tabs (the
>vt) or of a newline (ff). If it clarifies things, and restrictions always
>do, I can delete references to vt and ff. Opinions?

We spent a *lot* of time on whitespace in the XML discussions and it is 
still one of the troublesome areas. Without looking at CIF in detail I 
would think that Nick's approach is manageable. The important things include:
         - what is the allowed character set? are vt and ff allowed?
         - is there a difference between whitespace as token separators and 
as "data"? If so it must be defined.
         - is any normalization of whitespace allowed? required?

XML is more demanding on its whitespace mechanisms than CI, but some of the 
areas still arise.



> > 4.  The paper speaks of recognising a number, and gives a syntax for a
> > number (with and without an ESD).  Shouldn't this be in the BNF?
>
>I guess I really view the BNF down to the level of what is a data value in
>terms of the allowed character sequence. Whether it is a number or not is
>a higher level of abstraction. I can include the production for a number
>(with or without parentheses) but it would be a lexical definition. That
>is it would not appear in any of the grammar productions because the
>complexity would grow enormously. Imagine having to now define when a
><number> can appear within an <SC_bounded_string>! A *number* can be
>included for the sake of lexical definition. Opinions?

I think this is a useful separation. The XML spec consists of EBNF 
productions and constraints which are additional.  These constraints cannot 
(or are not) be expressed in EBNF and occur in numbered prose items. There 
is now a movement to including some of these in XML Schemas which allow 
simple patterns or regexps. I would agree the same for CIF.

I believe that the productions and the constraints are sequential - i.e. 
the BNF can tokenize an XML file and then the constraints are applied to 
see whether it is well-formed or valid or neither. CIF should adopt the 
same philosophy - i.e. the tokenization comes first, then constraints are 
examined. If the two are more strongly coupled then it may imply context 
sensitivity which must be avoided.

> > 5.  The paper includes an example with use of "\" (e.g. 'Cu K\a' escapes
> > in text and character fields.  Shouldn't this escape mechanism be
> > mentioned in the BNF, at least in the comments.
>
>As far as the BNF is concerned the use of \ is not excluded as a
>legitimate character.

Agreed.

What characters does CIF allow? is it 8-13, 32-127? what about Unicode? It 
needs to be clear. [I haven't got the latest CIF spec so forgive me if this 
is obvious.]

> > 6.  The BNF does not seem to break out the "." and "?" metacharacter data
> > values.  In real parsers, these are very important cases to distinguish.
>
>Again as far as the BNF is concerned the use of . and ? are not excluded
>as legitimate characters.
>
>In 5. and 6. you seem to be speaking of *semantic* meaning. Such
>definitions are not part of the BNF, The paper you speak of details these
>characters and how to interpret them. One cannot appreciate what CIF is
>with just a BNF, they will need to read other specifications not
>reproducible in a BNF, and only explained in the textual form (as in the
>paper).

Agreed.

In essence it should be possible to use lex and yacc to tokenize the BNF 
(or to reject inappropriate characters and tokens). Then constraints are 
applied, and additional semantics such as the above are processed. [NB I 
have never felt happy about the *semantics* of dot and query - it certainly 
used to be possible to interpret them so that CIF processors were required 
the to expand ? in DDLs  into default values in data files.]


         P.

>cheers
>
>Nick
>
>I will make the changes after some review of this correspondence by
>others.
>
>
>--------------------------------
>Dr Nick Spadaccini
>Department of Computer Science              voice: +(61 8) 9380 3452
>University of Western Australia               fax: +(61 8) 9380 1089
>Nedlands, Perth,  WA  6907                 email: nick@cs.uwa.edu.au
>AUSTRALIA                        web: http://www.cs.uwa.edu.au/~nick
>

Peter Murray-Rust, Director Virtual School of Molecular Sciences
Pharmaceutical Sciences, University of Nottingham, NG7 2RD, UK
Tel: +44-(0)-115-951-5087 Fax: +44-(0)-115-951-5110
http://www.vsms.nottingham.ac.uk