[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF-2 changes

I wrote:

> And the only CIF2 parsers that will fail when they see a square
> bracket in a dataname are those that are (incorrectly) prepared to
> accept no spaces between dataname and datavalue.  So I repeat: the
> only reason we have moved away from square brackets as list delimiters
> is so that in the specific case that a space is missing between a
> dataname and a datavalue the parser can continue.  I see no other
> justification.

Nick responded:
>Yes it is the reason. But short of re-visiting the long discussion on
>whitespace as token separators (as they usually are) versus whitespace being
>1 of the 2 token terminating characters, and the subsequent problem that
>there need to be two definitions for every type depending on their position
>in recursion, it is a necessary consequence.

I don't see any necessary consequence.  We stopped using whitespace as
one of two token terminating characters the moment we agreed that a
closing quote/double quote finished a quote-delimited string
regardless of the following character (and we have adopted the same
philosophy for bracket-delimited values).  Whitespace in CIF2 is
purely a token separator, and remains so whether or not brackets are
allowed inside datanames.  I repeat, allowing brackets inside
datanames will not change the grammar *at all*: it will simply mean
two extra characters in the list of acceptable characters for a
dataname.  In particular, I see no relevance for recursive parsing or
the need for two definitions for every type.

Take the following CIF fragment:

...
_foo[bar]_blahxyz      [elephant, cow, orangutang, [xxx]]

A lexer will tokenize the first entry as 'dataname', with a value of
'_foo[bar]_blahxyz', because it will continue eating characters until
it gets to a disallowed character, or the token separator
(whitespace).  It then tokenises all whitespace the same way, by
including all characters included in the definition of whitespace, and
then tokenizes the single open square bracket.  In what way has having
an open square bracket inside the dataname complicated the parse?
Would this be simpler without square brackets in the list of allowed
characters for a dataname?  Note that the parse is identical no matter
what type of brackets are used to start the list, so why use braces
anyway?

Put another way, we are in the nice position that following a
whitespace we can almost always predict the token based purely on the
first character.

If '_', then it is a dataname
If <quote> or <double quote> it is a datavalue
If alphanumeric then it is a non-delimited datavalue, unless the first
characters are 'loop_' or 'data_'
If <open bracket> then it is a list

This is true whether or not brackets of any sort are included in the
allowed characterset for a dataname.

If you disagree with this, I would like to see an example of where
having brackets in a dataname complicates the grammar compared to not
having them.

On Mon, Nov 9, 2009 at 7:26 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
> As I said in my previous email. The gain is that you can determine where you
> are at a lexical level without having to go further in to the parsing. There
> is a reason why languages use [] and {} separately, and that ease.
>
> If computer scientists have learnt one thing in the last 50 years, it is how
> to design and specify languages so that you avoid ambiguity and complexity.

Agreed that a different type of bracket for tables is preferable.
-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]