[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF-2 changes

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] CIF-2 changes
From: Nick Spadaccini <nick@csse.uwa.edu.au>
Date: Tue, 10 Nov 2009 12:25:19 +0800
Authentication-Results: postfix;
In-Reply-To: <279aad2a0911090330o6adeeb29he87cef7486071743@mail.gmail.com>




On 9/11/09 7:30 PM, "James Hester" <jamesrhester@gmail.com> wrote:

> I wrote:
> 
>> And the only CIF2 parsers that will fail when they see a square
>> bracket in a dataname are those that are (incorrectly) prepared to
>> accept no spaces between dataname and datavalue.  So I repeat: the
>> only reason we have moved away from square brackets as list delimiters
>> is so that in the specific case that a space is missing between a
>> dataname and a datavalue the parser can continue.  I see no other
>> justification.
> 
> Nick responded:
>> Yes it is the reason. But short of re-visiting the long discussion on
>> whitespace as token separators (as they usually are) versus whitespace being
>> 1 of the 2 token terminating characters, and the subsequent problem that
>> there need to be two definitions for every type depending on their position
>> in recursion, it is a necessary consequence.
> 
> I don't see any necessary consequence.  We stopped using whitespace as
> one of two token terminating characters the moment we agreed that a
> closing quote/double quote finished a quote-delimited string
> regardless of the following character (and we have adopted the same
> philosophy for bracket-delimited values).  Whitespace in CIF2 is
> purely a token separator, and remains so whether or not brackets are
> allowed inside datanames.  I repeat, allowing brackets inside
> datanames will not change the grammar *at all*: it will simply mean
> two extra characters in the list of acceptable characters for a
> dataname.  In particular, I see no relevance for recursive parsing or
> the need for two definitions for every type.

During my last Sydney stay we discussed that we could define an unquoted
string and a data name to be the same production rule with the addition the
first character of a data name had to be _.

What you are saying here is that if we allow [] inside a data name (which is
not allowed in an unquoted string) then we can have the best of both worlds.
The new grammar (and coercion rules for the data) and we simply accept []
inside a data name.

If we want to say the characters of a data name are the same as an unquote
string PLUS the [] characters, then this can be done. BUT there is a price
to be paid with regard to ambiguity when you include punctuation that
initiates a token in to the character set of a data name. See below.

> 
> Take the following CIF fragment:
> 
> ...
> _foo[bar]_blahxyz      [elephant, cow, orangutang, [xxx]]
> 
> A lexer will tokenize the first entry as 'dataname', with a value of
> '_foo[bar]_blahxyz', because it will continue eating characters until
> it gets to a disallowed character, or the token separator
> (whitespace).  It then tokenises all whitespace the same way, by
> including all characters included in the definition of whitespace, and
> then tokenizes the single open square bracket.  In what way has having
> an open square bracket inside the dataname complicated the parse?
> Would this be simpler without square brackets in the list of allowed
> characters for a dataname?  Note that the parse is identical no matter

I would say yes. Because parsing what you have written, and employing the
coercion rules for missing whitespace separator I would have come up with,

_foo [bar]
_blahxyz      [elephant, cow, orangutang, [xxx]]

Which is a parse predicated on the user making an error in missing
separators. Which is the correct parse is unknown because the syntax is
ambiguous.

No ambiguity exists if the syntax is

_foo[bar]_blahxyz      {elephant, cow, orangutang, {xxx}}

because [ is not token initiator.

On the other hand we could say lists are in [] and [] are also accepted in a
dataname and live with the ambiguity.

> what type of brackets are used to start the list, so why use braces
> anyway?
> 
> Put another way, we are in the nice position that following a
> whitespace we can almost always predict the token based purely on the
> first character.
> 
> If '_', then it is a dataname
> If <quote> or <double quote> it is a datavalue
> If alphanumeric then it is a non-delimited datavalue, unless the first
> characters are 'loop_' or 'data_'
> If <open bracket> then it is a list

This is true, up to the point that you have an illegal character that causes
ambiguity. ABC is an unquoted string until you read ABC[, now you have
terminated the unquoted string and (likely) initiated a list. You have gone
from one taken to another (in the absence of any whitespace). Same
interpretation could happen if the unquoted string were a dataname.

Hence my dilemma, but if others want to live with such ambiguity I am sure
it won't kill CIF2.

> 
> This is true whether or not brackets of any sort are included in the
> allowed characterset for a dataname.
> 
> If you disagree with this, I would like to see an example of where
> having brackets in a dataname complicates the grammar compared to not
> having them.

As above.

> 
> On Mon, Nov 9, 2009 at 7:26 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
>> As I said in my previous email. The gain is that you can determine where you
>> are at a lexical level without having to go further in to the parsing. There
>> is a reason why languages use [] and {} separately, and that ease.
>> 
>> If computer scientists have learnt one thing in the last 50 years, it is how
>> to design and specify languages so that you avoid ambiguity and complexity.
> 
> Agreed that a different type of bracket for tables is preferable.

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au




_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] CIF-2 changes (James Hester)

References:

Re: [ddlm-group] CIF-2 changes (James Hester)

Prev by Date: Re: [ddlm-group] CIF-2 changes

Next by Date: Re: [ddlm-group] CIF-2 changes

Prev by thread: Re: [ddlm-group] CIF-2 changes

Next by thread: Re: [ddlm-group] CIF-2 changes

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] CIF-2 changes