[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] CIF-2 changes
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] CIF-2 changes
- From: Nick Spadaccini <nick@csse.uwa.edu.au>
- Date: Tue, 10 Nov 2009 12:25:19 +0800
- Authentication-Results: postfix;
- In-Reply-To: <279aad2a0911090330o6adeeb29he87cef7486071743@mail.gmail.com>
On 9/11/09 7:30 PM, "James Hester" <jamesrhester@gmail.com> wrote: > I wrote: > >> And the only CIF2 parsers that will fail when they see a square >> bracket in a dataname are those that are (incorrectly) prepared to >> accept no spaces between dataname and datavalue. So I repeat: the >> only reason we have moved away from square brackets as list delimiters >> is so that in the specific case that a space is missing between a >> dataname and a datavalue the parser can continue. I see no other >> justification. > > Nick responded: >> Yes it is the reason. But short of re-visiting the long discussion on >> whitespace as token separators (as they usually are) versus whitespace being >> 1 of the 2 token terminating characters, and the subsequent problem that >> there need to be two definitions for every type depending on their position >> in recursion, it is a necessary consequence. > > I don't see any necessary consequence. We stopped using whitespace as > one of two token terminating characters the moment we agreed that a > closing quote/double quote finished a quote-delimited string > regardless of the following character (and we have adopted the same > philosophy for bracket-delimited values). Whitespace in CIF2 is > purely a token separator, and remains so whether or not brackets are > allowed inside datanames. I repeat, allowing brackets inside > datanames will not change the grammar *at all*: it will simply mean > two extra characters in the list of acceptable characters for a > dataname. In particular, I see no relevance for recursive parsing or > the need for two definitions for every type. During my last Sydney stay we discussed that we could define an unquoted string and a data name to be the same production rule with the addition the first character of a data name had to be _. What you are saying here is that if we allow [] inside a data name (which is not allowed in an unquoted string) then we can have the best of both worlds. The new grammar (and coercion rules for the data) and we simply accept [] inside a data name. If we want to say the characters of a data name are the same as an unquote string PLUS the [] characters, then this can be done. BUT there is a price to be paid with regard to ambiguity when you include punctuation that initiates a token in to the character set of a data name. See below. > > Take the following CIF fragment: > > ... > _foo[bar]_blahxyz [elephant, cow, orangutang, [xxx]] > > A lexer will tokenize the first entry as 'dataname', with a value of > '_foo[bar]_blahxyz', because it will continue eating characters until > it gets to a disallowed character, or the token separator > (whitespace). It then tokenises all whitespace the same way, by > including all characters included in the definition of whitespace, and > then tokenizes the single open square bracket. In what way has having > an open square bracket inside the dataname complicated the parse? > Would this be simpler without square brackets in the list of allowed > characters for a dataname? Note that the parse is identical no matter I would say yes. Because parsing what you have written, and employing the coercion rules for missing whitespace separator I would have come up with, _foo [bar] _blahxyz [elephant, cow, orangutang, [xxx]] Which is a parse predicated on the user making an error in missing separators. Which is the correct parse is unknown because the syntax is ambiguous. No ambiguity exists if the syntax is _foo[bar]_blahxyz {elephant, cow, orangutang, {xxx}} because [ is not token initiator. On the other hand we could say lists are in [] and [] are also accepted in a dataname and live with the ambiguity. > what type of brackets are used to start the list, so why use braces > anyway? > > Put another way, we are in the nice position that following a > whitespace we can almost always predict the token based purely on the > first character. > > If '_', then it is a dataname > If <quote> or <double quote> it is a datavalue > If alphanumeric then it is a non-delimited datavalue, unless the first > characters are 'loop_' or 'data_' > If <open bracket> then it is a list This is true, up to the point that you have an illegal character that causes ambiguity. ABC is an unquoted string until you read ABC[, now you have terminated the unquoted string and (likely) initiated a list. You have gone from one taken to another (in the absence of any whitespace). Same interpretation could happen if the unquoted string were a dataname. Hence my dilemma, but if others want to live with such ambiguity I am sure it won't kill CIF2. > > This is true whether or not brackets of any sort are included in the > allowed characterset for a dataname. > > If you disagree with this, I would like to see an example of where > having brackets in a dataname complicates the grammar compared to not > having them. As above. > > On Mon, Nov 9, 2009 at 7:26 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote: >> As I said in my previous email. The gain is that you can determine where you >> are at a lexical level without having to go further in to the parsing. There >> is a reason why languages use [] and {} separately, and that ease. >> >> If computer scientists have learnt one thing in the last 50 years, it is how >> to design and specify languages so that you avoid ambiguity and complexity. > > Agreed that a different type of bracket for tables is preferable. cheers Nick -------------------------------- Associate Professor N. Spadaccini, PhD School of Computer Science & Software Engineering The University of Western Australia t: +61 (0)8 6488 3452 35 Stirling Highway f: +61 (0)8 6488 1089 CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick MBDP M002 CRICOS Provider Code: 00126G e: Nick.Spadaccini@uwa.edu.au _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] CIF-2 changes (James Hester)
- References:
- Re: [ddlm-group] CIF-2 changes (James Hester)
- Prev by Date: Re: [ddlm-group] CIF-2 changes
- Next by Date: Re: [ddlm-group] CIF-2 changes
- Prev by thread: Re: [ddlm-group] CIF-2 changes
- Next by thread: Re: [ddlm-group] CIF-2 changes
- Index(es):