[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] CIF-2 changes
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] CIF-2 changes
- From: James Hester <jamesrhester@gmail.com>
- Date: Tue, 10 Nov 2009 17:06:32 +1100
- In-Reply-To: <C71F0C2F.123AB%nick@csse.uwa.edu.au>
- References: <279aad2a0911090330o6adeeb29he87cef7486071743@mail.gmail.com><C71F0C2F.123AB%nick@csse.uwa.edu.au>
Thanks Nick for the explanation. I think understand precisely where we differ. You place a higher value on detecting missing whitespace than I do - I think that keeping square brackets as list delimiters is worth the slight loss of functionality (loss of a coercion rule), but this is really something we all have to just evaluate for ourselves and vote on, as Herbert has said. I'm just glad we now have a clear explanation of exactly what the issue is. I've added some minor comments below in any case. In a last valiant effort to argue my case: I really don't want to extend the hand of peace to syntactically incorrect files. If there are some coercion rules that logically follow from our syntax, jolly good, but let's not compromise the syntax for the sake of files that aren't correctly written. On the other hand, Doug has raised a further issue in an email: if datanames can have square brackets in them, what does that mean for dREL methods that will contain those datanames? Will a dREL parser fail if it expects the square bracket syntax to refer to list elements? Should we define an alternative built-in function for list access for these cases? On Tue, Nov 10, 2009 at 3:25 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote: > > > > During my last Sydney stay we discussed that we could define an unquoted > string and a data name to be the same production rule with the addition the > first character of a data name had to be _. > > What you are saying here is that if we allow [] inside a data name (which is > not allowed in an unquoted string) then we can have the best of both worlds. > The new grammar (and coercion rules for the data) and we simply accept [] > inside a data name. > > If we want to say the characters of a data name are the same as an unquote > string PLUS the [] characters, then this can be done. BUT there is a price > to be paid with regard to ambiguity when you include punctuation that > initiates a token in to the character set of a data name. See below. > >> >> Take the following CIF fragment: >> >> ... >> _foo[bar]_blahxyz [elephant, cow, orangutang, [xxx]] >> >> A lexer will tokenize the first entry as 'dataname', with a value of >> '_foo[bar]_blahxyz', because it will continue eating characters until >> it gets to a disallowed character, or the token separator >> (whitespace). It then tokenises all whitespace the same way, by >> including all characters included in the definition of whitespace, and >> then tokenizes the single open square bracket. In what way has having >> an open square bracket inside the dataname complicated the parse? >> Would this be simpler without square brackets in the list of allowed >> characters for a dataname? Note that the parse is identical no matter > > I would say yes. Because parsing what you have written, and employing the > coercion rules for missing whitespace separator I would have come up with, > > _foo [bar] > _blahxyz [elephant, cow, orangutang, [xxx]] > > Which is a parse predicated on the user making an error in missing > separators. Which is the correct parse is unknown because the syntax is > ambiguous. I view the coercion rules as logical consequences of the syntax we define, not as part of the syntax itself (is that wrong?). If we include square brackets in datanames, the coercion rule you describe ceases to exist and so does the ambiguity. > No ambiguity exists if the syntax is > > _foo[bar]_blahxyz {elephant, cow, orangutang, {xxx}} > > because [ is not token initiator. > > On the other hand we could say lists are in [] and [] are also accepted in a > dataname and live with the ambiguity. More precisely, live without the coercion rule. The syntax itself is unambigous - as there is no whitespace between the dataname and the square bracket, the square bracket is part of the dataname. >> what type of brackets are used to start the list, so why use braces >> anyway? >> >> Put another way, we are in the nice position that following a >> whitespace we can almost always predict the token based purely on the >> first character. >> >> If '_', then it is a dataname >> If <quote> or <double quote> it is a datavalue >> If alphanumeric then it is a non-delimited datavalue, unless the first >> characters are 'loop_' or 'data_' >> If <open bracket> then it is a list > > This is true, up to the point that you have an illegal character that causes > ambiguity. ABC is an unquoted string until you read ABC[, now you have > terminated the unquoted string and (likely) initiated a list. You have gone > from one taken to another (in the absence of any whitespace). Same > interpretation could happen if the unquoted string were a dataname. As square brackets are not allowed in unquoted strings, you get to keep the coercion rule in this latest example here. Mind you, the file is syntactically incorrect, so if it were up to me, I'd send the file back with maximum prejudice. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] CIF-2 changes (Nick Spadaccini)
- References:
- Re: [ddlm-group] CIF-2 changes (James Hester)
- Re: [ddlm-group] CIF-2 changes (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] CIF-2 changes
- Next by Date: Re: [ddlm-group] CIF-2 changes
- Prev by thread: Re: [ddlm-group] CIF-2 changes
- Next by thread: Re: [ddlm-group] CIF-2 changes
- Index(es):