[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] CIF-2 changes
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] CIF-2 changes
- From: Nick Spadaccini <nick@csse.uwa.edu.au>
- Date: Tue, 10 Nov 2009 15:00:27 +0800
- Authentication-Results: postfix;
- In-Reply-To: <279aad2a0911092206g3c68199at8e06336e0b72b3c0@mail.gmail.com>
On 10/11/09 2:06 PM, "James Hester" <jamesrhester@gmail.com> wrote: > Thanks Nick for the explanation. I think understand precisely where > we differ. You place a higher value on detecting missing whitespace > than I do - I think that keeping square brackets as list delimiters is > worth the slight loss of functionality (loss of a coercion rule), but > this is really something we all have to just evaluate for ourselves > and vote on, as Herbert has said. I'm just glad we now have a clear > explanation of exactly what the issue is. I've added some minor > comments below in any case. Trust me it is not that I am fixated on missing whitespace, it is that I want to build a simplified parser. That is, I handle ABC[ in the same way as _ABC[ (given we accept that [ initiates a token and there SHOULD be whitespace separating tokens, but may not be). Programming languages handle [] as part of an identifier and don't confuse it with [] as part of a data because they demand an assignment and other operators between them so you always know where you are. STAR/CIF only have whitespace to do that job! > In a last valiant effort to argue my case: I really don't want to > extend the hand of peace to syntactically incorrect files. If there > are some coercion rules that logically follow from our syntax, jolly > good, but let's not compromise the syntax for the sake of files that > aren't correctly written. I can live with that, but below we address the REAL problem with punctuation as part of an identifier. > On the other hand, Doug has raised a further issue in an email: if > datanames can have square brackets in them, what does that mean for > dREL methods that will contain those datanames? Will a dREL parser > fail if it expects the square bracket syntax to refer to list Well here is the real issue. I can re-engineer (with considerable work) the dREL parser BUT the more important question is what is every body going to do at the implementation side of everything (not just dREL). Typically I would expect to generate source code in a static or dynamic (scripting) language using as identifiers the data names. Now I am in trouble because _atom_site.U[1][2] or _sint/lambda are NOT acceptable as identifier names in most languages. The former is, but only if declared as an array. _atom_site.U[1][2] is an element of an array in the definition but there has not been (until DDLm) an array object. I can see a torturous way to handle that, but it is messy. > elements? Should we define an alternative built-in function for list > access for these cases? The problem is there are characters not acceptable as an identifier string in the target language. sint/lambda is going to cause problems as an identifier. There was a reason the dictionaries are filled with new data names, none of which contain punctuation. In the 2008 submitted dictionaries the data names are _atom_site_aniso.B_12 (instead of _atom_site_aniso.U[1][2] in mmCIF) and refln.sin_theta_over_lambda (instead of refln.sint/lambda in core CIF). We made the problem disappear. It would be nice if we could map the data name to the identical identifier name in the target language. > > > On Tue, Nov 10, 2009 at 3:25 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote: >> >> >> >> During my last Sydney stay we discussed that we could define an unquoted >> string and a data name to be the same production rule with the addition the >> first character of a data name had to be _. >> >> What you are saying here is that if we allow [] inside a data name (which is >> not allowed in an unquoted string) then we can have the best of both worlds. >> The new grammar (and coercion rules for the data) and we simply accept [] >> inside a data name. >> >> If we want to say the characters of a data name are the same as an unquote >> string PLUS the [] characters, then this can be done. BUT there is a price >> to be paid with regard to ambiguity when you include punctuation that >> initiates a token in to the character set of a data name. See below. >> >>> >>> Take the following CIF fragment: >>> >>> ... >>> _foo[bar]_blahxyz [elephant, cow, orangutang, [xxx]] >>> >>> A lexer will tokenize the first entry as 'dataname', with a value of >>> '_foo[bar]_blahxyz', because it will continue eating characters until >>> it gets to a disallowed character, or the token separator >>> (whitespace). It then tokenises all whitespace the same way, by >>> including all characters included in the definition of whitespace, and >>> then tokenizes the single open square bracket. In what way has having >>> an open square bracket inside the dataname complicated the parse? >>> Would this be simpler without square brackets in the list of allowed >>> characters for a dataname? Note that the parse is identical no matter >> >> I would say yes. Because parsing what you have written, and employing the >> coercion rules for missing whitespace separator I would have come up with, >> >> _foo [bar] >> _blahxyz [elephant, cow, orangutang, [xxx]] >> >> Which is a parse predicated on the user making an error in missing >> separators. Which is the correct parse is unknown because the syntax is >> ambiguous. > > I view the coercion rules as logical consequences of the syntax we > define, not as part of the syntax itself (is that wrong?). If we > include square brackets in datanames, the coercion rule you describe > ceases to exist and so does the ambiguity. > >> No ambiguity exists if the syntax is >> >> _foo[bar]_blahxyz {elephant, cow, orangutang, {xxx}} >> >> because [ is not token initiator. >> >> On the other hand we could say lists are in [] and [] are also accepted in a >> dataname and live with the ambiguity. > > More precisely, live without the coercion rule. The syntax itself is > unambigous - as there is no whitespace between the dataname and the > square bracket, the square bracket is part of the dataname. > >>> what type of brackets are used to start the list, so why use braces >>> anyway? >>> >>> Put another way, we are in the nice position that following a >>> whitespace we can almost always predict the token based purely on the >>> first character. >>> >>> If '_', then it is a dataname >>> If <quote> or <double quote> it is a datavalue >>> If alphanumeric then it is a non-delimited datavalue, unless the first >>> characters are 'loop_' or 'data_' >>> If <open bracket> then it is a list >> >> This is true, up to the point that you have an illegal character that causes >> ambiguity. ABC is an unquoted string until you read ABC[, now you have >> terminated the unquoted string and (likely) initiated a list. You have gone >> from one taken to another (in the absence of any whitespace). Same >> interpretation could happen if the unquoted string were a dataname. > > As square brackets are not allowed in unquoted strings, you get to > keep the coercion rule in this latest example here. Mind you, the > file is syntactically incorrect, so if it were up to me, I'd send the > file back with maximum prejudice. cheers Nick -------------------------------- Associate Professor N. Spadaccini, PhD School of Computer Science & Software Engineering The University of Western Australia t: +61 (0)8 6488 3452 35 Stirling Highway f: +61 (0)8 6488 1089 CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick MBDP M002 CRICOS Provider Code: 00126G e: Nick.Spadaccini@uwa.edu.au _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] CIF-2 changes (David Brown)
- References:
- Re: [ddlm-group] CIF-2 changes (James Hester)
- Prev by Date: Re: [ddlm-group] CIF-2 changes
- Next by Date: Re: [ddlm-group] CIF-2 changes
- Prev by thread: Re: [ddlm-group] CIF-2 changes
- Next by thread: Re: [ddlm-group] CIF-2 changes
- Index(es):