[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF-2 changes

On 10/11/09 2:06 PM, "James Hester" <jamesrhester@gmail.com> wrote:

> Thanks Nick for the explanation.  I think understand precisely where
> we differ. You place a higher value on detecting missing whitespace
> than I do - I think that keeping square brackets as list delimiters is
> worth the slight loss of functionality (loss of a coercion rule), but
> this is really something we all have to just evaluate for ourselves
> and vote on, as Herbert has said.  I'm just glad we now have a clear
> explanation of exactly what the issue is.  I've added some minor
> comments below in any case.

Trust me it is not that I am fixated on missing whitespace, it is that I
want to build a simplified parser. That is, I handle ABC[ in the same way as
_ABC[ (given we accept that [ initiates a token and there SHOULD be
whitespace separating tokens, but may not be). Programming languages handle
[] as part of an identifier and don't confuse it with [] as part of a data
because they demand an assignment and other operators between them so you
always know where you are. STAR/CIF only have whitespace to do that job!

> In a last valiant effort to argue my case:  I really don't want to
> extend the hand of peace to syntactically incorrect files.  If there
> are some coercion rules that logically follow from our syntax, jolly
> good, but let's not compromise the syntax for the sake of files that
> aren't correctly written.

I can live with that, but below we address the REAL problem with punctuation
as part of an identifier.

> On the other hand, Doug has raised a further issue in an email: if
> datanames can have square brackets in them, what does that mean for
> dREL methods that will contain those datanames?  Will a dREL parser
> fail if it expects the square bracket syntax to refer to list

Well here is the real issue. I can re-engineer (with considerable work) the
dREL parser BUT the more important question is what is every body going to
do at the implementation side of everything (not just dREL).

Typically I would expect to generate source code in a static or dynamic
(scripting) language using as identifiers the data names. Now I am in
trouble because _atom_site.U[1][2] or _sint/lambda are NOT acceptable as
identifier names in most languages. The former is, but only if declared as
an array. _atom_site.U[1][2] is an element of an array in the definition but
there has not been (until DDLm) an array object. I can see a torturous way
to handle that, but it is messy.

> elements?  Should we define an alternative built-in function for list
> access for these cases?

The problem is there are characters not acceptable as an identifier string
in the target language. sint/lambda is going to cause problems as an
identifier. There was a reason the dictionaries are filled with new data
names, none of which contain punctuation. In the 2008 submitted dictionaries
the data names are _atom_site_aniso.B_12 (instead of
_atom_site_aniso.U[1][2] in mmCIF) and refln.sin_theta_over_lambda (instead
of refln.sint/lambda in core CIF). We made the problem disappear.

It would be nice if we could map the data name to the identical identifier
name in the target language.

> On Tue, Nov 10, 2009 at 3:25 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
>> During my last Sydney stay we discussed that we could define an unquoted
>> string and a data name to be the same production rule with the addition the
>> first character of a data name had to be _.
>> What you are saying here is that if we allow [] inside a data name (which is
>> not allowed in an unquoted string) then we can have the best of both worlds.
>> The new grammar (and coercion rules for the data) and we simply accept []
>> inside a data name.
>> If we want to say the characters of a data name are the same as an unquote
>> string PLUS the [] characters, then this can be done. BUT there is a price
>> to be paid with regard to ambiguity when you include punctuation that
>> initiates a token in to the character set of a data name. See below.
>>> Take the following CIF fragment:
>>> ...
>>> _foo[bar]_blahxyz      [elephant, cow, orangutang, [xxx]]
>>> A lexer will tokenize the first entry as 'dataname', with a value of
>>> '_foo[bar]_blahxyz', because it will continue eating characters until
>>> it gets to a disallowed character, or the token separator
>>> (whitespace).  It then tokenises all whitespace the same way, by
>>> including all characters included in the definition of whitespace, and
>>> then tokenizes the single open square bracket.  In what way has having
>>> an open square bracket inside the dataname complicated the parse?
>>> Would this be simpler without square brackets in the list of allowed
>>> characters for a dataname?  Note that the parse is identical no matter
>> I would say yes. Because parsing what you have written, and employing the
>> coercion rules for missing whitespace separator I would have come up with,
>> _foo [bar]
>> _blahxyz      [elephant, cow, orangutang, [xxx]]
>> Which is a parse predicated on the user making an error in missing
>> separators. Which is the correct parse is unknown because the syntax is
>> ambiguous.
> I view the coercion rules as logical consequences of the syntax we
> define, not as part of the syntax itself (is that wrong?).  If we
> include square brackets in datanames, the coercion rule you describe
> ceases to exist and so does the ambiguity.
>> No ambiguity exists if the syntax is
>> _foo[bar]_blahxyz      {elephant, cow, orangutang, {xxx}}
>> because [ is not token initiator.
>> On the other hand we could say lists are in [] and [] are also accepted in a
>> dataname and live with the ambiguity.
> More precisely, live without the coercion rule.  The syntax itself is
> unambigous - as there is no whitespace between the dataname and the
> square bracket, the square bracket is part of the dataname.
>>> what type of brackets are used to start the list, so why use braces
>>> anyway?
>>> Put another way, we are in the nice position that following a
>>> whitespace we can almost always predict the token based purely on the
>>> first character.
>>> If '_', then it is a dataname
>>> If <quote> or <double quote> it is a datavalue
>>> If alphanumeric then it is a non-delimited datavalue, unless the first
>>> characters are 'loop_' or 'data_'
>>> If <open bracket> then it is a list
>> This is true, up to the point that you have an illegal character that causes
>> ambiguity. ABC is an unquoted string until you read ABC[, now you have
>> terminated the unquoted string and (likely) initiated a list. You have gone
>> from one taken to another (in the absence of any whitespace). Same
>> interpretation could happen if the unquoted string were a dataname.
> As square brackets are not allowed in unquoted strings, you get to
> keep the coercion rule in this latest example here.  Mind you, the
> file is syntactically incorrect, so if it were up to me, I'd send the
> file back with maximum prejudice.



Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au

ddlm-group mailing list

Reply to: [list | sender only]