Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF-2 changes

On 10/11/09 2:06 PM, "James Hester" <jamesrhester@gmail.com> wrote:

> Thanks Nick for the explanation.  I think understand precisely where
> we differ. You place a higher value on detecting missing whitespace
> than I do - I think that keeping square brackets as list delimiters is
> worth the slight loss of functionality (loss of a coercion rule), but
> this is really something we all have to just evaluate for ourselves
> and vote on, as Herbert has said.  I'm just glad we now have a clear
> explanation of exactly what the issue is.  I've added some minor
> comments below in any case.

Trust me it is not that I am fixated on missing whitespace, it is that I
want to build a simplified parser. That is, I handle ABC[ in the same way as
_ABC[ (given we accept that [ initiates a token and there SHOULD be
whitespace separating tokens, but may not be). Programming languages handle
[] as part of an identifier and don't confuse it with [] as part of a data
because they demand an assignment and other operators between them so you
always know where you are. STAR/CIF only have whitespace to do that job!

> In a last valiant effort to argue my case:  I really don't want to
> extend the hand of peace to syntactically incorrect files.  If there
> are some coercion rules that logically follow from our syntax, jolly
> good, but let's not compromise the syntax for the sake of files that
> aren't correctly written.

I can live with that, but below we address the REAL problem with punctuation
as part of an identifier.

> On the other hand, Doug has raised a further issue in an email: if
> datanames can have square brackets in them, what does that mean for
> dREL methods that will contain those datanames?  Will a dREL parser
> fail if it expects the square bracket syntax to refer to list

Well here is the real issue. I can re-engineer (with considerable work) the
dREL parser BUT the more important question is what is every body going to
do at the implementation side of everything (not just dREL).

Typically I would expect to generate source code in a static or dynamic
(scripting) language using as identifiers the data names. Now I am in
trouble because _atom_site.U[1][2] or _sint/lambda are NOT acceptable as
identifier names in most languages. The former is, but only if declared as
an array. _atom_site.U[1][2] is an element of an array in the definition but
there has not been (until DDLm) an array object. I can see a torturous way
to handle that, but it is messy.

> elements?  Should we define an alternative built-in function for list
> access for these cases?

The problem is there are characters not acceptable as an identifier string
in the target language. sint/lambda is going to cause problems as an
identifier. There was a reason the dictionaries are filled with new data
names, none of which contain punctuation. In the 2008 submitted dictionaries
the data names are _atom_site_aniso.B_12 (instead of
_atom_site_aniso.U[1][2] in mmCIF) and refln.sin_theta_over_lambda (instead
of refln.sint/lambda in core CIF). We made the problem disappear.

It would be nice if we could map the data name to the identical identifier
name in the target language.

> On Tue, Nov 10, 2009 at 3:25 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
>> During my last Sydney stay we discussed that we could define an unquoted
>> string and a data name to be the same production rule with the addition the
>> first character of a data name had to be _.
>> What you are saying here is that if we allow [] inside a data name (which is
>> not allowed in an unquoted string) then we can have the best of both worlds.
>> The new grammar (and coercion rules for the data) and we simply accept []
>> inside a data name.
>> If we want to say the characters of a data name are the same as an unquote
>> string PLUS the [] characters, then this can be done. BUT there is a price
>> to be paid with regard to ambiguity when you include punctuation that
>> initiates a token in to the character set of a data name. See below.
>>> Take the following CIF fragment:
>>> ...
>>> _foo[bar]_blahxyz      [elephant, cow, orangutang, [xxx]]
>>> A lexer will tokenize the first entry as 'dataname', with a value of
>>> '_foo[bar]_blahxyz', because it will continue eating characters until
>>> it gets to a disallowed character, or the token separator
>>> (whitespace).  It then tokenises all whitespace the same way, by
>>> including all characters included in the definition of whitespace, and
>>> then tokenizes the single open square bracket.  In what way has having
>>> an open square bracket inside the dataname complicated the parse?
>>> Would this be simpler without square brackets in the list of allowed
>>> characters for a dataname?  Note that the parse is identical no matter
>> I would say yes. Because parsing what you have written, and employing the
>> coercion rules for missing whitespace separator I would have come up with,
>> _foo [bar]
>> _blahxyz      [elephant, cow, orangutang, [xxx]]
>> Which is a parse predicated on the user making an error in missing
>> separators. Which is the correct parse is unknown because the syntax is
>> ambiguous.
> I view the coercion rules as logical consequences of the syntax we
> define, not as part of the syntax itself (is that wrong?).  If we
> include square brackets in datanames, the coercion rule you describe
> ceases to exist and so does the ambiguity.
>> No ambiguity exists if the syntax is
>> _foo[bar]_blahxyz      {elephant, cow, orangutang, {xxx}}
>> because [ is not token initiator.
>> On the other hand we could say lists are in [] and [] are also accepted in a
>> dataname and live with the ambiguity.
> More precisely, live without the coercion rule.  The syntax itself is
> unambigous - as there is no whitespace between the dataname and the
> square bracket, the square bracket is part of the dataname.
>>> what type of brackets are used to start the list, so why use braces
>>> anyway?
>>> Put another way, we are in the nice position that following a
>>> whitespace we can almost always predict the token based purely on the
>>> first character.
>>> If '_', then it is a dataname
>>> If <quote> or <double quote> it is a datavalue
>>> If alphanumeric then it is a non-delimited datavalue, unless the first
>>> characters are 'loop_' or 'data_'
>>> If <open bracket> then it is a list
>> This is true, up to the point that you have an illegal character that causes
>> ambiguity. ABC is an unquoted string until you read ABC[, now you have
>> terminated the unquoted string and (likely) initiated a list. You have gone
>> from one taken to another (in the absence of any whitespace). Same
>> interpretation could happen if the unquoted string were a dataname.
> As square brackets are not allowed in unquoted strings, you get to
> keep the coercion rule in this latest example here.  Mind you, the
> file is syntactically incorrect, so if it were up to me, I'd send the
> file back with maximum prejudice.



Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.