Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF-2 changes

Thanks Nick for the explanation.  I think understand precisely where
we differ. You place a higher value on detecting missing whitespace
than I do - I think that keeping square brackets as list delimiters is
worth the slight loss of functionality (loss of a coercion rule), but
this is really something we all have to just evaluate for ourselves
and vote on, as Herbert has said.  I'm just glad we now have a clear
explanation of exactly what the issue is.  I've added some minor
comments below in any case.

In a last valiant effort to argue my case:  I really don't want to
extend the hand of peace to syntactically incorrect files.  If there
are some coercion rules that logically follow from our syntax, jolly
good, but let's not compromise the syntax for the sake of files that
aren't correctly written.

On the other hand, Doug has raised a further issue in an email: if
datanames can have square brackets in them, what does that mean for
dREL methods that will contain those datanames?  Will a dREL parser
fail if it expects the square bracket syntax to refer to list
elements?  Should we define an alternative built-in function for list
access for these cases?


On Tue, Nov 10, 2009 at 3:25 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
>
>
>
> During my last Sydney stay we discussed that we could define an unquoted
> string and a data name to be the same production rule with the addition the
> first character of a data name had to be _.
>
> What you are saying here is that if we allow [] inside a data name (which is
> not allowed in an unquoted string) then we can have the best of both worlds.
> The new grammar (and coercion rules for the data) and we simply accept []
> inside a data name.
>
> If we want to say the characters of a data name are the same as an unquote
> string PLUS the [] characters, then this can be done. BUT there is a price
> to be paid with regard to ambiguity when you include punctuation that
> initiates a token in to the character set of a data name. See below.
>
>>
>> Take the following CIF fragment:
>>
>> ...
>> _foo[bar]_blahxyz      [elephant, cow, orangutang, [xxx]]
>>
>> A lexer will tokenize the first entry as 'dataname', with a value of
>> '_foo[bar]_blahxyz', because it will continue eating characters until
>> it gets to a disallowed character, or the token separator
>> (whitespace).  It then tokenises all whitespace the same way, by
>> including all characters included in the definition of whitespace, and
>> then tokenizes the single open square bracket.  In what way has having
>> an open square bracket inside the dataname complicated the parse?
>> Would this be simpler without square brackets in the list of allowed
>> characters for a dataname?  Note that the parse is identical no matter
>
> I would say yes. Because parsing what you have written, and employing the
> coercion rules for missing whitespace separator I would have come up with,
>
> _foo [bar]
> _blahxyz      [elephant, cow, orangutang, [xxx]]
>
> Which is a parse predicated on the user making an error in missing
> separators. Which is the correct parse is unknown because the syntax is
> ambiguous.

I view the coercion rules as logical consequences of the syntax we
define, not as part of the syntax itself (is that wrong?).  If we
include square brackets in datanames, the coercion rule you describe
ceases to exist and so does the ambiguity.

> No ambiguity exists if the syntax is
>
> _foo[bar]_blahxyz      {elephant, cow, orangutang, {xxx}}
>
> because [ is not token initiator.
>
> On the other hand we could say lists are in [] and [] are also accepted in a
> dataname and live with the ambiguity.

More precisely, live without the coercion rule.  The syntax itself is
unambigous - as there is no whitespace between the dataname and the
square bracket, the square bracket is part of the dataname.

>> what type of brackets are used to start the list, so why use braces
>> anyway?
>>
>> Put another way, we are in the nice position that following a
>> whitespace we can almost always predict the token based purely on the
>> first character.
>>
>> If '_', then it is a dataname
>> If <quote> or <double quote> it is a datavalue
>> If alphanumeric then it is a non-delimited datavalue, unless the first
>> characters are 'loop_' or 'data_'
>> If <open bracket> then it is a list
>
> This is true, up to the point that you have an illegal character that causes
> ambiguity. ABC is an unquoted string until you read ABC[, now you have
> terminated the unquoted string and (likely) initiated a list. You have gone
> from one taken to another (in the absence of any whitespace). Same
> interpretation could happen if the unquoted string were a dataname.

As square brackets are not allowed in unquoted strings, you get to
keep the coercion rule in this latest example here.  Mind you, the
file is syntactically incorrect, so if it were up to me, I'd send the
file back with maximum prejudice.

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.