[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fine-tuning CIF dictionary regexes

On Mon, 18 Apr 2005, James Hester wrote:

> The point I want to discuss boils down to: should the regular
> expressions in the CIF dictionary be find-tuned to be compatible not
> only with POSIX-compliant regular expression engines?

POSIX compliance makes sure you exhaust the input string until you find 
the longest matching sequence. This is necessary to get the "correct" 
token.

>
> The following two constructs from mm_cif, although POSIX compliant, will
> not correctly match in a Perl or Python or Tcl regular expression (and
> any other NFA engine)
>
> floating point numbers:
>
> '-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?'
>
> symmetry operations
> '([1-9]|[1-9][0-9]|1[0-8][0-9]|19[0-2])(_[1-9][1-9][1-9])?'
>
> The problem is that the non-POSIX engines will go through the
> alternations (separated by |) in the above expressions from left to
> right, returning the first match, and as the second part is optional,
> there is no requirement to match it.  In contrast, a POSIX engine must
> return the longest match.  So e.g. if Python is fed the number
> 78.456(22), "78." will be matched by the floating point expression, as
> this satisfies the first part of the alternation, and everything else in
> the regular expression is optional.

But if you are throwing the "number" to a series of compiled regular 
expressions won't 78.456(22) also match '7', an integer, and return the 
INT token? If that happens to be the first rule it comes across?

>
> One suggestion is that these two regular expressions are re-ordered so
> that those alternatives in an alternation which are a subset of other
> alternatives come later.  This remains POSIX-compliant and means many
> non-POSIX engines will find the longest match.

Are you sure you can order the rules such that it eliminates all instances 
of the problem you allude to?

cheers

Nick

--------------------------------
Dr N. Spadaccini                                      Head of School

School of Computer Science &                voice: +(61 8) 6488 3452
Software Engineering                          fax: +(61 8) 6488 1089
The University of Western Australia      email: nick@csse.uwa.edu.au 
35 Stirling Highway                    w3: www.csse.uwa.edu.au/~nick
CRAWLEY, Perth,  WA  6009 
AUSTRALIA                               CRICOS Provider Code: 00126G

_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers

Reply to: [list | sender only]