Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fine-tuning CIF dictionary regexes

On Mon, 18 Apr 2005, James Hester wrote:

> The point I want to discuss boils down to: should the regular
> expressions in the CIF dictionary be find-tuned to be compatible not
> only with POSIX-compliant regular expression engines?

POSIX compliance makes sure you exhaust the input string until you find 
the longest matching sequence. This is necessary to get the "correct" 
token.

>
> The following two constructs from mm_cif, although POSIX compliant, will
> not correctly match in a Perl or Python or Tcl regular expression (and
> any other NFA engine)
>
> floating point numbers:
>
> '-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?'
>
> symmetry operations
> '([1-9]|[1-9][0-9]|1[0-8][0-9]|19[0-2])(_[1-9][1-9][1-9])?'
>
> The problem is that the non-POSIX engines will go through the
> alternations (separated by |) in the above expressions from left to
> right, returning the first match, and as the second part is optional,
> there is no requirement to match it.  In contrast, a POSIX engine must
> return the longest match.  So e.g. if Python is fed the number
> 78.456(22), "78." will be matched by the floating point expression, as
> this satisfies the first part of the alternation, and everything else in
> the regular expression is optional.

But if you are throwing the "number" to a series of compiled regular 
expressions won't 78.456(22) also match '7', an integer, and return the 
INT token? If that happens to be the first rule it comes across?

>
> One suggestion is that these two regular expressions are re-ordered so
> that those alternatives in an alternation which are a subset of other
> alternatives come later.  This remains POSIX-compliant and means many
> non-POSIX engines will find the longest match.

Are you sure you can order the rules such that it eliminates all instances 
of the problem you allude to?

cheers

Nick

--------------------------------
Dr N. Spadaccini                                      Head of School

School of Computer Science &                voice: +(61 8) 6488 3452
Software Engineering                          fax: +(61 8) 6488 1089
The University of Western Australia      email: nick@csse.uwa.edu.au 
35 Stirling Highway                    w3: www.csse.uwa.edu.au/~nick
CRAWLEY, Perth,  WA  6009 
AUSTRALIA                               CRICOS Provider Code: 00126G

_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.