[IUCr Home Page]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fine-tuning CIF dictionary regexes



The point I want to discuss boils down to: should the regular
expressions in the CIF dictionary be find-tuned to be compatible not
only with POSIX-compliant regular expression engines?

The following two constructs from mm_cif, although POSIX compliant, will
not correctly match in a Perl or Python or Tcl regular expression (and
any other NFA engine)

floating point numbers:

'-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?'

symmetry operations
'([1-9]|[1-9][0-9]|1[0-8][0-9]|19[0-2])(_[1-9][1-9][1-9])?'

The problem is that the non-POSIX engines will go through the
alternations (separated by |) in the above expressions from left to
right, returning the first match, and as the second part is optional,
there is no requirement to match it.  In contrast, a POSIX engine must
return the longest match.  So e.g. if Python is fed the number
78.456(22), "78." will be matched by the floating point expression, as
this satisfies the first part of the alternation, and everything else in
the regular expression is optional.

One suggestion is that these two regular expressions are re-ordered so
that those alternatives in an alternation which are a subset of other
alternatives come later.  This remains POSIX-compliant and means many
non-POSIX engines will find the longest match.

Does anyone read the dictionary-defined regexes directly into their
program?

James.

_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers

Reply to: [list | sender only]


Copyright © International Union of Crystallography

IUCr Webmaster