[IUCr Home Page]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Fine-tuning CIF dictionary regexes



 
James Hester wrote:

[...]

> My question came up in connection with validating a CIF against a
> dictionary: all I want is to be able to determine whether or 
> not a given string matches the regexp, so rather than 
> throwing a series of regexps at a string to get a token, I'm 
> throwing a string corresponding to a data item value at a 
> single regexp.  I had hoped to be able to read the regexps 
> from the dictionary rather than hard code them.

For your particular case, it seems that you ought to be able to read a
regex from the dictionary, prepend a '^', append a '$',  and go.
Alternatively, some regex engines (e.g. Java's) allow you to exert
control at the API level over whether or not the whole string, the
beginning of the string, or just any old part of the string needs to
match.
 
> >> One suggestion is that these two regular expressions are 
> re-ordered 
> >> so that those alternatives in an alternation which are a subset of 
> >> other alternatives come later.  This remains POSIX-compliant and 
> >> means many non-POSIX engines will find the longest match.
> 
> > Are you sure you can order the rules such that it eliminates all 
> > instances of the problem you allude to?
> 
> Not at all. However, such a reordering will increase the 
> number of regexp engines which will match the entire string.  
> POSIX correctness is maintained, so nothing is lost and 
> something (not necessarily all the
> time) practical is gained in that Perl/Python/Tcl/? 
> programmers can automate type checking.

To the extent it is feasible, I agree that it is useful to arrange the
regexes so that they exhibit favorable behavior in the widest possible
range of regex engines.  Some standard needed to be chosen to
unambiguously establish the meaning of the regexes, however, and it may
not be possible to arrange all the regexes so that they have the same
meaning to regex engines that do not conform to the chosen standard
(POSIX).  One could document how the regexes used in the dictionary are
affected by the different regex semantics of some other engine(s) (e.g.
Perl's), and that might be useful, but one cannot write a generic
document of that nature.


-- 

John C. Bollinger, Ph.D.
Indiana University
Molecular Structure Center

jobollin@indiana.edu 
_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers


Reply to: [list | sender only]


Copyright © International Union of Crystallography

IUCr Webmaster