Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Fine-tuning CIF dictionary regexes

  • Subject: RE: Fine-tuning CIF dictionary regexes
  • From: "Bollinger, John Clayton" <jobollin@xxxxxxxxxxx>
  • Date: Thu, 16 Jun 2005 08:53:22 -0500
James Hester wrote:


> My question came up in connection with validating a CIF against a
> dictionary: all I want is to be able to determine whether or 
> not a given string matches the regexp, so rather than 
> throwing a series of regexps at a string to get a token, I'm 
> throwing a string corresponding to a data item value at a 
> single regexp.  I had hoped to be able to read the regexps 
> from the dictionary rather than hard code them.

For your particular case, it seems that you ought to be able to read a
regex from the dictionary, prepend a '^', append a '$',  and go.
Alternatively, some regex engines (e.g. Java's) allow you to exert
control at the API level over whether or not the whole string, the
beginning of the string, or just any old part of the string needs to
> >> One suggestion is that these two regular expressions are 
> re-ordered 
> >> so that those alternatives in an alternation which are a subset of 
> >> other alternatives come later.  This remains POSIX-compliant and 
> >> means many non-POSIX engines will find the longest match.
> > Are you sure you can order the rules such that it eliminates all 
> > instances of the problem you allude to?
> Not at all. However, such a reordering will increase the 
> number of regexp engines which will match the entire string.  
> POSIX correctness is maintained, so nothing is lost and 
> something (not necessarily all the
> time) practical is gained in that Perl/Python/Tcl/? 
> programmers can automate type checking.

To the extent it is feasible, I agree that it is useful to arrange the
regexes so that they exhibit favorable behavior in the widest possible
range of regex engines.  Some standard needed to be chosen to
unambiguously establish the meaning of the regexes, however, and it may
not be possible to arrange all the regexes so that they have the same
meaning to regex engines that do not conform to the chosen standard
(POSIX).  One could document how the regexes used in the dictionary are
affected by the different regex semantics of some other engine(s) (e.g.
Perl's), and that might be useful, but one cannot write a generic
document of that nature.


John C. Bollinger, Ph.D.
Indiana University
Molecular Structure Center

cif-developers mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.