Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Fine-tuning CIF dictionary regexes

  • Subject: RE: Fine-tuning CIF dictionary regexes
  • From: "Bollinger, John Clayton" <jobollin@xxxxxxxxxxx>
  • Date: Mon, 18 Apr 2005 10:04:20 -0500
 
James Hester wrote:

> The point I want to discuss boils down to: should the regular 
> expressions in the CIF dictionary be find-tuned to be 
> compatible not only with POSIX-compliant regular expression engines?

It seems to me that it is desirable for the REs to be as general as
possible.  POSIX does have the advantage of being a formal (series of)
standard(s).  Perl-compatible REs, on the other-hand, have the advantage
of widespread use, support, and acceptance, to the extent that I'd have
to call them a defacto standard.  POSIX compliance is attractive from
the formal standards point of view, but Perl compatibility is more
likely to be useful to software developers.  If a particular RE in the
dictionary must choose only one, then the Perl direction is the one I
think I favor.

> The following two constructs from mm_cif, although POSIX 
> compliant, will not correctly match in a Perl or Python or 
> Tcl regular expression (and any other NFA engine)
> 
> floating point numbers:
> 
> '-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?'
> 
> symmetry operations
> '([1-9]|[1-9][0-9]|1[0-8][0-9]|19[0-2])(_[1-9][1-9][1-9])?'
> 
> The problem is that the non-POSIX engines will go through the 
> alternations (separated by |) in the above expressions from 
> left to right, returning the first match, and as the second 
> part is optional, there is no requirement to match it.  In 
> contrast, a POSIX engine must return the longest match.  So 
> e.g. if Python is fed the number 78.456(22), "78." will be 
> matched by the floating point expression, as this satisfies 
> the first part of the alternation, and everything else in the 
> regular expression is optional.

Isn't it implied that the provided RE's must match an entire input
token?  As far as I can tell, that makes the (particular) distinction
between RE semantics moot.


Regards,

John Bollinger

-- 

John C. Bollinger, Ph.D.
Indiana University
Molecular Structure Center

jobollin@indiana.edu 
_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.