Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fine-tuning CIF dictionary regexes

  • Subject: Re: Fine-tuning CIF dictionary regexes
  • From: James Hester <jrh@xxxxxxxxxxxx>
  • Date: Thu, 16 Jun 2005 17:05:15 +0900
On Mon Apr 18th Nick wrote:

> POSIX compliance makes sure you exhaust the input string until you find 
> the longest matching sequence. This is necessary to get the "correct" 
> token.

I understand it as "leftmost, longest" so that the regexp engine must
search through all alternative matches to find the longest.

> But if you are throwing the "number" to a series of compiled regular 
> expressions won't 78.456(22) also match '7', an integer, and return the 
> INT token? If that happens to be the first rule it comes across?

My question came up in connection with validating a CIF against a
dictionary: all I want is to be able to determine whether or not a given
string matches the regexp, so rather than throwing a series of regexps
at a string to get a token, I'm throwing a string corresponding to a
data item value at a single regexp.  I had hoped to be able to read the
regexps from the dictionary rather than hard code them.

(As an aside, I have split CIF processing into syntax and validation, so
that no tokenisation in terms of INT/FLOAT/NUMBER etc. happens during
syntax checking.  All data values after the syntax stage are strings
which are then inspected during validation).

>> One suggestion is that these two regular expressions are re-ordered so
>> that those alternatives in an alternation which are a subset of other
>> alternatives come later.  This remains POSIX-compliant and means many
>> non-POSIX engines will find the longest match.

> Are you sure you can order the rules such that it eliminates all instances 
> of the problem you allude to?

Not at all. However, such a reordering will increase the number of
regexp engines which will match the entire string.  POSIX correctness is
maintained, so nothing is lost and something (not necessarily all the
time) practical is gained in that Perl/Python/Tcl/? programmers can
automate type checking.

(This reply is so late because I seem to have dropped off the mailing
list and only noticed that some discussion had occurred when checking
the archive later on). 

James.


_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.