This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.
[IUCr Home Page] [CIF Home Page] [mmCIF Home Page]

More thoughts on regular expressions.

Peter Keller (bsspak@bath.ac.uk)
Mon, 6 Nov 1995 18:46:27 +0000 (GMT)


Hi all,

In spite of my earlier embarassment about regular expressions on the 
mmCIF mailing list, I will wade in with more criticism (constructive, I 
hope...).

Using \n and \t for newline and tab is not part of the POSIX 
specification for regular expressions. In fact, the standard says:

> The interpretation of an ordinary character preceded by a backslash (\) 
> is undefined.

(Posix 1003.2 section 2.8.4.1.1)

and 'n' and 't' are most definitely ordinary characters.

The intention in DDL and mmCIF is that these should be interpeted as in C
string constants. In that case, however, other special characters should,
in principle, be interpreted in this way. Consider the expression for text
in mmCIF: 

     '[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'

One might strip off the leading and trailing single quotes, and use the
libgen function strccpy to compress the two-byte sequences "\n" and "\t"
into true newline and tab characters. In that case, one would also lose
the geunine backslash, since "\{" would be compressed into "{". Without
compressing the string, the four bytes "\n\t" would be left in the
pattern, and produce undefined results from regular expression matching. 

The POSIX conforming way of representing these type of characters would be
to use character class expressions, such as "[:blank:]", instead of 
"[ \t]", say, but I think that this would limit their usefulness. I don't
think that there are any reliable regex packages which implement the
complete bracket expressions, and which come with an easy-going enough
licence to be incorporated into typical academic-produced software (a
policy of charging for non-academic use can cause problems here, for
example). Even if such a package was available, it might not be easily
portable to non-POSIX operating enviroments, since it would make use of 
locale information.

I am not really arguing for the regular expressions to be changed again,
but I think that _item_description.description for
save__item_type_list.construct in the DDL should be re-worded to reflect
that they are _based_ on P1003.2, and the differences documented. I think
that there are just two of these:

1). \n and \t should be interpreted as <newline> and <tab>

2). The constructs should be taken to match the whole string, i.e. they
should be treated as if anchored by ^.....$, without <newline> being
treated as a special character. This is the treatment implied by the flag
REG_NEWLINE being unset on a call to regcomp(), as described in
P1003.2/D11.2 section B5.2 (line 662 ff.) - in this case, ^/$ match the
start and end of strings, not of lines. 

Eventually, it may be better to go to the character class notation, but it
may be a long time before suitable software is generally available. 

Regards,
Peter.

========================================================================
Peter Keller.            \  "Having beguiled with fiction until I had
Dept. of Biology and      \    none left I resorted to facts, which
    Biochemistry,          \     also ran out."
University of Bath,         \          - Alisdair Gray
Bath, BA2 7AY, UK.           \ 
------------------------------\-----------------------------------------
Tel. (+44/0)1225 826826 x 4302 | Email: P.A.Keller@bath.ac.uk (Internet)
Fax. (+44/0)1225 826449        |   P.A.Keller%bath.ac.uk@UKACRL (BITNET)
========================================================================