Hi all, In spite of my earlier embarassment about regular expressions on the mmCIF mailing list, I will wade in with more criticism (constructive, I hope...). Using \n and \t for newline and tab is not part of the POSIX specification for regular expressions. In fact, the standard says: > The interpretation of an ordinary character preceded by a backslash (\) > is undefined. (Posix 1003.2 section 2.8.4.1.1) and 'n' and 't' are most definitely ordinary characters. The intention in DDL and mmCIF is that these should be interpeted as in C string constants. In that case, however, other special characters should, in principle, be interpreted in this way. Consider the expression for text in mmCIF: '[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*' One might strip off the leading and trailing single quotes, and use the libgen function strccpy to compress the two-byte sequences "\n" and "\t" into true newline and tab characters. In that case, one would also lose the geunine backslash, since "\{" would be compressed into "{". Without compressing the string, the four bytes "\n\t" would be left in the pattern, and produce undefined results from regular expression matching. The POSIX conforming way of representing these type of characters would be to use character class expressions, such as "[:blank:]", instead of "[ \t]", say, but I think that this would limit their usefulness. I don't think that there are any reliable regex packages which implement the complete bracket expressions, and which come with an easy-going enough licence to be incorporated into typical academic-produced software (a policy of charging for non-academic use can cause problems here, for example). Even if such a package was available, it might not be easily portable to non-POSIX operating enviroments, since it would make use of locale information. I am not really arguing for the regular expressions to be changed again, but I think that _item_description.description for save__item_type_list.construct in the DDL should be re-worded to reflect that they are _based_ on P1003.2, and the differences documented. I think that there are just two of these: 1). \n and \t should be interpreted as <newline> and <tab> 2). The constructs should be taken to match the whole string, i.e. they should be treated as if anchored by ^.....$, without <newline> being treated as a special character. This is the treatment implied by the flag REG_NEWLINE being unset on a call to regcomp(), as described in P1003.2/D11.2 section B5.2 (line 662 ff.) - in this case, ^/$ match the start and end of strings, not of lines. Eventually, it may be better to go to the character class notation, but it may be a long time before suitable software is generally available. Regards, Peter. ======================================================================== Peter Keller. \ "Having beguiled with fiction until I had Dept. of Biology and \ none left I resorted to facts, which Biochemistry, \ also ran out." University of Bath, \ - Alisdair Gray Bath, BA2 7AY, UK. \ ------------------------------\----------------------------------------- Tel. (+44/0)1225 826826 x 4302 | Email: P.A.Keller@bath.ac.uk (Internet) Fax. (+44/0)1225 826449 | P.A.Keller%bath.ac.uk@UKACRL (BITNET) ========================================================================