Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fine-tuning CIF dictionary regexes

Now that CIF can handle long lines and has documented handling
of special characters, it should be feasible to convert
any previously fudged regex expressions to be fully posix compliant
regexes that can be used for automatic validation.  I would propose
that we start collecting and testing a full set of compliant
regexes for the types in, say, the mmCIF dictionary, and, once
we have general agreement on the expressions, update the
dictionaries and our on-line documentation.

I have appended John's list in the current mmCIF dictionary, which,
is, I think, in fairly good shape.  I would suggest we do as much
as we can before the IUCr meeting in Florence, so that those of us
who are at Florence can have a productive discussion.

   -- Herbert

####################
## ITEM_TYPE_LIST ##
####################
#
#
#  The regular expressions defined here are not compliant
#  with the POSIX 1003.2 standard as they include the
#  '\n' and '\t' special characters. These regular expressions
#  have been tested using the version 0.12 of Richard Stallman's
#  GNU regular expression libary in POSIX mode.
#
#
# For some data items, a standard syntax is assumed. The syntax is
#   described for each data item in the dictionary, but is summarized here:
#
#   Names:     The family name(s) followed by a comma, precedes the first
#              name(s) or initial(s).
#
#   Telephone numbers:
#              The international code is given in brackets and any extension
#              number is preceded by 'ext'.
#
#   Dates:     In the form yyyy-mm-dd.
#
##############################################################################

       loop_
      _item_type_list.code
      _item_type_list.primitive_code
      _item_type_list.construct
      _item_type_list.detail
                 code      char
                 '[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*'
;              code item types/single words ...
;
                 ucode      uchar
                 '[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*'
;              code item types/single words  (case insensitive) ...
;
                 line      char
                 '[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
;              char item types / multi-word items ...
;
                 uline      uchar
                 '[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
;              char item types / multi-word items (case insensitive)...
;
                 text      char
                 '[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
;              text item types / multi-line text ...
;
                 int       numb
                 '-?[0-9]+'
;              int item types are the subset of numbers that are the negative
                 or positive integers.
;
                 float     numb
 
'-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?'
;              int item types are the subset of numbers that are the floating
                 numbers.
;
                 name      uchar
                 '_[_A-Za-z0-9]+\.[][_A-Za-z0-9%-]+'
;              name item types take the form...
;
                 idname    uchar
                 '[_A-Za-z0-9]+'
;              idname item types take the form...
;
                 any       char
                 '.*'
;              A catch all for items that may take any form...
;
                 yyyy-mm-dd  char
                  '[0-9]?[0-9]?[0-9][0-9]-[0-9]?[0-9]-[0-9][0-9]'
;
                 Standard format for CIF dates.
;
                 uchar3    uchar
                '[+]?[A-Za-z0-9][A-Za-z0-9][A-Za-z0-9]'
;
                 data item for 3 character codes
;
                 uchar1    uchar
                '[+]?[A-Za-z0-9]'
;
                 data item for 1 character codes
;
                 symop    char
                 '([1-9]|[1-9][0-9]|1[0-8][0-9]|19[0-2])(_[1-9][1-9][1-9])?'
;              symop item types take the form n_klm, where n refers to the
                 symmetry operation that is applied to the coordinates in the
                 ATOM_SITE category identified by _atom_site_label.  It must
                 match a number given in _symmetry_equiv_pos_site_id.

                 k, l, and m refer to the translations that are subsequently
                 applied to the symmetry transformed coordinates to generate
                 the atom used.  These translations (x,y,z) are related to
                 (k,l,m) by
                       k = 5 + x
                       l = 5 + y
                       m = 5 + z
                 By adding 5 to the translations, the use of negative numbers
                 is avoided.
;
                 atcode      char
                 '[][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
;              Character data type for atom names  ...
;




At 5:05 PM +0900 6/16/05, James Hester wrote:
>On Mon Apr 18th Nick wrote:
>
>>  POSIX compliance makes sure you exhaust the input string until you find
>>  the longest matching sequence. This is necessary to get the "correct"
>>  token.
>
>I understand it as "leftmost, longest" so that the regexp engine must
>search through all alternative matches to find the longest.
>
>>  But if you are throwing the "number" to a series of compiled regular
>>  expressions won't 78.456(22) also match '7', an integer, and return the
>>  INT token? If that happens to be the first rule it comes across?
>
>My question came up in connection with validating a CIF against a
>dictionary: all I want is to be able to determine whether or not a given
>string matches the regexp, so rather than throwing a series of regexps
>at a string to get a token, I'm throwing a string corresponding to a
>data item value at a single regexp.  I had hoped to be able to read the
>regexps from the dictionary rather than hard code them.
>
>(As an aside, I have split CIF processing into syntax and validation, so
>that no tokenisation in terms of INT/FLOAT/NUMBER etc. happens during
>syntax checking.  All data values after the syntax stage are strings
>which are then inspected during validation).
>
>>>  One suggestion is that these two regular expressions are re-ordered so
>>>  that those alternatives in an alternation which are a subset of other
>>>  alternatives come later.  This remains POSIX-compliant and means many
>>>  non-POSIX engines will find the longest match.
>
>>  Are you sure you can order the rules such that it eliminates all instances
>>  of the problem you allude to?
>
>Not at all. However, such a reordering will increase the number of
>regexp engines which will match the entire string.  POSIX correctness is
>maintained, so nothing is lost and something (not necessarily all the
>time) practical is gained in that Perl/Python/Tcl/? programmers can
>automate type checking.
>
>(This reply is so late because I seem to have dropped off the mailing
>list and only noticed that some discussion had occurred when checking
>the archive later on).
>
>James.
>
>
>_______________________________________________
>cif-developers mailing list
>cif-developers@iucr.org
>http://scripts.iucr.org/mailman/listinfo/cif-developers

-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

               Office:  +1-631-244-3035
            Lab (KSC 020): +1-631-244-3451
                  yaya@dowling.edu
=====================================================
_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.