YappsStarParser.nw

Noweb literate programming file for Star grammar and parser specification. We are using Amit Patel's excellent context-sensitive Yapps2 parser. This was chosen because it enables us to process long semicolon delimited strings without running into Python recursion limits. In the original kjParsing implementation, it was impossible to get the lexer to return a single line of text within the semicolon-delimited string as that re would have matched a single line of text anywhere in the file. The resulting very long match expression only worked for text strings less than about 9000 characters in length. For further information about Yapps2, see http://theory.stanford.edu/ amitp/Yapps/

<*>=
from StarFile import *
from types import *
import copy
<Helper functions>
%%
parser StarParser:
    <Regular expressions>
    <Grammar specification>
%%

Helper functions.

We have a monitor function which we can call to save the last parsed value (and print, if we are debugging). We also have functions for stripping off delimiters from strings. Finally, we match up our loops after reading them in. Note that we have function stripextras, which is only for semicolon strings, and stripstring, which is for getting rid of the inverted commas.

<Helper functions>= (<-U)
# An alternative specification for the Cif Parser, based on Yapps2
# by Amit Patel (http://theory.stanford.edu/~amitp/Yapps)
#
# helper code: we define our match tokens
lastval = ''
def monitor(location,value):
    global lastval
    # print 'At %s: %s' % (location,`value`)
    lastval = `value`
    return value

# Strip extras gets rid of leading and trailing whitespace, and
# semicolons.
def stripextras(value):
    # we get rid of semicolons and leading/trailing terminators etc.
     import re
     jj = re.compile("[\n\r\f \t\v]*")
     semis = re.compile("[\n\r\f \t\v]*[\n\r\f]\n*;")
     cut = semis.match(value)
     if cut:        #we have a semicolon-delimited string
          nv = value[cut.end():len(value)-2]
          try:
             if nv[-1]=='\r': nv = nv[:-1]
          except IndexError:    #empty data value
             pass
          return nv 
     else: 
          cut = jj.match(value)
          if cut:
               return stripstring(value[cut.end():])
          return value

# helper function to get rid of inverted commas etc.

def stripstring(value):
     if value:
         if value[0]== '\'' and value[-1]=='\'':
           return value[1:-1]
         if value[0]=='"' and value[-1]=='"':
           return value[1:-1]
     return value

# helper function to populate a nested LoopBlock structure given an
# empty structure together with listed values.   The values are 
# organised into a list of lists, where each time 'stop' was
# encountered one list terminates and a new one starts. 
# For a correctly constructed loop, the final 'popout' will pop out
# of the iteration completely and raise a StopIteration error.
#
# Note that there may be an empty list at the very end of our itemlists,
# so we remove that if necessary.
#
# We optimise for CIF files by loading differently if we have a flat loop

def makeloop(loopstructure,itemlists):
    if itemlists[-1] == []: itemlists.pop(-1)
    # print 'Making loop with %s' % `itemlists`
    if loopstructure.dimension == 1 and loopstructure.loops == []:
        storage_iter = loopstructure.fast_load_iter()
    else:
        storage_iter = loopstructure.load_iter()
    nowloop = loopstructure
    for datalist in itemlists:
       for datavalue in datalist:
           try:
               nowloop,target = storage_iter.next()
           except StopIteration:
               print "StopIter at %s/%s" % (datavalue,datalist)
               raise StopIteration
           # print 'Got %s %s ->' % (`nowloop`,`target`),
           target.append(datavalue)
           # print '%s' % `target`
       # the end of each list is the same as a stop_ token
       # print 'Saw end of list'
       nowloop.popout = True
       nowloop,blank = storage_iter.next()  #execute the pop
       # print 'discarding %s/%s' % (`nowloop`,`blank`)
    # print 'Makeloop returning %s' % `loopstructure`
    return loopstructure

# return an object with the appropriate amount of nesting
def make_empty(nestlevel):
    gd = []
    for i in range(1,nestlevel):
        gd = [gd]
    return gd

# this function updates a dictionary first checking for name collisions,
# which imply that the CIF is invalid.  We need case insensitivity for
# names. 

# Unfortunately we cannot check loop item contents against non-loop contents
# in a non-messy way during parsing, as we may not have easy access to previous
# key value pairs in the context of our call (unlike our built-in access to all 
# previous loops).
# For this reason, we don't waste time checking looped items against non-looped
# names during parsing of a data block.  This would only match a subset of the
# final items.   We do check against ordinary items, however.
#
# Note the following situations:
# (1) new_dict is empty -> we have just added a loop; do no checking
# (2) new_dict is not empty -> we have some new key-value pairs
#
def cif_update(old_dict,new_dict,loops):
    old_keys = map(lambda a:a.lower(),old_dict.keys())
    if new_dict != {}:    # otherwise we have a new loop
        #print 'Comparing %s to %s' % (`old_keys`,`new_dict.keys()`)
        for new_key in new_dict.keys():
            if new_key.lower() in old_keys:
                raise CifError, "Duplicate dataname or blockname %s in input file" % new_key
            old_dict[new_key] = new_dict[new_key]
#
# this takes two lines, so we couldn't fit it into a one line execution statement...
def order_update(order_array,new_name):
    order_array.append(new_name) 
    return new_name

The regular expressions aren't quite as easy to deal with as in kjParsing; in kjParsing we could pass string variables as re arguments, but here we have to have a raw string. However, we can simplify the BNC specification of Nick Spadaccini. First of all, we do not have to have type I and type II strings, which are distinguished by the presence or absence of a line feed directly preceding them and thus by being allowed a semicolon at the front or not. We take care of this by treating as whitespace all terminators except for those with a following semicolon, so that a carriage-return-semicolon sequence matches the start_sc_line uniquely.

We include reserved words and save frames, although we semantically utterly ignore any save frames that we see - but no syntax error is flagged. The other reserved words have no rules defined, so will flag a syntax error. However, as yapps is a context-sensitive parser, it will by default make any word found starting with our reserved words into a data value if it occurs in the expected position, so we explicity exclude stuff starting with our words in the definition of data_value_1.

<Regular expressions>= (<-U)
# first handle whitespace and comments, keeping whitespace
# before a semicolon
ignore: "([ \t\n\r](?!;))|[ \t]"
ignore: "#.*[\n\r](?!;)"
ignore: "#.*"
# now the tokens
token LBLOCK:  "(L|l)(O|o)(O|o)(P|p)_"        # loop_
token GLOBAL: "(G|g)(L|l)(O|o)(B|b)(A|a)(L|l)_"
token STOP: "(S|s)(T|t)(O|o)(P|p)_"
token save_heading: "(S|s)(A|a)(V|v)(E|e)_[][!%&\(\)*+,./:<=>?@0-9A-Za-z\\\\^`{}\|~\"#$';_-]+"
token save_end: "(S|s)(A|a)(V|v)(E|e)_"
token data_name: "_[][!%&\(\)*+,./:<=>?@0-9A-Za-z\\\\^`{}\|~\"#$';_-]+" #_followed by stuff
token data_heading: "(D|d)(A|a)(T|t)(A|a)_[][!%&\(\)*+,./:<=>?@0-9A-Za-z\\\\^`{}\|~\"#$';_-]+"
token start_sc_line: "(\n|\r\n);([^\n\r])*(\r\n|\r|\n)+"
token sc_line_of_text: "[^;\r\n]([^\r\n])*(\r\n|\r|\n)+"
token end_sc_line: ";"
token data_value_1: "((?!(((S|s)(A|a)(V|v)(E|e)_[^\s]*)|((G|g)(L|l)(O|o)(B|b)(A|a)(L|l)_[^\s]*)|((S|s)(T|t)(O|o)(P|p)_[^\s]*)|((D|d)(A|a)(T|t)(A|a)_[^\s]*)))[^\s\"#$'_\[\]][^\s]*)|'(('(?=\S))|([^\n\r\f']))*'+|\"((\"(?=\S))|([^\n\r\"]))*\"+"
token END: '$'

The final returned value is a StarFile, with each key a datablock name. The value attached to each key is an entire dictionary for that block. We bypass the standard __setitem__ methods to gain precision in checking for duplicate blocknames and skipping name checks.

<Grammar specification>= (<-U)
# now the rules

rule input: ((
            dblock              {{allblocks = StarFile(); allblocks.NewBlock(dblock[0],blockcontents=dblock[1],fix=False,replace=False)}}
            (
            dblock              {{allblocks.NewBlock(dblock[0],blockcontents=monitor('input',dblock[1]),fix=False,replace=False)}} #
            )*
            END                
            )
            |
            (
             END                 {{allblocks = StarFile()}}
            ))                    {{return allblocks}}

     rule dblock: (data_heading           {{heading = data_heading[5:];thisblock=StarBlock(overwrite=False)}}# a data heading
                  (
                   dataseq<<thisblock>>  
                  |
                  save_frame              {{thisblock["saves"].NewBlock(save_frame[0],save_frame[1],fix=False,replace=True)}}
                  )*
                   )                      {{return (heading,monitor('dblock',thisblock))}} # but may be empty

     rule dataseq<<starblock>>:  data<<starblock>>   
                       (
                       data<<starblock>>            
                       )*                         

     rule data<<currentblock>>:        top_loop      {{currentblock.insert_loop(top_loop,audit=False)}} 
                                        | 
                                        datakvpair    {{currentblock.AddLoopItem(datakvpair,precheck=True)}} #kv pair

     rule datakvpair: data_name data_value  {{return (data_name,data_value) }} # name-value
     rule data_value: (data_value_1          {{thisval = stripstring(data_value_1)}}
                      |
                      sc_lines_of_text      {{thisval = stripextras(sc_lines_of_text)}}
                      )                     {{return monitor('data_value',thisval)}}

     rule sc_lines_of_text: start_sc_line   {{lines = start_sc_line}}
                            (
                            sc_line_of_text {{lines = lines+sc_line_of_text}}
                            )*
                            end_sc_line     {{return monitor('sc_line_of_text',lines+end_sc_line)}}

# due to the inability of the parser to backtrack, we contruct our loops in helper functions,
# and simply collect data during parsing proper.

     rule top_loop: LBLOCK loopfield loopvalues {{return makeloop(loopfield,loopvalues)}}

# OK: a loopfield is either a sequence of dataname*,loopfield with stop
# or else dataname,loopfield without stop

     rule loopfield: (            {{toploop=LoopBlock(dimension=1,overwrite=False);curloop=toploop;poploop=None;dim=1}}
                     (
                                  data_name    {{curloop[data_name]=[]}}
                                  |
                                  LBLOCK       {{dim=dim+1;newloop=LoopBlock(dimension=dim,overwrite=False);poploop=curloop;curloop.insert_loop(newloop,audit=False);curloop=newloop}}
                                  |
                                  STOP         {{curloop=poploop;dim=dim-1}}
                      )*
                      )                        {{return toploop}} # sequence of data names 
                         

     rule loopvalues: ( 
                       data_value    {{dataloop=[[data_value]]}}
                       ( 
                       data_value     {{dataloop[-1].append(monitor('loopval',data_value))}}
                       |
                       STOP           {{dataloop.append([])}} 
                       )*             
                       )              {{return dataloop}}

     rule save_frame: save_heading       {{savehead = save_heading[5:];savebody = StarBlock(overwrite=False) }} 
                      (
                      dataseq<<savebody>> 
                      )*
                      save_end           {{return (savehead,monitor('save_frame',savebody))}}

<*>: D1
<Grammar specification>: U1, D2
<Helper functions>: U1, D2
<Regular expressions>: U1, D2