[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

The only problem with referring all elisdes to the application is that
with the removal of the requirement of a blank after a \n; for it to be
effective, the line folding protocol develops a slight gap.  The
case is as follows

;\
;\
;

Is a valid single text field in CIF 1.1, which when handled with the
line folding protocol translates to the equivalent of ';' because the
embedded ;\ is not a valid text terminator.  If we require that
a text field the begins with "\n;\\" must be terminated by "\n; "
or "\n;\n" or "\n;\t" that problem would be fixed.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 25 Nov 2009, James Hester wrote:

> I wholeheartedly agree with Nick's suggestion.
>
> On Tue, Nov 24, 2009 at 6:30 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
>> It appears to me that we have spent far too long on a syntactic issue which
>> can be avoided 99.9999% of the time. Quite simply given the 5 ways to
>> delimit strings, it is next to impossible to get a situation where you
>> cannot choose one of those to make the problem go away.
>>
>> I think the RCSB systematically avoid it by choosing
>>
>> "ab'cd"
>> 'ab"cd'
>> ;ab'"cd
>> ;
>>
>> But now we additionally have """ and ''' to choose from, making it even
>> easier.
>>
>> So I propose in line with James' position there is NO eliding of terminator
>> character at the CIF2 syntax level. ALL elides in the string are assumed to
>> be user specific encoding (say TeX, IUCr \greek) which can be resolved at
>> the dictionary level.
>>
>> This necessarily means NO terminator character can appear in a string
>> delimited by the same terminator character. You will need to choose a
>> different terminator character. That is
>>
>> No " in "strings"
>> No ' in 'strings'
>> No """ in """strings""" (but separable individual and doublet " are allowed)
>> No ''' in '''strings''' (but separable individual and doublet ' are allowed)
>>
>> EVERYTHING in the string is returned as raw (except the initiating and
>> terminating character).
>>
>> The only time you will not be able to encode anything in a delimited string
>> is when you want to include ' " """ ''' and \n; in the one string. The
>> likelihood of that is almost zero, unless you may want to include a CIF
>> within a CIF (a silly thing to do IMHO). In that case the contents can be
>> encoded in a dictionary driven way. I suggest it be declared as a BASE64
>> type and then all the syntactic ambiguity disappears.
>>
>> Problem solved! No need to elide because of CIF2 syntax rules all elides are
>> user driven, contents are returned raw.
>>
>> As for Herbs comment in a recent email what about line-folding, then the
>> same holds. That is NOT a lexer issue and it has nothing to do with the
>> parser, everything is read literally and returned raw and what to do with it
>> is promulgated to the downstream application.
>>
>> Straw vote - No elides of terminator strings as described above - Nick
>>
>>
>> On 24/11/09 10:00 AM, "James Hester" <jamesrhester@gmail.com> wrote:
>>
>>> OK, my rewritten voting proposal appears to be an abject failure.  Let
>>> me repeat 1 as clearly as possible
>>>
>>> 1.  Should CIF2 allow elision of terminator characters?  In other
>>> words, should we make it possible to include <quote> as a normal
>>> character in a <quote> delimited string?
>>>
>>> Herbert:  It's difficult to understand how to rephrase things if it is
>>> not clear where exactly the problem lies.
>>>
>>> Joe: good point about double backslash.  Consider this added to proposal (a).
>>>
>>> Before we discuss (2) precisely, can we agree to use the following
>>> abstract model and terminology for CIF2 file parsing and dictionary
>>> application?  If not, please indicate your alternative.
>>>
>>> 1. A CIF lexer separates a CIF file into tokens according to the CIF2
>>> syntax specification only, that is, this process cannot be altered by
>>> DDL directives.
>>>
>>> 2. A CIF parser accepts the tokens from the lexer.  CIF parsers can be
>>> modelled as performing at least the following actions with these
>>> tokens:
>>>   (i) assignment of datavalue to dataname
>>>   (ii) grouping looped datanames into a set
>>>   (iii) assigning looped datavalues to the appropriate dataname and packet
>>>   (iv) editing datavalues according to the syntax specification if
>>> this has not been performed in the lexer (e.g. stripping enclosing
>>> quotes, removing elides)
>>>
>>> 3. DDL dictionaries operate on and refer to the datavalues and
>>> datanames returned by the CIF parser after (2).  They have no ability
>>> to influence the lexing process, or the parsing actions listed above
>>> (in particular the datavalue editing).
>>>
>>> 4. The 'string value' or 'value' of a token is that value returned by
>>> the parser in (2).  In particular, this is the value that:
>>>   (i) may be checked against regular expressions in the dictionary;
>>>   (ii) is accessed by dREL expressions;
>>>   (iii) is returned by dREL expressions;
>>>   (iv) is referred to in dictionary descriptive text;
>>>   (v) may be passed to client routines for further editing;
>>>   (vi) may be passed to external applications
>>>
>>> [Side note: in other words the parser returns the CIF "infoset" and
>>> the dictionaries refer to the CIF "infoset", but we haven't been
>>> talking in those terms so I've been more explicit].
>>>
>>> So my voting question (2) is: should the 'string value' of a token
>>> referred to in (4) include the eliding characters?
>>>
>>>
>>> On Tue, Nov 24, 2009 at 10:57 AM, Joe Krahn <krahn@niehs.nih.gov> wrote:
>>>> A few points to consider:
>>>>
>>>> James Hester wrote:
>>>> ...
>>>>> 2. Character(s) used to indicate elision should be part of the string value
>>>> This does not specify where the elision character should be stripped. It
>>>> could be done by the parser or the dictionary-level code. The rule only
>>>> refers to the final string for the final output text, right?
>>>>
>>>>>
>>>>> Now for the specifics:
>>>>>
>>>>> 3.  Which of the following elision proposals do you support (more than one
>>>>> OK)?
>>>>>
>>>>>   Proposal (a) (intended to correspond to Nick's)
>>>>>    (i) A character which would otherwise be interpreted as a delimiter
>>>>> is elided by immediately preceding it with a reverse solidus.
>>>>>   (ii) Otherwise a reverse solidus in the string has no special
>>>>> lexical significance.
>>>>>
>>>>>   Proposal (b)
>>>>>    (i) The combinations <reverse solidus><quote> or a <reverse
>>>>> solidus><double quote> always signify <quote> and <double quote>
>>>>> respectively, regardless of the delimiter used in a particular string.
>>>>>    (ii) The combinations in (i) elide the <quote> or <double quote>
>>>>> character where that character would otherwise terminate the string
>>>>>    (iii) Apart from (i) and (ii), the reverse solidus has no special
>>>>> significance
>>>>>    (iv) If not used as the string delimiter, <quote> or <double quote>
>>>>> when not preceded by <reverse solidus> represent themselves.
>>>>
>>>> In both forms <reverse solidus><reverse solidus> should also be defined
>>>> in order to allow a literal string that ends in <reverse solidus>. For
>>>> example, a single <reverse solidus> character has to be written as "\\",
>>>> to avoid eliding the close quote.
>>>>
>>>> Joe Krahn
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>
>>>
>>
>> cheers
>>
>> Nick
>>
>> --------------------------------
>> Associate Professor N. Spadaccini, PhD
>> School of Computer Science & Software Engineering
>>
>> The University of Western Australia    t: +61 (0)8 6488 3452
>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>> MBDP  M002
>>
>> CRICOS Provider Code: 00126G
>>
>> e: Nick.Spadaccini@uwa.edu.au
>>
>>
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]