[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

I wholeheartedly agree with Nick's suggestion.

On Tue, Nov 24, 2009 at 6:30 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
> It appears to me that we have spent far too long on a syntactic issue which
> can be avoided 99.9999% of the time. Quite simply given the 5 ways to
> delimit strings, it is next to impossible to get a situation where you
> cannot choose one of those to make the problem go away.
> I think the RCSB systematically avoid it by choosing
> "ab'cd"
> 'ab"cd'
> ;ab'"cd
> ;
> But now we additionally have """ and ''' to choose from, making it even
> easier.
> So I propose in line with James' position there is NO eliding of terminator
> character at the CIF2 syntax level. ALL elides in the string are assumed to
> be user specific encoding (say TeX, IUCr \greek) which can be resolved at
> the dictionary level.
> This necessarily means NO terminator character can appear in a string
> delimited by the same terminator character. You will need to choose a
> different terminator character. That is
> No " in "strings"
> No ' in 'strings'
> No """ in """strings""" (but separable individual and doublet " are allowed)
> No ''' in '''strings''' (but separable individual and doublet ' are allowed)
> EVERYTHING in the string is returned as raw (except the initiating and
> terminating character).
> The only time you will not be able to encode anything in a delimited string
> is when you want to include ' " """ ''' and \n; in the one string. The
> likelihood of that is almost zero, unless you may want to include a CIF
> within a CIF (a silly thing to do IMHO). In that case the contents can be
> encoded in a dictionary driven way. I suggest it be declared as a BASE64
> type and then all the syntactic ambiguity disappears.
> Problem solved! No need to elide because of CIF2 syntax rules all elides are
> user driven, contents are returned raw.
> As for Herbs comment in a recent email what about line-folding, then the
> same holds. That is NOT a lexer issue and it has nothing to do with the
> parser, everything is read literally and returned raw and what to do with it
> is promulgated to the downstream application.
> Straw vote - No elides of terminator strings as described above - Nick
> On 24/11/09 10:00 AM, "James Hester" <jamesrhester@gmail.com> wrote:
>> OK, my rewritten voting proposal appears to be an abject failure.  Let
>> me repeat 1 as clearly as possible
>> 1.  Should CIF2 allow elision of terminator characters?  In other
>> words, should we make it possible to include <quote> as a normal
>> character in a <quote> delimited string?
>> Herbert:  It's difficult to understand how to rephrase things if it is
>> not clear where exactly the problem lies.
>> Joe: good point about double backslash.  Consider this added to proposal (a).
>> Before we discuss (2) precisely, can we agree to use the following
>> abstract model and terminology for CIF2 file parsing and dictionary
>> application?  If not, please indicate your alternative.
>> 1. A CIF lexer separates a CIF file into tokens according to the CIF2
>> syntax specification only, that is, this process cannot be altered by
>> DDL directives.
>> 2. A CIF parser accepts the tokens from the lexer.  CIF parsers can be
>> modelled as performing at least the following actions with these
>> tokens:
>>   (i) assignment of datavalue to dataname
>>   (ii) grouping looped datanames into a set
>>   (iii) assigning looped datavalues to the appropriate dataname and packet
>>   (iv) editing datavalues according to the syntax specification if
>> this has not been performed in the lexer (e.g. stripping enclosing
>> quotes, removing elides)
>> 3. DDL dictionaries operate on and refer to the datavalues and
>> datanames returned by the CIF parser after (2).  They have no ability
>> to influence the lexing process, or the parsing actions listed above
>> (in particular the datavalue editing).
>> 4. The 'string value' or 'value' of a token is that value returned by
>> the parser in (2).  In particular, this is the value that:
>>   (i) may be checked against regular expressions in the dictionary;
>>   (ii) is accessed by dREL expressions;
>>   (iii) is returned by dREL expressions;
>>   (iv) is referred to in dictionary descriptive text;
>>   (v) may be passed to client routines for further editing;
>>   (vi) may be passed to external applications
>> [Side note: in other words the parser returns the CIF "infoset" and
>> the dictionaries refer to the CIF "infoset", but we haven't been
>> talking in those terms so I've been more explicit].
>> So my voting question (2) is: should the 'string value' of a token
>> referred to in (4) include the eliding characters?
>> On Tue, Nov 24, 2009 at 10:57 AM, Joe Krahn <krahn@niehs.nih.gov> wrote:
>>> A few points to consider:
>>> James Hester wrote:
>>> ...
>>>> 2. Character(s) used to indicate elision should be part of the string value
>>> This does not specify where the elision character should be stripped. It
>>> could be done by the parser or the dictionary-level code. The rule only
>>> refers to the final string for the final output text, right?
>>>> Now for the specifics:
>>>> 3.  Which of the following elision proposals do you support (more than one
>>>> OK)?
>>>>   Proposal (a) (intended to correspond to Nick's)
>>>>    (i) A character which would otherwise be interpreted as a delimiter
>>>> is elided by immediately preceding it with a reverse solidus.
>>>>   (ii) Otherwise a reverse solidus in the string has no special
>>>> lexical significance.
>>>>   Proposal (b)
>>>>    (i) The combinations <reverse solidus><quote> or a <reverse
>>>> solidus><double quote> always signify <quote> and <double quote>
>>>> respectively, regardless of the delimiter used in a particular string.
>>>>    (ii) The combinations in (i) elide the <quote> or <double quote>
>>>> character where that character would otherwise terminate the string
>>>>    (iii) Apart from (i) and (ii), the reverse solidus has no special
>>>> significance
>>>>    (iv) If not used as the string delimiter, <quote> or <double quote>
>>>> when not preceded by <reverse solidus> represent themselves.
>>> In both forms <reverse solidus><reverse solidus> should also be defined
>>> in order to allow a literal string that ends in <reverse solidus>. For
>>> example, a single <reverse solidus> character has to be written as "\\",
>>> to avoid eliding the close quote.
>>> Joe Krahn
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> cheers
> Nick
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
> CRICOS Provider Code: 00126G
> e: Nick.Spadaccini@uwa.edu.au
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]