Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

It appears to me that we have spent far too long on a syntactic issue which
can be avoided 99.9999% of the time. Quite simply given the 5 ways to
delimit strings, it is next to impossible to get a situation where you
cannot choose one of those to make the problem go away.

I think the RCSB systematically avoid it by choosing

"ab'cd"
'ab"cd'
;ab'"cd
;

But now we additionally have """ and ''' to choose from, making it even
easier.

So I propose in line with James' position there is NO eliding of terminator
character at the CIF2 syntax level. ALL elides in the string are assumed to
be user specific encoding (say TeX, IUCr \greek) which can be resolved at
the dictionary level.

This necessarily means NO terminator character can appear in a string
delimited by the same terminator character. You will need to choose a
different terminator character. That is

No " in "strings"
No ' in 'strings'
No """ in """strings""" (but separable individual and doublet " are allowed)
No ''' in '''strings''' (but separable individual and doublet ' are allowed)

EVERYTHING in the string is returned as raw (except the initiating and
terminating character).

The only time you will not be able to encode anything in a delimited string
is when you want to include ' " """ ''' and \n; in the one string. The
likelihood of that is almost zero, unless you may want to include a CIF
within a CIF (a silly thing to do IMHO). In that case the contents can be
encoded in a dictionary driven way. I suggest it be declared as a BASE64
type and then all the syntactic ambiguity disappears.

Problem solved! No need to elide because of CIF2 syntax rules all elides are
user driven, contents are returned raw.

As for Herbs comment in a recent email what about line-folding, then the
same holds. That is NOT a lexer issue and it has nothing to do with the
parser, everything is read literally and returned raw and what to do with it
is promulgated to the downstream application.

Straw vote - No elides of terminator strings as described above - Nick


On 24/11/09 10:00 AM, "James Hester" <jamesrhester@gmail.com> wrote:

> OK, my rewritten voting proposal appears to be an abject failure.  Let
> me repeat 1 as clearly as possible
> 
> 1.  Should CIF2 allow elision of terminator characters?  In other
> words, should we make it possible to include <quote> as a normal
> character in a <quote> delimited string?
> 
> Herbert:  It's difficult to understand how to rephrase things if it is
> not clear where exactly the problem lies.
> 
> Joe: good point about double backslash.  Consider this added to proposal (a).
> 
> Before we discuss (2) precisely, can we agree to use the following
> abstract model and terminology for CIF2 file parsing and dictionary
> application?  If not, please indicate your alternative.
> 
> 1. A CIF lexer separates a CIF file into tokens according to the CIF2
> syntax specification only, that is, this process cannot be altered by
> DDL directives.
> 
> 2. A CIF parser accepts the tokens from the lexer.  CIF parsers can be
> modelled as performing at least the following actions with these
> tokens:
>   (i) assignment of datavalue to dataname
>   (ii) grouping looped datanames into a set
>   (iii) assigning looped datavalues to the appropriate dataname and packet
>   (iv) editing datavalues according to the syntax specification if
> this has not been performed in the lexer (e.g. stripping enclosing
> quotes, removing elides)
> 
> 3. DDL dictionaries operate on and refer to the datavalues and
> datanames returned by the CIF parser after (2).  They have no ability
> to influence the lexing process, or the parsing actions listed above
> (in particular the datavalue editing).
> 
> 4. The 'string value' or 'value' of a token is that value returned by
> the parser in (2).  In particular, this is the value that:
>   (i) may be checked against regular expressions in the dictionary;
>   (ii) is accessed by dREL expressions;
>   (iii) is returned by dREL expressions;
>   (iv) is referred to in dictionary descriptive text;
>   (v) may be passed to client routines for further editing;
>   (vi) may be passed to external applications
> 
> [Side note: in other words the parser returns the CIF "infoset" and
> the dictionaries refer to the CIF "infoset", but we haven't been
> talking in those terms so I've been more explicit].
> 
> So my voting question (2) is: should the 'string value' of a token
> referred to in (4) include the eliding characters?
> 
> 
> On Tue, Nov 24, 2009 at 10:57 AM, Joe Krahn <krahn@niehs.nih.gov> wrote:
>> A few points to consider:
>> 
>> James Hester wrote:
>> ...
>>> 2. Character(s) used to indicate elision should be part of the string value
>> This does not specify where the elision character should be stripped. It
>> could be done by the parser or the dictionary-level code. The rule only
>> refers to the final string for the final output text, right?
>> 
>>> 
>>> Now for the specifics:
>>> 
>>> 3.  Which of the following elision proposals do you support (more than one
>>> OK)?
>>> 
>>>   Proposal (a) (intended to correspond to Nick's)
>>>    (i) A character which would otherwise be interpreted as a delimiter
>>> is elided by immediately preceding it with a reverse solidus.
>>>   (ii) Otherwise a reverse solidus in the string has no special
>>> lexical significance.
>>> 
>>>   Proposal (b)
>>>    (i) The combinations <reverse solidus><quote> or a <reverse
>>> solidus><double quote> always signify <quote> and <double quote>
>>> respectively, regardless of the delimiter used in a particular string.
>>>    (ii) The combinations in (i) elide the <quote> or <double quote>
>>> character where that character would otherwise terminate the string
>>>    (iii) Apart from (i) and (ii), the reverse solidus has no special
>>> significance
>>>    (iv) If not used as the string delimiter, <quote> or <double quote>
>>> when not preceded by <reverse solidus> represent themselves.
>> 
>> In both forms <reverse solidus><reverse solidus> should also be defined
>> in order to allow a literal string that ends in <reverse solidus>. For
>> example, a single <reverse solidus> character has to be written as "\\",
>> to avoid eliding the close quote.
>> 
>> Joe Krahn
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
> 
> 

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au




_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.