[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

Dear James,

   I started to write:
   "No, in CIF 1.1, none of the terminal quote marks, including the \n; are 
effective unless followed by whitespace (\n, space, tab, of end of file). 
This is a well-established, and very tricky part of the CIF spec going 
back to 1990.  That is why Nick had to explicitly specify that a terminal 
quote mark would be effective no matter what it was followed by."

   But the grammer currently on the IUCr web site is _not_ the one that 
I recall COMCIFs discussing and approving.  It now explcitly removes
the requirement for terminal white space in the special case of
the \n; text field terminator.  I don't recall when that change was 
adopted, but it appears that you are right under the current spec
about the example I chose.  Inasmuch as there is a lot of working code
that enforces and uses the original whitespace handling and uses it
in line-folding, I will not revise CIFtbx 3, but I will try to do
something to adapt to this change for CIFtbx 4.

   I guess we are just going to have yet another few dialects of CIF.

   Regards,
     Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 25 Nov 2009, James Hester wrote:

> To be precise, we are not 'referring all elides to the application'
> because no elides are recognised by the lexer under Nick's latest
> suggestion, so there are no elides to refer to the application.
>
> My understanding of CIF1.1 syntax suggests that the string you provide
> would produce a syntax error in CIF1.1, as the semicolon at the start
> of the second line would terminate the string, and so whitespace
> should then appear as the second character on the second line, rather
> than reverse solidus.
>
> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>> The only problem with referring all elisdes to the application is that
>> with the removal of the requirement of a blank after a \n; for it to be
>> effective, the line folding protocol develops a slight gap.  The
>> case is as follows
>>
>> ;\
>> ;\
>> ;
>>
>> Is a valid single text field in CIF 1.1, which when handled with the
>> line folding protocol translates to the equivalent of ';' because the
>> embedded ;\ is not a valid text terminator.  If we require that
>> a text field the begins with "\n;\\" must be terminated by "\n; "
>> or "\n;\n" or "\n;\t" that problem would be fixed.
>>
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>>
>> On Wed, 25 Nov 2009, James Hester wrote:
>>
>>> I wholeheartedly agree with Nick's suggestion.
>>>
>>> On Tue, Nov 24, 2009 at 6:30 PM, Nick Spadaccini <nick@csse.uwa.edu.au>
>>> wrote:
>>>>
>>>> It appears to me that we have spent far too long on a syntactic issue
>>>> which
>>>> can be avoided 99.9999% of the time. Quite simply given the 5 ways to
>>>> delimit strings, it is next to impossible to get a situation where you
>>>> cannot choose one of those to make the problem go away.
>>>>
>>>> I think the RCSB systematically avoid it by choosing
>>>>
>>>> "ab'cd"
>>>> 'ab"cd'
>>>> ;ab'"cd
>>>> ;
>>>>
>>>> But now we additionally have """ and ''' to choose from, making it even
>>>> easier.
>>>>
>>>> So I propose in line with James' position there is NO eliding of
>>>> terminator
>>>> character at the CIF2 syntax level. ALL elides in the string are assumed
>>>> to
>>>> be user specific encoding (say TeX, IUCr \greek) which can be resolved at
>>>> the dictionary level.
>>>>
>>>> This necessarily means NO terminator character can appear in a string
>>>> delimited by the same terminator character. You will need to choose a
>>>> different terminator character. That is
>>>>
>>>> No " in "strings"
>>>> No ' in 'strings'
>>>> No """ in """strings""" (but separable individual and doublet " are
>>>> allowed)
>>>> No ''' in '''strings''' (but separable individual and doublet ' are
>>>> allowed)
>>>>
>>>> EVERYTHING in the string is returned as raw (except the initiating and
>>>> terminating character).
>>>>
>>>> The only time you will not be able to encode anything in a delimited
>>>> string
>>>> is when you want to include ' " """ ''' and \n; in the one string. The
>>>> likelihood of that is almost zero, unless you may want to include a CIF
>>>> within a CIF (a silly thing to do IMHO). In that case the contents can be
>>>> encoded in a dictionary driven way. I suggest it be declared as a BASE64
>>>> type and then all the syntactic ambiguity disappears.
>>>>
>>>> Problem solved! No need to elide because of CIF2 syntax rules all elides
>>>> are
>>>> user driven, contents are returned raw.
>>>>
>>>> As for Herbs comment in a recent email what about line-folding, then the
>>>> same holds. That is NOT a lexer issue and it has nothing to do with the
>>>> parser, everything is read literally and returned raw and what to do with
>>>> it
>>>> is promulgated to the downstream application.
>>>>
>>>> Straw vote - No elides of terminator strings as described above - Nick
>>>>
>>>>
>>>> On 24/11/09 10:00 AM, "James Hester" <jamesrhester@gmail.com> wrote:
>>>>
>>>>> OK, my rewritten voting proposal appears to be an abject failure.  Let
>>>>> me repeat 1 as clearly as possible
>>>>>
>>>>> 1.  Should CIF2 allow elision of terminator characters?  In other
>>>>> words, should we make it possible to include <quote> as a normal
>>>>> character in a <quote> delimited string?
>>>>>
>>>>> Herbert:  It's difficult to understand how to rephrase things if it is
>>>>> not clear where exactly the problem lies.
>>>>>
>>>>> Joe: good point about double backslash.  Consider this added to proposal
>>>>> (a).
>>>>>
>>>>> Before we discuss (2) precisely, can we agree to use the following
>>>>> abstract model and terminology for CIF2 file parsing and dictionary
>>>>> application?  If not, please indicate your alternative.
>>>>>
>>>>> 1. A CIF lexer separates a CIF file into tokens according to the CIF2
>>>>> syntax specification only, that is, this process cannot be altered by
>>>>> DDL directives.
>>>>>
>>>>> 2. A CIF parser accepts the tokens from the lexer.  CIF parsers can be
>>>>> modelled as performing at least the following actions with these
>>>>> tokens:
>>>>>   (i) assignment of datavalue to dataname
>>>>>   (ii) grouping looped datanames into a set
>>>>>   (iii) assigning looped datavalues to the appropriate dataname and
>>>>> packet
>>>>>   (iv) editing datavalues according to the syntax specification if
>>>>> this has not been performed in the lexer (e.g. stripping enclosing
>>>>> quotes, removing elides)
>>>>>
>>>>> 3. DDL dictionaries operate on and refer to the datavalues and
>>>>> datanames returned by the CIF parser after (2).  They have no ability
>>>>> to influence the lexing process, or the parsing actions listed above
>>>>> (in particular the datavalue editing).
>>>>>
>>>>> 4. The 'string value' or 'value' of a token is that value returned by
>>>>> the parser in (2).  In particular, this is the value that:
>>>>>   (i) may be checked against regular expressions in the dictionary;
>>>>>   (ii) is accessed by dREL expressions;
>>>>>   (iii) is returned by dREL expressions;
>>>>>   (iv) is referred to in dictionary descriptive text;
>>>>>   (v) may be passed to client routines for further editing;
>>>>>   (vi) may be passed to external applications
>>>>>
>>>>> [Side note: in other words the parser returns the CIF "infoset" and
>>>>> the dictionaries refer to the CIF "infoset", but we haven't been
>>>>> talking in those terms so I've been more explicit].
>>>>>
>>>>> So my voting question (2) is: should the 'string value' of a token
>>>>> referred to in (4) include the eliding characters?
>>>>>
>>>>>
>>>>> On Tue, Nov 24, 2009 at 10:57 AM, Joe Krahn <krahn@niehs.nih.gov> wrote:
>>>>>>
>>>>>> A few points to consider:
>>>>>>
>>>>>> James Hester wrote:
>>>>>> ...
>>>>>>>
>>>>>>> 2. Character(s) used to indicate elision should be part of the string
>>>>>>> value
>>>>>>
>>>>>> This does not specify where the elision character should be stripped.
>>>>>> It
>>>>>> could be done by the parser or the dictionary-level code. The rule only
>>>>>> refers to the final string for the final output text, right?
>>>>>>
>>>>>>>
>>>>>>> Now for the specifics:
>>>>>>>
>>>>>>> 3.  Which of the following elision proposals do you support (more than
>>>>>>> one
>>>>>>> OK)?
>>>>>>>
>>>>>>>   Proposal (a) (intended to correspond to Nick's)
>>>>>>>    (i) A character which would otherwise be interpreted as a delimiter
>>>>>>> is elided by immediately preceding it with a reverse solidus.
>>>>>>>   (ii) Otherwise a reverse solidus in the string has no special
>>>>>>> lexical significance.
>>>>>>>
>>>>>>>   Proposal (b)
>>>>>>>    (i) The combinations <reverse solidus><quote> or a <reverse
>>>>>>> solidus><double quote> always signify <quote> and <double quote>
>>>>>>> respectively, regardless of the delimiter used in a particular string.
>>>>>>>    (ii) The combinations in (i) elide the <quote> or <double quote>
>>>>>>> character where that character would otherwise terminate the string
>>>>>>>    (iii) Apart from (i) and (ii), the reverse solidus has no special
>>>>>>> significance
>>>>>>>    (iv) If not used as the string delimiter, <quote> or <double quote>
>>>>>>> when not preceded by <reverse solidus> represent themselves.
>>>>>>
>>>>>> In both forms <reverse solidus><reverse solidus> should also be defined
>>>>>> in order to allow a literal string that ends in <reverse solidus>. For
>>>>>> example, a single <reverse solidus> character has to be written as
>>>>>> "\\",
>>>>>> to avoid eliding the close quote.
>>>>>>
>>>>>> Joe Krahn
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> ddlm-group@iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>
>>>>>
>>>>>
>>>>
>>>> cheers
>>>>
>>>> Nick
>>>>
>>>> --------------------------------
>>>> Associate Professor N. Spadaccini, PhD
>>>> School of Computer Science & Software Engineering
>>>>
>>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>>> MBDP  M002
>>>>
>>>> CRICOS Provider Code: 00126G
>>>>
>>>> e: Nick.Spadaccini@uwa.edu.au
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>
>>>
>>>
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]