[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Use of elides in strings
From: "Herbert J. Bernstein" <[email protected]>
Date: Sun, 22 Nov 2009 13:29:13 -0500 (EST)
In-Reply-To: <[email protected]>
References: <C72C423A.12515%[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>

Yes, I would prefer that the lexer and parser deliver

"A\"BC"

as


A\"BC

-- Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Sun, 22 Nov 2009, SIMON WESTRIP wrote:

> Thanks for this
> 
> Does everyone else agree that the value in
> 
> _label "A\"BC"
> 
> is A\"BC ?
> 
> Sorry to keep on about this, but at the end of THREAD3 I understood this to
> be
> the conclusion, then more recently it looked like there was some acceptance
> of
>  dropping the backslash in a context-sensitive manner - i.e. the value being
>  A"BC in this particular case.
> 
> Cheers
> 
> Simon
> 
> 
> ____________________________________________________________________________
> From: John Westbrook <[email protected]>
> To: SIMON WESTRIP <[email protected]>
> Cc: [email protected]; Group finalising DDLm and associated
> dictionaries <[email protected]>
> Sent: Sunday, 22 November, 2009 15:12:10
> Subject: Re: [ddlm-group] Use of elides in strings
> 
> 
> Hi Simon -
> 
> Subject to the regex example we would process this as A\"BC
> as the '\' allowed in the regex.  We loads of similar cases in which
> there is an embedded quote in a character string which is not surrounded
> by whitespace.  When Nick visited us recently he analyzed these cases
> and we agreed that we would be able to quote these in the opposite
> since 'AB"C' or "AB'C", or in semi-colons for the odd case in which
> both should occur.    In none of these cases would we expect the
> internal quote to be escaped.
> 
> John
> 
> 
> SIMON WESTRIP wrote:
> > lexical analysis and parsing aside, in terms of specifying the syntax of
> CIF2,
> > what should someone expect the following to represent:
> >
> > _label "A\"BC"
> >
> > Is the value A\"BC or A"BC?
> >
> > I'm talking CIF2 only here (the use of elides for greek etc in CIF1 will
> no longer be part of the spec; rather it can be handled at the application
> level or perhaps defined in the dictionary in some way as some sort of item
> content type?)
> >
> > Cheers
> >
> > Simon
> >
> >
> >
> >
> > ________________________________
> > From: John Westbrook <[email protected]>
> > To: Group finalising DDLm and associated dictionaries
> <[email protected]>
> > Sent: Saturday, 21 November, 2009 14:02:19
> > Subject: Re: [ddlm-group] Use of elides in strings
> >
> >
> > To take another example in support of passing the data without
> > processing back to the application, DDL2 depends heavily on
> > using dictionary regex's to define the interpretation for
> > the application.  For instance, the regex for an atom code in
> > our dictionary is -
> >
> > [][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*
> >
> > Not only does processing this regex present an issue for the
> > treatment of the '\', but it also defines how the '\' will
> > be interpreted in data subject to the regex.
> >
> > I agree with Herb's conclusion that the lexer should do
> > the minimum of interpretation.
> >
> > Regards,
> >
> > John
> >
> > Herbert J. Bernstein wrote:
> >> Dear Colleagues,
> >>
> >>  Let us consider James' example.  He is actually making the case
> >> for _not_ removing the reverse-solidus from a string at the
> >> lexical level.
> >>
> >>  xxxx<backslash><quote>elxxxx
> >>
> >> or to be more specific
> >>
> >>  abcd\'efgh
> >>
> >> and we are presented with the question of ho should the
> >> dictionary interpret that string.
> >>
> >> If we have a string intended to be part of the modern pythonesque
> >> world, then I would expect the data element to have been typed
> >> in a way that says we should read the string as
> >>
> >>  abcd'efgh
> >>
> >> If we have a string that is a legacy from a CIF 1 file with
> >> IUCr type-setting codes, I would expect the data element to
> >> have beentyped in a way that says we should read the string as
> >> abcd{e with an acute accent)fgh
> >>
> >> Anything the lexer does to remove the reverse-solidus is
> >> going to disfavor one intepretation or the other.
> >>
> >> By moving these two interpretations one level up to two
> >> different utility routines, we gain much more use from
> >> a common lexer and nobody loses any functionality.
> >>
> >>  Regards,
> >>    Herbert
> >>
> >> =====================================================
> >>  Herbert J. Bernstein, Professor of Computer Science
> >>    Dowling College, Kramer Science Center, KSC 121
> >>        Idle Hour Blvd, Oakdale, NY, 11769
> >>
> >>                  +1-631-244-3035
> >>                  [email protected]
> >> =====================================================
> >>
> >> On Sat, 21 Nov 2009, James Hester wrote:
> >>
> >>> Joe, I agree with you.  There is a fundamental issue here that I have
> >>> already raised, but can't see Herbert and John's proposal addressing:
> >>> if we allow lexical escaping, but then pass on both the escaping and
> >>> escaped character, how does the dictionary layer know if a given
> >>> character sequence represents an escape, or corresponds to something
> >>> else?  If the dictionary layer gets a string like:
> >>>
> >>> 'xxxx<backslash><quote>elxxxx', does that mean:
> >>>
> >>> 'xxxx<quote>elxxxx'
> >>>
> >>> or does it mean
> >>>
> >>> 'xxxx<e acute>lxxxx' ?
> >>>
> >>> (First case might be from a string "He said 'elephants are pink' ",
> >>> second case "Fren<e acute>l formalism" (apologies to French speakers,
> >>> I have no idea when to use e acute).
> >>>
> >>> Similar examples can be constructed no matter what the alternative
> >>> meaning of <backslash><quote> might be in the particular domain.  The
> >>> key point is that you can't overload the meaning of
> >>> <escape><terminator>: either it is an instruction to the lexer, or it
> >>> has semantic meaning, but not both.  It doesn't even matter if the
> >>> lexer reads the dictionary definition before reading in the string
> >>> value: if two meanings are possible, the dictionary layer faces the
> >>> same problem.
> >>>
> >>> So: here is my latest proposal to deal with this issue:
> >>>
> >>> 1.  As in CIF1, there is no lexical elision available at all, ever.
> >>> All instances of the terminator terminate (unlike CIF1).
> >>>
> >>> 2.  Dictionary writers anticipate when a string value may run into
> >>> trouble due to this lack of elision (because those string values could
> >>> contain all of triple quote/triple double quote/<eol><semicolon>) and
> >>> describe a workaround in the dictionary: for example, inserting a
> >>> space between <eol> and <semicolon> when writing these string values,
> >>> and removing the space when reading them back.  We could provide
> >>> support by defining a special string type in DDLm with these
> >>> properties.
> >>>
> >>> I believe that this deals with all real and imagined problems.
> >>>
> >>> On Sat, Nov 21, 2009 at 12:40 PM, Joe Krahn <[email protected]> wrote:
> >>>> Clearly, Herbert is not referring to the 'reading and writing
> >>>> application' as the parser, but the application calling the parser. It
> >>>> makes things easier for the parser, but harder for the caller. It would
> >>>> not be that much of a problem, except that there are now several ways
> to
> >>>> quote strings, and the disallowed character sequences that need
> encoding
> >>>> varies among them.
> >>>>
> >>>> Herbert seems to view "the calling application" as a middle layer,
> >>>> rather than the program making use of the data. That sort of makes
> >>>> sense, in that conversion between strings and numeric values cannot
> >>>> happen at the CIF level. You could argue that a dictionary level middle
> >>>> layer is required to convert data to the final end-user form, and that
> >>>> handling character conversions at that level is more flexible. In
> >>>> general, that is a reasonable approach. However, even in that case, I
> >>>> think it is much less problematic to handle the few conversions that
> are
> >>>> specific to a given string quoting method at the parser level.
> >>>>
> >>>> Joe
> >>>>
> >>>> James Hester wrote:
> >>>>> First in reply to Joe: I believe that when Nick refers to the 'reading
> >>>>> and writing application' he indeed has in mind the CIF parser/CIF
> >>>>> writer layer, so I would guess that he agrees with your opinion as
> >>>>> well.  The issue is that we do not present an opaque storage format,
> >>>>> unlike SQL or HDF; it is pretty easy to create and manipulate CIFs
> >>>>> with text tools, so we need to cater to this method of interfacing to
> >>>>> CIFs as well.
> >>>>>
> >>>>> In reply to Herbert: your suggestion implies that we abandon any
> >>>>> *lexical* meaning for <elide><terminator>.  Or are you suggesting that
> >>>>> an application reads the dataname, then looks up the dictionary to
> >>>>> decide if it should continue to input the string when it sees
> >>>>> <elide><terminator>?  So we have dictionary-driven parsing?
> >>>>>
> >>>>> I can't work out from your previous email whether you are now in
> >>>>> support of abandoning elision as well as supporting treating all
> >>>>> strings as raw.  Please clarify...
> >>>>>
> >>>>> On Sat, Nov 21, 2009 at 6:44 AM, Herbert J. Bernstein
> >>>>> <[email protected]> wrote:
> >>>>>> Dear Colelagues,
> >>>>>>
> >>>>>>  There is a difference between what are useful utitlties to have in
> >>>>>> an API in support of CIF2 and what is formally part of the base CIF2.
> >>>>>> I am all in favor of utiltities to apply and unapply the various
> >>>>>> uses for the reverse solidus -- one for cleaning up python-style
> >>>>>> use, one to handle the IUCr special characters, one for line folding,
> >>>>>> etc., but I don;t think that means we have to make one of those
> >>>>>> particular uses formally part of the base CIF2.
> >>>>>>
> >>>>>>  Regards,
> >>>>>>    Herbert
> >>>>>>
> >>>>>> =====================================================
> >>>>>>  Herbert J. Bernstein, Professor of Computer Science
> >>>>>>    Dowling College, Kramer Science Center, KSC 121
> >>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
> >>>>>>
> >>>>>>                  +1-631-244-3035
> >>>>>>                  [email protected]
> >>>>>> =====================================================
> >>>>>>
> >>>>>> On Fri, 20 Nov 2009, Joe Krahn wrote:
> >>>>>>
> >>>>>>> Unlike others here, I feel that a proper text archive library should
> be
> >>>>>>> able to take any string from the calling application, and return
> that
> >>>>>>> exact same string when reading it back in. It is the job of the
> archive
> >>>>>>> format to avoid delimiter problems. An applications should be able
> to
> >>>>>>> store and retrieve strings without such worries, and interface to an
> SQL
> >>>>>>> database the same is it would interface to CIF. All commonly used
> >>>>>>> database libraries work this way. Why should CIF continue to take an
> >>>>>>> archaic approach?
> >>>>>>>
> >>>>>>> I essentially agree with the design below, except that the library
> >>>>>>> should handle insertion and removal of the reverse solidus for the
> >>>>>>> limited cases where it is required.
> >>>>>>>
> >>>>>>> If it is the client application's responsibility to deal with
> reverse
> >>>>>>> solidus escape sequences, then the description below doesn't make
> sense.
> >>>>>>> In that case, the reverse solidus never has any special meaning to
> CIF2.
> >>>>>>> Instead, CIF2 simply disallows certain character sequences. A client
> >>>>>>> application can use whatever it wants to encode/decode the
> disallowed
> >>>>>>> character sequences.
> >>>>>>>
> >>>>>>> The advantage of having well-defined escape sequences at the I/O
> library
> >>>>>>> level is that updates to the format do not require updates to client
> >>>>>>> applications. A CIF client application should be able to send a
> string
> >>>>>>> to the CIF library, and not have to know in advance what CIF
> revision is
> >>>>>>> in use, or whether the string is semicolong block quoted or triple
> >>>>>>> quoted. By requiring the client to escape invalid sequences, the
> client
> >>>>>>> will have to escape strings differently, i.e. triple quote is OK
> withing
> >>>>>>> semi-colon quotes, and a leading semicolon is OK within triple
> quotes,
> >>>>>>> but not the other way around.
> >>>>>>>
> >>>>>>> Joe Krahn
> >>>>>>>
> >>>>>>>
> >>>>>>> Nick Spadaccini wrote:
> >>>>>>>> SUMMARISING.
> >>>>>>>>
> >>>>>>>> (a) The contents of delimited strings are returned as raw, with the
> token
> >>>>>>>> delimiters removed.
> >>>>>>>> (b) Where a delimiter character is to be part of the string, that
> character
> >>>>>>>> must be preceded by a reverse solidus when written out to the file.
> When
> >>>>>>>> read, any reverse solidus preceding a terminating character is
> deleted.
> >>>>>>>> (c) It is the responsibility of the writing and reading application
> to
> >>>>>>>> insert and remove the reverse solidus preceding the terminating
> character.
> >>>>>>>> (d) Otherwise the presence of a reverse solidus in the string has
> no
> >>>>>>>> meaning.
> >>>>>>> _______________________________________________
> >>>>>>> ddlm-group mailing list
> >>>>>>> [email protected]
> >>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> ddlm-group mailing list
> >>>>>> [email protected]
> >>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >>>>>>
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> ddlm-group mailing list
> >>>> [email protected]
> >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >>>>
> >>>
> >>>
> >>> -- T +61 (02) 9717 9907
> >>> F +61 (02) 9717 3145
> >>> M +61 (04) 0249 4148
> >>> _______________________________________________
> >>> ddlm-group mailing list
> >>> [email protected]
> >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >>>
> >> ------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> ddlm-group mailing list
> >> [email protected]
> >> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> > _______________________________________________
> > ddlm-group mailing list
> > [email protected]
> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> 
>

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

References:

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

Re: [ddlm-group] Use of elides in strings (Joe Krahn)

Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)

Re: [ddlm-group] Use of elides in strings (James Hester)

Re: [ddlm-group] Use of elides in strings (Joe Krahn)

Re: [ddlm-group] Use of elides in strings (James Hester)

Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)

Re: [ddlm-group] Use of elides in strings (John Westbrook)

Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)

Re: [ddlm-group] Use of elides in strings (John Westbrook)

Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)

Prev by Date: Re: [ddlm-group] Use of elides in strings

Next by Date: Re: [ddlm-group] Use of elides in strings

Prev by thread: Re: [ddlm-group] Use of elides in strings

Next by thread: Re: [ddlm-group] Use of elides in strings

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Use of elides in strings