[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Sun, 22 Nov 2009 13:29:13 -0500 (EST)
- In-Reply-To: <838719.99061.qm@web87005.mail.ird.yahoo.com>
- References: <C72C423A.12515%nick@csse.uwa.edu.au><4B06DEAF.4070109@niehs.nih.gov><alpine.BSF.2.00.0911201441550.25803@epsilon.pair.com><279aad2a0911201545m22547e50i39df8f165c1c340e@mail.gmail.com><4B0744F9.3040907@niehs.nih.gov><279aad2a0911210437v53726196p7ffd0fa9a3e1cee8@mail.gmail.com><alpine.BSF.2.00.0911210813280.11915@epsilon.pair.com><4B07F2EB.5070606@pdb-mail.rutgers.edu><576511.20853.qm@web87003.mail.ird.yahoo.com><4B0954CA.2070202@pdb-mail.rutgers.edu><838719.99061.qm@web87005.mail.ird.yahoo.com>
Yes, I would prefer that the lexer and parser deliver "A\"BC" as A\"BC -- Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Sun, 22 Nov 2009, SIMON WESTRIP wrote: > Thanks for this > > Does everyone else agree that the value in > > _label "A\"BC" > > is A\"BC ? > > Sorry to keep on about this, but at the end of THREAD3 I understood this to > be > the conclusion, then more recently it looked like there was some acceptance > of > dropping the backslash in a context-sensitive manner - i.e. the value being > A"BC in this particular case. > > Cheers > > Simon > > > ____________________________________________________________________________ > From: John Westbrook <jwest@pdb-mail.rutgers.edu> > To: SIMON WESTRIP <simonwestrip@btinternet.com> > Cc: jwest@rcsb.rutgers.edu; Group finalising DDLm and associated > dictionaries <ddlm-group@iucr.org> > Sent: Sunday, 22 November, 2009 15:12:10 > Subject: Re: [ddlm-group] Use of elides in strings > > > Hi Simon - > > Subject to the regex example we would process this as A\"BC > as the '\' allowed in the regex. We loads of similar cases in which > there is an embedded quote in a character string which is not surrounded > by whitespace. When Nick visited us recently he analyzed these cases > and we agreed that we would be able to quote these in the opposite > since 'AB"C' or "AB'C", or in semi-colons for the odd case in which > both should occur. In none of these cases would we expect the > internal quote to be escaped. > > John > > > SIMON WESTRIP wrote: > > lexical analysis and parsing aside, in terms of specifying the syntax of > CIF2, > > what should someone expect the following to represent: > > > > _label "A\"BC" > > > > Is the value A\"BC or A"BC? > > > > I'm talking CIF2 only here (the use of elides for greek etc in CIF1 will > no longer be part of the spec; rather it can be handled at the application > level or perhaps defined in the dictionary in some way as some sort of item > content type?) > > > > Cheers > > > > Simon > > > > > > > > > > ________________________________ > > From: John Westbrook <jwest@pdb-mail.rutgers.edu> > > To: Group finalising DDLm and associated dictionaries > <ddlm-group@iucr.org> > > Sent: Saturday, 21 November, 2009 14:02:19 > > Subject: Re: [ddlm-group] Use of elides in strings > > > > > > To take another example in support of passing the data without > > processing back to the application, DDL2 depends heavily on > > using dictionary regex's to define the interpretation for > > the application. For instance, the regex for an atom code in > > our dictionary is - > > > > [][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]* > > > > Not only does processing this regex present an issue for the > > treatment of the '\', but it also defines how the '\' will > > be interpreted in data subject to the regex. > > > > I agree with Herb's conclusion that the lexer should do > > the minimum of interpretation. > > > > Regards, > > > > John > > > > Herbert J. Bernstein wrote: > >> Dear Colleagues, > >> > >> Let us consider James' example. He is actually making the case > >> for _not_ removing the reverse-solidus from a string at the > >> lexical level. > >> > >> xxxx<backslash><quote>elxxxx > >> > >> or to be more specific > >> > >> abcd\'efgh > >> > >> and we are presented with the question of ho should the > >> dictionary interpret that string. > >> > >> If we have a string intended to be part of the modern pythonesque > >> world, then I would expect the data element to have been typed > >> in a way that says we should read the string as > >> > >> abcd'efgh > >> > >> If we have a string that is a legacy from a CIF 1 file with > >> IUCr type-setting codes, I would expect the data element to > >> have beentyped in a way that says we should read the string as > >> abcd{e with an acute accent)fgh > >> > >> Anything the lexer does to remove the reverse-solidus is > >> going to disfavor one intepretation or the other. > >> > >> By moving these two interpretations one level up to two > >> different utility routines, we gain much more use from > >> a common lexer and nobody loses any functionality. > >> > >> Regards, > >> Herbert > >> > >> ===================================================== > >> Herbert J. Bernstein, Professor of Computer Science > >> Dowling College, Kramer Science Center, KSC 121 > >> Idle Hour Blvd, Oakdale, NY, 11769 > >> > >> +1-631-244-3035 > >> yaya@dowling.edu > >> ===================================================== > >> > >> On Sat, 21 Nov 2009, James Hester wrote: > >> > >>> Joe, I agree with you. There is a fundamental issue here that I have > >>> already raised, but can't see Herbert and John's proposal addressing: > >>> if we allow lexical escaping, but then pass on both the escaping and > >>> escaped character, how does the dictionary layer know if a given > >>> character sequence represents an escape, or corresponds to something > >>> else? If the dictionary layer gets a string like: > >>> > >>> 'xxxx<backslash><quote>elxxxx', does that mean: > >>> > >>> 'xxxx<quote>elxxxx' > >>> > >>> or does it mean > >>> > >>> 'xxxx<e acute>lxxxx' ? > >>> > >>> (First case might be from a string "He said 'elephants are pink' ", > >>> second case "Fren<e acute>l formalism" (apologies to French speakers, > >>> I have no idea when to use e acute). > >>> > >>> Similar examples can be constructed no matter what the alternative > >>> meaning of <backslash><quote> might be in the particular domain. The > >>> key point is that you can't overload the meaning of > >>> <escape><terminator>: either it is an instruction to the lexer, or it > >>> has semantic meaning, but not both. It doesn't even matter if the > >>> lexer reads the dictionary definition before reading in the string > >>> value: if two meanings are possible, the dictionary layer faces the > >>> same problem. > >>> > >>> So: here is my latest proposal to deal with this issue: > >>> > >>> 1. As in CIF1, there is no lexical elision available at all, ever. > >>> All instances of the terminator terminate (unlike CIF1). > >>> > >>> 2. Dictionary writers anticipate when a string value may run into > >>> trouble due to this lack of elision (because those string values could > >>> contain all of triple quote/triple double quote/<eol><semicolon>) and > >>> describe a workaround in the dictionary: for example, inserting a > >>> space between <eol> and <semicolon> when writing these string values, > >>> and removing the space when reading them back. We could provide > >>> support by defining a special string type in DDLm with these > >>> properties. > >>> > >>> I believe that this deals with all real and imagined problems. > >>> > >>> On Sat, Nov 21, 2009 at 12:40 PM, Joe Krahn <krahn@niehs.nih.gov> wrote: > >>>> Clearly, Herbert is not referring to the 'reading and writing > >>>> application' as the parser, but the application calling the parser. It > >>>> makes things easier for the parser, but harder for the caller. It would > >>>> not be that much of a problem, except that there are now several ways > to > >>>> quote strings, and the disallowed character sequences that need > encoding > >>>> varies among them. > >>>> > >>>> Herbert seems to view "the calling application" as a middle layer, > >>>> rather than the program making use of the data. That sort of makes > >>>> sense, in that conversion between strings and numeric values cannot > >>>> happen at the CIF level. You could argue that a dictionary level middle > >>>> layer is required to convert data to the final end-user form, and that > >>>> handling character conversions at that level is more flexible. In > >>>> general, that is a reasonable approach. However, even in that case, I > >>>> think it is much less problematic to handle the few conversions that > are > >>>> specific to a given string quoting method at the parser level. > >>>> > >>>> Joe > >>>> > >>>> James Hester wrote: > >>>>> First in reply to Joe: I believe that when Nick refers to the 'reading > >>>>> and writing application' he indeed has in mind the CIF parser/CIF > >>>>> writer layer, so I would guess that he agrees with your opinion as > >>>>> well. The issue is that we do not present an opaque storage format, > >>>>> unlike SQL or HDF; it is pretty easy to create and manipulate CIFs > >>>>> with text tools, so we need to cater to this method of interfacing to > >>>>> CIFs as well. > >>>>> > >>>>> In reply to Herbert: your suggestion implies that we abandon any > >>>>> *lexical* meaning for <elide><terminator>. Or are you suggesting that > >>>>> an application reads the dataname, then looks up the dictionary to > >>>>> decide if it should continue to input the string when it sees > >>>>> <elide><terminator>? So we have dictionary-driven parsing? > >>>>> > >>>>> I can't work out from your previous email whether you are now in > >>>>> support of abandoning elision as well as supporting treating all > >>>>> strings as raw. Please clarify... > >>>>> > >>>>> On Sat, Nov 21, 2009 at 6:44 AM, Herbert J. Bernstein > >>>>> <yaya@bernstein-plus-sons.com> wrote: > >>>>>> Dear Colelagues, > >>>>>> > >>>>>> There is a difference between what are useful utitlties to have in > >>>>>> an API in support of CIF2 and what is formally part of the base CIF2. > >>>>>> I am all in favor of utiltities to apply and unapply the various > >>>>>> uses for the reverse solidus -- one for cleaning up python-style > >>>>>> use, one to handle the IUCr special characters, one for line folding, > >>>>>> etc., but I don;t think that means we have to make one of those > >>>>>> particular uses formally part of the base CIF2. > >>>>>> > >>>>>> Regards, > >>>>>> Herbert > >>>>>> > >>>>>> ===================================================== > >>>>>> Herbert J. Bernstein, Professor of Computer Science > >>>>>> Dowling College, Kramer Science Center, KSC 121 > >>>>>> Idle Hour Blvd, Oakdale, NY, 11769 > >>>>>> > >>>>>> +1-631-244-3035 > >>>>>> yaya@dowling.edu > >>>>>> ===================================================== > >>>>>> > >>>>>> On Fri, 20 Nov 2009, Joe Krahn wrote: > >>>>>> > >>>>>>> Unlike others here, I feel that a proper text archive library should > be > >>>>>>> able to take any string from the calling application, and return > that > >>>>>>> exact same string when reading it back in. It is the job of the > archive > >>>>>>> format to avoid delimiter problems. An applications should be able > to > >>>>>>> store and retrieve strings without such worries, and interface to an > SQL > >>>>>>> database the same is it would interface to CIF. All commonly used > >>>>>>> database libraries work this way. Why should CIF continue to take an > >>>>>>> archaic approach? > >>>>>>> > >>>>>>> I essentially agree with the design below, except that the library > >>>>>>> should handle insertion and removal of the reverse solidus for the > >>>>>>> limited cases where it is required. > >>>>>>> > >>>>>>> If it is the client application's responsibility to deal with > reverse > >>>>>>> solidus escape sequences, then the description below doesn't make > sense. > >>>>>>> In that case, the reverse solidus never has any special meaning to > CIF2. > >>>>>>> Instead, CIF2 simply disallows certain character sequences. A client > >>>>>>> application can use whatever it wants to encode/decode the > disallowed > >>>>>>> character sequences. > >>>>>>> > >>>>>>> The advantage of having well-defined escape sequences at the I/O > library > >>>>>>> level is that updates to the format do not require updates to client > >>>>>>> applications. A CIF client application should be able to send a > string > >>>>>>> to the CIF library, and not have to know in advance what CIF > revision is > >>>>>>> in use, or whether the string is semicolong block quoted or triple > >>>>>>> quoted. By requiring the client to escape invalid sequences, the > client > >>>>>>> will have to escape strings differently, i.e. triple quote is OK > withing > >>>>>>> semi-colon quotes, and a leading semicolon is OK within triple > quotes, > >>>>>>> but not the other way around. > >>>>>>> > >>>>>>> Joe Krahn > >>>>>>> > >>>>>>> > >>>>>>> Nick Spadaccini wrote: > >>>>>>>> SUMMARISING. > >>>>>>>> > >>>>>>>> (a) The contents of delimited strings are returned as raw, with the > token > >>>>>>>> delimiters removed. > >>>>>>>> (b) Where a delimiter character is to be part of the string, that > character > >>>>>>>> must be preceded by a reverse solidus when written out to the file. > When > >>>>>>>> read, any reverse solidus preceding a terminating character is > deleted. > >>>>>>>> (c) It is the responsibility of the writing and reading application > to > >>>>>>>> insert and remove the reverse solidus preceding the terminating > character. > >>>>>>>> (d) Otherwise the presence of a reverse solidus in the string has > no > >>>>>>>> meaning. > >>>>>>> _______________________________________________ > >>>>>>> ddlm-group mailing list > >>>>>>> ddlm-group@iucr.org > >>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> ddlm-group mailing list > >>>>>> ddlm-group@iucr.org > >>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group > >>>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> ddlm-group mailing list > >>>> ddlm-group@iucr.org > >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group > >>>> > >>> > >>> > >>> -- T +61 (02) 9717 9907 > >>> F +61 (02) 9717 3145 > >>> M +61 (04) 0249 4148 > >>> _______________________________________________ > >>> ddlm-group mailing list > >>> ddlm-group@iucr.org > >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group > >>> > >> ------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> ddlm-group mailing list > >> ddlm-group@iucr.org > >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > _______________________________________________ > > ddlm-group mailing list > > ddlm-group@iucr.org > > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- References:
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Re: [ddlm-group] Use of elides in strings (Joe Krahn)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Joe Krahn)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- Re: [ddlm-group] Use of elides in strings (John Westbrook)
- Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)
- Re: [ddlm-group] Use of elides in strings (John Westbrook)
- Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):