[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Sun, 22 Nov 2009 23:13:02 -0500 (EST)
- In-Reply-To: <C73024F0.12573%nick@csse.uwa.edu.au>
- References: <C73024F0.12573%nick@csse.uwa.edu.au>
I am now totally lost. Please start over with a coherent proposal for the syntax of a quoted string. In particular, please state how the following strings will be parsed "ab\"cd" 'ab\"cd' "ab\\"cd" 'ab\\"cd' ;ab\"cd\ ; ;ab\\"cd\\ ; """ab\"""" """ab\\"""" {"abcd\"":ggg} {'abcd\"':ggg} "resum\'ee" 'resum\'ee' ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Mon, 23 Nov 2009, Nick Spadaccini wrote: > I'll try again. > > Yes Simon, what I am suggesting here after James' proposal is that it is > different to the end of THREAD 3. > > READ MY LIPS (FINGERS) > > (1) This is CIF2 specification. What is done in CIF1 encoding is irrelevant. > ALL parsers will have to drop in and out of CIF1 and CIF2 modes for legacy > reasons. Please desist from including CIF1 examples, they are a distraction. > (2) James (and I support) proposes that the very special behaviour of a > terminator (AND ONLY THIS CHARACTER) within a string delimited by the same > character is a CIF2 issue that should be handled by the parser (read this to > mean that application that is responsible for writing and reading CIF 2 > files). WE MUST agree where this is done, because users have to know when > they write or read a string who will handle the specialness of terminators. > > Herb and John argue it is the users responsibility. Sorry but most users > would not understand what the issues are (there is enough confusion amongst > us to make that clear). > (3) As an example lets just discuss "" delimited strings, and I wish to > include a " in that string. The proposal is that the parser ONLY EVER deals > with the " character. ALL OTHER ELIDES etc are irrelevant - they will be > passed on as raw. The \\ is NOT a special case, and I can see no reason for > it to be considered. > > The process is, if the string to be written out has a " included this will > be elided as it is written out. When reading, as you parse left to write, > when you find a " if it has a preceding \ delete it, skip over the " and > continue, otherwise it must be the terminating character. Everything else is > left untouched. > > Examples > > HB example > > Say a user (Dr Joe Ordinary) wishes to output in a "" delimited string > > (1) abcd\"efgh. The parser would output "abcd\\"efgh" > (2) abcd"efgh. The parser would output "abcd\"efgh" > > JW example > > [][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*. The parser would output > "[][ _(),.;:\"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*" > > The proposal is that ONLY \" is of any relevance to this process, all other > elides are irrelevant. John do you really expect to leave it up to the user > to know that the \" in your regular expression was inserted as a CIF2 > requirement, rather than an actual regex? > > Again the price is you can't cut and paste, but you can't do that anyway > whether you take the JRH/NS view or the HB/JW view. > > Is see real strength in requiring the parser to do this for the user. I can > see no downside, but I do see downsides if you expect all users to know what > to do. > > Herb is correct that all other handling of elides (the encodings) is left to > the dictionary level definitions. JW quite correctly states that they avoid > the problem by using a different delimiter character to the character they > wish to insert. There will be very rare cases though, where that may to be > possible and the above HAS TO be done. > > > On 23/11/09 2:29 AM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com> > wrote: > >> Yes, I would prefer that the lexer and parser deliver >> >> "A\"BC" >> >> as >> >> >> A\"BC >> >> -- Herbert >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Sun, 22 Nov 2009, SIMON WESTRIP wrote: >> >>> Thanks for this >>> >>> Does everyone else agree that the value in >>> >>> _label "A\"BC" >>> >>> is A\"BC ? >>> >>> Sorry to keep on about this, but at the end of THREAD3 I understood this to >>> be >>> the conclusion, then more recently it looked like there was some acceptance >>> of >>> dropping the backslash in a context-sensitive manner - i.e. the value being >>> A"BC in this particular case. >>> >>> Cheers >>> >>> Simon >>> >>> >>> ____________________________________________________________________________ >>> From: John Westbrook <jwest@pdb-mail.rutgers.edu> >>> To: SIMON WESTRIP <simonwestrip@btinternet.com> >>> Cc: jwest@rcsb.rutgers.edu; Group finalising DDLm and associated >>> dictionaries <ddlm-group@iucr.org> >>> Sent: Sunday, 22 November, 2009 15:12:10 >>> Subject: Re: [ddlm-group] Use of elides in strings >>> >>> >>> Hi Simon - >>> >>> Subject to the regex example we would process this as A\"BC >>> as the '\' allowed in the regex. We loads of similar cases in which >>> there is an embedded quote in a character string which is not surrounded >>> by whitespace. When Nick visited us recently he analyzed these cases >>> and we agreed that we would be able to quote these in the opposite >>> since 'AB"C' or "AB'C", or in semi-colons for the odd case in which >>> both should occur. In none of these cases would we expect the >>> internal quote to be escaped. >>> >>> John >>> >>> >>> SIMON WESTRIP wrote: >>>> lexical analysis and parsing aside, in terms of specifying the syntax of >>> CIF2, >>>> what should someone expect the following to represent: >>>> >>>> _label "A\"BC" >>>> >>>> Is the value A\"BC or A"BC? >>>> >>>> I'm talking CIF2 only here (the use of elides for greek etc in CIF1 will >>> no longer be part of the spec; rather it can be handled at the application >>> level or perhaps defined in the dictionary in some way as some sort of item >>> content type?) >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> >>>> >>>> >>>> ________________________________ >>>> From: John Westbrook <jwest@pdb-mail.rutgers.edu> >>>> To: Group finalising DDLm and associated dictionaries >>> <ddlm-group@iucr.org> >>>> Sent: Saturday, 21 November, 2009 14:02:19 >>>> Subject: Re: [ddlm-group] Use of elides in strings >>>> >>>> >>>> To take another example in support of passing the data without >>>> processing back to the application, DDL2 depends heavily on >>>> using dictionary regex's to define the interpretation for >>>> the application. For instance, the regex for an atom code in >>>> our dictionary is - >>>> >>>> [][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]* >>>> >>>> Not only does processing this regex present an issue for the >>>> treatment of the '\', but it also defines how the '\' will >>>> be interpreted in data subject to the regex. >>>> >>>> I agree with Herb's conclusion that the lexer should do >>>> the minimum of interpretation. >>>> >>>> Regards, >>>> >>>> John >>>> >>>> Herbert J. Bernstein wrote: >>>>> Dear Colleagues, >>>>> >>>>> Let us consider James' example. He is actually making the case >>>>> for _not_ removing the reverse-solidus from a string at the >>>>> lexical level. >>>>> >>>>> xxxx<backslash><quote>elxxxx >>>>> >>>>> or to be more specific >>>>> >>>>> abcd\'efgh >>>>> >>>>> and we are presented with the question of ho should the >>>>> dictionary interpret that string. >>>>> >>>>> If we have a string intended to be part of the modern pythonesque >>>>> world, then I would expect the data element to have been typed >>>>> in a way that says we should read the string as >>>>> >>>>> abcd'efgh >>>>> >>>>> If we have a string that is a legacy from a CIF 1 file with >>>>> IUCr type-setting codes, I would expect the data element to >>>>> have beentyped in a way that says we should read the string as >>>>> abcd{e with an acute accent)fgh >>>>> >>>>> Anything the lexer does to remove the reverse-solidus is >>>>> going to disfavor one intepretation or the other. >>>>> >>>>> By moving these two interpretations one level up to two >>>>> different utility routines, we gain much more use from >>>>> a common lexer and nobody loses any functionality. >>>>> >>>>> Regards, >>>>> Herbert >>>>> >>>>> ===================================================== >>>>> Herbert J. Bernstein, Professor of Computer Science >>>>> Dowling College, Kramer Science Center, KSC 121 >>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>> >>>>> +1-631-244-3035 >>>>> yaya@dowling.edu >>>>> ===================================================== >>>>> >>>>> On Sat, 21 Nov 2009, James Hester wrote: >>>>> >>>>>> Joe, I agree with you. There is a fundamental issue here that I have >>>>>> already raised, but can't see Herbert and John's proposal addressing: >>>>>> if we allow lexical escaping, but then pass on both the escaping and >>>>>> escaped character, how does the dictionary layer know if a given >>>>>> character sequence represents an escape, or corresponds to something >>>>>> else? If the dictionary layer gets a string like: >>>>>> >>>>>> 'xxxx<backslash><quote>elxxxx', does that mean: >>>>>> >>>>>> 'xxxx<quote>elxxxx' >>>>>> >>>>>> or does it mean >>>>>> >>>>>> 'xxxx<e acute>lxxxx' ? >>>>>> >>>>>> (First case might be from a string "He said 'elephants are pink' ", >>>>>> second case "Fren<e acute>l formalism" (apologies to French speakers, >>>>>> I have no idea when to use e acute). >>>>>> >>>>>> Similar examples can be constructed no matter what the alternative >>>>>> meaning of <backslash><quote> might be in the particular domain. The >>>>>> key point is that you can't overload the meaning of >>>>>> <escape><terminator>: either it is an instruction to the lexer, or it >>>>>> has semantic meaning, but not both. It doesn't even matter if the >>>>>> lexer reads the dictionary definition before reading in the string >>>>>> value: if two meanings are possible, the dictionary layer faces the >>>>>> same problem. >>>>>> >>>>>> So: here is my latest proposal to deal with this issue: >>>>>> >>>>>> 1. As in CIF1, there is no lexical elision available at all, ever. >>>>>> All instances of the terminator terminate (unlike CIF1). >>>>>> >>>>>> 2. Dictionary writers anticipate when a string value may run into >>>>>> trouble due to this lack of elision (because those string values could >>>>>> contain all of triple quote/triple double quote/<eol><semicolon>) and >>>>>> describe a workaround in the dictionary: for example, inserting a >>>>>> space between <eol> and <semicolon> when writing these string values, >>>>>> and removing the space when reading them back. We could provide >>>>>> support by defining a special string type in DDLm with these >>>>>> properties. >>>>>> >>>>>> I believe that this deals with all real and imagined problems. >>>>>> >>>>>> On Sat, Nov 21, 2009 at 12:40 PM, Joe Krahn <krahn@niehs.nih.gov> wrote: >>>>>>> Clearly, Herbert is not referring to the 'reading and writing >>>>>>> application' as the parser, but the application calling the parser. It >>>>>>> makes things easier for the parser, but harder for the caller. It would >>>>>>> not be that much of a problem, except that there are now several ways >>> to >>>>>>> quote strings, and the disallowed character sequences that need >>> encoding >>>>>>> varies among them. >>>>>>> >>>>>>> Herbert seems to view "the calling application" as a middle layer, >>>>>>> rather than the program making use of the data. That sort of makes >>>>>>> sense, in that conversion between strings and numeric values cannot >>>>>>> happen at the CIF level. You could argue that a dictionary level middle >>>>>>> layer is required to convert data to the final end-user form, and that >>>>>>> handling character conversions at that level is more flexible. In >>>>>>> general, that is a reasonable approach. However, even in that case, I >>>>>>> think it is much less problematic to handle the few conversions that >>> are >>>>>>> specific to a given string quoting method at the parser level. >>>>>>> >>>>>>> Joe >>>>>>> >>>>>>> James Hester wrote: >>>>>>>> First in reply to Joe: I believe that when Nick refers to the 'reading >>>>>>>> and writing application' he indeed has in mind the CIF parser/CIF >>>>>>>> writer layer, so I would guess that he agrees with your opinion as >>>>>>>> well. The issue is that we do not present an opaque storage format, >>>>>>>> unlike SQL or HDF; it is pretty easy to create and manipulate CIFs >>>>>>>> with text tools, so we need to cater to this method of interfacing to >>>>>>>> CIFs as well. >>>>>>>> >>>>>>>> In reply to Herbert: your suggestion implies that we abandon any >>>>>>>> *lexical* meaning for <elide><terminator>. Or are you suggesting that >>>>>>>> an application reads the dataname, then looks up the dictionary to >>>>>>>> decide if it should continue to input the string when it sees >>>>>>>> <elide><terminator>? So we have dictionary-driven parsing? >>>>>>>> >>>>>>>> I can't work out from your previous email whether you are now in >>>>>>>> support of abandoning elision as well as supporting treating all >>>>>>>> strings as raw. Please clarify... >>>>>>>> >>>>>>>> On Sat, Nov 21, 2009 at 6:44 AM, Herbert J. Bernstein >>>>>>>> <yaya@bernstein-plus-sons.com> wrote: >>>>>>>>> Dear Colelagues, >>>>>>>>> >>>>>>>>> There is a difference between what are useful utitlties to have in >>>>>>>>> an API in support of CIF2 and what is formally part of the base CIF2. >>>>>>>>> I am all in favor of utiltities to apply and unapply the various >>>>>>>>> uses for the reverse solidus -- one for cleaning up python-style >>>>>>>>> use, one to handle the IUCr special characters, one for line folding, >>>>>>>>> etc., but I don;t think that means we have to make one of those >>>>>>>>> particular uses formally part of the base CIF2. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Herbert >>>>>>>>> >>>>>>>>> ===================================================== >>>>>>>>> Herbert J. Bernstein, Professor of Computer Science >>>>>>>>> Dowling College, Kramer Science Center, KSC 121 >>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>>>> >>>>>>>>> +1-631-244-3035 >>>>>>>>> yaya@dowling.edu >>>>>>>>> ===================================================== >>>>>>>>> >>>>>>>>> On Fri, 20 Nov 2009, Joe Krahn wrote: >>>>>>>>> >>>>>>>>>> Unlike others here, I feel that a proper text archive library should >>> be >>>>>>>>>> able to take any string from the calling application, and return >>> that >>>>>>>>>> exact same string when reading it back in. It is the job of the >>> archive >>>>>>>>>> format to avoid delimiter problems. An applications should be able >>> to >>>>>>>>>> store and retrieve strings without such worries, and interface to an >>> SQL >>>>>>>>>> database the same is it would interface to CIF. All commonly used >>>>>>>>>> database libraries work this way. Why should CIF continue to take an >>>>>>>>>> archaic approach? >>>>>>>>>> >>>>>>>>>> I essentially agree with the design below, except that the library >>>>>>>>>> should handle insertion and removal of the reverse solidus for the >>>>>>>>>> limited cases where it is required. >>>>>>>>>> >>>>>>>>>> If it is the client application's responsibility to deal with >>> reverse >>>>>>>>>> solidus escape sequences, then the description below doesn't make >>> sense. >>>>>>>>>> In that case, the reverse solidus never has any special meaning to >>> CIF2. >>>>>>>>>> Instead, CIF2 simply disallows certain character sequences. A client >>>>>>>>>> application can use whatever it wants to encode/decode the >>> disallowed >>>>>>>>>> character sequences. >>>>>>>>>> >>>>>>>>>> The advantage of having well-defined escape sequences at the I/O >>> library >>>>>>>>>> level is that updates to the format do not require updates to client >>>>>>>>>> applications. A CIF client application should be able to send a >>> string >>>>>>>>>> to the CIF library, and not have to know in advance what CIF >>> revision is >>>>>>>>>> in use, or whether the string is semicolong block quoted or triple >>>>>>>>>> quoted. By requiring the client to escape invalid sequences, the >>> client >>>>>>>>>> will have to escape strings differently, i.e. triple quote is OK >>> withing >>>>>>>>>> semi-colon quotes, and a leading semicolon is OK within triple >>> quotes, >>>>>>>>>> but not the other way around. >>>>>>>>>> >>>>>>>>>> Joe Krahn >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Nick Spadaccini wrote: >>>>>>>>>>> SUMMARISING. >>>>>>>>>>> >>>>>>>>>>> (a) The contents of delimited strings are returned as raw, with the >>> token >>>>>>>>>>> delimiters removed. >>>>>>>>>>> (b) Where a delimiter character is to be part of the string, that >>> character >>>>>>>>>>> must be preceded by a reverse solidus when written out to the file. >>> When >>>>>>>>>>> read, any reverse solidus preceding a terminating character is >>> deleted. >>>>>>>>>>> (c) It is the responsibility of the writing and reading application >>> to >>>>>>>>>>> insert and remove the reverse solidus preceding the terminating >>> character. >>>>>>>>>>> (d) Otherwise the presence of a reverse solidus in the string has >>> no >>>>>>>>>>> meaning. >>>>>>>>>> _______________________________________________ >>>>>>>>>> ddlm-group mailing list >>>>>>>>>> ddlm-group@iucr.org >>>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ddlm-group mailing list >>>>>>>>> ddlm-group@iucr.org >>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> ddlm-group mailing list >>>>>>> ddlm-group@iucr.org >>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>>>> >>>>>> >>>>>> >>>>>> -- T +61 (02) 9717 9907 >>>>>> F +61 (02) 9717 3145 >>>>>> M +61 (04) 0249 4148 >>>>>> _______________________________________________ >>>>>> ddlm-group mailing list >>>>>> ddlm-group@iucr.org >>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> ddlm-group mailing list >>>>> ddlm-group@iucr.org >>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>> >>> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Use of elides in strings (James Hester)
- References:
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):