[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Sat, 21 Nov 2009 08:27:56 -0500 (EST)
- In-Reply-To: <279aad2a0911210437v53726196p7ffd0fa9a3e1cee8@mail.gmail.com>
- References: <C72C423A.12515%nick@csse.uwa.edu.au><4B06DEAF.4070109@niehs.nih.gov><alpine.BSF.2.00.0911201441550.25803@epsilon.pair.com><279aad2a0911201545m22547e50i39df8f165c1c340e@mail.gmail.com><4B0744F9.3040907@niehs.nih.gov><279aad2a0911210437v53726196p7ffd0fa9a3e1cee8@mail.gmail.com>
Dear Colleagues, Let us consider James' example. He is actually making the case for _not_ removing the reverse-solidus from a string at the lexical level. xxxx<backslash><quote>elxxxx or to be more specific abcd\'efgh and we are presented with the question of ho should the dictionary interpret that string. If we have a string intended to be part of the modern pythonesque world, then I would expect the data element to have been typed in a way that says we should read the string as abcd'efgh If we have a string that is a legacy from a CIF 1 file with IUCr type-setting codes, I would expect the data element to have beentyped in a way that says we should read the string as abcd{e with an acute accent)fgh Anything the lexer does to remove the reverse-solidus is going to disfavor one intepretation or the other. By moving these two interpretations one level up to two different utility routines, we gain much more use from a common lexer and nobody loses any functionality. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Sat, 21 Nov 2009, James Hester wrote: > Joe, I agree with you. There is a fundamental issue here that I have > already raised, but can't see Herbert and John's proposal addressing: > if we allow lexical escaping, but then pass on both the escaping and > escaped character, how does the dictionary layer know if a given > character sequence represents an escape, or corresponds to something > else? If the dictionary layer gets a string like: > > 'xxxx<backslash><quote>elxxxx', does that mean: > > 'xxxx<quote>elxxxx' > > or does it mean > > 'xxxx<e acute>lxxxx' ? > > (First case might be from a string "He said 'elephants are pink' ", > second case "Fren<e acute>l formalism" (apologies to French speakers, > I have no idea when to use e acute). > > Similar examples can be constructed no matter what the alternative > meaning of <backslash><quote> might be in the particular domain. The > key point is that you can't overload the meaning of > <escape><terminator>: either it is an instruction to the lexer, or it > has semantic meaning, but not both. It doesn't even matter if the > lexer reads the dictionary definition before reading in the string > value: if two meanings are possible, the dictionary layer faces the > same problem. > > So: here is my latest proposal to deal with this issue: > > 1. As in CIF1, there is no lexical elision available at all, ever. > All instances of the terminator terminate (unlike CIF1). > > 2. Dictionary writers anticipate when a string value may run into > trouble due to this lack of elision (because those string values could > contain all of triple quote/triple double quote/<eol><semicolon>) and > describe a workaround in the dictionary: for example, inserting a > space between <eol> and <semicolon> when writing these string values, > and removing the space when reading them back. We could provide > support by defining a special string type in DDLm with these > properties. > > I believe that this deals with all real and imagined problems. > > On Sat, Nov 21, 2009 at 12:40 PM, Joe Krahn <krahn@niehs.nih.gov> wrote: >> Clearly, Herbert is not referring to the 'reading and writing >> application' as the parser, but the application calling the parser. It >> makes things easier for the parser, but harder for the caller. It would >> not be that much of a problem, except that there are now several ways to >> quote strings, and the disallowed character sequences that need encoding >> varies among them. >> >> Herbert seems to view "the calling application" as a middle layer, >> rather than the program making use of the data. That sort of makes >> sense, in that conversion between strings and numeric values cannot >> happen at the CIF level. You could argue that a dictionary level middle >> layer is required to convert data to the final end-user form, and that >> handling character conversions at that level is more flexible. In >> general, that is a reasonable approach. However, even in that case, I >> think it is much less problematic to handle the few conversions that are >> specific to a given string quoting method at the parser level. >> >> Joe >> >> James Hester wrote: >>> First in reply to Joe: I believe that when Nick refers to the 'reading >>> and writing application' he indeed has in mind the CIF parser/CIF >>> writer layer, so I would guess that he agrees with your opinion as >>> well. The issue is that we do not present an opaque storage format, >>> unlike SQL or HDF; it is pretty easy to create and manipulate CIFs >>> with text tools, so we need to cater to this method of interfacing to >>> CIFs as well. >>> >>> In reply to Herbert: your suggestion implies that we abandon any >>> *lexical* meaning for <elide><terminator>. Or are you suggesting that >>> an application reads the dataname, then looks up the dictionary to >>> decide if it should continue to input the string when it sees >>> <elide><terminator>? So we have dictionary-driven parsing? >>> >>> I can't work out from your previous email whether you are now in >>> support of abandoning elision as well as supporting treating all >>> strings as raw. Please clarify... >>> >>> On Sat, Nov 21, 2009 at 6:44 AM, Herbert J. Bernstein >>> <yaya@bernstein-plus-sons.com> wrote: >>>> Dear Colelagues, >>>> >>>> There is a difference between what are useful utitlties to have in >>>> an API in support of CIF2 and what is formally part of the base CIF2. >>>> I am all in favor of utiltities to apply and unapply the various >>>> uses for the reverse solidus -- one for cleaning up python-style >>>> use, one to handle the IUCr special characters, one for line folding, >>>> etc., but I don;t think that means we have to make one of those >>>> particular uses formally part of the base CIF2. >>>> >>>> Regards, >>>> Herbert >>>> >>>> ===================================================== >>>> Herbert J. Bernstein, Professor of Computer Science >>>> Dowling College, Kramer Science Center, KSC 121 >>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>> >>>> +1-631-244-3035 >>>> yaya@dowling.edu >>>> ===================================================== >>>> >>>> On Fri, 20 Nov 2009, Joe Krahn wrote: >>>> >>>>> Unlike others here, I feel that a proper text archive library should be >>>>> able to take any string from the calling application, and return that >>>>> exact same string when reading it back in. It is the job of the archive >>>>> format to avoid delimiter problems. An applications should be able to >>>>> store and retrieve strings without such worries, and interface to an SQL >>>>> database the same is it would interface to CIF. All commonly used >>>>> database libraries work this way. Why should CIF continue to take an >>>>> archaic approach? >>>>> >>>>> I essentially agree with the design below, except that the library >>>>> should handle insertion and removal of the reverse solidus for the >>>>> limited cases where it is required. >>>>> >>>>> If it is the client application's responsibility to deal with reverse >>>>> solidus escape sequences, then the description below doesn't make sense. >>>>> In that case, the reverse solidus never has any special meaning to CIF2. >>>>> Instead, CIF2 simply disallows certain character sequences. A client >>>>> application can use whatever it wants to encode/decode the disallowed >>>>> character sequences. >>>>> >>>>> The advantage of having well-defined escape sequences at the I/O library >>>>> level is that updates to the format do not require updates to client >>>>> applications. A CIF client application should be able to send a string >>>>> to the CIF library, and not have to know in advance what CIF revision is >>>>> in use, or whether the string is semicolong block quoted or triple >>>>> quoted. By requiring the client to escape invalid sequences, the client >>>>> will have to escape strings differently, i.e. triple quote is OK withing >>>>> semi-colon quotes, and a leading semicolon is OK within triple quotes, >>>>> but not the other way around. >>>>> >>>>> Joe Krahn >>>>> >>>>> >>>>> Nick Spadaccini wrote: >>>>>> SUMMARISING. >>>>>> >>>>>> (a) The contents of delimited strings are returned as raw, with the token >>>>>> delimiters removed. >>>>>> (b) Where a delimiter character is to be part of the string, that character >>>>>> must be preceded by a reverse solidus when written out to the file. When >>>>>> read, any reverse solidus preceding a terminating character is deleted. >>>>>> (c) It is the responsibility of the writing and reading application to >>>>>> insert and remove the reverse solidus preceding the terminating character. >>>>>> (d) Otherwise the presence of a reverse solidus in the string has no >>>>>> meaning. >>>>> _______________________________________________ >>>>> ddlm-group mailing list >>>>> ddlm-group@iucr.org >>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>> >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>> >>> >>> >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (John Westbrook)
- References:
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Re: [ddlm-group] Use of elides in strings (Joe Krahn)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Joe Krahn)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):