[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: John Westbrook <jwest@pdb-mail.rutgers.edu>
- Date: Thu, 19 Nov 2009 07:51:03 -0500
- In-Reply-To: <alpine.BSF.2.00.0911190638540.75409@epsilon.pair.com>
- References: <279aad2a0911182058j4d41e6d0i4b059f175ea3dfcd@mail.gmail.com> <C72B1C8D.124EF%nick@csse.uwa.edu.au> <279aad2a0911190331o1e4f97b6p2dcd69acdfb5b91b@mail.gmail.com><alpine.BSF.2.00.0911190638540.75409@epsilon.pair.com>
Hi all, In a previous posting I registered PDB preference to leave the interpretation of elided characters to the application. I am not presently aware of cases from our work where it would be useful to introduce handling of elides within the CIF syntax. I am particularly worried about any lexical interpretation of '\' that may interfere with their use in regular expressions which we have a significant dependency in our dictionaries. I continue to vote for the particular interpretation of '\' and other special characters be defined by the dictionary definition of the item. I think this provides the greatest flexibility for all applications. Regards, John Herbert J. Bernstein wrote: > Dear Colleagues, > > My personal preference would be to leave things in what to me seems the > simpler approach of passing all reverse solidus glyphs to the application. > However, the pragmatics achieving a consensus and getting on with coding > is more important that my personal taste. > > The major impact of a chnage un the handling of the reverse solidus in > having some of them absorbed by the CIF2 parsers would be in then > handling of legacy CIFs at the IUCr and at the PDB. James is right > that what we are discussing is the difference between raw and cooked > python strings. Inasmuch as CIF2 is now going to forbid the use of > quote marks within non-delimited strings, in order to make the > conversion of legacy CIFs from CIF1 to CIF2 as easy as possible, > may I suggest that we adopt both cooked and raw quoted strings > from python, so that r" and r' can be used to introduce any raw, > unconverted string taken from a CIF1 in which almost all existing > CIF1 reverse solidus behavior could be left untouched, and that > we accept James cooked approach for quoted strings not marked with > the r' or r". > > What say the IUCr journal operation and the PDB? It is their ox we > are goring here. > > Regards, > Herbert > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Thu, 19 Nov 2009, James Hester wrote: > >> OK, fair enough. Just to clarify, I am not advocating the full >> repertoire of backslash elides, only two specific ones: >> <backslash><terminator> and <backslash><backslash>. Any other use of >> backslash would simply leave that backslash untouched. >> >> Would suggesting that the cut-and-pasters restrict themselves to >> semicolon-delimited strings or triple-quote delimited strings help >> with legacy issues? >> >> Anyway, let us await the opinions of our Western Hemisphere colleagues... >> >> On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini >> <nick@csse.uwa.edu.au> wrote: >>> >>> >>> >>> On 19/11/09 12:58 PM, "James Hester" <jamesrhester@gmail.com> wrote: >>> >>>> We need to figure out the behaviour of elides. This was previously >>>> discussed in a thread entitled "The alphabet of non-delimited >>>> strings", especially in messages around Oct 16th. The behaviour >>>> advocated by Nick is for both the eliding and elided character to be >>>> returned from the parser. The behaviour I would prefer is for the >>>> eliding character to disappear; it should itself be elided if it is to >>>> remain in the string. >>>> >>>> To summarize Nick's and Herbert's arguments from the emails dated Fri >>>> Oct 16, 2009 at 6:22AM and subsequently >>>> >>>> 1. We don't interpret elides because we don't know what algorithm to >>>> use (i.e. it might be a greek character sequence) >>>> >>>> 2. The elide simply signals that the lexer should not interpret the >>>> following character >>>> >>>> My counter-proposal is similar to Simon's original expectation: if the >>>> elide character is really eliding a syntactically significant >>>> character (i.e. a terminator character or an elide character), the >>>> elide sequence is replaced by the single character. I counter the >>>> above arguments as follows: >>>> >>>> (a) The profusion of algorithms for backslash processing is >>>> irrelevant. We can interpret the elides because the only algorithm >>>> that has any relevance at the parser level is the simple >>>> <backslash><character> -> <character>. All other potential uses >>>> belong to higher levels. If the higher levels require a >>>> <backslash><quote>, that is created by writing >>>> <backslash><backslash><backslash><quote> in the on-disk string. >>> >>> Couldn't agree with you more, and you are preaching to the converted who >>> were converted away by others. This is what I was arguing months ago >>> for how >>> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE >>> SOLIDUS) is always a newline, \t is always a tab etc. The parser should >>> always substitute the single binary character for these character >>> doublets >>> ala unix/python/C etc. And you quite rightly argue if you want \n to >>> really >>> mean the IUCr Greek nu then it will have to be \\n, and the same >>> parser will >>> give the downstream application \n (having removed the leading elide). >>> Beautiful, that's what the computer scientist in me argues. >>> >>> However others argued that many users vim/emacs the file and cut and >>> paste >>> the text content. So if you have a LaTEX string "{\\em I am italicised}" >>> that you cut and paste then it fails. And the blasted backward >>> compatibility argument comes in with existing CIF1 files that are not >>> doubly >>> elided. >>> >>> What we can do is push the idea that a CIF2 string is a COMPLETELY >>> different >>> beast to a CIF1 string. We know that with CIF1 data names and data >>> values we >>> have to push our CIF2 parser in to a different grammar to handle things >>> correctly. At that level elides in a string will have a strict CIF1 >>> meaning >>> (ie IUCr Greek markup). >>> >>> In CIF2 an elide in a string protects the following character from being >>> interpreted as a delimiter. There is special meaning for \n, \t etc >>> which >>> are replaced by their single character. \u123456 (up to 6 hex numbers) >>> indicate a unicode character which should be replaced by the correct >>> byte >>> sequence. All other first reverse solidus should be removed, and the >>> immediately following character passed on as part of the string. >>> Characters >>> can be (multibyte) UTF-8. >>> >>> If you want to encode LaTEX (or IUCr-speak or something similar) then >>> you >>> are going to have double all your reverse solidii. You can't cut and >>> paste >>> from an editor - bad luck. >>> >>> I will wait for Herb's response to this because he was an advocate of >>> leaving things as they were (I think). I am happy to move forward >>> with your >>> suggested interpretation. >>> >>>> (b) The profusion of algorithms for backslash processing means that >>>> we *must* remove ambiguity by removing the eliding character during >>>> processing; otherwise, an application can't tell if it is e.g. looking >>>> at an escaped prime or an acute accent without applying ugly >>>> heuristics. Note also that a caller of a CIF reading program doesn't >>>> currently need to know what the particular string delimiting character >>>> was for a given string value; in order to make a guess at what >>>> the backslash might mean, it would often need to know this. >>>> >>>> It appears that Nick is describing Python raw string behaviour, >>>> and I am describing Python 'cooked' string behaviour. Note for the >>>> following paragraph from >>>> docs.python.org/reference/lexical_analysis.html#strings: >>>> >>>> When an 'r' or 'R' prefix is present, a character following a >>>> backslash is included in the string without change, and all >>>> backslashes are left in the string. For example, the string >>>> literal r"\n" consists of two characters: a backslash and a >>>> lowercase 'n'. String quotes can be escaped with a backslash, >>>> but the backslash remains in the string; for example, r"\"" is >>>> a valid string literal consisting of two characters: a >>>> backslash and a double quote; r"\" is not a valid string >>>> literal (even a raw string cannot end in an odd number of >>>> backslashes). Specifically, a raw string cannot end in a >>>> single backslash (since the backslash would escape the >>>> following quote character). Note also that a single backslash >>>> followed by a newline is interpreted as those two characters >>>> as part of the string, not as a line continuation. >>>> >>>> Note that raw strings cannot end in a backslash, so I would consider >>>> them slightly less expressive than cooked strings, which can express >>>> everything. >>>> >>>> I would challenge Nick et. al. to explain what the advantage >>>> of keeping the eliding character in the datavalue is, keeping in mind >>>> that programs like CIFtbx and PyCIFRW and several others aim to hide >>>> CIF syntax from their users (as a service), and this proposal appears >>>> to want to expose a confusing part of it to them. Some questions we >>> >>> The original "advantage" (if you could call it that) was to keep others >>> happy and to support backwards compatibility. >>> >>>> toolbox maintainers will need to ask if this goes through: Do you >>>> handle escaping any strings passed to you for output? How do you know >>>> if the caller has done the escaping already, or not? Do you really >>>> expect >>>> the calling software to work out whether it wants a single or double >>>> or triple quote delimited string? Isn't that the service provided by >>>> your software? What are they (not) paying you for, anyway? >>> >>> When they pay, I'll answer that question! >>> >>> cheers >>> >>> Nick >>> >>> -------------------------------- >>> Associate Professor N. Spadaccini, PhD >>> School of Computer Science & Software Engineering >>> >>> The University of Western Australia t: +61 (0)8 6488 3452 >>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>> MBDP M002 >>> >>> CRICOS Provider Code: 00126G >>> >>> e: Nick.Spadaccini@uwa.edu.au >>> >>> >>> >>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > > ------------------------------------------------------------------------ > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Relationship of CIF2 to legacy platforms
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):