[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: James Hester <jamesrhester@gmail.com>
- Date: Thu, 19 Nov 2009 22:31:27 +1100
- In-Reply-To: <C72B1C8D.124EF%nick@csse.uwa.edu.au>
- References: <279aad2a0911182058j4d41e6d0i4b059f175ea3dfcd@mail.gmail.com><C72B1C8D.124EF%nick@csse.uwa.edu.au>
OK, fair enough. Just to clarify, I am not advocating the full repertoire of backslash elides, only two specific ones: <backslash><terminator> and <backslash><backslash>. Any other use of backslash would simply leave that backslash untouched. Would suggesting that the cut-and-pasters restrict themselves to semicolon-delimited strings or triple-quote delimited strings help with legacy issues? Anyway, let us await the opinions of our Western Hemisphere colleagues... On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote: > > > > On 19/11/09 12:58 PM, "James Hester" <jamesrhester@gmail.com> wrote: > >> We need to figure out the behaviour of elides. This was previously >> discussed in a thread entitled "The alphabet of non-delimited >> strings", especially in messages around Oct 16th. The behaviour >> advocated by Nick is for both the eliding and elided character to be >> returned from the parser. The behaviour I would prefer is for the >> eliding character to disappear; it should itself be elided if it is to >> remain in the string. >> >> To summarize Nick's and Herbert's arguments from the emails dated Fri >> Oct 16, 2009 at 6:22AM and subsequently >> >> 1. We don't interpret elides because we don't know what algorithm to >> use (i.e. it might be a greek character sequence) >> >> 2. The elide simply signals that the lexer should not interpret the >> following character >> >> My counter-proposal is similar to Simon's original expectation: if the >> elide character is really eliding a syntactically significant >> character (i.e. a terminator character or an elide character), the >> elide sequence is replaced by the single character. I counter the >> above arguments as follows: >> >> (a) The profusion of algorithms for backslash processing is >> irrelevant. We can interpret the elides because the only algorithm >> that has any relevance at the parser level is the simple >> <backslash><character> -> <character>. All other potential uses >> belong to higher levels. If the higher levels require a >> <backslash><quote>, that is created by writing >> <backslash><backslash><backslash><quote> in the on-disk string. > > Couldn't agree with you more, and you are preaching to the converted who > were converted away by others. This is what I was arguing months ago for how > to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE > SOLIDUS) is always a newline, \t is always a tab etc. The parser should > always substitute the single binary character for these character doublets > ala unix/python/C etc. And you quite rightly argue if you want \n to really > mean the IUCr Greek nu then it will have to be \\n, and the same parser will > give the downstream application \n (having removed the leading elide). > Beautiful, that's what the computer scientist in me argues. > > However others argued that many users vim/emacs the file and cut and paste > the text content. So if you have a LaTEX string "{\\em I am italicised}" > that you cut and paste then it fails. And the blasted backward > compatibility argument comes in with existing CIF1 files that are not doubly > elided. > > What we can do is push the idea that a CIF2 string is a COMPLETELY different > beast to a CIF1 string. We know that with CIF1 data names and data values we > have to push our CIF2 parser in to a different grammar to handle things > correctly. At that level elides in a string will have a strict CIF1 meaning > (ie IUCr Greek markup). > > In CIF2 an elide in a string protects the following character from being > interpreted as a delimiter. There is special meaning for \n, \t etc which > are replaced by their single character. \u123456 (up to 6 hex numbers) > indicate a unicode character which should be replaced by the correct byte > sequence. All other first reverse solidus should be removed, and the > immediately following character passed on as part of the string. Characters > can be (multibyte) UTF-8. > > If you want to encode LaTEX (or IUCr-speak or something similar) then you > are going to have double all your reverse solidii. You can't cut and paste > from an editor - bad luck. > > I will wait for Herb's response to this because he was an advocate of > leaving things as they were (I think). I am happy to move forward with your > suggested interpretation. > >> (b) The profusion of algorithms for backslash processing means that >> we *must* remove ambiguity by removing the eliding character during >> processing; otherwise, an application can't tell if it is e.g. looking >> at an escaped prime or an acute accent without applying ugly >> heuristics. Note also that a caller of a CIF reading program doesn't >> currently need to know what the particular string delimiting character >> was for a given string value; in order to make a guess at what >> the backslash might mean, it would often need to know this. >> >> It appears that Nick is describing Python raw string behaviour, >> and I am describing Python 'cooked' string behaviour. Note for the >> following paragraph from >> docs.python.org/reference/lexical_analysis.html#strings: >> >> When an 'r' or 'R' prefix is present, a character following a >> backslash is included in the string without change, and all >> backslashes are left in the string. For example, the string >> literal r"\n" consists of two characters: a backslash and a >> lowercase 'n'. String quotes can be escaped with a backslash, >> but the backslash remains in the string; for example, r"\"" is >> a valid string literal consisting of two characters: a >> backslash and a double quote; r"\" is not a valid string >> literal (even a raw string cannot end in an odd number of >> backslashes). Specifically, a raw string cannot end in a >> single backslash (since the backslash would escape the >> following quote character). Note also that a single backslash >> followed by a newline is interpreted as those two characters >> as part of the string, not as a line continuation. >> >> Note that raw strings cannot end in a backslash, so I would consider >> them slightly less expressive than cooked strings, which can express >> everything. >> >> I would challenge Nick et. al. to explain what the advantage >> of keeping the eliding character in the datavalue is, keeping in mind >> that programs like CIFtbx and PyCIFRW and several others aim to hide >> CIF syntax from their users (as a service), and this proposal appears >> to want to expose a confusing part of it to them. Some questions we > > The original "advantage" (if you could call it that) was to keep others > happy and to support backwards compatibility. > >> toolbox maintainers will need to ask if this goes through: Do you >> handle escaping any strings passed to you for output? How do you know >> if the caller has done the escaping already, or not? Do you really expect >> the calling software to work out whether it wants a single or double >> or triple quote delimited string? Isn't that the service provided by >> your software? What are they (not) paying you for, anyway? > > When they pay, I'll answer that question! > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- References:
- [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):