[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: James Hester <jamesrhester@gmail.com>
- Date: Fri, 20 Nov 2009 16:10:21 +1100
- In-Reply-To: <C72C423A.12515%nick@csse.uwa.edu.au>
- References: <166143.63604.qm@web87011.mail.ird.yahoo.com><C72C423A.12515%nick@csse.uwa.edu.au>
Hi Nick: you have described exactly what I would like to see. I am currently working on some way of taking into account John's issues and will hopefully post something later on today. On Fri, Nov 20, 2009 at 3:55 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote: > The general sentiment is to pass on the string as raw. > > Herb's suggestion of raw and cooked can't be done at a lexical level as it > is with Python. r"..." u"..." are syntactically incorrect in STAR. It can > happen at a dictionary level where a type could be declared as a raw string > or a cooked string. Probably better still is that such data items should > have an associated data item that indicates what the mark up is and how to > handle it. This will handle a multitude of markups. > > James' original email didn't address the whole hog approach, but focussed > only on what to do with \", \' and \\. I think James's is suggesting that if > a user wants to store the string abc"def as a double-quote string then a > user will want the application to deal with putting it out as "abc\"def" > > The same user will want the application to deliver on a read the string as > abc"def. > > I think this is a reasonable expectation, and we should specify this > behaviour in a CIF2 parser - THIS IS A PROPOSAL. > > But James goes further to discuss \\, which I didn't think was necessary at > first and neither does Simon. But I can see pathological cases where it is > problematic. Say the string is abc\"def. > > The above algorithm will create "abc\\"def", but on reading according to the > current draft specification it should probably say \\ -> \, then a " -> > terminate token. An alternative is we need to create "abc\\\"def" but now we > have to duplicate all elides. Doing nothing and creating "abc\"def" will > parse, but return abc"def - incorrect. > > But all of this can be solved simply by revisiting the wording of the > specification. I have said that a \ protects the next character and it > should be ignored as far as parsing is concerned. But since the contents are > raw, it should be (eg) > > For a "" delimited string a \ has no meaning unless it immediately precedes > a ", in which case the " is ignored as a token terminator. Hence given the > above algorithm the string abc\"def is encoded as "abc\\"def" by the parser. > When it is read the parsing process is first \ precedes a \, hence has no > meaning (pass it through). Second \ precedes a ", this this is an elide, > drop that reverse solidus and pass through the " as a legitimate character. > > You end up deriving abc\"def as required. > > Have I missed anything? > > SUMMARISING. > > (a) The contents of delimited strings are returned as raw, with the token > delimiters removed. > (b) Where a delimiter character is to be part of the string, that character > must be preceded by a reverse solidus when written out to the file. When > read, any reverse solidus preceding a terminating character is deleted. > (c) It is the responsibility of the writing and reading application to > insert and remove the reverse solidus preceding the terminating character. > (d) Otherwise the presence of a reverse solidus in the string has no > meaning. > > Does this cover all bases? > > > On 20/11/09 4:57 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote: > >> Dear all >> >> Haven't caught up with all the recent discussions yet, but hopefully have >> identified >> the following views appropriately: >> >> 1) Nick's proposal (preference): >> >> "In CIF2 an elide in a string protects the following character from being >> interpreted as a delimiter. >> >> There is special meaning for \n, \t etc which >> are replaced by their single character. >> >> \u123456 (up to 6 hex numbers) >> indicate a unicode character which should be replaced by the correct byte >> sequence. >> >> All other first reverse solidus should be removed, and the >> immediately following character passed on as part of the string. >> >> Characters can be (multibyte) UTF-8. >> " >> >> SPW: Though the logic of this is unquestionable (from a programmers >> perspective), >> I think this might be too disruptive. Though CIF2 promises interpretable >> content to >> enhance data processing, CIF is also an archiving format. I beleive that >> restrictions on >> the content of a data value should be minimal, governed by necessity >> (e.g. restrictions to avoid delimiter conflicts), rather than restricting the >> character set of the >> content to facilitate parsing or interpretablity by any particular programming >> language. >> On the one hand CIF2 promises to be a more flexible archiving format by >> extending its character >> set, while on the other hand it could become more restrictive by requiring >> that every reverse solidus >> has to be 'doubled-up' in a data value. >> >> Granted, there are strong arguments that people will decreasingly need to >> interact with a CIF >> in its raw form so extra complexities of syntax are not too much of a problem, >> but as many have pointed out, >> they still will read/edit raw CIFs, and may well have no alternative on >> occassion >> (for example, the IUCr will shortly be requiring authors to include >> refinement-software instruction >> listings in their CIFs, which will need to be included 'as is' within the >> restrictions of the data value delimiters >> and line lengths, purely for review purposes and only available in their raw >> form in the CIF) >> >> So on a fundamental level, I dont see that \n, \t, ... need to be reserved as >> special within a data value, >> nor \u123456. Definition of special meanings for these can be handled at a >> higher level? Equally, unless the >> reverse solidus escapes a delimiter character within the context of the >> identified >> opening delimiter, I dont see why it should be discarded by a parser. >> >> 2) James' proposal: >> >> "backslash elides, only two specific ones: >> >> <backslash><terminator> and <backslash><backslash>. >> >> Any other use of >> backslash would simply leave that backslash untouched. >> " >> >> SPW: tend to agree with this (see above), but why escape a backslash when they >> will be untouched anyway if they're not >> followed by a terminator? >> >> 3) Herbert's proposal: >> >> "may I suggest that we adopt both cooked and raw quoted strings >> from python, so that r" and r' can be used to introduce any raw, >> unconverted string taken from a CIF1 in which almost all existing >> CIF1 reverse solidus behavior could be left untouched, and that >> we accept James cooked approach for quoted strings not marked with >> the r' or r". >> " >> >> SPW: could be a neat solution for backward-compatability, but with more >> complexity comes the potential for more errors? >> Also, what about r; (assuming we're not just talking about quoted strings)? >> >> >> So if its not possible to allow context-sensitive handling of elides (escaping >> a delimiter if the value is delimited by the same delimiter), >> then I find myself supporting Nick's earlier conclusion (a month back) that >> all elides will be returned at the parser level for >> the application to deal with (THREAD 3)? If either of these approaches is >> considered unsatisfactory, then 'go the whole hog' and adopt >> the familiar 'programming syntax' treatment of elides as described by Nick. >> >> Cheers >> >> Simon >> >> PS usual disclaimer that these arn't necessarily the IUCr's views >> >> >> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> >> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >> Cc: Nick.Spadaccini@uwa.edu.au >> Sent: Thursday, 19 November, 2009 11:55:37 >> Subject: Re: [ddlm-group] Use of elides in strings >> >> Dear Colleagues, >> >> My personal preference would be to leave things in what to me seems the >> simpler approach of passing all reverse solidus glyphs to the application. >> However, the pragmatics achieving a consensus and getting on with coding >> is more important that my personal taste. >> >> The major impact of a chnage un the handling of the reverse solidus in >> having some of them absorbed by the CIF2 parsers would be in then >> handling of legacy CIFs at the IUCr and at the PDB. James is right >> that what we are discussing is the difference between raw and cooked >> python strings. Inasmuch as CIF2 is now going to forbid the use of >> quote marks within non-delimited strings, in order to make the >> conversion of legacy CIFs from CIF1 to CIF2 as easy as possible, >> may I suggest that we adopt both cooked and raw quoted strings >> from python, so that r" and r' can be used to introduce any raw, >> unconverted string taken from a CIF1 in which almost all existing >> CIF1 reverse solidus behavior could be left untouched, and that >> we accept James cooked approach for quoted strings not marked with >> the r' or r". >> >> What say the IUCr journal operation and the PDB? It is their ox we are >> goring here. >> >> Regards, >> Herbert >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Thu, 19 Nov 2009, James Hester wrote: >> >>> OK, fair enough. Just to clarify, I am not advocating the full >>> repertoire of backslash elides, only two specific ones: >>> <backslash><terminator> and <backslash><backslash>. Any other use of >>> backslash would simply leave that backslash untouched. >>> >>> Would suggesting that the cut-and-pasters restrict themselves to >>> semicolon-delimited strings or triple-quote delimited strings help >>> with legacy issues? >>> >>> Anyway, let us await the opinions of our Western Hemisphere colleagues... >>> >>> On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini <nick@csse.uwa.edu.au> >>> wrote: >>>> >>>> >>>> >>>> On 19/11/09 12:58 PM, "James Hester" <jamesrhester@gmail.com> wrote: >>>> >>>>> We need to figure out the behaviour of elides. This was previously >>>>> discussed in a thread entitled "The alphabet of non-delimited >>>>> strings", especially in messages around Oct 16th. The behaviour >>>>> advocated by Nick is for both the eliding and elided character to be >>>>> returned from the parser. The behaviour I would prefer is for the >>>>> eliding character to disappear; it should itself be elided if it is to >>>>> remain in the string. >>>>> >>>>> To summarize Nick's and Herbert's arguments from the emails dated Fri >>>>> Oct 16, 2009 at 6:22AM and subsequently >>>>> >>>>> 1. We don't interpret elides because we don't know what algorithm to >>>>> use (i.e. it might be a greek character sequence) >>>>> >>>>> 2. The elide simply signals that the lexer should not interpret the >>>>> following character >>>>> >>>>> My counter-proposal is similar to Simon's original expectation: if the >>>>> elide character is really eliding a syntactically significant >>>>> character (i.e. a terminator character or an elide character), the >>>>> elide sequence is replaced by the single character. I counter the >>>>> above arguments as follows: >>>>> >>>>> (a) The profusion of algorithms for backslash processing is >>>>> irrelevant. We can interpret the elides because the only algorithm >>>>> that has any relevance at the parser level is the simple >>>>> <backslash><character> -> <character>. All other potential uses >>>>> belong to higher levels. If the higher levels require a >>>>> <backslash><quote>, that is created by writing >>>>> <backslash><backslash><backslash><quote> in the on-disk string. >>>> >>>> Couldn't agree with you more, and you are preaching to the converted who >>>> were converted away by others. This is what I was arguing months ago for how >>>> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE >>>> SOLIDUS) is always a newline, \t is always a tab etc. The parser should >>>> always substitute the single binary character for these character doublets >>>> ala unix/python/C etc. And you quite rightly argue if you want \n to really >>>> mean the IUCr Greek nu then it will have to be \\n, and the same parser will >>>> give the downstream application \n (having removed the leading elide). >>>> Beautiful, that's what the computer scientist in me argues. >>>> >>>> However others argued that many users vim/emacs the file and cut and paste >>>> the text content. So if you have a LaTEX string "{\\em I am italicised}" >>>> that you cut and paste then it fails. And the blasted backward >>>> compatibility argument comes in with existing CIF1 files that are not doubly >>>> elided. >>>> >>>> What we can do is push the idea that a CIF2 string is a COMPLETELY different >>>> beast to a CIF1 string. We know that with CIF1 data names and data values we >>>> have to push our CIF2 parser in to a different grammar to handle things >>>> correctly. At that level elides in a string will have a strict CIF1 meaning >>>> (ie IUCr Greek markup). >>>> >>>> In CIF2 an elide in a string protects the following character from being >>>> interpreted as a delimiter. There is special meaning for \n, \t etc which >>>> are replaced by their single character. \u123456 (up to 6 hex numbers) >>>> indicate a unicode character which should be replaced by the correct byte >>>> sequence. All other first reverse solidus should be removed, and the >>>> immediately following character passed on as part of the string. Characters >>>> can be (multibyte) UTF-8. >>>> >>>> If you want to encode LaTEX (or IUCr-speak or something similar) then you >>>> are going to have double all your reverse solidii. You can't cut and paste >>>> from an editor - bad luck. >>>> >>>> I will wait for Herb's response to this because he was an advocate of >>>> leaving things as they were (I think). I am happy to move forward with your >>>> suggested interpretation. >>>> >>>>> (b) The profusion of algorithms for backslash processing means that >>>>> we *must* remove ambiguity by removing the eliding character during >>>>> processing; otherwise, an application can't tell if it is e.g. looking >>>>> at an escaped prime or an acute accent without applying ugly >>>>> heuristics. Note also that a caller of a CIF reading program doesn't >>>>> currently need to know what the particular string delimiting character >>>>> was for a given string value; in order to make a guess at what >>>>> the backslash might mean, it would often need to know this. >>>>> >>>>> It appears that Nick is describing Python raw string behaviour, >>>>> and I am describing Python 'cooked' string behaviour. Note for the >>>>> following paragraph from >>>>> docs.python.org/reference/lexical_analysis.html#strings: >>>>> >>>>> When an 'r' or 'R' prefix is present, a character following a >>>>> backslash is included in the string without change, and all >>>>> backslashes are left in the string. For example, the string >>>>> literal r"\n" consists of two characters: a backslash and a >>>>> lowercase 'n'. String quotes can be escaped with a backslash, >>>>> but the backslash remains in the string; for example, r"\"" is >>>>> a valid string literal consisting of two characters: a >>>>> backslash and a double quote; r"\" is not a valid string >>>>> literal (even a raw string cannot end in an odd number of >>>>> backslashes). Specifically, a raw string cannot end in a >>>>> single backslash (since the backslash would escape the >>>>> following quote character). Note also that a single backslash >>>>> followed by a newline is interpreted as those two characters >>>>> as part of the string, not as a line continuation. >>>>> >>>>> Note that raw strings cannot end in a backslash, so I would consider >>>>> them slightly less expressive than cooked strings, which can express >>>>> everything. >>>>> >>>>> I would challenge Nick et. al. to explain what the advantage >>>>> of keeping the eliding character in the datavalue is, keeping in mind >>>>> that programs like CIFtbx and PyCIFRW and several others aim to hide >>>>> CIF syntax from their users (as a service), and this proposal appears >>>>> to want to expose a confusing part of it to them. Some questions we >>>> >>>> The original "advantage" (if you could call it that) was to keep others >>>> happy and to support backwards compatibility. >>>> >>>>> toolbox maintainers will need to ask if this goes through: Do you >>>>> handle escaping any strings passed to you for output? How do you know >>>>> if the caller has done the escaping already, or not? Do you really expect >>>>> the calling software to work out whether it wants a single or double >>>>> or triple quote delimited string? Isn't that the service provided by >>>>> your software? What are they (not) paying you for, anyway? >>>> >>>> When they pay, I'll answer that question! >>>> >>>> cheers >>>> >>>> Nick >>>> >>>> -------------------------------- >>>> Associate Professor N. Spadaccini, PhD >>>> School of Computer Science & Software Engineering >>>> >>>> The University of Western Australia t: +61 (0)8 6488 3452 >>>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>>> <http://www.csse.uwa.edu.au/%7Enick> >>>> MBDP M002 >>>> >>>> CRICOS Provider Code: 00126G >>>> >>>> e: Nick.Spadaccini@uwa.edu.au >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>> >>> >>> >>> -- >>> T +61 (02) 9717 9907 >>> F +61 (02) 9717 3145 >>> M +61 (04) 0249 4148 >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >> >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):