[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: Nick Spadaccini <nick@csse.uwa.edu.au>
- Date: Fri, 20 Nov 2009 15:08:39 +0800
- Authentication-Results: postfix;
- In-Reply-To: <279aad2a0911192255i7826ef8fu62789d1ec71baf6b@mail.gmail.com>
We are agreed then. When the northern hemisphere wakes up they will have read a brilliant argument as to why it is Option 1. As for *nix tools. Isn't that all broken the moment we go to UTF-8? You'll grep/sed/awk the ascii characters easily enough but the result will be packed with "junk". And I don't know if I can specify a UTF-8 character easily to grep on. If necessary it wouldn't take much to create a *nix shell tool such that you would execute cifparse | grep ... And cifparse can do whatever is necessary to the file to "clean" it up. On 20/11/09 2:55 PM, "James Hester" <jamesrhester@gmail.com> wrote: > I agree that it is either option 1 or nothing. Note that option 4 is > there to see if everyone is awake, but thanks for taking it seriously. > Don't forget that 'readability' includes not only editors, but fine > text tools like grep, awk and sed and scripting languages that want to > do quick and dirty scans without importing or writing and CIF module. > By keeping CIF compatible with such tools we make the format that much > more accessible. > > Anyway, let's see what our Northern Hemisphere colleagues have to say... > > On Fri, Nov 20, 2009 at 5:34 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote: >> On 20/11/09 1:49 PM, "James Hester" <jamesrhester@gmail.com> wrote: >> >>> The essential issue here appears to be that CIF files are not only >>> accessed via CIF applications, but also via general-purpose editors >>> and text utilities, so the lexing/parsing stage should not produce >>> string values which differ from what might be seen by non-CIF-aware >>> applications. I think this is a reasonable concern. >> >> I do not think it is a reasonable concern any longer and certainly not in to >> the future. I don't know how many people hand-crank or eye-read a CIF but it >> has got to be vanishingly small. I can't see how a modern system can encode >> anything in an (essentially) ASCII base and expect to have to conform to >> human reading. >> >> It is almost like saying all image formats have to be ASCIIArt so that the >> user can view it in an editor. Time to move forward. If any one wants to >> read a CIF in that way then it comes with a proviso. >> >> Having said that, you are correct in that given the 5 ways to delimit a >> string the likelihood to have to use elision is low, BUT we need to allow >> for it so that at least if the case were to arise we can handle it. >> >> To that end only option 1 below is reasonable. 3 and 4 would just make >> things too much more complex, and 2 is not lexically possible and would have >> to be driven through the dictionary. And this is the way to specify what to >> do with the raw string regarding its markup, anyway. So the behaviour I >> specified in my previous mail will only need to be invoked in rare cases, >> unless there are people will insist of only ever using "" strings in which >> case elision will be needed more frequently. >> >>> >>> However: if we disallow meddling with string content, then we must >>> also not provide for eliding terminators, for the reasons put forward >>> in my previous email: the higher-level application has no guaranteed >>> way of determining whether the <elide><quote> digraph it finds in a >>> string refers to a lexical escape, or to a domain specific meaning. >>> >>> I would note that we have 5 different possibilities for delimiting >>> strings. What string is going to fail all those variations? It must >>> contain both triple double and triple single quotes, as well as a >>> semicolon following a newline. Frankly, the only realistic scenario >>> that would produce such a monster string is CIF-inside-CIF, which is a >>> scenario that can be dealt with either at the dictionary level by >>> defining a transformation to break up the relevant di/trigraph, or by >>> preparing the contents of the CIF-inside-CIF using alternative >>> delimiters. >>> >>> I therefore strongly suggest that we just forget about trying to elide >>> terminator characters. >>> >>> If there isn't support for dropping elision, I note that the following >>> proposals are on the table, in addition to the original one of leaving >>> the eliding character in the string (numbers 3,4 are ones I just >>> thought up): >>> >>> 1. Nick has just put forward a refined version of my recent proposal >>> which is about as minimalist as one could make it. >>> 2. Herbert's r" suggestion >>> 3. Adding a string concatenation character to the syntax, so that >>> problem strings could be split into separate bits that use different >>> delimiters. >>> 4. Specify Standard LISP behaviour for all lists where the first entry >>> is 'eval'. String concatenation is one of many possibilities that is >>> opened up... :) >>> >>> On Fri, Nov 20, 2009 at 7:57 AM, SIMON WESTRIP >>> <simonwestrip@btinternet.com> wrote: >>>> Dear all >>>> >>>> Haven't caught up with all the recent discussions yet, but hopefully have >>>> identified >>>> the following views appropriately: >>>> >>>> 1) Nick's proposal (preference): >>>> >>>> "In CIF2 an elide in a string protects the following character from being >>>> interpreted as a delimiter. >>>> >>>> There is special meaning for \n, \t etc which >>>> are replaced by their single character. >>>> >>>> \u123456 (up to 6 hex numbers) >>>> indicate a unicode character which should be replaced by the correct byte >>>> sequence. >>>> >>>> All other first reverse solidus should be removed, and the >>>> immediately following character passed on as part of the string. >>>> >>>> Characters can be (multibyte) UTF-8. >>>> " >>>> >>>> SPW: Though the logic of this is unquestionable (from a programmers >>>> perspective), >>>> I think this might be too disruptive. Though CIF2 promises interpretable >>>> content to >>>> enhance data processing, CIF is also an archiving format. I beleive that >>>> restrictions on >>>> the content of a data value should be minimal, governed by necessity >>>> (e.g. restrictions to avoid delimiter conflicts), rather than restricting >>>> the character set of the >>>> content to facilitate parsing or interpretablity by any particular >>>> programming language. >>>> On the one hand CIF2 promises to be a more flexible archiving format by >>>> extending its character >>>> set, while on the other hand it could become more restrictive by requiring >>>> that every reverse solidus >>>> has to be 'doubled-up' in a data value. >>>> >>>> Granted, there are strong arguments that people will decreasingly need to >>>> interact with a CIF >>>> in its raw form so extra complexities of syntax are not too much of a >>>> problem, but as many have pointed out, >>>> they still will read/edit raw CIFs, and may well have no alternative on >>>> occassion >>>> (for example, the IUCr will shortly be requiring authors to include >>>> refinement-software instruction >>>> listings in their CIFs, which will need to be included 'as is' within the >>>> restrictions of the data value delimiters >>>> and line lengths, purely for review purposes and only available in their >>>> raw >>>> form in the CIF) >>>> >>>> So on a fundamental level, I dont see that \n, \t, ... need to be reserved >>>> as special within a data value, >>>> nor \u123456. Definition of special meanings for these can be handled at a >>>> higher level? Equally, unless the >>>> reverse solidus escapes a delimiter character within the context of the >>>> identified >>>> opening delimiter, I dont see why it should be discarded by a parser. >>>> >>>> 2) James' proposal: >>>> >>>> "backslash elides, only two specific ones: >>>> >>>> <backslash><terminator> and <backslash><backslash>. >>>> >>>> Any other use of >>>> backslash would simply leave that backslash untouched. >>>> " >>>> >>>> SPW: tend to agree with this (see above), but why escape a backslash when >>>> they will be untouched anyway if they're not >>>> followed by a terminator? >>>> >>>> 3) Herbert's proposal: >>>> >>>> "may I suggest that we adopt both cooked and raw quoted strings >>>> from python, so that r" and r' can be used to introduce any raw, >>>> unconverted string taken from a CIF1 in which almost all existing >>>> CIF1 reverse solidus behavior could be left untouched, and that >>>> we accept James cooked approach for quoted strings not marked with >>>> the r' or r". >>>> " >>>> >>>> SPW: could be a neat solution for backward-compatability, but with more >>>> complexity comes the potential for more errors? >>>> Also, what about r; (assuming we're not just talking about quoted strings)? >>>> >>>> >>>> So if its not possible to allow context-sensitive handling of elides >>>> (escaping a delimiter if the value is delimited by the same delimiter), >>>> then I find myself supporting Nick's earlier conclusion (a month back) that >>>> all elides will be returned at the parser level for >>>> the application to deal with (THREAD 3)? If either of these approaches is >>>> considered unsatisfactory, then 'go the whole hog' and adopt >>>> the familiar 'programming syntax' treatment of elides as described by Nick. >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> PS usual disclaimer that these arn't necessarily the IUCr's views >>>> ________________________________ >>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> >>>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >>>> Cc: Nick.Spadaccini@uwa.edu.au >>>> Sent: Thursday, 19 November, 2009 11:55:37 >>>> Subject: Re: [ddlm-group] Use of elides in strings >>>> >>>> Dear Colleagues, >>>> >>>> My personal preference would be to leave things in what to me seems the >>>> simpler approach of passing all reverse solidus glyphs to the application. >>>> However, the pragmatics achieving a consensus and getting on with coding >>>> is more important that my personal taste. >>>> >>>> The major impact of a chnage un the handling of the reverse solidus in >>>> having some of them absorbed by the CIF2 parsers would be in then >>>> handling of legacy CIFs at the IUCr and at the PDB. James is right >>>> that what we are discussing is the difference between raw and cooked >>>> python strings. Inasmuch as CIF2 is now going to forbid the use of >>>> quote marks within non-delimited strings, in order to make the >>>> conversion of legacy CIFs from CIF1 to CIF2 as easy as possible, >>>> may I suggest that we adopt both cooked and raw quoted strings >>>> from python, so that r" and r' can be used to introduce any raw, >>>> unconverted string taken from a CIF1 in which almost all existing >>>> CIF1 reverse solidus behavior could be left untouched, and that >>>> we accept James cooked approach for quoted strings not marked with >>>> the r' or r". >>>> >>>> What say the IUCr journal operation and the PDB? It is their ox we are >>>> goring here. >>>> >>>> Regards, >>>> Herbert >>>> ===================================================== >>>> Herbert J. Bernstein, Professor of Computer Science >>>> Dowling College, Kramer Science Center, KSC 121 >>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>> >>>> +1-631-244-3035 >>>> yaya@dowling.edu >>>> ===================================================== >>>> >>>> On Thu, 19 Nov 2009, James Hester wrote: >>>> >>>>> OK, fair enough. Just to clarify, I am not advocating the full >>>>> repertoire of backslash elides, only two specific ones: >>>>> <backslash><terminator> and <backslash><backslash>. Any other use of >>>>> backslash would simply leave that backslash untouched. >>>>> >>>>> Would suggesting that the cut-and-pasters restrict themselves to >>>>> semicolon-delimited strings or triple-quote delimited strings help >>>>> with legacy issues? >>>>> >>>>> Anyway, let us await the opinions of our Western Hemisphere colleagues... >>>>> >>>>> On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini <nick@csse.uwa.edu.au> >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 19/11/09 12:58 PM, "James Hester" <jamesrhester@gmail.com> wrote: >>>>>> >>>>>>> We need to figure out the behaviour of elides. This was previously >>>>>>> discussed in a thread entitled "The alphabet of non-delimited >>>>>>> strings", especially in messages around Oct 16th. The behaviour >>>>>>> advocated by Nick is for both the eliding and elided character to be >>>>>>> returned from the parser. The behaviour I would prefer is for the >>>>>>> eliding character to disappear; it should itself be elided if it is to >>>>>>> remain in the string. >>>>>>> >>>>>>> To summarize Nick's and Herbert's arguments from the emails dated Fri >>>>>>> Oct 16, 2009 at 6:22AM and subsequently >>>>>>> >>>>>>> 1. We don't interpret elides because we don't know what algorithm to >>>>>>> use (i.e. it might be a greek character sequence) >>>>>>> >>>>>>> 2. The elide simply signals that the lexer should not interpret the >>>>>>> following character >>>>>>> >>>>>>> My counter-proposal is similar to Simon's original expectation: if the >>>>>>> elide character is really eliding a syntactically significant >>>>>>> character (i.e. a terminator character or an elide character), the >>>>>>> elide sequence is replaced by the single character. I counter the >>>>>>> above arguments as follows: >>>>>>> >>>>>>> (a) The profusion of algorithms for backslash processing is >>>>>>> irrelevant. We can interpret the elides because the only algorithm >>>>>>> that has any relevance at the parser level is the simple >>>>>>> <backslash><character> -> <character>. All other potential uses >>>>>>> belong to higher levels. If the higher levels require a >>>>>>> <backslash><quote>, that is created by writing >>>>>>> <backslash><backslash><backslash><quote> in the on-disk string. >>>>>> >>>>>> Couldn't agree with you more, and you are preaching to the converted who >>>>>> were converted away by others. This is what I was arguing months ago for >>>>>> how >>>>>> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE >>>>>> SOLIDUS) is always a newline, \t is always a tab etc. The parser should >>>>>> always substitute the single binary character for these character >>>>>> doublets >>>>>> ala unix/python/C etc. And you quite rightly argue if you want \n to >>>>>> really >>>>>> mean the IUCr Greek nu then it will have to be \\n, and the same parser >>>>>> will >>>>>> give the downstream application \n (having removed the leading elide). >>>>>> Beautiful, that's what the computer scientist in me argues. >>>>>> >>>>>> However others argued that many users vim/emacs the file and cut and >>>>>> paste >>>>>> the text content. So if you have a LaTEX string "{\\em I am italicised}" >>>>>> that you cut and paste then it fails. And the blasted backward >>>>>> compatibility argument comes in with existing CIF1 files that are not >>>>>> doubly >>>>>> elided. >>>>>> >>>>>> What we can do is push the idea that a CIF2 string is a COMPLETELY >>>>>> different >>>>>> beast to a CIF1 string. We know that with CIF1 data names and data values >>>>>> we >>>>>> have to push our CIF2 parser in to a different grammar to handle things >>>>>> correctly. At that level elides in a string will have a strict CIF1 >>>>>> meaning >>>>>> (ie IUCr Greek markup). >>>>>> >>>>>> In CIF2 an elide in a string protects the following character from being >>>>>> interpreted as a delimiter. There is special meaning for \n, \t etc >>>>>> which >>>>>> are replaced by their single character. \u123456 (up to 6 hex numbers) >>>>>> indicate a unicode character which should be replaced by the correct byte >>>>>> sequence. All other first reverse solidus should be removed, and the >>>>>> immediately following character passed on as part of the string. >>>>>> Characters >>>>>> can be (multibyte) UTF-8. >>>>>> >>>>>> If you want to encode LaTEX (or IUCr-speak or something similar) then you >>>>>> are going to have double all your reverse solidii. You can't cut and >>>>>> paste >>>>>> from an editor - bad luck. >>>>>> >>>>>> I will wait for Herb's response to this because he was an advocate of >>>>>> leaving things as they were (I think). I am happy to move forward with >>>>>> your >>>>>> suggested interpretation. >>>>>> >>>>>>> (b) The profusion of algorithms for backslash processing means that >>>>>>> we *must* remove ambiguity by removing the eliding character during >>>>>>> processing; otherwise, an application can't tell if it is e.g. looking >>>>>>> at an escaped prime or an acute accent without applying ugly >>>>>>> heuristics. Note also that a caller of a CIF reading program doesn't >>>>>>> currently need to know what the particular string delimiting character >>>>>>> was for a given string value; in order to make a guess at what >>>>>>> the backslash might mean, it would often need to know this. >>>>>>> >>>>>>> It appears that Nick is describing Python raw string behaviour, >>>>>>> and I am describing Python 'cooked' string behaviour. Note for the >>>>>>> following paragraph from >>>>>>> docs.python.org/reference/lexical_analysis.html#strings: >>>>>>> >>>>>>> When an 'r' or 'R' prefix is present, a character following a >>>>>>> backslash is included in the string without change, and all >>>>>>> backslashes are left in the string. For example, the string >>>>>>> literal r"\n" consists of two characters: a backslash and a >>>>>>> lowercase 'n'. String quotes can be escaped with a backslash, >>>>>>> but the backslash remains in the string; for example, r"\"" is >>>>>>> a valid string literal consisting of two characters: a >>>>>>> backslash and a double quote; r"\" is not a valid string >>>>>>> literal (even a raw string cannot end in an odd number of >>>>>>> backslashes). Specifically, a raw string cannot end in a >>>>>>> single backslash (since the backslash would escape the >>>>>>> following quote character). Note also that a single backslash >>>>>>> followed by a newline is interpreted as those two characters >>>>>>> as part of the string, not as a line continuation. >>>>>>> >>>>>>> Note that raw strings cannot end in a backslash, so I would consider >>>>>>> them slightly less expressive than cooked strings, which can express >>>>>>> everything. >>>>>>> >>>>>>> I would challenge Nick et. al. to explain what the advantage >>>>>>> of keeping the eliding character in the datavalue is, keeping in mind >>>>>>> that programs like CIFtbx and PyCIFRW and several others aim to hide >>>>>>> CIF syntax from their users (as a service), and this proposal appears >>>>>>> to want to expose a confusing part of it to them. Some questions we >>>>>> >>>>>> The original "advantage" (if you could call it that) was to keep others >>>>>> happy and to support backwards compatibility. >>>>>> >>>>>>> toolbox maintainers will need to ask if this goes through: Do you >>>>>>> handle escaping any strings passed to you for output? How do you know >>>>>>> if the caller has done the escaping already, or not? Do you really >>>>>>> expect >>>>>>> the calling software to work out whether it wants a single or double >>>>>>> or triple quote delimited string? Isn't that the service provided by >>>>>>> your software? What are they (not) paying you for, anyway? >>>>>> >>>>>> When they pay, I'll answer that question! >>>>>> >>>>>> cheers >>>>>> >>>>>> Nick >>>>>> >>>>>> -------------------------------- >>>>>> Associate Professor N. Spadaccini, PhD >>>>>> School of Computer Science & Software Engineering >>>>>> >>>>>> The University of Western Australia t: +61 (0)8 6488 3452 >>>>>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>>>>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>>>>> MBDP M002 >>>>>> >>>>>> CRICOS Provider Code: 00126G >>>>>> >>>>>> e: Nick.Spadaccini@uwa.edu.au >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ddlm-group mailing list >>>>>> ddlm-group@iucr.org >>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> T +61 (02) 9717 9907 >>>>> F +61 (02) 9717 3145 >>>>> M +61 (04) 0249 4148 >>>>> _______________________________________________ >>>>> ddlm-group mailing list >>>>> ddlm-group@iucr.org >>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>> >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>>> >>> >>> >> >> cheers >> >> Nick >> >> -------------------------------- >> Associate Professor N. Spadaccini, PhD >> School of Computer Science & Software Engineering >> >> The University of Western Australia t: +61 (0)8 6488 3452 >> 35 Stirling Highway f: +61 (0)8 6488 1089 >> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >> MBDP M002 >> >> CRICOS Provider Code: 00126G >> >> e: Nick.Spadaccini@uwa.edu.au >> >> >> >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > > cheers Nick -------------------------------- Associate Professor N. Spadaccini, PhD School of Computer Science & Software Engineering The University of Western Australia t: +61 (0)8 6488 3452 35 Stirling Highway f: +61 (0)8 6488 1089 CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick MBDP M002 CRICOS Provider Code: 00126G e: Nick.Spadaccini@uwa.edu.au _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)
- References:
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):