[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

OK, fair enough.  Just to clarify, I am not advocating the full
repertoire of backslash elides, only two specific ones:
<backslash><terminator> and <backslash><backslash>.  Any other use of
backslash would simply leave that backslash untouched.

Would suggesting that the cut-and-pasters restrict themselves to
semicolon-delimited strings or triple-quote delimited strings help
with legacy issues?

Anyway, let us await the opinions of our Western Hemisphere colleagues...

On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
> On 19/11/09 12:58 PM, "James Hester" <jamesrhester@gmail.com> wrote:
>> We need to figure out the behaviour of elides.  This was previously
>> discussed in a thread entitled "The alphabet of non-delimited
>> strings", especially in messages around Oct 16th.  The behaviour
>> advocated by Nick is for both the eliding and elided character to be
>> returned from the parser.  The behaviour I would prefer is for the
>> eliding character to disappear; it should itself be elided if it is to
>> remain in the string.
>> To summarize Nick's and Herbert's arguments from the emails dated Fri
>> Oct 16, 2009 at 6:22AM and subsequently
>> 1. We don't interpret elides because we don't know what algorithm to
>> use (i.e. it might be a greek character sequence)
>> 2. The elide simply signals that the lexer should not interpret the
>> following character
>> My counter-proposal is similar to Simon's original expectation: if the
>> elide character is really eliding a syntactically significant
>> character (i.e. a terminator character or an elide character), the
>> elide sequence is replaced by the single character.  I counter the
>> above arguments as follows:
>> (a) The profusion of algorithms for backslash processing is
>> irrelevant. We can interpret the elides because the only algorithm
>> that has any relevance at the parser level is the simple
>> <backslash><character> -> <character>.  All other potential uses
>> belong to higher levels.  If the higher levels require a
>> <backslash><quote>, that is created by writing
>> <backslash><backslash><backslash><quote> in the on-disk string.
> Couldn't agree with you more, and you are preaching to the converted who
> were converted away by others. This is what I was arguing months ago for how
> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE
> SOLIDUS) is always a newline, \t is always a tab etc. The parser should
> always substitute the single binary character for these character doublets
> ala unix/python/C etc. And you quite rightly argue if you want \n to really
> mean the IUCr Greek nu then it will have to be \\n, and the same parser will
> give the downstream application \n (having removed the leading elide).
> Beautiful, that's what the computer scientist in me argues.
> However others argued that many users vim/emacs the file and cut and paste
> the text content. So if you have a LaTEX string "{\\em I am italicised}"
> that you cut and paste then it fails.  And the blasted backward
> compatibility argument comes in with existing CIF1 files that are not doubly
> elided.
> What we can do is push the idea that a CIF2 string is a COMPLETELY different
> beast to a CIF1 string. We know that with CIF1 data names and data values we
> have to push our CIF2 parser in to a different grammar to handle things
> correctly. At that level elides in a string will have a strict CIF1 meaning
> (ie IUCr Greek markup).
> In CIF2 an elide in a string protects the following character from being
> interpreted as a delimiter. There is special meaning for \n, \t etc  which
> are replaced by their single character. \u123456 (up to 6 hex numbers)
> indicate a unicode character which should be replaced by the correct byte
> sequence. All other first reverse solidus should be removed, and the
> immediately following character passed on as part of the string. Characters
> can be (multibyte) UTF-8.
> If you want to encode LaTEX (or IUCr-speak or something similar) then you
> are going to have double all your reverse solidii. You can't cut and paste
> from an editor - bad luck.
> I will wait for Herb's response to this because he was an advocate of
> leaving things as they were (I think). I am happy to move forward with your
> suggested interpretation.
>> (b) The profusion of algorithms for backslash processing means that
>> we *must* remove ambiguity by removing the eliding character during
>> processing; otherwise, an application can't tell if it is e.g. looking
>> at an escaped prime or an acute accent without applying ugly
>> heuristics.  Note also that a caller of a CIF reading program doesn't
>> currently need to know what the particular string delimiting character
>> was for a given string value; in order to make a guess at what
>> the backslash might mean, it would often need to know this.
>> It appears that Nick is describing Python raw string behaviour,
>> and I am describing Python 'cooked' string behaviour.  Note for the
>> following paragraph from
>> docs.python.org/reference/lexical_analysis.html#strings:
>> When an 'r' or 'R' prefix is present, a character following a
>> backslash is included in the string without change, and all
>> backslashes are left in the string. For example, the string
>> literal r"\n" consists of two characters: a backslash and a
>> lowercase 'n'. String quotes can be escaped with a backslash,
>> but the backslash remains in the string; for example, r"\"" is
>> a valid string literal consisting of two characters: a
>> backslash and a double quote; r"\" is not a valid string
>> literal (even a raw string cannot end in an odd number of
>> backslashes). Specifically, a raw string cannot end in a
>> single backslash (since the backslash would escape the
>> following quote character). Note also that a single backslash
>> followed by a newline is interpreted as those two characters
>> as part of the string, not as a line continuation.
>> Note that raw strings cannot end in a backslash, so I would consider
>> them slightly less expressive than cooked strings, which can express
>> everything.
>> I would challenge Nick et. al. to explain what the advantage
>> of keeping the eliding character in the datavalue is, keeping in mind
>> that programs like CIFtbx and PyCIFRW and several others aim to hide
>> CIF syntax from their users (as a service), and this proposal appears
>> to want to expose a confusing part of it to them.  Some questions we
> The original "advantage" (if you could call it that) was to keep others
> happy and to support backwards compatibility.
>> toolbox maintainers will need to ask if this goes through: Do you
>> handle escaping any strings passed to you for output?  How do you know
>> if the caller has done the escaping already, or not?  Do you really expect
>> the calling software to work out whether it wants a single or double
>> or triple quote delimited string?  Isn't that the service provided by
>> your software?  What are they (not) paying you for, anyway?
> When they pay, I'll answer that question!
> cheers
> Nick
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
> CRICOS Provider Code: 00126G
> e: Nick.Spadaccini@uwa.edu.au
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]