# Re: [ddlm-group] Use of elides in strings



On 19/11/09 12:58 PM, "James Hester" <jamesrhester@gmail.com> wrote:

> We need to figure out the behaviour of elides.  This was previously
> discussed in a thread entitled "The alphabet of non-delimited
> strings", especially in messages around Oct 16th.  The behaviour
> advocated by Nick is for both the eliding and elided character to be
> returned from the parser.  The behaviour I would prefer is for the
> eliding character to disappear; it should itself be elided if it is to
> remain in the string.
>
> To summarize Nick's and Herbert's arguments from the emails dated Fri
> Oct 16, 2009 at 6:22AM and subsequently
>
> 1. We don't interpret elides because we don't know what algorithm to
> use (i.e. it might be a greek character sequence)
>
> 2. The elide simply signals that the lexer should not interpret the
> following character
>
> My counter-proposal is similar to Simon's original expectation: if the
> elide character is really eliding a syntactically significant
> character (i.e. a terminator character or an elide character), the
> elide sequence is replaced by the single character.  I counter the
> above arguments as follows:
>
> (a) The profusion of algorithms for backslash processing is
> irrelevant. We can interpret the elides because the only algorithm
> that has any relevance at the parser level is the simple
> <backslash><character> -> <character>.  All other potential uses
> belong to higher levels.  If the higher levels require a
> <backslash><quote>, that is created by writing
> <backslash><backslash><backslash><quote> in the on-disk string.

Couldn't agree with you more, and you are preaching to the converted who
were converted away by others. This is what I was arguing months ago for how
to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE
SOLIDUS) is always a newline, \t is always a tab etc. The parser should
always substitute the single binary character for these character doublets
ala unix/python/C etc. And you quite rightly argue if you want \n to really
mean the IUCr Greek nu then it will have to be \\n, and the same parser will
give the downstream application \n (having removed the leading elide).
Beautiful, that's what the computer scientist in me argues.

However others argued that many users vim/emacs the file and cut and paste
the text content. So if you have a LaTEX string "{\\em I am italicised}"
that you cut and paste then it fails.  And the blasted backward
compatibility argument comes in with existing CIF1 files that are not doubly
elided.

What we can do is push the idea that a CIF2 string is a COMPLETELY different
beast to a CIF1 string. We know that with CIF1 data names and data values we
have to push our CIF2 parser in to a different grammar to handle things
correctly. At that level elides in a string will have a strict CIF1 meaning
(ie IUCr Greek markup).

In CIF2 an elide in a string protects the following character from being
interpreted as a delimiter. There is special meaning for \n, \t etc  which
are replaced by their single character. \u123456 (up to 6 hex numbers)
indicate a unicode character which should be replaced by the correct byte
sequence. All other first reverse solidus should be removed, and the
immediately following character passed on as part of the string. Characters
can be (multibyte) UTF-8.

If you want to encode LaTEX (or IUCr-speak or something similar) then you
are going to have double all your reverse solidii. You can't cut and paste
from an editor - bad luck.

I will wait for Herb's response to this because he was an advocate of
leaving things as they were (I think). I am happy to move forward with your
suggested interpretation.

> (b) The profusion of algorithms for backslash processing means that
> we *must* remove ambiguity by removing the eliding character during
> processing; otherwise, an application can't tell if it is e.g. looking
> at an escaped prime or an acute accent without applying ugly
> heuristics.  Note also that a caller of a CIF reading program doesn't
> currently need to know what the particular string delimiting character
> was for a given string value; in order to make a guess at what
> the backslash might mean, it would often need to know this.
>
> It appears that Nick is describing Python raw string behaviour,
> and I am describing Python 'cooked' string behaviour.  Note for the
> following paragraph from
> docs.python.org/reference/lexical_analysis.html#strings:
>
> When an 'r' or 'R' prefix is present, a character following a
> backslash is included in the string without change, and all
> backslashes are left in the string. For example, the string
> literal r"\n" consists of two characters: a backslash and a
> lowercase 'n'. String quotes can be escaped with a backslash,
> but the backslash remains in the string; for example, r"\"" is
> a valid string literal consisting of two characters: a
> backslash and a double quote; r"\" is not a valid string
> literal (even a raw string cannot end in an odd number of
> backslashes). Specifically, a raw string cannot end in a
> single backslash (since the backslash would escape the
> following quote character). Note also that a single backslash
> followed by a newline is interpreted as those two characters
> as part of the string, not as a line continuation.
>
> Note that raw strings cannot end in a backslash, so I would consider
> them slightly less expressive than cooked strings, which can express
> everything.
>
> I would challenge Nick et. al. to explain what the advantage
> of keeping the eliding character in the datavalue is, keeping in mind
> that programs like CIFtbx and PyCIFRW and several others aim to hide
> CIF syntax from their users (as a service), and this proposal appears
> to want to expose a confusing part of it to them.  Some questions we

The original "advantage" (if you could call it that) was to keep others
happy and to support backwards compatibility.

> toolbox maintainers will need to ask if this goes through: Do you
> handle escaping any strings passed to you for output?  How do you know
> if the caller has done the escaping already, or not?  Do you really expect
> the calling software to work out whether it wants a single or double
> or triple quote delimited string?  Isn't that the service provided by
> your software?  What are they (not) paying you for, anyway?

When they pay, I'll answer that question!

cheers

Nick

--------------------------------
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G