[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Use of elides in strings

We need to figure out the behaviour of elides.  This was previously
discussed in a thread entitled "The alphabet of non-delimited
strings", especially in messages around Oct 16th.  The behaviour
advocated by Nick is for both the eliding and elided character to be
returned from the parser.  The behaviour I would prefer is for the
eliding character to disappear; it should itself be elided if it is to
remain in the string.

To summarize Nick's and Herbert's arguments from the emails dated Fri
Oct 16, 2009 at 6:22AM and subsequently

1. We don't interpret elides because we don't know what algorithm to
use (i.e. it might be a greek character sequence)

2. The elide simply signals that the lexer should not interpret the
following character

My counter-proposal is similar to Simon's original expectation: if the
elide character is really eliding a syntactically significant
character (i.e. a terminator character or an elide character), the
elide sequence is replaced by the single character.  I counter the
above arguments as follows:

(a) The profusion of algorithms for backslash processing is
irrelevant. We can interpret the elides because the only algorithm
that has any relevance at the parser level is the simple
<backslash><character> -> <character>.  All other potential uses
belong to higher levels.  If the higher levels require a
<backslash><quote>, that is created by writing
<backslash><backslash><backslash><quote> in the on-disk string.

(b) The profusion of algorithms for backslash processing means that
we *must* remove ambiguity by removing the eliding character during
processing; otherwise, an application can't tell if it is e.g. looking
at an escaped prime or an acute accent without applying ugly
heuristics.  Note also that a caller of a CIF reading program doesn't
currently need to know what the particular string delimiting character
was for a given string value; in order to make a guess at what
the backslash might mean, it would often need to know this.

It appears that Nick is describing Python raw string behaviour,
and I am describing Python 'cooked' string behaviour.  Note for the
following paragraph from
docs.python.org/reference/lexical_analysis.html#strings:

	When an 'r' or 'R' prefix is present, a character following a
	backslash is included in the string without change, and all
	backslashes are left in the string. For example, the string
	literal r"\n" consists of two characters: a backslash and a
	lowercase 'n'. String quotes can be escaped with a backslash,
	but the backslash remains in the string; for example, r"\"" is
	a valid string literal consisting of two characters: a
	backslash and a double quote; r"\" is not a valid string
	literal (even a raw string cannot end in an odd number of
	backslashes). Specifically, a raw string cannot end in a
	single backslash (since the backslash would escape the
	following quote character). Note also that a single backslash
	followed by a newline is interpreted as those two characters
	as part of the string, not as a line continuation.

Note that raw strings cannot end in a backslash, so I would consider
them slightly less expressive than cooked strings, which can express
everything.

I would challenge Nick et. al. to explain what the advantage
of keeping the eliding character in the datavalue is, keeping in mind
that programs like CIFtbx and PyCIFRW and several others aim to hide
CIF syntax from their users (as a service), and this proposal appears
to want to expose a confusing part of it to them.  Some questions we
toolbox maintainers will need to ask if this goes through: Do you
handle escaping any strings passed to you for output?  How do you know
if the caller has done the escaping already, or not?  Do you really expect
the calling software to work out whether it wants a single or double
or triple quote delimited string?  Isn't that the service provided by
your software?  What are they (not) paying you for, anyway?

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]