[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Python-type eliding for triple-quoted strings

I am going to divide Ralf's proposal into two parts, which both
separately solve the problem of representing every possible string in
a CIF file.

Proposal A: strings can be delimited by three quotes or three
apostrophes ("cooked strings" hereafter) or else by three quotes or
three apostrophes immediately preceded by the letter 'r' ("raw
strings").  Both cooked and raw strings define two special sequences:
<backslash><delimiter> and <backslash><backslash>.  When these
sequences are encountered in a cooked string, the first backslash is
removed and the second character no longer has any special meaning
(delimiter or elide).  When these sequences are encountered in a raw
string, they function as for a cooked string, but the initial
<backslash> is not removed. Note that I have deliberately excluded the
following escape sequences from this proposal as they are not
syntactically relevant: \newline, \a, \b, \f,\n,\r,\t,\v,\ooo, \xhh

Under Proposal A, the sequence <backslash><delimiter> is represented
as <backslash><backslash><backslash><delimiter> in a cooked string.
In a raw string, it may be left as <backslash><delimiter>.  In a raw
string, a string terminating with <delimiter> must contain
<backslash><delimiter> as the last two characters.  A raw string
cannot finish with a single <backslash>.

Proposal B: strings can be delimited by three quotes or three
apostrophes or else by three quotes or three apostrophes immediately
preceded by the letter 'u' ("unicode strings").  In a non-unicode
string, no special behaviour is defined (as in the current CIF2
proposal).  In a Unicode string, the escapes \uxxxx and \Uxxxxxx are
defined as the corresponding Unicode code point.


I believe that this scheme is not particularly appropriate for the CIF
context, which is unsurprising given that Python literals are designed
for embedding in programs and CIF literals are intended to encapsulate
arbitrary data.  My criticisms are as follows:

(1) Many of the <backslash><character> sequences in non-raw strings
already have a meaning as IUCr markup or LaTeX markup
(2) The lexer must be informed of the
(2) Raw strings will include the <backslash><delimiter> sequence in
the datavalue, meaning that the


-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]