Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Python-type eliding for triple-quoted strings

I am going to divide Ralf's proposal into two parts, which both
separately solve the problem of representing every possible string in
a CIF file.

Proposal A: strings can be delimited by three quotes or three
apostrophes ("cooked strings" hereafter) or else by three quotes or
three apostrophes immediately preceded by the letter 'r' ("raw
strings").  Both cooked and raw strings define two special sequences:
<backslash><delimiter> and <backslash><backslash>.  When these
sequences are encountered in a cooked string, the first backslash is
removed and the second character no longer has any special meaning
(delimiter or elide).  When these sequences are encountered in a raw
string, they function as for a cooked string, but the initial
<backslash> is not removed. Note that I have deliberately excluded the
following escape sequences from this proposal as they are not
syntactically relevant: \newline, \a, \b, \f,\n,\r,\t,\v,\ooo, \xhh

Under Proposal A, the sequence <backslash><delimiter> is represented
as <backslash><backslash><backslash><delimiter> in a cooked string.
In a raw string, it may be left as <backslash><delimiter>.  In a raw
string, a string terminating with <delimiter> must contain
<backslash><delimiter> as the last two characters.  A raw string
cannot finish with a single <backslash>.

Proposal B: strings can be delimited by three quotes or three
apostrophes or else by three quotes or three apostrophes immediately
preceded by the letter 'u' ("unicode strings").  In a non-unicode
string, no special behaviour is defined (as in the current CIF2
proposal).  In a Unicode string, the escapes \uxxxx and \Uxxxxxx are
defined as the corresponding Unicode code point.


I believe that this scheme is not particularly appropriate for the CIF
context, which is unsurprising given that Python literals are designed
for embedding in programs and CIF literals are intended to encapsulate
arbitrary data.  My criticisms are as follows:

(1) Many of the <backslash><character> sequences in non-raw strings
already have a meaning as IUCr markup or LaTeX markup
(2) The lexer must be informed of the
(2) Raw strings will include the <backslash><delimiter> sequence in
the datavalue, meaning that the


-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.