Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Python-type eliding for triple-quoted strings

I do not think Ralf's proposal as it stands is suitable for CIF, for
the following reasons:
(i) It implies 10 escape sequences in non-raw strings which are
syntactically irrelevant and a need for which has not been identified.
  These are \newline, \a, \b, \f,\n,\r,\t,\v,\ooo, \xhh
(ii) Raw strings are mostly useful for the non-lexically significant
escape sequences listed in (i).  Raw and cooked strings are almost
equivalent when these escape sequences are removed (see below for
discussion of this point). It is not clear therefore that raw strings
provide much benefit.
(iii) The inclusion of Unicode strings is not required to satisfy the
need for expressing any string in CIF.  They are a viable stand-alone
solution, however (Proposal B below).

Why are cooked and raw strings almost equivalent in our situation?
The point of raw strings is to allow backslash escape sequences to be
preserved in the input string, which is particularly important for
backslash-rich markup such as LaTeX strings.  If we exclude the 10
unneeded escape sequences and allow only the <backslash><delimiter>
and <backslash><backslash> sequences to be significant during parsing,
then we win very little by including raw strings in the proposal. Most
importantly, we cannot determine whether a <backslash><delimiter>
sequence in our raw string is due to a need to elide the delimiter, or
is a backslash combination that is intended for the string's consumer.
 In order to lift this ambiguity, we need to process the string to
remove the <backslash> when it is only intended to elide the
delimiter, and include some way of indicating this in the string, most
simply by preceding the <backslash><delimiter> with a
<backslash><backslash> when we want a backslash to remain in the
string - which means that the raw string is interpreted just like the
cooked string, as the extra backslashes do not form part of the
string's value.  Note that the one thing that a raw string does give
us in this situation is that we can ignore double backslashes
elsewhere in the string as long as they are not associated with a
<delimiter>.  Proposal A below also has this attribute.

Most of these criticisms are rectifiable, so I'm going to simplify and
divide Ralf's proposal into two parts, which both separately solve the
problem of representing every possible string in a CIF file.

Proposal A: strings can be delimited by three quotes or three
apostrophes.  Whenever the sequence <backslash><delimiter> is
encountered when reading such a string, it is replaced by <delimiter>,
and <delimiter> loses any special meaning.

Proposal B: strings can be delimited by three quotes or three
apostrophes or else by three quotes or three apostrophes immediately
preceded by the letter 'u' ("unicode strings").  In a non-unicode
string, no special behaviour is defined (as in the current CIF2
proposal).  In a Unicode string, the escapes \uxxxx and \Uxxxxxx are
defined as the corresponding Unicode code point.  Delimiters and
backslashes can therefore be included in the string by using their
Unicode number.

JRH comments on these two proposals:

(i) Both proposals require the cooperation of the lexer.  Proposal A
requires it only because of the particular case of the source string
terminating with <delimiter>; in all other cases the triple
<delimiter> is broken by <backslash>, and so the lexer can be
oblivious to any special meaning.  Proposal B obviously requires that
the initial <u><delimiter> sequence is recognised.

(ii) When preparing a string for output: in Proposal A, all
<backslash><delimiter> sequences *must* be replaced by
<backslash><backslash><delimiter>.  <delimiter> *must* be prepended
with a <backslash> when to do otherwise would terminate the string
prematurely.  <delimiter> *may* be prepended with a <backslash>
elsewhere, but this is not a requirement; in Proposal B, <delimiter>
*must* be replaced by <backslash>uxxxx, where xxxx is the Unicode code
point number for <delimiter>, if to do otherwise would prematurely
terminate the string.  <backslash> *must* be replaced by
<backslash>u005C if the character sequence <backslash><u> is contained
in the source text.

(iii) Proposal B allows Unicode characters to be included in strings
even without access to a Unicode-aware editor, so this proposal may be
useful in general.

(iv) No proposal can make cutting and pasting foolproof, but the need
for editing is considerably reduced if we restrict the operation of
eliding to triple-<delimiter> delimited strings.

What responses do others have to these proposals?  Even if your
response is predictable from previous discussions, please at least
indicate that this is the case.
-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.