Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

Dear All,

As before, I maintain my position that we should abandon eliding
completely. I examine here the proposition that all elide processing
is performed at a higher level, where one might expect that different
behaviours can be logically separated.

Before doing this analysis, note the following:

(1) If the meaning of <elide><terminator> in a string received from
the lexer is ambiguous, something akin to at least the minimal
approach suggested by Nick of mechanically adding/removing one <elide>
character from before every <terminator> character is necessary in
order to reliably lift the ambiguity.  This may be done at the lexer
level, as we originally proposed, or indeed at the dictionary level.
Regardless of where it is done, the raw string on disk will have extra
<elide> characters in those situations where <elide><terminator> does
not mean <terminator>.  As Nick said in a later email, cutting and
pasting will all the same not work in this case.

As a concrete example, <backslash><quote> in an IUCr 'legacy' string
may mean <acute accent> or it may mean <quote>, but by inserting an
extra backslash before those combinations that mean <acute accent>, we
can remove ambiguity.

(2) In the approach of (1), if the dictionary level doesn't know what
the particular terminator character was, it has no way of knowing
which character sequences it has to remove the <elide>s from: before
all the <quote>s, or before all the <double quotes>?  So the lexer
will need to pass the particular string delimiter character used to
the dictionary level.  Alternatively, we can specify that all
potential terminator characters are always escaped, even if that
particular string has different delimiters.  In either case, we are
adding significant additional complexity to our specification.

Now to Herbert's email:

>  Let us consider James' example.  He is actually making the case
> for _not_ removing the reverse-solidus from a string at the
> lexical level.
>  xxxx<backslash><quote>elxxxx
> or to be more specific
>  abcd\'efgh
> and we are presented with the question of ho should the
> dictionary interpret that string.
> If we have a string intended to be part of the modern pythonesque
> world, then I would expect the data element to have been typed
> in a way that says we should read the string as
>  abcd'efgh
> If we have a string that is a legacy from a CIF 1 file with
> IUCr type-setting codes, I would expect the data element to
> have beentyped in a way that says we should read the string as
> abcd{e with an acute accent)fgh

My point was that *both* readings are possible in a *single* string
because, as far as I know, the IUCr currently accepts a plain <quote>
character as meaning <quote>.  Thus there is ambiguity in the
interpretation, thus we need some scheme to disambiguate these uses.

> Anything the lexer does to remove the reverse-solidus is
> going to disfavor one intepretation or the other.

Not disfavour, simply separate lexical and semantic functions.

> By moving these two interpretations one level up to two
> different utility routines, we gain much more use from
> a common lexer and nobody loses any functionality.

To repeat: we cannot separate these interpretations into two different
routines/dictionary types, because both interpretations are possible
in a single string.

To take this further: what about strings for which only one meaning of
<elide><terminator> is possible, that meaning is not <terminator>
(because that reduces trivially to the minimalist proposal), and
<terminator> cannot appear apart from in the sequence
<elide><terminator>?  Can any of you produce a string type from
anywhere (computer language, legacy CIF, whatever) for which this is
true?  If not, I would suggest that leaving handling of elides to the
dictionary gains us nothing, at the cost of additional complexity and
confusion among users, as Nick points out in a later email.

Note that it is reasonable to suppose that if a language has a special
meaning for <elide><terminator>, that meaning exists in order to
escape the ordinary meaning of <terminator>, which must therefore also
exist in that same language.

I rest my case that there is no advantage now or ever in leaving elide
treatment to the dictionary level because (a) all elide treatment will
require differences between on-disk and actual string value (b)
complexity is added due to the need to either pass information about
string delimiters to the dictionary level, or elide all potential
delimiters in all strings.

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.