Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Eliding in triple-quoted strings: Proposals C and D

Dear DDLm group members,

Most of you will be aware that the CIF2 standard has been approved by
COMCIFS, with one dissenting vote.  I propose to revisit the point
raised by Ralf in his dissenting vote, in order to see if we can't
improve this aspect of the standard.  The particular problem
identified by Ralf, and this problem exists to a more limited extent
with CIF1 as well, is that there is no mechanism to elide instances of
the string delimiter sequence, meaning that certain pathological
strings cannot be included in a CIF2 file.  A further issue is that
CIF writing programs have to run through a long series of checks when
determining how to delimit any given string. I propose that we revisit
this problem, with the restriction proposed by Ralf that we consider
only triple quote/triple apostrophe delimited strings.

To get us back up to speed on this issue, you will recall some salient
points from previous discussions, which taken together led to our
failure to make any progress:

(1) CIF files are often edited in text editors.  Working with CIF text
in a text editor should not produce unexpected behaviour for a typical
workflow.
(2) CIF text may include LaTeX or other marked-up text, which will be
cumbersome to insert in the file if it contains many instances of
elide characters (see point (1))
(3) IUCr "markup" for Greek letters uses backslash to introduce the
special character combination
(4) Any characters that function as elides must be removed from the
string at parse time to avoid ambiguity in interpretation when
returned to the calling application

If we limit ourselves to triple quote/apostrophe delimited strings, as
Ralf proposes, then we can construct an elide scheme that is invisible
to the lexer, by simply breaking the trigraph appropriately.  I
propose the following general scheme, where <delimiter> refers to one
delimiter character, so the full string delimiter would be
<delimiter><delimiter><delimiter>:

Proposal C:

When reconstructing the datavalue from an input triple-<delimiter>
delimited string, the following simple transformation is performed:
all occurrences of <delimiter><elide> are replaced by <delimiter>.

My comments on this scheme are as follows:
(0) When preparing a string for output, any occurrences of
<delimiter><elide> *must* be replaced by <delimiter><elide><elide>;
<delimiter> only needs to be elided when necessary to break up triple
<delimiter> sequences in the source string, and when the final
character of a string is <delimiter>
(1) It is invisible to the lexer, which will correctly find the string
terminator characters without knowledge of the <elide> character used.
(2) With appropriate choice of <elide>, there is a low likelihood of
ever encountering a string where transformation needs to be performed,
which means transforming the string is necessary only where three or
more delimiter characters are present in a row, or the string
concludes with a delimiter character.
(3) The <elide> is a post-elide, by which I mean it elides the
preceding character, not the next character.  This is preferable to
cover the case of an input string finishing with the <delimiter>
character, in which case some non-<delimiter> character must appear
after it to ensure the lexer does not consider the final <delimiter>
character in the string as the first character of the terminating
<delimiter><delimiter><delimiter> sequence.

Finally, consider a general proposal D:

Elided triple-<delimiter> strings are delimited by
<char><delimiter><delimiter><delimiter>...<delimiter><delimiter><delimiter>.
 The initial <char> defines the character to use to post-elide the
contents of the string as per proposal C. <char> would initially be
any non-alphanumeric ASCII character, with the set expanded in the
future to include Unicode characters once most applications were
Unicode-aware.

Examples (LHS is string as written in CIF file, RHS is actual
datavalue inside angle brackets)

      &""" Bleg blah blah ""&"  and so forth "&"""                <
Bleg blah blah """ and so forth">
      $'''''$' AAABBB ''$' CCCDDD '$'''
 <''' AAABBB ''' CCCDDD '>

This allows the string writer to choose the elide character to
minimise <delimiter><elide> occurrences in the source text.  Note that
the need to choose and prepend a character to the string minimizes the
likelihood that somebody will do a naive cut and paste.

An even more general proposal would prepend a character to the string
to indicate pre-elide (as per Proposal A in a separate email) or
append a character to indicate post-elide.  I don't propose to
consider this.

Again, please indicate your views on including any of these proposals
in the CIF standard.
-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.