[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D

Dear Colleagues,

   There are many rational alterative to Ralf's proposal,
but that misses the point -- there is a well-established,
well-supported mechanism for string quoting in Python,
and we are simply making a confusing mess out of CIF2
by not simply adopting the Python quoting mechanism in

   I propose that we do precisely what Ralf has suggested
for the tiple quoted strings -- follow the Python rules
as written.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Fri, 7 Jan 2011, James Hester wrote:

> Dear DDLm group members,
> Most of you will be aware that the CIF2 standard has been approved by
> COMCIFS, with one dissenting vote.  I propose to revisit the point
> raised by Ralf in his dissenting vote, in order to see if we can't
> improve this aspect of the standard.  The particular problem
> identified by Ralf, and this problem exists to a more limited extent
> with CIF1 as well, is that there is no mechanism to elide instances of
> the string delimiter sequence, meaning that certain pathological
> strings cannot be included in a CIF2 file.  A further issue is that
> CIF writing programs have to run through a long series of checks when
> determining how to delimit any given string. I propose that we revisit
> this problem, with the restriction proposed by Ralf that we consider
> only triple quote/triple apostrophe delimited strings.
> To get us back up to speed on this issue, you will recall some salient
> points from previous discussions, which taken together led to our
> failure to make any progress:
> (1) CIF files are often edited in text editors.  Working with CIF text
> in a text editor should not produce unexpected behaviour for a typical
> workflow.
> (2) CIF text may include LaTeX or other marked-up text, which will be
> cumbersome to insert in the file if it contains many instances of
> elide characters (see point (1))
> (3) IUCr "markup" for Greek letters uses backslash to introduce the
> special character combination
> (4) Any characters that function as elides must be removed from the
> string at parse time to avoid ambiguity in interpretation when
> returned to the calling application
> If we limit ourselves to triple quote/apostrophe delimited strings, as
> Ralf proposes, then we can construct an elide scheme that is invisible
> to the lexer, by simply breaking the trigraph appropriately.  I
> propose the following general scheme, where <delimiter> refers to one
> delimiter character, so the full string delimiter would be
> <delimiter><delimiter><delimiter>:
> Proposal C:
> When reconstructing the datavalue from an input triple-<delimiter>
> delimited string, the following simple transformation is performed:
> all occurrences of <delimiter><elide> are replaced by <delimiter>.
> My comments on this scheme are as follows:
> (0) When preparing a string for output, any occurrences of
> <delimiter><elide> *must* be replaced by <delimiter><elide><elide>;
> <delimiter> only needs to be elided when necessary to break up triple
> <delimiter> sequences in the source string, and when the final
> character of a string is <delimiter>
> (1) It is invisible to the lexer, which will correctly find the string
> terminator characters without knowledge of the <elide> character used.
> (2) With appropriate choice of <elide>, there is a low likelihood of
> ever encountering a string where transformation needs to be performed,
> which means transforming the string is necessary only where three or
> more delimiter characters are present in a row, or the string
> concludes with a delimiter character.
> (3) The <elide> is a post-elide, by which I mean it elides the
> preceding character, not the next character.  This is preferable to
> cover the case of an input string finishing with the <delimiter>
> character, in which case some non-<delimiter> character must appear
> after it to ensure the lexer does not consider the final <delimiter>
> character in the string as the first character of the terminating
> <delimiter><delimiter><delimiter> sequence.
> Finally, consider a general proposal D:
> Elided triple-<delimiter> strings are delimited by
> <char><delimiter><delimiter><delimiter>...<delimiter><delimiter><delimiter>.
> The initial <char> defines the character to use to post-elide the
> contents of the string as per proposal C. <char> would initially be
> any non-alphanumeric ASCII character, with the set expanded in the
> future to include Unicode characters once most applications were
> Unicode-aware.
> Examples (LHS is string as written in CIF file, RHS is actual
> datavalue inside angle brackets)
>      &""" Bleg blah blah ""&"  and so forth "&"""                <
> Bleg blah blah """ and so forth">
>      $'''''$' AAABBB ''$' CCCDDD '$'''
> <''' AAABBB ''' CCCDDD '>
> This allows the string writer to choose the elide character to
> minimise <delimiter><elide> occurrences in the source text.  Note that
> the need to choose and prepend a character to the string minimizes the
> likelihood that somebody will do a naive cut and paste.
> An even more general proposal would prepend a character to the string
> to indicate pre-elide (as per Proposal A in a separate email) or
> append a character to indicate post-elide.  I don't propose to
> consider this.
> Again, please indicate your views on including any of these proposals
> in the CIF standard.
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]