Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D

Dear Colleagues,

   There are many rational alterative to Ralf's proposal,
but that misses the point -- there is a well-established,
well-supported mechanism for string quoting in Python,
and we are simply making a confusing mess out of CIF2
by not simply adopting the Python quoting mechanism in
toto.

   I propose that we do precisely what Ralf has suggested
for the tiple quoted strings -- follow the Python rules
as written.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 7 Jan 2011, James Hester wrote:

> Dear DDLm group members,
>
> Most of you will be aware that the CIF2 standard has been approved by
> COMCIFS, with one dissenting vote.  I propose to revisit the point
> raised by Ralf in his dissenting vote, in order to see if we can't
> improve this aspect of the standard.  The particular problem
> identified by Ralf, and this problem exists to a more limited extent
> with CIF1 as well, is that there is no mechanism to elide instances of
> the string delimiter sequence, meaning that certain pathological
> strings cannot be included in a CIF2 file.  A further issue is that
> CIF writing programs have to run through a long series of checks when
> determining how to delimit any given string. I propose that we revisit
> this problem, with the restriction proposed by Ralf that we consider
> only triple quote/triple apostrophe delimited strings.
>
> To get us back up to speed on this issue, you will recall some salient
> points from previous discussions, which taken together led to our
> failure to make any progress:
>
> (1) CIF files are often edited in text editors.  Working with CIF text
> in a text editor should not produce unexpected behaviour for a typical
> workflow.
> (2) CIF text may include LaTeX or other marked-up text, which will be
> cumbersome to insert in the file if it contains many instances of
> elide characters (see point (1))
> (3) IUCr "markup" for Greek letters uses backslash to introduce the
> special character combination
> (4) Any characters that function as elides must be removed from the
> string at parse time to avoid ambiguity in interpretation when
> returned to the calling application
>
> If we limit ourselves to triple quote/apostrophe delimited strings, as
> Ralf proposes, then we can construct an elide scheme that is invisible
> to the lexer, by simply breaking the trigraph appropriately.  I
> propose the following general scheme, where <delimiter> refers to one
> delimiter character, so the full string delimiter would be
> <delimiter><delimiter><delimiter>:
>
> Proposal C:
>
> When reconstructing the datavalue from an input triple-<delimiter>
> delimited string, the following simple transformation is performed:
> all occurrences of <delimiter><elide> are replaced by <delimiter>.
>
> My comments on this scheme are as follows:
> (0) When preparing a string for output, any occurrences of
> <delimiter><elide> *must* be replaced by <delimiter><elide><elide>;
> <delimiter> only needs to be elided when necessary to break up triple
> <delimiter> sequences in the source string, and when the final
> character of a string is <delimiter>
> (1) It is invisible to the lexer, which will correctly find the string
> terminator characters without knowledge of the <elide> character used.
> (2) With appropriate choice of <elide>, there is a low likelihood of
> ever encountering a string where transformation needs to be performed,
> which means transforming the string is necessary only where three or
> more delimiter characters are present in a row, or the string
> concludes with a delimiter character.
> (3) The <elide> is a post-elide, by which I mean it elides the
> preceding character, not the next character.  This is preferable to
> cover the case of an input string finishing with the <delimiter>
> character, in which case some non-<delimiter> character must appear
> after it to ensure the lexer does not consider the final <delimiter>
> character in the string as the first character of the terminating
> <delimiter><delimiter><delimiter> sequence.
>
> Finally, consider a general proposal D:
>
> Elided triple-<delimiter> strings are delimited by
> <char><delimiter><delimiter><delimiter>...<delimiter><delimiter><delimiter>.
> The initial <char> defines the character to use to post-elide the
> contents of the string as per proposal C. <char> would initially be
> any non-alphanumeric ASCII character, with the set expanded in the
> future to include Unicode characters once most applications were
> Unicode-aware.
>
> Examples (LHS is string as written in CIF file, RHS is actual
> datavalue inside angle brackets)
>
>      &""" Bleg blah blah ""&"  and so forth "&"""                <
> Bleg blah blah """ and so forth">
>      $'''''$' AAABBB ''$' CCCDDD '$'''
> <''' AAABBB ''' CCCDDD '>
>
> This allows the string writer to choose the elide character to
> minimise <delimiter><elide> occurrences in the source text.  Note that
> the need to choose and prepend a character to the string minimizes the
> likelihood that somebody will do a naive cut and paste.
>
> An even more general proposal would prepend a character to the string
> to indicate pre-elide (as per Proposal A in a separate email) or
> append a character to indicate post-elide.  I don't propose to
> consider this.
>
> Again, please indicate your views on including any of these proposals
> in the CIF standard.
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.