[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Fri, 7 Jan 2011 07:41:20 -0500 (EST)
- In-Reply-To: <AANLkTimWpd1kMZDGcTprEhcJw+uQE4_JtgJ4SbtMPVXt@mail.gmail.com>
- References: <AANLkTimWpd1kMZDGcTprEhcJw+uQE4_JtgJ4SbtMPVXt@mail.gmail.com>
Dear Colleagues, There are many rational alterative to Ralf's proposal, but that misses the point -- there is a well-established, well-supported mechanism for string quoting in Python, and we are simply making a confusing mess out of CIF2 by not simply adopting the Python quoting mechanism in toto. I propose that we do precisely what Ralf has suggested for the tiple quoted strings -- follow the Python rules as written. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Fri, 7 Jan 2011, James Hester wrote: > Dear DDLm group members, > > Most of you will be aware that the CIF2 standard has been approved by > COMCIFS, with one dissenting vote. I propose to revisit the point > raised by Ralf in his dissenting vote, in order to see if we can't > improve this aspect of the standard. The particular problem > identified by Ralf, and this problem exists to a more limited extent > with CIF1 as well, is that there is no mechanism to elide instances of > the string delimiter sequence, meaning that certain pathological > strings cannot be included in a CIF2 file. A further issue is that > CIF writing programs have to run through a long series of checks when > determining how to delimit any given string. I propose that we revisit > this problem, with the restriction proposed by Ralf that we consider > only triple quote/triple apostrophe delimited strings. > > To get us back up to speed on this issue, you will recall some salient > points from previous discussions, which taken together led to our > failure to make any progress: > > (1) CIF files are often edited in text editors. Working with CIF text > in a text editor should not produce unexpected behaviour for a typical > workflow. > (2) CIF text may include LaTeX or other marked-up text, which will be > cumbersome to insert in the file if it contains many instances of > elide characters (see point (1)) > (3) IUCr "markup" for Greek letters uses backslash to introduce the > special character combination > (4) Any characters that function as elides must be removed from the > string at parse time to avoid ambiguity in interpretation when > returned to the calling application > > If we limit ourselves to triple quote/apostrophe delimited strings, as > Ralf proposes, then we can construct an elide scheme that is invisible > to the lexer, by simply breaking the trigraph appropriately. I > propose the following general scheme, where <delimiter> refers to one > delimiter character, so the full string delimiter would be > <delimiter><delimiter><delimiter>: > > Proposal C: > > When reconstructing the datavalue from an input triple-<delimiter> > delimited string, the following simple transformation is performed: > all occurrences of <delimiter><elide> are replaced by <delimiter>. > > My comments on this scheme are as follows: > (0) When preparing a string for output, any occurrences of > <delimiter><elide> *must* be replaced by <delimiter><elide><elide>; > <delimiter> only needs to be elided when necessary to break up triple > <delimiter> sequences in the source string, and when the final > character of a string is <delimiter> > (1) It is invisible to the lexer, which will correctly find the string > terminator characters without knowledge of the <elide> character used. > (2) With appropriate choice of <elide>, there is a low likelihood of > ever encountering a string where transformation needs to be performed, > which means transforming the string is necessary only where three or > more delimiter characters are present in a row, or the string > concludes with a delimiter character. > (3) The <elide> is a post-elide, by which I mean it elides the > preceding character, not the next character. This is preferable to > cover the case of an input string finishing with the <delimiter> > character, in which case some non-<delimiter> character must appear > after it to ensure the lexer does not consider the final <delimiter> > character in the string as the first character of the terminating > <delimiter><delimiter><delimiter> sequence. > > Finally, consider a general proposal D: > > Elided triple-<delimiter> strings are delimited by > <char><delimiter><delimiter><delimiter>...<delimiter><delimiter><delimiter>. > The initial <char> defines the character to use to post-elide the > contents of the string as per proposal C. <char> would initially be > any non-alphanumeric ASCII character, with the set expanded in the > future to include Unicode characters once most applications were > Unicode-aware. > > Examples (LHS is string as written in CIF file, RHS is actual > datavalue inside angle brackets) > > &""" Bleg blah blah ""&" and so forth "&""" < > Bleg blah blah """ and so forth"> > $'''''$' AAABBB ''$' CCCDDD '$''' > <''' AAABBB ''' CCCDDD '> > > This allows the string writer to choose the elide character to > minimise <delimiter><elide> occurrences in the source text. Note that > the need to choose and prepend a character to the string minimizes the > likelihood that somebody will do a naive cut and paste. > > An even more general proposal would prepend a character to the string > to indicate pre-elide (as per Proposal A in a separate email) or > append a character to indicate post-elide. I don't propose to > consider this. > > Again, please indicate your views on including any of these proposals > in the CIF standard. > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Prev by Date: Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- Next by Date: Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- Prev by thread: Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C andD. .. .. .
- Next by thread: [ddlm-group] Python-type eliding for triple-quoted strings
- Index(es):