[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
[ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- To: ddlm-group <ddlm-group@iucr.org>
- Subject: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- From: James Hester <jamesrhester@gmail.com>
- Date: Fri, 7 Jan 2011 15:46:10 +1100
Dear DDLm group members, Most of you will be aware that the CIF2 standard has been approved by COMCIFS, with one dissenting vote. I propose to revisit the point raised by Ralf in his dissenting vote, in order to see if we can't improve this aspect of the standard. The particular problem identified by Ralf, and this problem exists to a more limited extent with CIF1 as well, is that there is no mechanism to elide instances of the string delimiter sequence, meaning that certain pathological strings cannot be included in a CIF2 file. A further issue is that CIF writing programs have to run through a long series of checks when determining how to delimit any given string. I propose that we revisit this problem, with the restriction proposed by Ralf that we consider only triple quote/triple apostrophe delimited strings. To get us back up to speed on this issue, you will recall some salient points from previous discussions, which taken together led to our failure to make any progress: (1) CIF files are often edited in text editors. Working with CIF text in a text editor should not produce unexpected behaviour for a typical workflow. (2) CIF text may include LaTeX or other marked-up text, which will be cumbersome to insert in the file if it contains many instances of elide characters (see point (1)) (3) IUCr "markup" for Greek letters uses backslash to introduce the special character combination (4) Any characters that function as elides must be removed from the string at parse time to avoid ambiguity in interpretation when returned to the calling application If we limit ourselves to triple quote/apostrophe delimited strings, as Ralf proposes, then we can construct an elide scheme that is invisible to the lexer, by simply breaking the trigraph appropriately. I propose the following general scheme, where <delimiter> refers to one delimiter character, so the full string delimiter would be <delimiter><delimiter><delimiter>: Proposal C: When reconstructing the datavalue from an input triple-<delimiter> delimited string, the following simple transformation is performed: all occurrences of <delimiter><elide> are replaced by <delimiter>. My comments on this scheme are as follows: (0) When preparing a string for output, any occurrences of <delimiter><elide> *must* be replaced by <delimiter><elide><elide>; <delimiter> only needs to be elided when necessary to break up triple <delimiter> sequences in the source string, and when the final character of a string is <delimiter> (1) It is invisible to the lexer, which will correctly find the string terminator characters without knowledge of the <elide> character used. (2) With appropriate choice of <elide>, there is a low likelihood of ever encountering a string where transformation needs to be performed, which means transforming the string is necessary only where three or more delimiter characters are present in a row, or the string concludes with a delimiter character. (3) The <elide> is a post-elide, by which I mean it elides the preceding character, not the next character. This is preferable to cover the case of an input string finishing with the <delimiter> character, in which case some non-<delimiter> character must appear after it to ensure the lexer does not consider the final <delimiter> character in the string as the first character of the terminating <delimiter><delimiter><delimiter> sequence. Finally, consider a general proposal D: Elided triple-<delimiter> strings are delimited by <char><delimiter><delimiter><delimiter>...<delimiter><delimiter><delimiter>. The initial <char> defines the character to use to post-elide the contents of the string as per proposal C. <char> would initially be any non-alphanumeric ASCII character, with the set expanded in the future to include Unicode characters once most applications were Unicode-aware. Examples (LHS is string as written in CIF file, RHS is actual datavalue inside angle brackets) &""" Bleg blah blah ""&" and so forth "&""" < Bleg blah blah """ and so forth"> $'''''$' AAABBB ''$' CCCDDD '$''' <''' AAABBB ''' CCCDDD '> This allows the string writer to choose the elide character to minimise <delimiter><elide> occurrences in the source text. Note that the need to choose and prepend a character to the string minimizes the likelihood that somebody will do a naive cut and paste. An even more general proposal would prepend a character to the string to indicate pre-elide (as per Proposal A in a separate email) or append a character to indicate post-elide. I don't propose to consider this. Again, please indicate your views on including any of these proposals in the CIF standard. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D (Herbert J. Bernstein)
- Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D (SIMON WESTRIP)
- Prev by Date: [ddlm-group] Python-type eliding for triple-quoted strings
- Next by Date: Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- Prev by thread: Re: [ddlm-group] Simon's elide proposal
- Next by thread: Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- Index(es):