[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Alternative proposal for eliding

  • To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Alternative proposal for eliding
  • From: Saulius Grazulis <grazulis@ibt.lt>
  • Date: Tue, 28 Jun 2011 17:30:57 +0300
  • Organization: Biotechnologijos institutas
Dear DDLm group members,
I would like to comment on some concerns regarding backwards compatibility of my proposal, the "prefixed <eol><semicolon> text fields". I think that in most cases the problems can be easily circumvented:
On Wed, Jun 8, 2011 at 6:17 AM, David Brown <idbrown@mcmaster.ca> wrote:
> The only place I see a possible problem is with a heritage CIF with the> following sequence>> _publ_section_experimental> ; a,b,c,\a,\b,\c were> determined from powder patterns> ;>> /.../> A CIF reader would expect to find:>> _publ_section_experimental> ; a,b,c,\a,\b,\c>  a,b,c,were determined from powder patterns> ;>> and strip off the a,b,c, /.../
Not, thats actually NOT the way I supposed the things would work. Under my proposal, the above sequence would not be interpreted as a prefix, since the final backslash is not followed by a newline (or by a white space and a newline). Thus, the pattern would be interpreted literally, as it it is done now, and no problem would occur with such legacy archived files.
To make "a,b,c" a prefix, one should write:
_publ_section_experimental;a,b,c,\a,b,c,\a,\b,\ca,b,c,were determined from powder patterns;
Which is different from above and should be equivalent, after prefix removal, to '\a,\b,\c were determined from powder patterns' in an unquoted string.
Note that the 'a,b,c,' string *may* be at the beginning of a line, even if it is a prefix:
_publ_section_experimental;a,b,c,\a,b,c,a,b,c,\a,\b,\ca,b,c,were determined from powder patterns;
would fold to 'a,b,c,\a,\b,\c were determined from powder patterns' single-quoted string after changing newlines to spaces.
Actually, the Perl RE was not accurate in my previous prosal, the more appropriate determination of prefix in Perl REs would be:
if( $text =~ /^([^\\]+)\\(\s+)?\n/ ) { # a text without backslashes,                                       # then a backslash,                                       # then maybe blank, then newline.    my $prefix = $1;    $text =~ s/^${prefix}\\\n//;    $text =~ s/^${prefix}//mg}
> I agree that misreading of a legacy file without incurring a parsing error> is practically impossible.
The only situation when the legacy files would be misinterpreted would be when they contain a *nonempty* text and a *trailing* backslash as the first line of the ';'-delimited text. Arguably, such files are seldom and probably non-existent. For example, there are only two such files in the COD CIF collection out of 140k+ (which encompasses nearly all files from the IUCr journals and quite a few by other publishers):
saulius@tasmanijos-velnias cif/ > find ? -iname '*.cif' \| xargs perl -ne 'print $ARGV, "\t", $_ if /^;([^\\]+)\\(\s+)?\n/'
2/2213918.cif	;{4,4'-Dibromo-2,2'-[1,2-phenylenebis(nitrilomethylidene)]diphenolato-\
2/2224012.cif	; \
and both are probably mis-represented folded long lines which should be corrected anyway; see the full files:
(Originals are at:
and they have the same syntax).
I can run the same check on the PDB mmCIF collection if needed.
Even if such files are encountered, in most cases it will not cause much harm -- a parser will not be able to strip away prefixes and leave the rest of the value as is. This could (should?) trigger a warning.
> We should, however, make it possible in CIF2 to present multiline values> containing a backslash before the first <eol> without risking a parsing> error on read when this <backslash> is misunderstood as a prefix flag.
I think discarding the new line of the first ';' line is not necessary in case the line is not a prefix. The suggested prefix declarations are unique enough to be recognized without this rule.
# From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com># Date: Tue, 7 Jun 2011 05:48:34 -0400 (EDT):
> This would certainy be a worthy suggestion to consider in a> CIF1 context.
Sure the prefixed ';'-texts can be used in CIF1 as well, being mostly backwards compatible, and compatible with the CIF line folding rule.
> For CIF2, my own preference would be to solve this problem by adopting> the full Python syntax and semantics for treble-quoted strings
My understanding is that, unless escape sequences like those in C or in Python or Perl are mandated in CIF strings ("The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character"[1]), the triple-quoted syntax does not solve the cif-in-cif problem -- as I have read in the recent CIF2 draft[2], 'Clearly, the string within cannot contain an ASCII """'. Thus again we will have a non-representable values in CIF -- the ones that contain triple-single quotes followed by a space, triple double quotes followed by a space and a semicolon at the beginning of a line.
[1] http://docs.python.org/reference/lexical_analysis.html
[2] http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf
We do not need to go far to find such values -- the text of the cif2_syntax_changes_jrh20100705.pdf draft itself *is* an example of a non-representable value :). The prefixes could easily save the situation without adding much extra work for parsers.
-- Dr. Saulius Gra┼żulisInstitute of Biotechnology, Graiciuno 8LT-02241 Vilnius, Lietuva (Lithuania)fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556mobile: (+370-684)-49802, (+370-614)-36366_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]