Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Alternative proposal for eliding

  • To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Alternative proposal for eliding
  • From: Saulius Grazulis <grazulis@ibt.lt>
  • Date: Tue, 28 Jun 2011 17:30:57 +0300
  • Organization: Biotechnologijos institutas
Dear DDLm group members,
I would like to comment on some concerns regarding backwards compatibility of my proposal, the "prefixed <eol><semicolon> text fields". I think that in most cases the problems can be easily circumvented:
On Wed, Jun 8, 2011 at 6:17 AM, David Brown <idbrown@mcmaster.ca> wrote:
> The only place I see a possible problem is with a heritage CIF with the> following sequence>> _publ_section_experimental> ; a,b,c,\a,\b,\c were> determined from powder patterns> ;>> /.../> A CIF reader would expect to find:>> _publ_section_experimental> ; a,b,c,\a,\b,\c>  a,b,c,were determined from powder patterns> ;>> and strip off the a,b,c, /.../
Not, thats actually NOT the way I supposed the things would work. Under my proposal, the above sequence would not be interpreted as a prefix, since the final backslash is not followed by a newline (or by a white space and a newline). Thus, the pattern would be interpreted literally, as it it is done now, and no problem would occur with such legacy archived files.
To make "a,b,c" a prefix, one should write:
_publ_section_experimental;a,b,c,\a,b,c,\a,\b,\ca,b,c,were determined from powder patterns;
Which is different from above and should be equivalent, after prefix removal, to '\a,\b,\c were determined from powder patterns' in an unquoted string.
Note that the 'a,b,c,' string *may* be at the beginning of a line, even if it is a prefix:
_publ_section_experimental;a,b,c,\a,b,c,a,b,c,\a,\b,\ca,b,c,were determined from powder patterns;
would fold to 'a,b,c,\a,\b,\c were determined from powder patterns' single-quoted string after changing newlines to spaces.
Actually, the Perl RE was not accurate in my previous prosal, the more appropriate determination of prefix in Perl REs would be:
if( $text =~ /^([^\\]+)\\(\s+)?\n/ ) { # a text without backslashes,                                       # then a backslash,                                       # then maybe blank, then newline.    my $prefix = $1;    $text =~ s/^${prefix}\\\n//;    $text =~ s/^${prefix}//mg}
> I agree that misreading of a legacy file without incurring a parsing error> is practically impossible.
The only situation when the legacy files would be misinterpreted would be when they contain a *nonempty* text and a *trailing* backslash as the first line of the ';'-delimited text. Arguably, such files are seldom and probably non-existent. For example, there are only two such files in the COD CIF collection out of 140k+ (which encompasses nearly all files from the IUCr journals and quite a few by other publishers):
saulius@tasmanijos-velnias cif/ > find ? -iname '*.cif' \| xargs perl -ne 'print $ARGV, "\t", $_ if /^;([^\\]+)\\(\s+)?\n/'
2/2213918.cif	;{4,4'-Dibromo-2,2'-[1,2-phenylenebis(nitrilomethylidene)]diphenolato-\
2/2224012.cif	; \
and both are probably mis-represented folded long lines which should be corrected anyway; see the full files:
http://www.crystallography.net/2213918.cifhttp://www.crystallography.net/2224012.cif
(Originals are at:
http://scripts.iucr.org/cgi-bin/sendcif?ng2268sup1http://scripts.iucr.org/cgi-bin/sendcif?sj2654sup1
and they have the same syntax).
I can run the same check on the PDB mmCIF collection if needed.
Even if such files are encountered, in most cases it will not cause much harm -- a parser will not be able to strip away prefixes and leave the rest of the value as is. This could (should?) trigger a warning.
> We should, however, make it possible in CIF2 to present multiline values> containing a backslash before the first <eol> without risking a parsing> error on read when this <backslash> is misunderstood as a prefix flag.
I think discarding the new line of the first ';' line is not necessary in case the line is not a prefix. The suggested prefix declarations are unique enough to be recognized without this rule.
# From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com># Date: Tue, 7 Jun 2011 05:48:34 -0400 (EDT):
> This would certainy be a worthy suggestion to consider in a> CIF1 context.
Sure the prefixed ';'-texts can be used in CIF1 as well, being mostly backwards compatible, and compatible with the CIF line folding rule.
> For CIF2, my own preference would be to solve this problem by adopting> the full Python syntax and semantics for treble-quoted strings
My understanding is that, unless escape sequences like those in C or in Python or Perl are mandated in CIF strings ("The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character"[1]), the triple-quoted syntax does not solve the cif-in-cif problem -- as I have read in the recent CIF2 draft[2], 'Clearly, the string within cannot contain an ASCII """'. Thus again we will have a non-representable values in CIF -- the ones that contain triple-single quotes followed by a space, triple double quotes followed by a space and a semicolon at the beginning of a line.
[1] http://docs.python.org/reference/lexical_analysis.html
[2] http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf
We do not need to go far to find such values -- the text of the cif2_syntax_changes_jrh20100705.pdf draft itself *is* an example of a non-representable value :). The prefixes could easily save the situation without adding much extra work for parsers.
Sincerely,Saulius
-- Dr. Saulius GražulisInstitute of Biotechnology, Graiciuno 8LT-02241 Vilnius, Lietuva (Lithuania)fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556mobile: (+370-684)-49802, (+370-614)-36366_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.