Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Alternative proposal for eliding

Thank you for clarifying - I see now that there are no real legacy issues.

Cheers

Simon


From: Saulius Grazulis <grazulis@ibt.lt>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Tuesday, 28 June, 2011 15:30:57
Subject: Re: [ddlm-group] Alternative proposal for eliding

Dear DDLm group members,

I would like to comment on some concerns regarding backwards compatibility of
my proposal, the "prefixed <eol><semicolon> text fields". I think that in
most cases the problems can be easily circumvented:

On Wed, Jun 8, 2011 at 6:17 AM, David Brown <idbrown@mcmaster.ca> wrote:

> The only place I see a possible problem is with a heritage CIF with the
> following sequence
>
> _publ_section_experimental
> ; a,b,c,\a,\b,\c were
> determined from powder patterns
> ;
>
> /.../
> A CIF reader would expect to find:
>
> _publ_section_experimental
> ; a,b,c,\a,\b,\c
>  a,b,c,were determined from powder patterns
> ;
>
> and strip off the a,b,c, /.../

Not, thats actually NOT the way I supposed the things would work. Under my
proposal, the above sequence would not be interpreted as a prefix, since the
final backslash is not followed by a newline (or by a white space and a
newline). Thus, the pattern would be interpreted literally, as it it is done
now, and no problem would occur with such legacy archived files.

To make "a,b,c" a prefix, one should write:

_publ_section_experimental
;a,b,c,\
a,b,c,\a,\b,\c
a,b,c,were determined from powder patterns
;

Which is different from above and should be equivalent, after prefix removal,
to '\a,\b,\c were determined from powder patterns' in an unquoted string.

Note that the 'a,b,c,' string *may* be at the beginning of a line, even if it
is a prefix:

_publ_section_experimental
;a,b,c,\
a,b,c,a,b,c,\a,\b,\c
a,b,c,were determined from powder patterns
;

would fold to 'a,b,c,\a,\b,\c were determined from powder patterns'
single-quoted string after changing newlines to spaces.

Actually, the Perl RE was not accurate in my previous prosal, the more
appropriate determination of prefix in Perl REs would be:

if( $text =~ /^([^\\]+)\\(\s+)?\n/ ) { # a text without backslashes,
                                      # then a backslash,
                                      # then maybe blank, then newline.
    my $prefix = $1;
    $text =~ s/^${prefix}\\\n//;
    $text =~ s/^${prefix}//mg
}

> I agree that misreading of a legacy file without incurring a parsing error
> is practically impossible.

The only situation when the legacy files would be misinterpreted would be when
they contain a *nonempty* text and a *trailing* backslash as the first line
of the ';'-delimited text. Arguably, such files are seldom and probably
non-existent. For example, there are only two such files in the COD CIF
collection out of 140k+ (which encompasses nearly all files from the IUCr
journals and quite a few by other publishers):

saulius@tasmanijos-velnias cif/ > find ? -iname '*.cif' \
| xargs perl -ne 'print $ARGV, "\t", $_ if /^;([^\\]+)\\(\s+)?\n/'

2/2213918.cif    ;
{4,4'-Dibromo-2,2'-[1,2-phenylenebis(nitrilomethylidene)]diphenolato-\

2/2224012.cif    ; \

and both are probably mis-represented folded long lines which should be
corrected anyway; see the full files:

http://www.crystallography.net/2213918.cif
http://www.crystallography.net/2224012.cif

(Originals are at:

http://scripts.iucr.org/cgi-bin/sendcif?ng2268sup1
http://scripts.iucr.org/cgi-bin/sendcif?sj2654sup1

and they have the same syntax).

I can run the same check on the PDB mmCIF collection if needed.

Even if such files are encountered, in most cases it will not cause much
harm -- a parser will not be able to strip away prefixes and leave the rest
of the value as is. This could (should?) trigger a warning.

> We should, however, make it possible in CIF2 to present multiline values
> containing a backslash before the first <eol> without risking a parsing
> error on read when this <backslash> is misunderstood as a prefix flag.

I think discarding the new line of the first ';' line is not necessary in case
the line is not a prefix. The suggested prefix declarations are unique enough
to be recognized without this rule.

# From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
# Date: Tue, 7 Jun 2011 05:48:34 -0400 (EDT):

> This would certainy be a worthy suggestion to consider in a
> CIF1 context.

Sure the prefixed ';'-texts can be used in CIF1 as well, being mostly
backwards compatible, and compatible with the CIF line folding rule.

> For CIF2, my own preference would be to solve this problem by adopting
> the full Python syntax and semantics for treble-quoted strings

My understanding is that, unless escape sequences like those in C or in Python
or Perl are mandated in CIF strings ("The backslash (\) character is used to
escape characters that otherwise have a special meaning, such as newline,
backslash itself, or the quote character"[1]), the triple-quoted syntax does
not solve the cif-in-cif problem -- as I have read in the recent CIF2
draft[2], 'Clearly, the string within cannot contain an ASCII """'. Thus
again we will have a non-representable values in CIF -- the ones that contain
triple-single quotes followed by a space, triple double quotes followed by a
space and a semicolon at the beginning of a line.

[1]
http://docs.python.org/reference/lexical_analysis.html

[2]
http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf

We do not need to go far to find such values -- the text of the
cif2_syntax_changes_jrh20100705.pdf draft itself *is* an example of a
non-representable value :). The prefixes could easily save the situation
without adding much extra work for parsers.

Sincerely,
Saulius

--
Dr. Saulius Gražulis
Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.