[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Alternative proposal for eliding

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Alternative proposal for eliding
From: SIMON WESTRIP <[email protected]>
Date: Tue, 28 Jun 2011 15:58:52 +0100 (BST)
In-Reply-To: <[email protected]>
References: <[email protected]>

Thank you for clarifying - I see now that there are no real legacy issues.

Cheers

Simon

From: Saulius Grazulis <[email protected]>
To: Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Tuesday, 28 June, 2011 15:30:57
Subject: Re: [ddlm-group] Alternative proposal for eliding

Dear DDLm group members,

I would like to comment on some concerns regarding backwards compatibility of
my proposal, the "prefixed <eol><semicolon> text fields". I think that in
most cases the problems can be easily circumvented:

On Wed, Jun 8, 2011 at 6:17 AM, David Brown <[email protected]> wrote:

> The only place I see a possible problem is with a heritage CIF with the
> following sequence
>
> _publ_section_experimental
> ; a,b,c,\a,\b,\c were
> determined from powder patterns
> ;
>
> /.../
> A CIF reader would expect to find:
>
> _publ_section_experimental
> ; a,b,c,\a,\b,\c
> a,b,c,were determined from powder patterns
> ;
>
> and strip off the a,b,c, /.../

Not, thats actually NOT the way I supposed the things would work. Under my
proposal, the above sequence would not be interpreted as a prefix, since the
final backslash is not followed by a newline (or by a white space and a
newline). Thus, the pattern would be interpreted literally, as it it is done
now, and no problem would occur with such legacy archived files.

To make "a,b,c" a prefix, one should write:

_publ_section_experimental
;a,b,c,\
a,b,c,\a,\b,\c
a,b,c,were determined from powder patterns
;

Which is different from above and should be equivalent, after prefix removal,
to '\a,\b,\c were determined from powder patterns' in an unquoted string.

Note that the 'a,b,c,' string *may* be at the beginning of a line, even if it
is a prefix:

_publ_section_experimental
;a,b,c,\
a,b,c,a,b,c,\a,\b,\c
a,b,c,were determined from powder patterns
;

would fold to 'a,b,c,\a,\b,\c were determined from powder patterns'
single-quoted string after changing newlines to spaces.

Actually, the Perl RE was not accurate in my previous prosal, the more
appropriate determination of prefix in Perl REs would be:

if( $text =~ /^([^\\]+)\\(\s+)?\n/ ) { # a text without backslashes,
# then a backslash,
# then maybe blank, then newline.
my $prefix = $1;
$text =~ s/^${prefix}\\\n//;
$text =~ s/^${prefix}//mg
}

> I agree that misreading of a legacy file without incurring a parsing error
> is practically impossible.

The only situation when the legacy files would be misinterpreted would be when
they contain a *nonempty* text and a *trailing* backslash as the first line
of the ';'-delimited text. Arguably, such files are seldom and probably
non-existent. For example, there are only two such files in the COD CIF
collection out of 140k+ (which encompasses nearly all files from the IUCr
journals and quite a few by other publishers):

saulius@tasmanijos-velnias cif/ > find ? -iname '*.cif' \
| xargs perl -ne 'print $ARGV, "\t", $_ if /^;([^\\]+)\\(\s+)?\n/'

2/2213918.cif ;
{4,4'-Dibromo-2,2'-[1,2-phenylenebis(nitrilomethylidene)]diphenolato-\

2/2224012.cif ; \

and both are probably mis-represented folded long lines which should be
corrected anyway; see the full files:

http://www.crystallography.net/2213918.cif
http://www.crystallography.net/2224012.cif

(Originals are at:

http://scripts.iucr.org/cgi-bin/sendcif?ng2268sup1
http://scripts.iucr.org/cgi-bin/sendcif?sj2654sup1

and they have the same syntax).

I can run the same check on the PDB mmCIF collection if needed.

Even if such files are encountered, in most cases it will not cause much
harm -- a parser will not be able to strip away prefixes and leave the rest
of the value as is. This could (should?) trigger a warning.

> We should, however, make it possible in CIF2 to present multiline values
> containing a backslash before the first <eol> without risking a parsing
> error on read when this <backslash> is misunderstood as a prefix flag.

I think discarding the new line of the first ';' line is not necessary in case
the line is not a prefix. The suggested prefix declarations are unique enough
to be recognized without this rule.

# From: "Herbert J. Bernstein" <[email protected]>
# Date: Tue, 7 Jun 2011 05:48:34 -0400 (EDT):

> This would certainy be a worthy suggestion to consider in a
> CIF1 context.

Sure the prefixed ';'-texts can be used in CIF1 as well, being mostly
backwards compatible, and compatible with the CIF line folding rule.

> For CIF2, my own preference would be to solve this problem by adopting
> the full Python syntax and semantics for treble-quoted strings

My understanding is that, unless escape sequences like those in C or in Python
or Perl are mandated in CIF strings ("The backslash (\) character is used to
escape characters that otherwise have a special meaning, such as newline,
backslash itself, or the quote character"[1]), the triple-quoted syntax does
not solve the cif-in-cif problem -- as I have read in the recent CIF2
draft[2], 'Clearly, the string within cannot contain an ASCII """'. Thus
again we will have a non-representable values in CIF -- the ones that contain
triple-single quotes followed by a space, triple double quotes followed by a
space and a semicolon at the beginning of a line.

[1]
http://docs.python.org/reference/lexical_analysis.html

[2]
http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf

We do not need to go far to find such values -- the text of the
cif2_syntax_changes_jrh20100705.pdf draft itself *is* an example of a
non-representable value :). The prefixes could easily save the situation
without adding much extra work for parsers.

Sincerely,
Saulius

--
Dr. Saulius Gražulis
Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] Alternative proposal for eliding (Saulius Grazulis)

Prev by Date: Re: [ddlm-group] Alternative proposal for eliding

Next by Date: Re: [ddlm-group] The Grazulis eliding proposal: how to incorporateinto CIF?. .. .

Prev by thread: Re: [ddlm-group] Alternative proposal for eliding

Next by thread: [ddlm-group] Removing comma from non-delimited strings

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Alternative proposal for eliding