[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
From: Saulius Grazulis <grazulis@ibt.lt>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Tuesday, 28 June, 2011 15:30:57
Subject: Re: [ddlm-group] Alternative proposal for eliding
Dear DDLm group members,
I would like to comment on some concerns regarding backwards compatibility of
my proposal, the "prefixed <eol><semicolon> text fields". I think that in
most cases the problems can be easily circumvented:
On Wed, Jun 8, 2011 at 6:17 AM, David Brown <idbrown@mcmaster.ca> wrote:
> The only place I see a possible problem is with a heritage CIF with the
> following sequence
>
> _publ_section_experimental
> ; a,b,c,\a,\b,\c were
> determined from powder patterns
> ;
>
> /.../
> A CIF reader would expect to find:
>
> _publ_section_experimental
> ; a,b,c,\a,\b,\c
> a,b,c,were determined from powder patterns
> ;
>
> and strip off the a,b,c, /.../
Not, thats actually NOT the way I supposed the things would work. Under my
proposal, the above sequence would not be interpreted as a prefix, since the
final backslash is not followed by a newline (or by a white space and a
newline). Thus, the pattern would be interpreted literally, as it it is done
now, and no problem would occur with such legacy archived files.
To make "a,b,c" a prefix, one should write:
_publ_section_experimental
;a,b,c,\
a,b,c,\a,\b,\c
a,b,c,were determined from powder patterns
;
Which is different from above and should be equivalent, after prefix removal,
to '\a,\b,\c were determined from powder patterns' in an unquoted string.
Note that the 'a,b,c,' string *may* be at the beginning of a line, even if it
is a prefix:
_publ_section_experimental
;a,b,c,\
a,b,c,a,b,c,\a,\b,\c
a,b,c,were determined from powder patterns
;
would fold to 'a,b,c,\a,\b,\c were determined from powder patterns'
single-quoted string after changing newlines to spaces.
Actually, the Perl RE was not accurate in my previous prosal, the more
appropriate determination of prefix in Perl REs would be:
if( $text =~ /^([^\\]+)\\(\s+)?\n/ ) { # a text without backslashes,
# then a backslash,
# then maybe blank, then newline.
my $prefix = $1;
$text =~ s/^${prefix}\\\n//;
$text =~ s/^${prefix}//mg
}
> I agree that misreading of a legacy file without incurring a parsing error
> is practically impossible.
The only situation when the legacy files would be misinterpreted would be when
they contain a *nonempty* text and a *trailing* backslash as the first line
of the ';'-delimited text. Arguably, such files are seldom and probably
non-existent. For example, there are only two such files in the COD CIF
collection out of 140k+ (which encompasses nearly all files from the IUCr
journals and quite a few by other publishers):
saulius@tasmanijos-velnias cif/ > find ? -iname '*.cif' \
| xargs perl -ne 'print $ARGV, "\t", $_ if /^;([^\\]+)\\(\s+)?\n/'
2/2213918.cif ;
{4,4'-Dibromo-2,2'-[1,2-phenylenebis(nitrilomethylidene)]diphenolato-\
2/2224012.cif ; \
and both are probably mis-represented folded long lines which should be
corrected anyway; see the full files:
http://www.crystallography.net/2213918.cif
http://www.crystallography.net/2224012.cif
(Originals are at:
http://scripts.iucr.org/cgi-bin/sendcif?ng2268sup1
http://scripts.iucr.org/cgi-bin/sendcif?sj2654sup1
and they have the same syntax).
I can run the same check on the PDB mmCIF collection if needed.
Even if such files are encountered, in most cases it will not cause much
harm -- a parser will not be able to strip away prefixes and leave the rest
of the value as is. This could (should?) trigger a warning.
> We should, however, make it possible in CIF2 to present multiline values
> containing a backslash before the first <eol> without risking a parsing
> error on read when this <backslash> is misunderstood as a prefix flag.
I think discarding the new line of the first ';' line is not necessary in case
the line is not a prefix. The suggested prefix declarations are unique enough
to be recognized without this rule.
# From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
# Date: Tue, 7 Jun 2011 05:48:34 -0400 (EDT):
> This would certainy be a worthy suggestion to consider in a
> CIF1 context.
Sure the prefixed ';'-texts can be used in CIF1 as well, being mostly
backwards compatible, and compatible with the CIF line folding rule.
> For CIF2, my own preference would be to solve this problem by adopting
> the full Python syntax and semantics for treble-quoted strings
My understanding is that, unless escape sequences like those in C or in Python
or Perl are mandated in CIF strings ("The backslash (\) character is used to
escape characters that otherwise have a special meaning, such as newline,
backslash itself, or the quote character"[1]), the triple-quoted syntax does
not solve the cif-in-cif problem -- as I have read in the recent CIF2
draft[2], 'Clearly, the string within cannot contain an ASCII """'. Thus
again we will have a non-representable values in CIF -- the ones that contain
triple-single quotes followed by a space, triple double quotes followed by a
space and a semicolon at the beginning of a line.
[1]
http://docs.python.org/reference/lexical_analysis.html
[2]
http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf
We do not need to go far to find such values -- the text of the
cif2_syntax_changes_jrh20100705.pdf draft itself *is* an example of a
non-representable value :). The prefixes could easily save the situation
without adding much extra work for parsers.
Sincerely,
Saulius
--
Dr. Saulius Gražulis
Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
Re: [ddlm-group] Alternative proposal for eliding
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Alternative proposal for eliding
- From: SIMON WESTRIP <simonwestrip@btinternet.com>
- Date: Tue, 28 Jun 2011 15:58:52 +0100 (BST)
- In-Reply-To: <201106281730.57294.grazulis@ibt.lt>
- References: <201106281730.57294.grazulis@ibt.lt>
Thank you for clarifying - I see now that there are no real legacy issues.
Cheers
Simon
Cheers
Simon
From: Saulius Grazulis <grazulis@ibt.lt>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Tuesday, 28 June, 2011 15:30:57
Subject: Re: [ddlm-group] Alternative proposal for eliding
Dear DDLm group members,
I would like to comment on some concerns regarding backwards compatibility of
my proposal, the "prefixed <eol><semicolon> text fields". I think that in
most cases the problems can be easily circumvented:
On Wed, Jun 8, 2011 at 6:17 AM, David Brown <idbrown@mcmaster.ca> wrote:
> The only place I see a possible problem is with a heritage CIF with the
> following sequence
>
> _publ_section_experimental
> ; a,b,c,\a,\b,\c were
> determined from powder patterns
> ;
>
> /.../
> A CIF reader would expect to find:
>
> _publ_section_experimental
> ; a,b,c,\a,\b,\c
> a,b,c,were determined from powder patterns
> ;
>
> and strip off the a,b,c, /.../
Not, thats actually NOT the way I supposed the things would work. Under my
proposal, the above sequence would not be interpreted as a prefix, since the
final backslash is not followed by a newline (or by a white space and a
newline). Thus, the pattern would be interpreted literally, as it it is done
now, and no problem would occur with such legacy archived files.
To make "a,b,c" a prefix, one should write:
_publ_section_experimental
;a,b,c,\
a,b,c,\a,\b,\c
a,b,c,were determined from powder patterns
;
Which is different from above and should be equivalent, after prefix removal,
to '\a,\b,\c were determined from powder patterns' in an unquoted string.
Note that the 'a,b,c,' string *may* be at the beginning of a line, even if it
is a prefix:
_publ_section_experimental
;a,b,c,\
a,b,c,a,b,c,\a,\b,\c
a,b,c,were determined from powder patterns
;
would fold to 'a,b,c,\a,\b,\c were determined from powder patterns'
single-quoted string after changing newlines to spaces.
Actually, the Perl RE was not accurate in my previous prosal, the more
appropriate determination of prefix in Perl REs would be:
if( $text =~ /^([^\\]+)\\(\s+)?\n/ ) { # a text without backslashes,
# then a backslash,
# then maybe blank, then newline.
my $prefix = $1;
$text =~ s/^${prefix}\\\n//;
$text =~ s/^${prefix}//mg
}
> I agree that misreading of a legacy file without incurring a parsing error
> is practically impossible.
The only situation when the legacy files would be misinterpreted would be when
they contain a *nonempty* text and a *trailing* backslash as the first line
of the ';'-delimited text. Arguably, such files are seldom and probably
non-existent. For example, there are only two such files in the COD CIF
collection out of 140k+ (which encompasses nearly all files from the IUCr
journals and quite a few by other publishers):
saulius@tasmanijos-velnias cif/ > find ? -iname '*.cif' \
| xargs perl -ne 'print $ARGV, "\t", $_ if /^;([^\\]+)\\(\s+)?\n/'
2/2213918.cif ;
{4,4'-Dibromo-2,2'-[1,2-phenylenebis(nitrilomethylidene)]diphenolato-\
2/2224012.cif ; \
and both are probably mis-represented folded long lines which should be
corrected anyway; see the full files:
http://www.crystallography.net/2213918.cif
http://www.crystallography.net/2224012.cif
(Originals are at:
http://scripts.iucr.org/cgi-bin/sendcif?ng2268sup1
http://scripts.iucr.org/cgi-bin/sendcif?sj2654sup1
and they have the same syntax).
I can run the same check on the PDB mmCIF collection if needed.
Even if such files are encountered, in most cases it will not cause much
harm -- a parser will not be able to strip away prefixes and leave the rest
of the value as is. This could (should?) trigger a warning.
> We should, however, make it possible in CIF2 to present multiline values
> containing a backslash before the first <eol> without risking a parsing
> error on read when this <backslash> is misunderstood as a prefix flag.
I think discarding the new line of the first ';' line is not necessary in case
the line is not a prefix. The suggested prefix declarations are unique enough
to be recognized without this rule.
# From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
# Date: Tue, 7 Jun 2011 05:48:34 -0400 (EDT):
> This would certainy be a worthy suggestion to consider in a
> CIF1 context.
Sure the prefixed ';'-texts can be used in CIF1 as well, being mostly
backwards compatible, and compatible with the CIF line folding rule.
> For CIF2, my own preference would be to solve this problem by adopting
> the full Python syntax and semantics for treble-quoted strings
My understanding is that, unless escape sequences like those in C or in Python
or Perl are mandated in CIF strings ("The backslash (\) character is used to
escape characters that otherwise have a special meaning, such as newline,
backslash itself, or the quote character"[1]), the triple-quoted syntax does
not solve the cif-in-cif problem -- as I have read in the recent CIF2
draft[2], 'Clearly, the string within cannot contain an ASCII """'. Thus
again we will have a non-representable values in CIF -- the ones that contain
triple-single quotes followed by a space, triple double quotes followed by a
space and a semicolon at the beginning of a line.
[1]
http://docs.python.org/reference/lexical_analysis.html
[2]
http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf
We do not need to go far to find such values -- the text of the
cif2_syntax_changes_jrh20100705.pdf draft itself *is* an example of a
non-representable value :). The prefixes could easily save the situation
without adding much extra work for parsers.
Sincerely,
Saulius
--
Dr. Saulius Gražulis
Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] Alternative proposal for eliding (Saulius Grazulis)
- Prev by Date: Re: [ddlm-group] Alternative proposal for eliding
- Next by Date: Re: [ddlm-group] The Grazulis eliding proposal: how to incorporateinto CIF?. .. .
- Prev by thread: Re: [ddlm-group] Alternative proposal for eliding
- Next by thread: [ddlm-group] Removing comma from non-delimited strings
- Index(es):