Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Alternative proposal for eliding

I agree that misreading of a legacy file without incurring a parsing error is practically impossible.

We should, however, make it possible in CIF2 to present multiline values containing a backslash before the first <eol> without risking a parsing error on read when this <backslash> is misunderstood as a prefix flag.

I suggest the following rule be added to the Grazulis proposal:
(Rule one) The <eol> at the end of the first line of all <eol><semicolon> delimited values does not form part of the data value.

This works as follows: when encoding a datavalue inside an <eol><semicolon> delimited string, a simple output routine would always insert an <eol> immediately after the <semicolon>, unless it wishes to use the prefix and/or line folding conventions.  On reading an <eol><semicolon> string, this first <eol> is always discarded.

On Wed, Jun 8, 2011 at 6:17 AM, David Brown <idbrown@mcmaster.ca> wrote:
The only place I see a possible problem is with a heritage CIF with the following sequence

_publ_section_experimental
; a,b,c,\a,\b,\c were
determined from powder patterns
:

Since this has already been written, there is no problem with a CIF writer.  A CIF reader would expect to find:

_publ_section_experimental
; a,b,c,\a,\b,\c
 a,b,c,were determined from powder patterns
:

and strip off the a,b,c, but if the supposed prefix is not preseent, the parser would presumably recognize this as a CIF1 file and ignore the supposed prefix. 

To really screw up the parser one would need:

_publ_section_experimental
; a,b,c,\a,\b,\c were determined from powder patterns but accurate values of
 a,b,c, were determined from from single crystals
;

This is not very likely.

David



James Hester wrote:
Dear DDLm-group,

Saulius Grazulis has submitted an alternative proposal for representing arbitrary strings in CIF2.  This proposal has grown out of his own concern around the inability of CIF to represent arbitrary strings, so I view this as further confirmation that a solution is needed.  In any case, please comment on the proposal given below.

James.

===========================
(below is from Saulius Grazulis)
1. In the current CIF specification, the only way to specify a
multi-line text value is to use a semicolon (';') delimited text

field. Since such field is terminated by the first semicolon at the
beginning of the CIF line, the value may not contain semicolons at the
beginning of any line. As a consequence, a valid CIF file may not be,
in general, provided as a multi-line value of another valid CIF (thus

we may can refer to this problem as "cif-in-cif problem"). The problem
was briefly mentioned as "theoretical" in the last year's DDLm group
discussions
(http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00839.html,

http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00843.html); but
in my experience, it surfaces as a lurking bug possibility each time

we print out a multi-line CIF value. Although the need to have
"nested" CIFs is marginal, a general purpose CIF processor that
obtains text values from sources other than the parsed syntactically
correct CIFs has no good way of dealing with it -- such values are not

guaranteed to be free of semicolons at the beginnings of the lines,
and when such value is encountered, there is no versatile algorithm
that would permit representation of such value in CIF (any
modifications such as prepending of whitespace or refolding of lines

can, in general, break the semantics of the value).

The newly proposed triple-quoted text fields (delimited with either
""" or ''' sequences) solve the problem for semicolon-starting text

lines, at a cost of introducing yet another two kinds of delimited
strings. The "cif-in-cif: problem still remains, however, since a
value that contains all of the delimiters (newlines, quotes, <eol>;,

''' and """) still can not be represented as a value in any kind of
the quoted text fields.

2. The 'cif-in-cif' problem might be solved in a general way by using
a "prefixed text field syntax":


a) a special starting sequence,
<eol>;<text-field-prefix>\<optional-trailing-whitespace><eol>

would signal that all lines in this text field are prefixed with a
<text-field-prefix>. Here


<text-field-prefix> ::= {<OrdinaryChar> | <space>}+
<optional-trailing-whitespace> ::= <space>*
<space> ::= SP | HT

Each line of such text field then MUST start with the specified text

field prefix. Both the starting sequence and the prefix do not belong
to the value and should be removed by a prefix-aware parser before
returning the value.

For example, a CIF sample can be included into a text like this:


data_providing_example
_example
;CIF>\
CIF>data_example
CIF>_text
CIF>;This is an embedded multiline value
CIF>;
; # here the field terminates.

Even more readable would be a blank prefix:


data_providing_example
_example
; \
 data_example
 _text
 ;This is an embedded multiline value
 ;
; # here the field terminates.

I see numerous advantages of such scheme:

a) it solves the "arbitrary value" a.k.a "CIF-in-CIF" problem once and

forever;

b) it is simple to describe and to follow;

c) it is simple to implement in parsers: a parser, after obtaining a
multi-line text field value, would match for a starting sequence and,
if it is found, remove both the starting sequence and a prefix

obtained from it. In Perl it can be done:

if( $text =~ /^([\w\s]+>)\\\n/ ) {
    my $prefix = $1;
    $text =~ s/^${prefix}\\\n//;
    $text =~ s/^${prefix}//mg
}

d) it is easy to implement in value printers: a printer, recognising a

multi-line string with "problematic" characters (<eol>; and friends),
would print a startings sequence, and then prepend each printed line
with a self-selected prefix, and then terminate the field with a

semicolon in a regular way:

my $prefix = " ";
print ";${prefix}\\\n";
print map { $prefix _ $_ } @text_lines
print ";\n"

implementations in other garbage-collected languages (Python, Java)

should be equally straightforward, and for manually-allocating
languages (Fortran, C) a simple pair of subroutines would convert
between the prefixed and non-prefixed text forms (regexps are not
strictly necessary for the implementation);


c) the proposal is backwards-compatible with the plain CIF1.x
parsers. Parsers that are not aware of the prefixing convention would
simple read and pass the whole prefixed value. If such value is
printed out without modification, the encapsulated information is

preserved correctly.

d) It is compatible with the current CIF1.x line folding notation, if
we first fold the lines, and then prefix them. In fact, it may be
viewed as an extension of the line folding convention. The prefix

would be added before the trailing backslash:

_long_text
;PFX>\\
PFX>long and folded\
PFX>prefixed line
PFX>;non-folded line
;

The parsing procedure would be the opposite: first unprefix (using the

algorithm in c) and then unfold in a usual way.

e) The method results in both machine- and human-readable CIFs, with
minimal additional markup if desired:

_example
; \
 # As an example, we provide a full, syntactically correct

 # CIF for your convenience
 data_I
 _text
 ;
  The nested values can be nicely indented using spaces or tabs
 ;
 _example # nested :)
 ; \
  ;Nesting the nested values is straightforward and unambiguous.

  ;
 ;
;

f) since the ";something\" is seldomly if ever used in current CIFs,
practically all existing CIFs retain their original semantics under
the new convention. The line-folding CIFs are recognised easily by not

having a prefix sequence, ";\" at the beginning of the text field.

3. A final note: I suggest permitting trailing whitespace at the end
of the starting sequence:

_text
;PFX>\              

PFX>The previous line has extra spaces at the end,
PFX>but we usually do not see them in text editors.
;

Such trailing space is difficult to spot for humans, and does not harm
computers. It should be removed together with the starting

sequence by a parser. In this way we would eliminate a potential source of
upsetting errors.

4. It is interesting to note that the similar problem exists in other
formats as well; e.g. XML CDATA value may not contain a terminating

]]> sequence. The same solution might apply to XML CDATA as well:

<![CDATA PREFIX: [
PREFIX: Another example of CDATA can be embedded with a prefix:
PREFIX: Anything goes here!
PREFIX: <![CDATA[

PREFIX:  Anything that goes here can be prefixed as well
PREFIX: ]]>
]]>

or, optionally, even nicer (note that the specified prefix is just a
space character):

<![CDATA [
 Another example of CDATA can be embedded with a prefix:

 Anything goes here!
 <![CDATA[
  Anything that goes here can be prefixed as well
 ]]>
]]>

Obviously, the same technique can be used for mmCIF as well.

5. If the prefixed text fields are implemented, arbitrary values can

be represented in CIFs at least as conveniently as can text fields in the
current CIF1.1 format. Thus, there is strictly speaking no need for the
"""/''' strings, and one could simplify CIF2.x by omitting them

althogether. However, the proposed method is orthogonal to the """/'''
string format, and thus both can be implemented simultaneously if
necessary.


Sincerely,
Saulius


-- 
Dr. Saulius Gražulis
Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366



_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group



_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.