[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Alternative proposal for eliding

Dear DDLm-group,

Saulius Grazulis has submitted an alternative proposal for representing arbitrary strings in CIF2.  This proposal has grown out of his own concern around the inability of CIF to represent arbitrary strings, so I view this as further confirmation that a solution is needed.  In any case, please comment on the proposal given below.

James.

===========================
(below is from Saulius Grazulis)
1. In the current CIF specification, the only way to specify a
multi-line text value is to use a semicolon (';') delimited text
field. Since such field is terminated by the first semicolon at the
beginning of the CIF line, the value may not contain semicolons at the
beginning of any line. As a consequence, a valid CIF file may not be,
in general, provided as a multi-line value of another valid CIF (thus
we may can refer to this problem as "cif-in-cif problem"). The problem
was briefly mentioned as "theoretical" in the last year's DDLm group
discussions
(http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00839.html,
http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00843.html); but
in my experience, it surfaces as a lurking bug possibility each time
we print out a multi-line CIF value. Although the need to have
"nested" CIFs is marginal, a general purpose CIF processor that
obtains text values from sources other than the parsed syntactically
correct CIFs has no good way of dealing with it -- such values are not
guaranteed to be free of semicolons at the beginnings of the lines,
and when such value is encountered, there is no versatile algorithm
that would permit representation of such value in CIF (any
modifications such as prepending of whitespace or refolding of lines
can, in general, break the semantics of the value).

The newly proposed triple-quoted text fields (delimited with either
""" or ''' sequences) solve the problem for semicolon-starting text
lines, at a cost of introducing yet another two kinds of delimited
strings. The "cif-in-cif: problem still remains, however, since a
value that contains all of the delimiters (newlines, quotes, <eol>;,
''' and """) still can not be represented as a value in any kind of
the quoted text fields.

2. The 'cif-in-cif' problem might be solved in a general way by using
a "prefixed text field syntax":

a) a special starting sequence,
<eol>;<text-field-prefix>\<optional-trailing-whitespace><eol>

would signal that all lines in this text field are prefixed with a
<text-field-prefix>. Here

<text-field-prefix> ::= {<OrdinaryChar> | <space>}+
<optional-trailing-whitespace> ::= <space>*
<space> ::= SP | HT

Each line of such text field then MUST start with the specified text
field prefix. Both the starting sequence and the prefix do not belong
to the value and should be removed by a prefix-aware parser before
returning the value.

For example, a CIF sample can be included into a text like this:

data_providing_example
_example
;CIF>\
CIF>data_example
CIF>_text
CIF>;This is an embedded multiline value
CIF>;
; # here the field terminates.

Even more readable would be a blank prefix:

data_providing_example
_example
; \
data_example
_text
;This is an embedded multiline value
;
; # here the field terminates.

I see numerous advantages of such scheme:

a) it solves the "arbitrary value" a.k.a "CIF-in-CIF" problem once and
forever;

b) it is simple to describe and to follow;

c) it is simple to implement in parsers: a parser, after obtaining a
multi-line text field value, would match for a starting sequence and,
if it is found, remove both the starting sequence and a prefix
obtained from it. In Perl it can be done:

if( $text =~ /^([\w\s]+>)\\\n/ ) {
my $prefix = $1;
$text =~ s/^${prefix}\\\n//;
$text =~ s/^${prefix}//mg
}

d) it is easy to implement in value printers: a printer, recognising a
multi-line string with "problematic" characters (<eol>; and friends),
would print a startings sequence, and then prepend each printed line
with a self-selected prefix, and then terminate the field with a
semicolon in a regular way:

my $prefix = " ";
print ";${prefix}\\\n";
print map { $prefix _ $_ } @text_lines
print ";\n"

implementations in other garbage-collected languages (Python, Java)
should be equally straightforward, and for manually-allocating
languages (Fortran, C) a simple pair of subroutines would convert
between the prefixed and non-prefixed text forms (regexps are not
strictly necessary for the implementation);

c) the proposal is backwards-compatible with the plain CIF1.x
parsers. Parsers that are not aware of the prefixing convention would
simple read and pass the whole prefixed value. If such value is
printed out without modification, the encapsulated information is
preserved correctly.

d) It is compatible with the current CIF1.x line folding notation, if
we first fold the lines, and then prefix them. In fact, it may be
viewed as an extension of the line folding convention. The prefix
would be added before the trailing backslash:

_long_text
;PFX>\\
PFX>long and folded\
PFX>prefixed line
PFX>;non-folded line
;

The parsing procedure would be the opposite: first unprefix (using the
algorithm in c) and then unfold in a usual way.

e) The method results in both machine- and human-readable CIFs, with
minimal additional markup if desired:

_example
; \
# As an example, we provide a full, syntactically correct
# CIF for your convenience
data_I
_text
;
The nested values can be nicely indented using spaces or tabs
;
_example # nested :)
; \
;Nesting the nested values is straightforward and unambiguous.
;
;
;

f) since the ";something\" is seldomly if ever used in current CIFs,
practically all existing CIFs retain their original semantics under
the new convention. The line-folding CIFs are recognised easily by not
having a prefix sequence, ";\" at the beginning of the text field.

3. A final note: I suggest permitting trailing whitespace at the end
of the starting sequence:

_text
;PFX>\
PFX>The previous line has extra spaces at the end,
PFX>but we usually do not see them in text editors.
;

Such trailing space is difficult to spot for humans, and does not harm
computers. It should be removed together with the starting
sequence by a parser. In this way we would eliminate a potential source of
upsetting errors.

4. It is interesting to note that the similar problem exists in other
formats as well; e.g. XML CDATA value may not contain a terminating
]]> sequence. The same solution might apply to XML CDATA as well:

<![CDATA PREFIX: [
PREFIX: Another example of CDATA can be embedded with a prefix:
PREFIX: Anything goes here!
PREFIX: <![CDATA[
PREFIX: Anything that goes here can be prefixed as well
PREFIX: ]]>
]]>

or, optionally, even nicer (note that the specified prefix is just a
space character):

<![CDATA [
Another example of CDATA can be embedded with a prefix:
Anything goes here!
<![CDATA[
Anything that goes here can be prefixed as well
]]>
]]>

Obviously, the same technique can be used for mmCIF as well.

5. If the prefixed text fields are implemented, arbitrary values can
be represented in CIFs at least as conveniently as can text fields in the
current CIF1.1 format. Thus, there is strictly speaking no need for the
"""/''' strings, and one could simplify CIF2.x by omitting them
althogether. However, the proposed method is orthogonal to the """/'''
string format, and thus both can be implemented simultaneously if
necessary.


Sincerely,
Saulius

--
Dr. Saulius Gra┼żulis
Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]