[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Alternative proposal for eliding

This would certainy be a worthy suggestion to consider in a
CIF1 context.

For CIF2, my own preference would be to solve this problem by adopting
the full Python syntax and semantics for treble-quoted strings

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Tue, 7 Jun 2011, James Hester wrote:

> Dear DDLm-group,
> Saulius Grazulis has submitted an alternative proposal for representing
> arbitrary strings in CIF2.  This proposal has grown out of his own concern
> around the inability of CIF to represent arbitrary strings, so I view this
> as further confirmation that a solution is needed.  In any case, please
> comment on the proposal given below.
> James.
> ===========================
> (below is from Saulius Grazulis)
> 1. In the current CIF specification, the only way to specify a
> multi-line text value is to use a semicolon (';') delimited text
> field. Since such field is terminated by the first semicolon at the
> beginning of the CIF line, the value may not contain semicolons at the
> beginning of any line. As a consequence, a valid CIF file may not be,
> in general, provided as a multi-line value of another valid CIF (thus
> we may can refer to this problem as "cif-in-cif problem"). The problem
> was briefly mentioned as "theoretical" in the last year's DDLm group
> discussions
> (http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00839.html,
> http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00843.html); but
> in my experience, it surfaces as a lurking bug possibility each time
> we print out a multi-line CIF value. Although the need to have
> "nested" CIFs is marginal, a general purpose CIF processor that
> obtains text values from sources other than the parsed syntactically
> correct CIFs has no good way of dealing with it -- such values are not
> guaranteed to be free of semicolons at the beginnings of the lines,
> and when such value is encountered, there is no versatile algorithm
> that would permit representation of such value in CIF (any
> modifications such as prepending of whitespace or refolding of lines
> can, in general, break the semantics of the value).
> The newly proposed triple-quoted text fields (delimited with either
> """ or ''' sequences) solve the problem for semicolon-starting text
> lines, at a cost of introducing yet another two kinds of delimited
> strings. The "cif-in-cif: problem still remains, however, since a
> value that contains all of the delimiters (newlines, quotes, <eol>;,
> ''' and """) still can not be represented as a value in any kind of
> the quoted text fields.
> 2. The 'cif-in-cif' problem might be solved in a general way by using
> a "prefixed text field syntax":
> a) a special starting sequence,
> <eol>;<text-field-prefix>\<optional-trailing-whitespace><eol>
> would signal that all lines in this text field are prefixed with a
> <text-field-prefix>. Here
> <text-field-prefix> ::= {<OrdinaryChar> | <space>}+
> <optional-trailing-whitespace> ::= <space>*
> <space> ::= SP | HT
> Each line of such text field then MUST start with the specified text
> field prefix. Both the starting sequence and the prefix do not belong
> to the value and should be removed by a prefix-aware parser before
> returning the value.
> For example, a CIF sample can be included into a text like this:
> data_providing_example
> _example
> ;CIF>\
> CIF>data_example
> CIF>_text
> CIF>;This is an embedded multiline value
> CIF>;
> ; # here the field terminates.
> Even more readable would be a blank prefix:
> data_providing_example
> _example
> ; \
>  data_example
>  _text
>  ;This is an embedded multiline value
>  ;
> ; # here the field terminates.
> I see numerous advantages of such scheme:
> a) it solves the "arbitrary value" a.k.a "CIF-in-CIF" problem once and
> forever;
> b) it is simple to describe and to follow;
> c) it is simple to implement in parsers: a parser, after obtaining a
> multi-line text field value, would match for a starting sequence and,
> if it is found, remove both the starting sequence and a prefix
> obtained from it. In Perl it can be done:
> if( $text =~ /^([\w\s]+>)\\\n/ ) {
>     my $prefix = $1;
>     $text =~ s/^${prefix}\\\n//;
>     $text =~ s/^${prefix}//mg
> }
> d) it is easy to implement in value printers: a printer, recognising a
> multi-line string with "problematic" characters (<eol>; and friends),
> would print a startings sequence, and then prepend each printed line
> with a self-selected prefix, and then terminate the field with a
> semicolon in a regular way:
> my $prefix = " ";
> print ";${prefix}\\\n";
> print map { $prefix _ $_ } @text_lines
> print ";\n"
> implementations in other garbage-collected languages (Python, Java)
> should be equally straightforward, and for manually-allocating
> languages (Fortran, C) a simple pair of subroutines would convert
> between the prefixed and non-prefixed text forms (regexps are not
> strictly necessary for the implementation);
> c) the proposal is backwards-compatible with the plain CIF1.x
> parsers. Parsers that are not aware of the prefixing convention would
> simple read and pass the whole prefixed value. If such value is
> printed out without modification, the encapsulated information is
> preserved correctly.
> d) It is compatible with the current CIF1.x line folding notation, if
> we first fold the lines, and then prefix them. In fact, it may be
> viewed as an extension of the line folding convention. The prefix
> would be added before the trailing backslash:
> _long_text
> ;PFX>\\
> PFX>long and folded\
> PFX>prefixed line
> PFX>;non-folded line
> ;
> The parsing procedure would be the opposite: first unprefix (using the
> algorithm in c) and then unfold in a usual way.
> e) The method results in both machine- and human-readable CIFs, with
> minimal additional markup if desired:
> _example
> ; \
>  # As an example, we provide a full, syntactically correct
>  # CIF for your convenience
>  data_I
>  _text
>  ;
>   The nested values can be nicely indented using spaces or tabs
>  ;
>  _example # nested :)
>  ; \
>   ;Nesting the nested values is straightforward and unambiguous.
>   ;
>  ;
> ;
> f) since the ";something\" is seldomly if ever used in current CIFs,
> practically all existing CIFs retain their original semantics under
> the new convention. The line-folding CIFs are recognised easily by not
> having a prefix sequence, ";\" at the beginning of the text field.
> 3. A final note: I suggest permitting trailing whitespace at the end
> of the starting sequence:
> _text
> ;PFX>\ 
> PFX>The previous line has extra spaces at the end,
> PFX>but we usually do not see them in text editors.
> ;
> Such trailing space is difficult to spot for humans, and does not harm
> computers. It should be removed together with the starting
> sequence by a parser. In this way we would eliminate a potential source of
> upsetting errors.
> 4. It is interesting to note that the similar problem exists in other
> formats as well; e.g. XML CDATA value may not contain a terminating
> ]]> sequence. The same solution might apply to XML CDATA as well:
> PREFIX: Another example of CDATA can be embedded with a prefix:
> PREFIX: Anything goes here!
> PREFIX:  Anything that goes here can be prefixed as well
> PREFIX: ]]>
> ]]>
> or, optionally, even nicer (note that the specified prefix is just a
> space character):
> <![CDATA [
>  Another example of CDATA can be embedded with a prefix:
>  Anything goes here!
>  <![CDATA[
>   Anything that goes here can be prefixed as well
>  ]]>
> ]]>
> Obviously, the same technique can be used for mmCIF as well.
> 5. If the prefixed text fields are implemented, arbitrary values can
> be represented in CIFs at least as conveniently as can text fields in the
> current CIF1.1 format. Thus, there is strictly speaking no need for the
> """/''' strings, and one could simplify CIF2.x by omitting them
> althogether. However, the proposed method is orthogonal to the """/'''
> string format, and thus both can be implemented simultaneously if
> necessary.
> Sincerely,
> Saulius
> -- 
> Dr. Saulius Gra?ulis
> Institute of Biotechnology, Graiciuno 8
> LT-02241 Vilnius, Lietuva (Lithuania)
> fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
> mobile: (+370-684)-49802, (+370-614)-36366
ddlm-group mailing list

Reply to: [list | sender only]