Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Alternative proposal for eliding

This would certainy be a worthy suggestion to consider in a
CIF1 context.

For CIF2, my own preference would be to solve this problem by adopting
the full Python syntax and semantics for treble-quoted strings

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Tue, 7 Jun 2011, James Hester wrote:

> Dear DDLm-group,
> Saulius Grazulis has submitted an alternative proposal for representing
> arbitrary strings in CIF2.  This proposal has grown out of his own concern
> around the inability of CIF to represent arbitrary strings, so I view this
> as further confirmation that a solution is needed.  In any case, please
> comment on the proposal given below.
> James.
> ===========================
> (below is from Saulius Grazulis)
> 1. In the current CIF specification, the only way to specify a
> multi-line text value is to use a semicolon (';') delimited text
> field. Since such field is terminated by the first semicolon at the
> beginning of the CIF line, the value may not contain semicolons at the
> beginning of any line. As a consequence, a valid CIF file may not be,
> in general, provided as a multi-line value of another valid CIF (thus
> we may can refer to this problem as "cif-in-cif problem"). The problem
> was briefly mentioned as "theoretical" in the last year's DDLm group
> discussions
> (http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00839.html,
> http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00843.html); but
> in my experience, it surfaces as a lurking bug possibility each time
> we print out a multi-line CIF value. Although the need to have
> "nested" CIFs is marginal, a general purpose CIF processor that
> obtains text values from sources other than the parsed syntactically
> correct CIFs has no good way of dealing with it -- such values are not
> guaranteed to be free of semicolons at the beginnings of the lines,
> and when such value is encountered, there is no versatile algorithm
> that would permit representation of such value in CIF (any
> modifications such as prepending of whitespace or refolding of lines
> can, in general, break the semantics of the value).
> The newly proposed triple-quoted text fields (delimited with either
> """ or ''' sequences) solve the problem for semicolon-starting text
> lines, at a cost of introducing yet another two kinds of delimited
> strings. The "cif-in-cif: problem still remains, however, since a
> value that contains all of the delimiters (newlines, quotes, <eol>;,
> ''' and """) still can not be represented as a value in any kind of
> the quoted text fields.
> 2. The 'cif-in-cif' problem might be solved in a general way by using
> a "prefixed text field syntax":
> a) a special starting sequence,
> <eol>;<text-field-prefix>\<optional-trailing-whitespace><eol>
> would signal that all lines in this text field are prefixed with a
> <text-field-prefix>. Here
> <text-field-prefix> ::= {<OrdinaryChar> | <space>}+
> <optional-trailing-whitespace> ::= <space>*
> <space> ::= SP | HT
> Each line of such text field then MUST start with the specified text
> field prefix. Both the starting sequence and the prefix do not belong
> to the value and should be removed by a prefix-aware parser before
> returning the value.
> For example, a CIF sample can be included into a text like this:
> data_providing_example
> _example
> ;CIF>\
> CIF>data_example
> CIF>_text
> CIF>;This is an embedded multiline value
> CIF>;
> ; # here the field terminates.
> Even more readable would be a blank prefix:
> data_providing_example
> _example
> ; \
>  data_example
>  _text
>  ;This is an embedded multiline value
>  ;
> ; # here the field terminates.
> I see numerous advantages of such scheme:
> a) it solves the "arbitrary value" a.k.a "CIF-in-CIF" problem once and
> forever;
> b) it is simple to describe and to follow;
> c) it is simple to implement in parsers: a parser, after obtaining a
> multi-line text field value, would match for a starting sequence and,
> if it is found, remove both the starting sequence and a prefix
> obtained from it. In Perl it can be done:
> if( $text =~ /^([\w\s]+>)\\\n/ ) {
>     my $prefix = $1;
>     $text =~ s/^${prefix}\\\n//;
>     $text =~ s/^${prefix}//mg
> }
> d) it is easy to implement in value printers: a printer, recognising a
> multi-line string with "problematic" characters (<eol>; and friends),
> would print a startings sequence, and then prepend each printed line
> with a self-selected prefix, and then terminate the field with a
> semicolon in a regular way:
> my $prefix = " ";
> print ";${prefix}\\\n";
> print map { $prefix _ $_ } @text_lines
> print ";\n"
> implementations in other garbage-collected languages (Python, Java)
> should be equally straightforward, and for manually-allocating
> languages (Fortran, C) a simple pair of subroutines would convert
> between the prefixed and non-prefixed text forms (regexps are not
> strictly necessary for the implementation);
> c) the proposal is backwards-compatible with the plain CIF1.x
> parsers. Parsers that are not aware of the prefixing convention would
> simple read and pass the whole prefixed value. If such value is
> printed out without modification, the encapsulated information is
> preserved correctly.
> d) It is compatible with the current CIF1.x line folding notation, if
> we first fold the lines, and then prefix them. In fact, it may be
> viewed as an extension of the line folding convention. The prefix
> would be added before the trailing backslash:
> _long_text
> ;PFX>\\
> PFX>long and folded\
> PFX>prefixed line
> PFX>;non-folded line
> ;
> The parsing procedure would be the opposite: first unprefix (using the
> algorithm in c) and then unfold in a usual way.
> e) The method results in both machine- and human-readable CIFs, with
> minimal additional markup if desired:
> _example
> ; \
>  # As an example, we provide a full, syntactically correct
>  # CIF for your convenience
>  data_I
>  _text
>  ;
>   The nested values can be nicely indented using spaces or tabs
>  ;
>  _example # nested :)
>  ; \
>   ;Nesting the nested values is straightforward and unambiguous.
>   ;
>  ;
> ;
> f) since the ";something\" is seldomly if ever used in current CIFs,
> practically all existing CIFs retain their original semantics under
> the new convention. The line-folding CIFs are recognised easily by not
> having a prefix sequence, ";\" at the beginning of the text field.
> 3. A final note: I suggest permitting trailing whitespace at the end
> of the starting sequence:
> _text
> ;PFX>\ 
> PFX>The previous line has extra spaces at the end,
> PFX>but we usually do not see them in text editors.
> ;
> Such trailing space is difficult to spot for humans, and does not harm
> computers. It should be removed together with the starting
> sequence by a parser. In this way we would eliminate a potential source of
> upsetting errors.
> 4. It is interesting to note that the similar problem exists in other
> formats as well; e.g. XML CDATA value may not contain a terminating
> ]]> sequence. The same solution might apply to XML CDATA as well:
> PREFIX: Another example of CDATA can be embedded with a prefix:
> PREFIX: Anything goes here!
> PREFIX:  Anything that goes here can be prefixed as well
> PREFIX: ]]>
> ]]>
> or, optionally, even nicer (note that the specified prefix is just a
> space character):
> <![CDATA [
>  Another example of CDATA can be embedded with a prefix:
>  Anything goes here!
>  <![CDATA[
>   Anything that goes here can be prefixed as well
>  ]]>
> ]]>
> Obviously, the same technique can be used for mmCIF as well.
> 5. If the prefixed text fields are implemented, arbitrary values can
> be represented in CIFs at least as conveniently as can text fields in the
> current CIF1.1 format. Thus, there is strictly speaking no need for the
> """/''' strings, and one could simplify CIF2.x by omitting them
> althogether. However, the proposed method is orthogonal to the """/'''
> string format, and thus both can be implemented simultaneously if
> necessary.
> Sincerely,
> Saulius
> -- 
> Dr. Saulius Gra?ulis
> Institute of Biotechnology, Graiciuno 8
> LT-02241 Vilnius, Lietuva (Lithuania)
> fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
> mobile: (+370-684)-49802, (+370-614)-36366
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.