[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Alternative proposal for eliding
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] Alternative proposal for eliding
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Tue, 7 Jun 2011 05:48:34 -0400 (EDT)
- In-Reply-To: <[email protected]>
- References: <[email protected]>
This would certainy be a worthy suggestion to consider in a
CIF1 context.
For CIF2, my own preference would be to solve this problem by adopting
the full Python syntax and semantics for treble-quoted strings
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
On Tue, 7 Jun 2011, James Hester wrote:
> Dear DDLm-group,
>
> Saulius Grazulis has submitted an alternative proposal for representing
> arbitrary strings in CIF2. This proposal has grown out of his own concern
> around the inability of CIF to represent arbitrary strings, so I view this
> as further confirmation that a solution is needed. In any case, please
> comment on the proposal given below.
>
> James.
>
> ===========================
> (below is from Saulius Grazulis)
>
> 1. In the current CIF specification, the only way to specify a
> multi-line text value is to use a semicolon (';') delimited text
>
> field. Since such field is terminated by the first semicolon at the
> beginning of the CIF line, the value may not contain semicolons at the
> beginning of any line. As a consequence, a valid CIF file may not be,
> in general, provided as a multi-line value of another valid CIF (thus
>
> we may can refer to this problem as "cif-in-cif problem"). The problem
> was briefly mentioned as "theoretical" in the last year's DDLm group
> discussions
> (http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00839.html,
>
> http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00843.html); but
> in my experience, it surfaces as a lurking bug possibility each time
>
> we print out a multi-line CIF value. Although the need to have
> "nested" CIFs is marginal, a general purpose CIF processor that
> obtains text values from sources other than the parsed syntactically
> correct CIFs has no good way of dealing with it -- such values are not
>
> guaranteed to be free of semicolons at the beginnings of the lines,
> and when such value is encountered, there is no versatile algorithm
> that would permit representation of such value in CIF (any
> modifications such as prepending of whitespace or refolding of lines
>
> can, in general, break the semantics of the value).
> The newly proposed triple-quoted text fields (delimited with either
> """ or ''' sequences) solve the problem for semicolon-starting text
>
> lines, at a cost of introducing yet another two kinds of delimited
> strings. The "cif-in-cif: problem still remains, however, since a
> value that contains all of the delimiters (newlines, quotes, <eol>;,
>
> ''' and """) still can not be represented as a value in any kind of
> the quoted text fields.
> 2. The 'cif-in-cif' problem might be solved in a general way by using
> a "prefixed text field syntax":
>
> a) a special starting sequence,
> <eol>;<text-field-prefix>\<optional-trailing-whitespace><eol>
> would signal that all lines in this text field are prefixed with a
> <text-field-prefix>. Here
>
> <text-field-prefix> ::= {<OrdinaryChar> | <space>}+
> <optional-trailing-whitespace> ::= <space>*
> <space> ::= SP | HT
> Each line of such text field then MUST start with the specified text
>
> field prefix. Both the starting sequence and the prefix do not belong
> to the value and should be removed by a prefix-aware parser before
> returning the value.
> For example, a CIF sample can be included into a text like this:
>
> data_providing_example
> _example
> ;CIF>\
> CIF>data_example
> CIF>_text
> CIF>;This is an embedded multiline value
> CIF>;
> ; # here the field terminates.
> Even more readable would be a blank prefix:
>
> data_providing_example
> _example
> ; \
> data_example
> _text
> ;This is an embedded multiline value
> ;
> ; # here the field terminates.
> I see numerous advantages of such scheme:
> a) it solves the "arbitrary value" a.k.a "CIF-in-CIF" problem once and
>
> forever;
> b) it is simple to describe and to follow;
> c) it is simple to implement in parsers: a parser, after obtaining a
> multi-line text field value, would match for a starting sequence and,
> if it is found, remove both the starting sequence and a prefix
>
> obtained from it. In Perl it can be done:
> if( $text =~ /^([\w\s]+>)\\\n/ ) {
> my $prefix = $1;
> $text =~ s/^${prefix}\\\n//;
> $text =~ s/^${prefix}//mg
> }
> d) it is easy to implement in value printers: a printer, recognising a
>
> multi-line string with "problematic" characters (<eol>; and friends),
> would print a startings sequence, and then prepend each printed line
> with a self-selected prefix, and then terminate the field with a
>
> semicolon in a regular way:
> my $prefix = " ";
> print ";${prefix}\\\n";
> print map { $prefix _ $_ } @text_lines
> print ";\n"
> implementations in other garbage-collected languages (Python, Java)
>
> should be equally straightforward, and for manually-allocating
> languages (Fortran, C) a simple pair of subroutines would convert
> between the prefixed and non-prefixed text forms (regexps are not
> strictly necessary for the implementation);
>
> c) the proposal is backwards-compatible with the plain CIF1.x
> parsers. Parsers that are not aware of the prefixing convention would
> simple read and pass the whole prefixed value. If such value is
> printed out without modification, the encapsulated information is
>
> preserved correctly.
> d) It is compatible with the current CIF1.x line folding notation, if
> we first fold the lines, and then prefix them. In fact, it may be
> viewed as an extension of the line folding convention. The prefix
>
> would be added before the trailing backslash:
> _long_text
> ;PFX>\\
> PFX>long and folded\
> PFX>prefixed line
> PFX>;non-folded line
> ;
> The parsing procedure would be the opposite: first unprefix (using the
>
> algorithm in c) and then unfold in a usual way.
> e) The method results in both machine- and human-readable CIFs, with
> minimal additional markup if desired:
> _example
> ; \
> # As an example, we provide a full, syntactically correct
>
> # CIF for your convenience
> data_I
> _text
> ;
> The nested values can be nicely indented using spaces or tabs
> ;
> _example # nested :)
> ; \
> ;Nesting the nested values is straightforward and unambiguous.
>
> ;
> ;
> ;
> f) since the ";something\" is seldomly if ever used in current CIFs,
> practically all existing CIFs retain their original semantics under
> the new convention. The line-folding CIFs are recognised easily by not
>
> having a prefix sequence, ";\" at the beginning of the text field.
> 3. A final note: I suggest permitting trailing whitespace at the end
> of the starting sequence:
> _text
> ;PFX>\
>
> PFX>The previous line has extra spaces at the end,
> PFX>but we usually do not see them in text editors.
> ;
> Such trailing space is difficult to spot for humans, and does not harm
> computers. It should be removed together with the starting
>
> sequence by a parser. In this way we would eliminate a potential source of
> upsetting errors.
> 4. It is interesting to note that the similar problem exists in other
> formats as well; e.g. XML CDATA value may not contain a terminating
>
> ]]> sequence. The same solution might apply to XML CDATA as well:
> <![CDATA PREFIX: [
> PREFIX: Another example of CDATA can be embedded with a prefix:
> PREFIX: Anything goes here!
> PREFIX: <![CDATA[
>
> PREFIX: Anything that goes here can be prefixed as well
> PREFIX: ]]>
> ]]>
> or, optionally, even nicer (note that the specified prefix is just a
> space character):
> <![CDATA [
> Another example of CDATA can be embedded with a prefix:
>
> Anything goes here!
> <![CDATA[
> Anything that goes here can be prefixed as well
> ]]>
> ]]>
> Obviously, the same technique can be used for mmCIF as well.
> 5. If the prefixed text fields are implemented, arbitrary values can
>
> be represented in CIFs at least as conveniently as can text fields in the
> current CIF1.1 format. Thus, there is strictly speaking no need for the
> """/''' strings, and one could simplify CIF2.x by omitting them
>
> althogether. However, the proposed method is orthogonal to the """/'''
> string format, and thus both can be implemented simultaneously if
> necessary.
> Sincerely,
> Saulius
>
> --
> Dr. Saulius Gra?ulis
> Institute of Biotechnology, Graiciuno 8
> LT-02241 Vilnius, Lietuva (Lithuania)
> fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
> mobile: (+370-684)-49802, (+370-614)-36366
>
>
>
_______________________________________________ ddlm-group mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Alternative proposal for eliding (James Hester)
- Prev by Date: [ddlm-group] Alternative proposal for eliding
- Next by Date: Re: [ddlm-group] Alternative proposal for eliding. .
- Prev by thread: [ddlm-group] Alternative proposal for eliding
- Next by thread: Re: [ddlm-group] Alternative proposal for eliding. .
- Index(es):

