[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Alternative proposal for eliding
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Alternative proposal for eliding
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Tue, 7 Jun 2011 05:48:34 -0400 (EDT)
- In-Reply-To: <BANLkTim3-cXPknJHipPKGCqyZ_wp52s9JA@mail.gmail.com>
- References: <BANLkTim3-cXPknJHipPKGCqyZ_wp52s9JA@mail.gmail.com>
This would certainy be a worthy suggestion to consider in a CIF1 context. For CIF2, my own preference would be to solve this problem by adopting the full Python syntax and semantics for treble-quoted strings ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Tue, 7 Jun 2011, James Hester wrote: > Dear DDLm-group, > > Saulius Grazulis has submitted an alternative proposal for representing > arbitrary strings in CIF2. This proposal has grown out of his own concern > around the inability of CIF to represent arbitrary strings, so I view this > as further confirmation that a solution is needed. In any case, please > comment on the proposal given below. > > James. > > =========================== > (below is from Saulius Grazulis) > > 1. In the current CIF specification, the only way to specify a > multi-line text value is to use a semicolon (';') delimited text > > field. Since such field is terminated by the first semicolon at the > beginning of the CIF line, the value may not contain semicolons at the > beginning of any line. As a consequence, a valid CIF file may not be, > in general, provided as a multi-line value of another valid CIF (thus > > we may can refer to this problem as "cif-in-cif problem"). The problem > was briefly mentioned as "theoretical" in the last year's DDLm group > discussions > (http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00839.html, > > http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00843.html); but > in my experience, it surfaces as a lurking bug possibility each time > > we print out a multi-line CIF value. Although the need to have > "nested" CIFs is marginal, a general purpose CIF processor that > obtains text values from sources other than the parsed syntactically > correct CIFs has no good way of dealing with it -- such values are not > > guaranteed to be free of semicolons at the beginnings of the lines, > and when such value is encountered, there is no versatile algorithm > that would permit representation of such value in CIF (any > modifications such as prepending of whitespace or refolding of lines > > can, in general, break the semantics of the value). > The newly proposed triple-quoted text fields (delimited with either > """ or ''' sequences) solve the problem for semicolon-starting text > > lines, at a cost of introducing yet another two kinds of delimited > strings. The "cif-in-cif: problem still remains, however, since a > value that contains all of the delimiters (newlines, quotes, <eol>;, > > ''' and """) still can not be represented as a value in any kind of > the quoted text fields. > 2. The 'cif-in-cif' problem might be solved in a general way by using > a "prefixed text field syntax": > > a) a special starting sequence, > <eol>;<text-field-prefix>\<optional-trailing-whitespace><eol> > would signal that all lines in this text field are prefixed with a > <text-field-prefix>. Here > > <text-field-prefix> ::= {<OrdinaryChar> | <space>}+ > <optional-trailing-whitespace> ::= <space>* > <space> ::= SP | HT > Each line of such text field then MUST start with the specified text > > field prefix. Both the starting sequence and the prefix do not belong > to the value and should be removed by a prefix-aware parser before > returning the value. > For example, a CIF sample can be included into a text like this: > > data_providing_example > _example > ;CIF>\ > CIF>data_example > CIF>_text > CIF>;This is an embedded multiline value > CIF>; > ; # here the field terminates. > Even more readable would be a blank prefix: > > data_providing_example > _example > ; \ > data_example > _text > ;This is an embedded multiline value > ; > ; # here the field terminates. > I see numerous advantages of such scheme: > a) it solves the "arbitrary value" a.k.a "CIF-in-CIF" problem once and > > forever; > b) it is simple to describe and to follow; > c) it is simple to implement in parsers: a parser, after obtaining a > multi-line text field value, would match for a starting sequence and, > if it is found, remove both the starting sequence and a prefix > > obtained from it. In Perl it can be done: > if( $text =~ /^([\w\s]+>)\\\n/ ) { > my $prefix = $1; > $text =~ s/^${prefix}\\\n//; > $text =~ s/^${prefix}//mg > } > d) it is easy to implement in value printers: a printer, recognising a > > multi-line string with "problematic" characters (<eol>; and friends), > would print a startings sequence, and then prepend each printed line > with a self-selected prefix, and then terminate the field with a > > semicolon in a regular way: > my $prefix = " "; > print ";${prefix}\\\n"; > print map { $prefix _ $_ } @text_lines > print ";\n" > implementations in other garbage-collected languages (Python, Java) > > should be equally straightforward, and for manually-allocating > languages (Fortran, C) a simple pair of subroutines would convert > between the prefixed and non-prefixed text forms (regexps are not > strictly necessary for the implementation); > > c) the proposal is backwards-compatible with the plain CIF1.x > parsers. Parsers that are not aware of the prefixing convention would > simple read and pass the whole prefixed value. If such value is > printed out without modification, the encapsulated information is > > preserved correctly. > d) It is compatible with the current CIF1.x line folding notation, if > we first fold the lines, and then prefix them. In fact, it may be > viewed as an extension of the line folding convention. The prefix > > would be added before the trailing backslash: > _long_text > ;PFX>\\ > PFX>long and folded\ > PFX>prefixed line > PFX>;non-folded line > ; > The parsing procedure would be the opposite: first unprefix (using the > > algorithm in c) and then unfold in a usual way. > e) The method results in both machine- and human-readable CIFs, with > minimal additional markup if desired: > _example > ; \ > # As an example, we provide a full, syntactically correct > > # CIF for your convenience > data_I > _text > ; > The nested values can be nicely indented using spaces or tabs > ; > _example # nested :) > ; \ > ;Nesting the nested values is straightforward and unambiguous. > > ; > ; > ; > f) since the ";something\" is seldomly if ever used in current CIFs, > practically all existing CIFs retain their original semantics under > the new convention. The line-folding CIFs are recognised easily by not > > having a prefix sequence, ";\" at the beginning of the text field. > 3. A final note: I suggest permitting trailing whitespace at the end > of the starting sequence: > _text > ;PFX>\ > > PFX>The previous line has extra spaces at the end, > PFX>but we usually do not see them in text editors. > ; > Such trailing space is difficult to spot for humans, and does not harm > computers. It should be removed together with the starting > > sequence by a parser. In this way we would eliminate a potential source of > upsetting errors. > 4. It is interesting to note that the similar problem exists in other > formats as well; e.g. XML CDATA value may not contain a terminating > > ]]> sequence. The same solution might apply to XML CDATA as well: > <![CDATA PREFIX: [ > PREFIX: Another example of CDATA can be embedded with a prefix: > PREFIX: Anything goes here! > PREFIX: <![CDATA[ > > PREFIX: Anything that goes here can be prefixed as well > PREFIX: ]]> > ]]> > or, optionally, even nicer (note that the specified prefix is just a > space character): > <![CDATA [ > Another example of CDATA can be embedded with a prefix: > > Anything goes here! > <![CDATA[ > Anything that goes here can be prefixed as well > ]]> > ]]> > Obviously, the same technique can be used for mmCIF as well. > 5. If the prefixed text fields are implemented, arbitrary values can > > be represented in CIFs at least as conveniently as can text fields in the > current CIF1.1 format. Thus, there is strictly speaking no need for the > """/''' strings, and one could simplify CIF2.x by omitting them > > althogether. However, the proposed method is orthogonal to the """/''' > string format, and thus both can be implemented simultaneously if > necessary. > Sincerely, > Saulius > > -- > Dr. Saulius Gra?ulis > Institute of Biotechnology, Graiciuno 8 > LT-02241 Vilnius, Lietuva (Lithuania) > fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556 > mobile: (+370-684)-49802, (+370-614)-36366 > > >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Alternative proposal for eliding (James Hester)
- Prev by Date: [ddlm-group] Alternative proposal for eliding
- Next by Date: Re: [ddlm-group] Alternative proposal for eliding. .
- Prev by thread: [ddlm-group] Alternative proposal for eliding
- Next by thread: Re: [ddlm-group] Alternative proposal for eliding. .
- Index(es):