[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Alternative proposal for eliding

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Alternative proposal for eliding
From: SIMON WESTRIP <[email protected]>
Date: Wed, 8 Jun 2011 12:40:28 +0100 (BST)
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]>

"(Rule one) The <eol> at the end of the first line of all <eol><semicolon> delimited values does not form part of the data value."

If I understand this correctly, this is a considerable departure from CIF1 syntax, e.g. the value of

_publ_section_experimental ; a,b,c,\a,\b,\c were determined from powder patterns ;

is

a,b,c,\a,\b,\c weredetermined from powder patterns i.e. first line has been folded in CIF2? My thoughts on the prefix proposal in general are that it seems to be worth exploring. In CIF1 we have used similar 'switches' to indicate how a datavalue should be parsed, i.e. <semicolon>\<eol> for line folding<semicolon>%T<eol> for tex content
These are rarely used conventions, but perhaps provide a precedent for the introduction of formal
mechanisms along these lines.
Obviously, the trick is finding an unambiguous switch that respects the legacy of CIF1...Simon

From: James Hester <[email protected]>
To: Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Wednesday, 8 June, 2011 4:01:19
Subject: Re: [ddlm-group] Alternative proposal for eliding

I agree that misreading of a legacy file without incurring a parsing error is practically impossible.

We should, however, make it possible in CIF2 to present multiline values containing a backslash before the first <eol> without risking a parsing error on read when this <backslash> is misunderstood as a prefix flag.

I suggest the following rule be added to the Grazulis proposal:
(Rule one) The <eol> at the end of the first line of all <eol><semicolon> delimited values does not form part of the data value.

This works as follows: when encoding a datavalue inside an <eol><semicolon> delimited string, a simple output routine would always insert an <eol> immediately after the <semicolon>, unless it wishes to use the prefix and/or line folding conventions. On reading an <eol><semicolon> string, this first <eol> is always discarded.

On Wed, Jun 8, 2011 at 6:17 AM, David Brown <[email protected]> wrote:

The only place I see a possible problem is with a heritage CIF with the following sequence _publ_section_experimental ; a,b,c,\a,\b,\c were determined from powder patterns : Since this has already been written, there is no problem with a CIF writer. A CIF reader would expect to find:_publ_section_experimental ; a,b,c,\a,\b,\c a,b,c,were determined from powder patterns :
and strip off the a,b,c, but if the supposed prefix is not preseent, the parser would presumably recognize this as a CIF1 file and ignore the supposed prefix. To really screw up the parser one would need:_publ_section_experimental ; a,b,c,\a,\b,\c were determined from powder patterns but accurate values of a,b,c, were determined from from single crystals ; This is not very likely.
DavidJames Hester wrote: Dear DDLm-group, Saulius Grazulis has submitted an alternative proposal for representing arbitrary strings in CIF2. This proposal has grown out of his own concern around the inability of CIF to represent arbitrary strings, so I view this as further confirmation that a solution is needed. In any case, please comment on the proposal given below. James. =========================== (below is from Saulius Grazulis) 1. In the current CIF specification, the only way to specify a multi-line text value is to use a semicolon (';') delimited text field. Since such field is terminated by the first semicolon at the beginning of the CIF line, the value may not contain semicolons at the beginning of any line. As a consequence, a valid CIF file may not be, in general, provided as a multi-line value of another valid CIF (thus we may can refer to this problem as "cif-in-cif problem"). The problem was briefly mentioned as "theoretical" in the last year's DDLm group discussions (http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00839.html, http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00843.html); but in my experience, it surfaces as a lurking bug possibility each time we print out a multi-line CIF value. Although the need to have "nested" CIFs is marginal, a general purpose CIF processor that obtains text values from sources other than the parsed syntactically correct CIFs has no good way of dealing with it -- such values are not guaranteed to be free of semicolons at the beginnings of the lines, and when such value is encountered, there is no versatile algorithm that would permit representation of such value in CIF (any modifications such as prepending of whitespace or refolding of lines can, in general, break the semantics of the value). The newly proposed triple-quoted text fields (delimited with either """ or ''' sequences) solve the problem for semicolon-starting text lines, at a cost of introducing yet another two kinds of delimited strings. The "cif-in-cif: problem still remains, however, since a value that contains all of the delimiters (newlines, quotes, <eol>;, ''' and """) still can not be represented as a value in any kind of the quoted text fields. 2. The 'cif-in-cif' problem might be solved in a general way by using a "prefixed text field syntax": a) a special starting sequence, <eol>;<text-field-prefix>\<optional-trailing-whitespace><eol> would signal that all lines in this text field are prefixed with a <text-field-prefix>. Here <text-field-prefix> ::= {<OrdinaryChar> | <space>}+ <optional-trailing-whitespace> ::= <space>* <space> ::= SP | HT Each line of such text field then MUST start with the specified text field prefix. Both the starting sequence and the prefix do not belong to the value and should be removed by a prefix-aware parser before returning the value. For example, a CIF sample can be included into a text like this: data_providing_example _example ;CIF>\ CIF>data_example CIF>_text CIF>;This is an embedded multiline value CIF>; ; # here the field terminates. Even more readable would be a blank prefix: data_providing_example _example ; \ data_example _text ;This is an embedded multiline value ; ; # here the field terminates. I see numerous advantages of such scheme: a) it solves the "arbitrary value" a.k.a "CIF-in-CIF" problem once and forever; b) it is simple to describe and to follow; c) it is simple to implement in parsers: a parser, after obtaining a multi-line text field value, would match for a starting sequence and, if it is found, remove both the starting sequence and a prefix obtained from it. In Perl it can be done: if( $text =~ /^([\w\s]+>)\\\n/ ) { my $prefix = $1; $text =~ s/^${prefix}\\\n//; $text =~ s/^${prefix}//mg } d) it is easy to implement in value printers: a printer, recognising a multi-line string with "problematic" characters (<eol>; and friends), would print a startings sequence, and then prepend each printed line with a self-selected prefix, and then terminate the field with a semicolon in a regular way: my $prefix = " "; print ";${prefix}\\\n"; print map { $prefix _ $_ } @text_lines print ";\n" implementations in other garbage-collected languages (Python, Java) should be equally straightforward, and for manually-allocating languages (Fortran, C) a simple pair of subroutines would convert between the prefixed and non-prefixed text forms (regexps are not strictly necessary for the implementation); c) the proposal is backwards-compatible with the plain CIF1.x parsers. Parsers that are not aware of the prefixing convention would simple read and pass the whole prefixed value. If such value is printed out without modification, the encapsulated information is preserved correctly. d) It is compatible with the current CIF1.x line folding notation, if we first fold the lines, and then prefix them. In fact, it may be viewed as an extension of the line folding convention. The prefix would be added before the trailing backslash: _long_text ;PFX>\\ PFX>long and folded\ PFX>prefixed line PFX>;non-folded line ; The parsing procedure would be the opposite: first unprefix (using the algorithm in c) and then unfold in a usual way. e) The method results in both machine- and human-readable CIFs, with minimal additional markup if desired: _example ; \ # As an example, we provide a full, syntactically correct # CIF for your convenience data_I _text ; The nested values can be nicely indented using spaces or tabs ; _example # nested :) ; \ ;Nesting the nested values is straightforward and unambiguous. ; ; ; f) since the ";something\" is seldomly if ever used in current CIFs, practically all existing CIFs retain their original semantics under the new convention. The line-folding CIFs are recognised easily by not having a prefix sequence, ";\" at the beginning of the text field. 3. A final note: I suggest permitting trailing whitespace at the end of the starting sequence: _text ;PFX>\ PFX>The previous line has extra spaces at the end, PFX>but we usually do not see them in text editors. ; Such trailing space is difficult to spot for humans, and does not harm computers. It should be removed together with the starting sequence by a parser. In this way we would eliminate a potential source of upsetting errors. 4. It is interesting to note that the similar problem exists in other formats as well; e.g. XML CDATA value may not contain a terminating ]]> sequence. The same solution might apply to XML CDATA as well: <![CDATA PREFIX: [ PREFIX: Another example of CDATA can be embedded with a prefix: PREFIX: Anything goes here! PREFIX: <![CDATA[ PREFIX: Anything that goes here can be prefixed as well PREFIX: ]]> ]]> or, optionally, even nicer (note that the specified prefix is just a space character): <![CDATA [ Another example of CDATA can be embedded with a prefix: Anything goes here! <![CDATA[ Anything that goes here can be prefixed as well ]]> ]]> Obviously, the same technique can be used for mmCIF as well. 5. If the prefixed text fields are implemented, arbitrary values can be represented in CIFs at least as conveniently as can text fields in the current CIF1.1 format. Thus, there is strictly speaking no need for the """/''' strings, and one could simplify CIF2.x by omitting them althogether. However, the proposed method is orthogonal to the """/''' string format, and thus both can be implemented simultaneously if necessary. Sincerely, Saulius -- Dr. Saulius Gražulis Institute of Biotechnology, Graiciuno 8 LT-02241 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556 mobile: (+370-684)-49802, (+370-614)-36366 _______________________________________________ ddlm-group mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

[ddlm-group] Alternative proposal for eliding (James Hester)

Re: [ddlm-group] Alternative proposal for eliding (David Brown)

Re: [ddlm-group] Alternative proposal for eliding (James Hester)

Prev by Date: Re: [ddlm-group] Alternative proposal for eliding

Next by Date: Re: [ddlm-group] Alternative proposal for eliding. .

Prev by thread: Re: [ddlm-group] Alternative proposal for eliding

Next by thread: Re: [ddlm-group] Alternative proposal for eliding. .

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Alternative proposal for eliding