Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Title: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
One quick question re:

>_name   ‘\’Be gone\’’      # better as “’Be gone’”

>The parser must return the string \’Be gone\’, that is it does not handle any of the elide characters. >This is the responsibility of the downstream application.

I would have expected the parser to return 'Be gone' in this case?
i.e. the elide should be recognized as escaping a
nested ' when within ' ... ' ,
otherwise ‘\’Be gone\’’ is not the same as “’Be gone’”

‘\’Be gone\’’ --> 'Be gone'   - parser recognizes the elides
"'Be gone'"   --> 'Be gone'
"\'Be gone\'" --> \'Be gone\'   - parser ignores elides as not relevant
'\\'Be gone\\'' --> \'Be gone\'  - parser ignores \\ but not \'
"\\'Be gone\\'" --> \\'Be gone\\'  - parser ignores elides



From: Nick Spadaccini <nick@csse.uwa.edu.au>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Thursday, 15 October, 2009 17:22:43
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Ok. I have formalised in my head the difference between whitespace as a part of a token, versus its presence as a separator.

I have copied out two threads forom a paper I am drafting up for proposing changes.

Restricted character set.
The adoption of the compound data structures described in THREAD 2 necessitates a restriction on the character set that can be used for string types. Namely the token delimiter characters and token separator characters cannot be included in a non-quote delimited string.
(1) A non-quote delimited string can be comprised of the printable characters, excluding any of the ASCII characters,
" ' , : { }. The first character of the string cannot be an ASCII _  or ASCII $ and the string cannot exactly match any of the reserved keywords of STAR (loop_ global_ save_[.]* stop_ data_[.]*).
(2) For consistency, any of
" ' , : { } are excluded from strings that form data names.
We further propose that
(3) a single-quote delimited string may not contain a single quote unless it is elided by ASCII reverse solidus (
(4) a double-quote delimited string may not contain a double quote unless it is elided by ASCII reverse solidus (
The reverse solidus syntax instructs the lexer that the immediately following character (provided it is allowed in the character set) is NOT to be interpreted as a token delimiter. For example
_name   ‘\’Be gone\’’      # better as “’Be gone’”

The parser must return the string
\’Be gone\’, that is it does not handle any of the elide characters. This is the responsibility of the downstream application.
The following example shows an illegal use of the reverse solidus;
_name    “Be gone \
they said”
A NEWLINE character in the double (or single) quoted strings is illegal.

Terminating tokens.
The adoption of the proposals in THREAD 3 ensures that the delimited values are initiated and terminated by a single instance of the token character (digram in the case of semi-colon delimited strings and trigram for triple quote delimited strings). The removes the (unnecessary) requirement that token character MUST be preceded by a whitespace at initiation and followed by a whitespace on termination.
However an appropriate separator is required between tokens to unambiguously parse a CIF2 document. The appropriate separator is defined by the context in which it is used. For example at the highest-level, a whitespace serves this purpose. In a List object the ASCII
, serves this purpose. In the Associative Array object, the separators are ASCII : and ASCII ,. The absence of a separator or use of the incorrect separator will give rise to ambiguity and possible error. The coercion rules for these cases need to be argued by the “community”.

Consider the following coercion rules for when a separator is not present.

(1) Always generate an error message and die (we might be able to do better)
(2) Atttempt to guess what is intended.

Example (at the zero level)

_name “butted “”strings”

Adopt the C/Python rule which returns “butted strings” as the lexeme. Splitting them doesn’t make sense because there is one data name that can have one data value. I would create an illegal STAR/CIF by splitting them.

Now this might be different

loop_ _name “butted “”strings”

Here I would argue that we should split in to two data values. It will be a correct structure in the STAR/CIF sense and is the explicit enforcement of the token termination rules, even though the separator rule violated.

For Brian’s examples in loops

loop_  _colour   'red'blue'green'     #  'red'     blue      'green'
loop_  _colour   'red' blue'green'    #  'red'     blue      'green'
loop_  _colour   'red'blue 'green'    #  'red'     blue      'green'
loop_  _colour   'red'''blue'green'   #  'red' ''  blue      'green'

These 4 (under the above rule) agrre with what is intended.

loop_  _colour   'red''''blue'''green #  'red'    '''blue'''  green

This one does also, because in my lexer (and everyone should do this and Herb agrees) the triple quote rules have priority over and single character quote rules.

The Brian’s other examples. Given the above coercion rules, and the restricted character set of data names. These would be
loop_  _colour'red' 'green' 'blue'            # loop_  _colour 'red' 'green' 'blue' [stop_] # added for clarity
loop_  _colour 'red' 'green' 'blue'_name Fred # loop_  _colour 'red' 'green' 'blue' [stop_] _name Fred
loop_  _colour 'red''green''blue'_name Fred   # loop_  _colour 'red' 'green' 'blue' [stop_] _name Fred  
loop_  _colour 'red''green''blue' _name Fred  # Ditto

Another coercion rule. The separator for lists is the comma. What if that is given as a space?

_name {{1 2 3}   # newlines mean nothing, so inserted for clarity/typesetting.
       {4 5 6}
       {7 8 9}}

We suggest this is a 3x3 matrix (which you would from the dictionary anyway) and it should be coerced in to

_name {{1,2,3},   # newlines mean nothing, so inserted for clarity/typesetting.

This is consistent with the loop rule above where we split. Similar rule for all lists.



Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.