[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Title: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.



On 16/10/09 2:45 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:

One quick question re:

>_name   ‘\’Be gone\’’      # better as “’Be gone’”

>The parser must return the string \’Be gone\’, that is it does not handle any of the elide characters. >This is the responsibility of the downstream application.

I would have expected the parser to return 'Be gone' in this case?

Why? There are a number of reasons why it would be difficult. We don’t interpret the elides because we don’t know what algorithm to use. Brian’s archive is littered with \n in strings for the Greek letter nu, the standard algorithm would insert a single byte NEWLINE character. Too many elides exist in strings for us to know what to do, unless you want to adopt a C/Python convention. Then we would break all the IUCr typesetting.

i.e. the elide should be recognized as escaping a
nested ' when within ' ... ' ,
otherwise ‘\’Be gone\’’ is not the same as “’Be gone’”
e.g.

The handling is left up to the downstream application. I know this seems strange but the discipline decides what the elides mean and then they define behaviour. The ONLY behaviour defined at the syntactic level is whatever follows the elide is literal and NOT in consideration as a delimiter character.

‘\’Be gone\’’ --> 'Be gone'   - parser recognizes the elides
"'Be gone'"   --> 'Be gone'
"\'Be gone\'" --> \'Be gone\'   - parser ignores elides as not relevant

Interesting you should argue this is ‘Be gone’, which is the C/Python interpretation.

'\\'Be gone\\'' --> \'Be gone\'  - parser ignores \\ but not \'

This is not correct. It doesn’t parse even with Python. In our suggested coercion it would be 4 string values -  \\ - Be - gone\\ - ‘’

"\\'Be gone\\'" --> \\'Be gone\\'  - parser ignores elides

Again this is not consistent? When do I strip the elides and when do I leave them?

The elide interpretation and stripping in C/Python is a consequent of typing/working it their execution environment. If you actually just read strings from a file no manipulation is done. We’re meeting that half way. As we read the elides help us avoid early token termination, but otherwise the string is the unaltered value.

Cheers

Simon


From:
Nick Spadaccini <nick@csse.uwa.edu.au>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Thursday, 15 October, 2009 17:22:43
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
Ok. I have formalised in my head the difference between whitespace as a part of a token, versus its presence as a separator.

I have copied out two threads forom a paper I am drafting up for proposing changes.

---------------------------------------------
THREAD 3 (SYNTAX)
Restricted character set.
The adoption of the compound data structures described in THREAD 2 necessitates a restriction on the character set that can be used for string types. Namely the token delimiter characters and token separator characters cannot be included in a non-quote delimited string.
 
(1) A non-quote delimited string can be comprised of the printable characters, excluding any of the ASCII characters,
" ' , : { }. The first character of the string cannot be an ASCII _  or ASCII $ and the string cannot exactly match any of the reserved keywords of STAR (loop_ global_ save_[.]* stop_ data_[.]*).
(2) For consistency, any of
" ' , : { } are excluded from strings that form data names.
 
We further propose that
 
(3) a single-quote delimited string may not contain a single quote unless it is elided by ASCII reverse solidus (
\).
(4) a double-quote delimited string may not contain a double quote unless it is elided by ASCII reverse solidus (
\).
 
The reverse solidus syntax instructs the lexer that the immediately following character (provided it is allowed in the character set) is NOT to be interpreted as a token delimiter. For example
 
_name   ‘\’Be gone\’’      # better as “’Be gone’”

The parser must return the string
\’Be gone\’, that is it does not handle any of the elide characters. This is the responsibility of the downstream application.
The following example shows an illegal use of the reverse solidus;
_name    “Be gone \
they said”
A NEWLINE character in the double (or single) quoted strings is illegal.

THREAD 4 (SYNTAX)
Terminating tokens.
The adoption of the proposals in THREAD 3 ensures that the delimited values are initiated and terminated by a single instance of the token character (digram in the case of semi-colon delimited strings and trigram for triple quote delimited strings). The removes the (unnecessary) requirement that token character MUST be preceded by a whitespace at initiation and followed by a whitespace on termination.
 
However an appropriate separator is required between tokens to unambiguously parse a CIF2 document. The appropriate separator is defined by the context in which it is used. For example at the highest-level, a whitespace serves this purpose. In a List object the ASCII
, serves this purpose. In the Associative Array object, the separators are ASCII : and ASCII ,. The absence of a separator or use of the incorrect separator will give rise to ambiguity and possible error. The coercion rules for these cases need to be argued by the “community”.
------------------------------------------------------

Consider the following coercion rules for when a separator is not present.

(1) Always generate an error message and die (we might be able to do better)
(2) Atttempt to guess what is intended.

Example (at the zero level)

_name “butted “”strings”

Adopt the C/Python rule which returns “butted strings” as the lexeme. Splitting them doesn’t make sense because there is one data name that can have one data value. I would create an illegal STAR/CIF by splitting them.

Now this might be different

loop_ _name “butted “”strings”

Here I would argue that we should split in to two data values. It will be a correct structure in the STAR/CIF sense and is the explicit enforcement of the token termination rules, even though the separator rule violated.

For Brian’s examples in loops

                                                 INTENDED
loop_  _colour   'red'blue'green'     #  'red'     blue      'green'
loop_  _colour   'red' blue'green'    #  'red'     blue      'green'
loop_  _colour   'red'blue 'green'    #  'red'     blue      'green'
loop_  _colour   'red'''blue'green'   #  'red' ''  blue      'green'

These 4 (under the above rule) agrre with what is intended.

loop_  _colour   'red''''blue'''green #  'red'    '''blue'''  green

This one does also, because in my lexer (and everyone should do this and Herb agrees) the triple quote rules have priority over and single character quote rules.

The Brian’s other examples. Given the above coercion rules, and the restricted character set of data names. These would be
                                               INTERPRETED
loop_  _colour'red' 'green' 'blue'            # loop_  _colour 'red' 'green' 'blue' [stop_] # added for clarity
loop_  _colour 'red' 'green' 'blue'_name Fred # loop_  _colour 'red' 'green' 'blue' [stop_] _name Fred
loop_  _colour 'red''green''blue'_name Fred   # loop_  _colour 'red' 'green' 'blue' [stop_] _name Fred  
loop_  _colour 'red''green''blue' _name Fred  # Ditto

Another coercion rule. The separator for lists is the comma. What if that is given as a space?


_name {{1 2 3}   # newlines mean nothing, so inserted for clarity/typesetting.
       {4 5 6}
       {7 8 9}}

We suggest this is a 3x3 matrix (which you would from the dictionary anyway) and it should be coerced in to

_name {{1,2,3},   # newlines mean nothing, so inserted for clarity/typesetting.
       {4,5,6},
       {7,8,9}}

This is consistent with the loop rule above where we split. Similar rule for all lists.

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick <http://www.csse.uwa.edu.au/%7Enick>
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au




_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]