[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Thu, 15 Oct 2009 15:38:38 -0400 (EDT)
In-Reply-To: <C6FD975B.12107%nick@csse.uwa.edu.au>
References: <C6FD975B.12107%nick@csse.uwa.edu.au>

I agree with Nick.  There is no reason why APIs distributed with parsers 
cannot also have useful application support utilities to do 
reverse-solidus processing on strings for various purposes, such as 
Brian's type-setting codes, or to do line-folding, but with so many 
conflicting approaches to handling reverse-solidus process already in use 
with CIF, I don't know a good way to build full processing into the parser 
itself.

Regards,
   Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 16 Oct 2009, Nick Spadaccini wrote:

>
>
>
> On 16/10/09 2:45 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:
>
>> One quick question re:
>>
>>>> _name   �\�Be gone\��      # better as ��Be gone��
>>
>>>> The parser must return the string \�Be gone\�, that is it does not handle
>>> any of the elide characters. >This is the responsibility of the downstream
>>> application.
>>
>> I would have expected the parser to return 'Be gone' in this case?
>>
> Why? There are a number of reasons why it would be difficult. We don�t
> interpret the elides because we don�t know what algorithm to use. Brian�s
> archive is littered with \n in strings for the Greek letter nu, the standard
> algorithm would insert a single byte NEWLINE character. Too many elides
> exist in strings for us to know what to do, unless you want to adopt a
> C/Python convention. Then we would break all the IUCr typesetting.
>>
>> i.e. the elide should be recognized as escaping a
>> nested ' when within ' ... ' ,
>> otherwise �\�Be gone\�� is not the same as ��Be gone��
>> e.g.
>>
> The handling is left up to the downstream application. I know this seems
> strange but the discipline decides what the elides mean and then they define
> behaviour. The ONLY behaviour defined at the syntactic level is whatever
> follows the elide is literal and NOT in consideration as a delimiter
> character.
>>
>> �\�Be gone\�� --> 'Be gone'   - parser recognizes the elides
>> "'Be gone'"   --> 'Be gone'
>> "\'Be gone\'" --> \'Be gone\'   - parser ignores elides as not relevant
>>
> Interesting you should argue this is �Be gone�, which is the C/Python
> interpretation.
>>
>> '\\'Be gone\\'' --> \'Be gone\'  - parser ignores \\ but not \'
>>
> This is not correct. It doesn�t parse even with Python. In our suggested
> coercion it would be 4 string values -  \\ - Be - gone\\ - ��
>>
>> "\\'Be gone\\'" --> \\'Be gone\\'  - parser ignores elides
>
> Again this is not consistent? When do I strip the elides and when do I leave
> them?
>
> The elide interpretation and stripping in C/Python is a consequent of
> typing/working it their execution environment. If you actually just read
> strings from a file no manipulation is done. We�re meeting that half way. As
> we read the elides help us avoid early token termination, but otherwise the
> string is the unaltered value.
>>
>> Cheers
>>
>> Simon
>>
>>
>> From: Nick Spadaccini <nick@csse.uwa.edu.au>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Sent: Thursday, 15 October, 2009 17:22:43
>> Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
>>
>> Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. Ok. I have
>> formalised in my head the difference between whitespace as a part of a token,
>> versus its presence as a separator.
>>
>> I have copied out two threads forom a paper I am drafting up for proposing
>> changes.
>>
>> ---------------------------------------------
>> THREAD 3 (SYNTAX)
>> Restricted character set.
>> The adoption of the compound data structures described in THREAD 2
>> necessitates a restriction on the character set that can be used for string
>> types. Namely the token delimiter characters and token separator characters
>> cannot be included in a non-quote delimited string.
>>
>> (1) A non-quote delimited string can be comprised of the printable characters,
>> excluding any of the ASCII characters, " ' , : { }. The first character of the
>> string cannot be an ASCII _  or ASCII $ and the string cannot exactly match
>> any of the reserved keywords of STAR (loop_ global_ save_[.]* stop_
>> data_[.]*).
>> (2) For consistency, any of " ' , : { } are excluded from strings that form
>> data names.
>>
>> We further propose that
>>
>> (3) a single-quote delimited string may not contain a single quote unless it
>> is elided by ASCII reverse solidus (\).
>> (4) a double-quote delimited string may not contain a double quote unless it
>> is elided by ASCII reverse solidus (\).
>>
>> The reverse solidus syntax instructs the lexer that the immediately following
>> character (provided it is allowed in the character set) is NOT to be
>> interpreted as a token delimiter. For example
>>
>> _name   �\�Be gone\��      # better as ��Be gone��
>>
>> The parser must return the string \�Be gone\�, that is it does not handle any
>> of the elide characters. This is the responsibility of the downstream
>> application.
>> The following example shows an illegal use of the reverse solidus;
>> _name    �Be gone \
>> they said�
>> A NEWLINE character in the double (or single) quoted strings is illegal.
>>
>> THREAD 4 (SYNTAX)
>> Terminating tokens.
>> The adoption of the proposals in THREAD 3 ensures that the delimited values
>> are initiated and terminated by a single instance of the token character
>> (digram in the case of semi-colon delimited strings and trigram for triple
>> quote delimited strings). The removes the (unnecessary) requirement that token
>> character MUST be preceded by a whitespace at initiation and followed by a
>> whitespace on termination.
>>
>> However an appropriate separator is required between tokens to unambiguously
>> parse a CIF2 document. The appropriate separator is defined by the context in
>> which it is used. For example at the highest-level, a whitespace serves this
>> purpose. In a List object the ASCII , serves this purpose. In the Associative
>> Array object, the separators are ASCII : and ASCII ,. The absence of a
>> separator or use of the incorrect separator will give rise to ambiguity and
>> possible error. The coercion rules for these cases need to be argued by the
>> �community�.
>> ------------------------------------------------------
>>
>> Consider the following coercion rules for when a separator is not present.
>>
>> (1) Always generate an error message and die (we might be able to do better)
>> (2) Atttempt to guess what is intended.
>>
>> Example (at the zero level)
>>
>> _name �butted ��strings�
>>
>> Adopt the C/Python rule which returns �butted strings� as the lexeme.
>> Splitting them doesn�t make sense because there is one data name that can have
>> one data value. I would create an illegal STAR/CIF by splitting them.
>>
>> Now this might be different
>>
>> loop_ _name �butted ��strings�
>>
>> Here I would argue that we should split in to two data values. It will be a
>> correct structure in the STAR/CIF sense and is the explicit enforcement of the
>> token termination rules, even though the separator rule violated.
>>
>> For Brian�s examples in loops
>>
>>                                                   INTENDED
>> loop_  _colour   'red'blue'green'     #  'red'     blue      'green'
>> loop_  _colour   'red' blue'green'    #  'red'     blue      'green'
>> loop_  _colour   'red'blue 'green'    #  'red'     blue      'green'
>> loop_  _colour   'red'''blue'green'   #  'red' ''  blue      'green'
>>
>> These 4 (under the above rule) agrre with what is intended.
>>
>> loop_  _colour   'red''''blue'''green #  'red'    '''blue'''  green
>>
>> This one does also, because in my lexer (and everyone should do this and Herb
>> agrees) the triple quote rules have priority over and single character quote
>> rules.
>>
>> The Brian�s other examples. Given the above coercion rules, and the restricted
>> character set of data names. These would be
>>                                                INTERPRETED
>> loop_  _colour'red' 'green' 'blue'            # loop_  _colour 'red' 'green'
>> 'blue' [stop_] # added for clarity
>> loop_  _colour 'red' 'green' 'blue'_name Fred # loop_  _colour 'red' 'green'
>> 'blue' [stop_] _name Fred
>> loop_  _colour 'red''green''blue'_name Fred   # loop_  _colour 'red' 'green'
>> 'blue' [stop_] _name Fred
>> loop_  _colour 'red''green''blue' _name Fred  # Ditto
>>
>> Another coercion rule. The separator for lists is the comma. What if that is
>> given as a space?
>>
>>
>> _name {{1 2 3}   # newlines mean nothing, so inserted for clarity/typesetting.
>>        {4 5 6}
>>        {7 8 9}}
>>
>> We suggest this is a 3x3 matrix (which you would from the dictionary anyway)
>> and it should be coerced in to
>>
>> _name {{1,2,3},   # newlines mean nothing, so inserted for
>> clarity/typesetting.
>>        {4,5,6},
>>        {7,8,9}}
>>
>> This is consistent with the loop rule above where we split. Similar rule for
>> all lists.
>>
>> cheers
>>
>> Nick
>>
>> --------------------------------
>> Associate Professor N. Spadaccini, PhD
>> School of Computer Science & Software Engineering
>>
>> The University of Western Australia    t: +61 (0)8 6488 3452
>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>> <http://www.csse.uwa.edu.au/%7Enick>
>> MBDP  M002
>>
>> CRICOS Provider Code: 00126G
>>
>> e: Nick.Spadaccini@uwa.edu.au
>>
>>
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)

Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.