Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

I agree with Nick.  There is no reason why APIs distributed with parsers 
cannot also have useful application support utilities to do 
reverse-solidus processing on strings for various purposes, such as 
Brian's type-setting codes, or to do line-folding, but with so many 
conflicting approaches to handling reverse-solidus process already in use 
with CIF, I don't know a good way to build full processing into the parser 
itself.

Regards,
   Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 16 Oct 2009, Nick Spadaccini wrote:

>
>
>
> On 16/10/09 2:45 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:
>
>> One quick question re:
>>
>>>> _name   Œ\¹Be gone\¹¹      # better as ³¹Be gone¹²
>>
>>>> The parser must return the string \¹Be gone\¹, that is it does not handle
>>> any of the elide characters. >This is the responsibility of the downstream
>>> application.
>>
>> I would have expected the parser to return 'Be gone' in this case?
>>
> Why? There are a number of reasons why it would be difficult. We don¹t
> interpret the elides because we don¹t know what algorithm to use. Brian¹s
> archive is littered with \n in strings for the Greek letter nu, the standard
> algorithm would insert a single byte NEWLINE character. Too many elides
> exist in strings for us to know what to do, unless you want to adopt a
> C/Python convention. Then we would break all the IUCr typesetting.
>>
>> i.e. the elide should be recognized as escaping a
>> nested ' when within ' ... ' ,
>> otherwise Œ\¹Be gone\¹¹ is not the same as ³¹Be gone¹²
>> e.g.
>>
> The handling is left up to the downstream application. I know this seems
> strange but the discipline decides what the elides mean and then they define
> behaviour. The ONLY behaviour defined at the syntactic level is whatever
> follows the elide is literal and NOT in consideration as a delimiter
> character.
>>
>> Œ\¹Be gone\¹¹ --> 'Be gone'   - parser recognizes the elides
>> "'Be gone'"   --> 'Be gone'
>> "\'Be gone\'" --> \'Be gone\'   - parser ignores elides as not relevant
>>
> Interesting you should argue this is ŒBe gone¹, which is the C/Python
> interpretation.
>>
>> '\\'Be gone\\'' --> \'Be gone\'  - parser ignores \\ but not \'
>>
> This is not correct. It doesn¹t parse even with Python. In our suggested
> coercion it would be 4 string values -  \\ - Be - gone\\ - Œ¹
>>
>> "\\'Be gone\\'" --> \\'Be gone\\'  - parser ignores elides
>
> Again this is not consistent? When do I strip the elides and when do I leave
> them?
>
> The elide interpretation and stripping in C/Python is a consequent of
> typing/working it their execution environment. If you actually just read
> strings from a file no manipulation is done. We¹re meeting that half way. As
> we read the elides help us avoid early token termination, but otherwise the
> string is the unaltered value.
>>
>> Cheers
>>
>> Simon
>>
>>
>> From: Nick Spadaccini <nick@csse.uwa.edu.au>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Sent: Thursday, 15 October, 2009 17:22:43
>> Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
>>
>> Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. Ok. I have
>> formalised in my head the difference between whitespace as a part of a token,
>> versus its presence as a separator.
>>
>> I have copied out two threads forom a paper I am drafting up for proposing
>> changes.
>>
>> ---------------------------------------------
>> THREAD 3 (SYNTAX)
>> Restricted character set.
>> The adoption of the compound data structures described in THREAD 2
>> necessitates a restriction on the character set that can be used for string
>> types. Namely the token delimiter characters and token separator characters
>> cannot be included in a non-quote delimited string.
>>
>> (1) A non-quote delimited string can be comprised of the printable characters,
>> excluding any of the ASCII characters, " ' , : { }. The first character of the
>> string cannot be an ASCII _  or ASCII $ and the string cannot exactly match
>> any of the reserved keywords of STAR (loop_ global_ save_[.]* stop_
>> data_[.]*).
>> (2) For consistency, any of " ' , : { } are excluded from strings that form
>> data names.
>>
>> We further propose that
>>
>> (3) a single-quote delimited string may not contain a single quote unless it
>> is elided by ASCII reverse solidus (\).
>> (4) a double-quote delimited string may not contain a double quote unless it
>> is elided by ASCII reverse solidus (\).
>>
>> The reverse solidus syntax instructs the lexer that the immediately following
>> character (provided it is allowed in the character set) is NOT to be
>> interpreted as a token delimiter. For example
>>
>> _name   Œ\¹Be gone\¹¹      # better as ³¹Be gone¹²
>>
>> The parser must return the string \¹Be gone\¹, that is it does not handle any
>> of the elide characters. This is the responsibility of the downstream
>> application.
>> The following example shows an illegal use of the reverse solidus;
>> _name    ³Be gone \
>> they said²
>> A NEWLINE character in the double (or single) quoted strings is illegal.
>>
>> THREAD 4 (SYNTAX)
>> Terminating tokens.
>> The adoption of the proposals in THREAD 3 ensures that the delimited values
>> are initiated and terminated by a single instance of the token character
>> (digram in the case of semi-colon delimited strings and trigram for triple
>> quote delimited strings). The removes the (unnecessary) requirement that token
>> character MUST be preceded by a whitespace at initiation and followed by a
>> whitespace on termination.
>>
>> However an appropriate separator is required between tokens to unambiguously
>> parse a CIF2 document. The appropriate separator is defined by the context in
>> which it is used. For example at the highest-level, a whitespace serves this
>> purpose. In a List object the ASCII , serves this purpose. In the Associative
>> Array object, the separators are ASCII : and ASCII ,. The absence of a
>> separator or use of the incorrect separator will give rise to ambiguity and
>> possible error. The coercion rules for these cases need to be argued by the
>> ³community².
>> ------------------------------------------------------
>>
>> Consider the following coercion rules for when a separator is not present.
>>
>> (1) Always generate an error message and die (we might be able to do better)
>> (2) Atttempt to guess what is intended.
>>
>> Example (at the zero level)
>>
>> _name ³butted ³²strings²
>>
>> Adopt the C/Python rule which returns ³butted strings² as the lexeme.
>> Splitting them doesn¹t make sense because there is one data name that can have
>> one data value. I would create an illegal STAR/CIF by splitting them.
>>
>> Now this might be different
>>
>> loop_ _name ³butted ³²strings²
>>
>> Here I would argue that we should split in to two data values. It will be a
>> correct structure in the STAR/CIF sense and is the explicit enforcement of the
>> token termination rules, even though the separator rule violated.
>>
>> For Brian¹s examples in loops
>>
>>                                                   INTENDED
>> loop_  _colour   'red'blue'green'     #  'red'     blue      'green'
>> loop_  _colour   'red' blue'green'    #  'red'     blue      'green'
>> loop_  _colour   'red'blue 'green'    #  'red'     blue      'green'
>> loop_  _colour   'red'''blue'green'   #  'red' ''  blue      'green'
>>
>> These 4 (under the above rule) agrre with what is intended.
>>
>> loop_  _colour   'red''''blue'''green #  'red'    '''blue'''  green
>>
>> This one does also, because in my lexer (and everyone should do this and Herb
>> agrees) the triple quote rules have priority over and single character quote
>> rules.
>>
>> The Brian¹s other examples. Given the above coercion rules, and the restricted
>> character set of data names. These would be
>>                                                INTERPRETED
>> loop_  _colour'red' 'green' 'blue'            # loop_  _colour 'red' 'green'
>> 'blue' [stop_] # added for clarity
>> loop_  _colour 'red' 'green' 'blue'_name Fred # loop_  _colour 'red' 'green'
>> 'blue' [stop_] _name Fred
>> loop_  _colour 'red''green''blue'_name Fred   # loop_  _colour 'red' 'green'
>> 'blue' [stop_] _name Fred
>> loop_  _colour 'red''green''blue' _name Fred  # Ditto
>>
>> Another coercion rule. The separator for lists is the comma. What if that is
>> given as a space?
>>
>>
>> _name {{1 2 3}   # newlines mean nothing, so inserted for clarity/typesetting.
>>        {4 5 6}
>>        {7 8 9}}
>>
>> We suggest this is a 3x3 matrix (which you would from the dictionary anyway)
>> and it should be coerced in to
>>
>> _name {{1,2,3},   # newlines mean nothing, so inserted for
>> clarity/typesetting.
>>        {4,5,6},
>>        {7,8,9}}
>>
>> This is consistent with the loop rule above where we split. Similar rule for
>> all lists.
>>
>> cheers
>>
>> Nick
>>
>> --------------------------------
>> Associate Professor N. Spadaccini, PhD
>> School of Computer Science & Software Engineering
>>
>> The University of Western Australia    t: +61 (0)8 6488 3452
>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>> <http://www.csse.uwa.edu.au/%7Enick>
>> MBDP  M002
>>
>> CRICOS Provider Code: 00126G
>>
>> e: Nick.Spadaccini@uwa.edu.au
>>
>>
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.