[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 15 Oct 2009 15:38:38 -0400 (EDT)
- In-Reply-To: <C6FD975B.12107%nick@csse.uwa.edu.au>
- References: <C6FD975B.12107%nick@csse.uwa.edu.au>
I agree with Nick. There is no reason why APIs distributed with parsers cannot also have useful application support utilities to do reverse-solidus processing on strings for various purposes, such as Brian's type-setting codes, or to do line-folding, but with so many conflicting approaches to handling reverse-solidus process already in use with CIF, I don't know a good way to build full processing into the parser itself. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Fri, 16 Oct 2009, Nick Spadaccini wrote: > > > > On 16/10/09 2:45 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote: > >> One quick question re: >> >>>> _name Œ\¹Be gone\¹¹ # better as ³¹Be gone¹² >> >>>> The parser must return the string \¹Be gone\¹, that is it does not handle >>> any of the elide characters. >This is the responsibility of the downstream >>> application. >> >> I would have expected the parser to return 'Be gone' in this case? >> > Why? There are a number of reasons why it would be difficult. We don¹t > interpret the elides because we don¹t know what algorithm to use. Brian¹s > archive is littered with \n in strings for the Greek letter nu, the standard > algorithm would insert a single byte NEWLINE character. Too many elides > exist in strings for us to know what to do, unless you want to adopt a > C/Python convention. Then we would break all the IUCr typesetting. >> >> i.e. the elide should be recognized as escaping a >> nested ' when within ' ... ' , >> otherwise Œ\¹Be gone\¹¹ is not the same as ³¹Be gone¹² >> e.g. >> > The handling is left up to the downstream application. I know this seems > strange but the discipline decides what the elides mean and then they define > behaviour. The ONLY behaviour defined at the syntactic level is whatever > follows the elide is literal and NOT in consideration as a delimiter > character. >> >> Œ\¹Be gone\¹¹ --> 'Be gone' - parser recognizes the elides >> "'Be gone'" --> 'Be gone' >> "\'Be gone\'" --> \'Be gone\' - parser ignores elides as not relevant >> > Interesting you should argue this is ŒBe gone¹, which is the C/Python > interpretation. >> >> '\\'Be gone\\'' --> \'Be gone\' - parser ignores \\ but not \' >> > This is not correct. It doesn¹t parse even with Python. In our suggested > coercion it would be 4 string values - \\ - Be - gone\\ - Œ¹ >> >> "\\'Be gone\\'" --> \\'Be gone\\' - parser ignores elides > > Again this is not consistent? When do I strip the elides and when do I leave > them? > > The elide interpretation and stripping in C/Python is a consequent of > typing/working it their execution environment. If you actually just read > strings from a file no manipulation is done. We¹re meeting that half way. As > we read the elides help us avoid early token termination, but otherwise the > string is the unaltered value. >> >> Cheers >> >> Simon >> >> >> From: Nick Spadaccini <nick@csse.uwa.edu.au> >> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >> Sent: Thursday, 15 October, 2009 17:22:43 >> Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. >> >> Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. Ok. I have >> formalised in my head the difference between whitespace as a part of a token, >> versus its presence as a separator. >> >> I have copied out two threads forom a paper I am drafting up for proposing >> changes. >> >> --------------------------------------------- >> THREAD 3 (SYNTAX) >> Restricted character set. >> The adoption of the compound data structures described in THREAD 2 >> necessitates a restriction on the character set that can be used for string >> types. Namely the token delimiter characters and token separator characters >> cannot be included in a non-quote delimited string. >> >> (1) A non-quote delimited string can be comprised of the printable characters, >> excluding any of the ASCII characters, " ' , : { }. The first character of the >> string cannot be an ASCII _ or ASCII $ and the string cannot exactly match >> any of the reserved keywords of STAR (loop_ global_ save_[.]* stop_ >> data_[.]*). >> (2) For consistency, any of " ' , : { } are excluded from strings that form >> data names. >> >> We further propose that >> >> (3) a single-quote delimited string may not contain a single quote unless it >> is elided by ASCII reverse solidus (\). >> (4) a double-quote delimited string may not contain a double quote unless it >> is elided by ASCII reverse solidus (\). >> >> The reverse solidus syntax instructs the lexer that the immediately following >> character (provided it is allowed in the character set) is NOT to be >> interpreted as a token delimiter. For example >> >> _name Œ\¹Be gone\¹¹ # better as ³¹Be gone¹² >> >> The parser must return the string \¹Be gone\¹, that is it does not handle any >> of the elide characters. This is the responsibility of the downstream >> application. >> The following example shows an illegal use of the reverse solidus; >> _name ³Be gone \ >> they said² >> A NEWLINE character in the double (or single) quoted strings is illegal. >> >> THREAD 4 (SYNTAX) >> Terminating tokens. >> The adoption of the proposals in THREAD 3 ensures that the delimited values >> are initiated and terminated by a single instance of the token character >> (digram in the case of semi-colon delimited strings and trigram for triple >> quote delimited strings). The removes the (unnecessary) requirement that token >> character MUST be preceded by a whitespace at initiation and followed by a >> whitespace on termination. >> >> However an appropriate separator is required between tokens to unambiguously >> parse a CIF2 document. The appropriate separator is defined by the context in >> which it is used. For example at the highest-level, a whitespace serves this >> purpose. In a List object the ASCII , serves this purpose. In the Associative >> Array object, the separators are ASCII : and ASCII ,. The absence of a >> separator or use of the incorrect separator will give rise to ambiguity and >> possible error. The coercion rules for these cases need to be argued by the >> ³community². >> ------------------------------------------------------ >> >> Consider the following coercion rules for when a separator is not present. >> >> (1) Always generate an error message and die (we might be able to do better) >> (2) Atttempt to guess what is intended. >> >> Example (at the zero level) >> >> _name ³butted ³²strings² >> >> Adopt the C/Python rule which returns ³butted strings² as the lexeme. >> Splitting them doesn¹t make sense because there is one data name that can have >> one data value. I would create an illegal STAR/CIF by splitting them. >> >> Now this might be different >> >> loop_ _name ³butted ³²strings² >> >> Here I would argue that we should split in to two data values. It will be a >> correct structure in the STAR/CIF sense and is the explicit enforcement of the >> token termination rules, even though the separator rule violated. >> >> For Brian¹s examples in loops >> >> INTENDED >> loop_ _colour 'red'blue'green' # 'red' blue 'green' >> loop_ _colour 'red' blue'green' # 'red' blue 'green' >> loop_ _colour 'red'blue 'green' # 'red' blue 'green' >> loop_ _colour 'red'''blue'green' # 'red' '' blue 'green' >> >> These 4 (under the above rule) agrre with what is intended. >> >> loop_ _colour 'red''''blue'''green # 'red' '''blue''' green >> >> This one does also, because in my lexer (and everyone should do this and Herb >> agrees) the triple quote rules have priority over and single character quote >> rules. >> >> The Brian¹s other examples. Given the above coercion rules, and the restricted >> character set of data names. These would be >> INTERPRETED >> loop_ _colour'red' 'green' 'blue' # loop_ _colour 'red' 'green' >> 'blue' [stop_] # added for clarity >> loop_ _colour 'red' 'green' 'blue'_name Fred # loop_ _colour 'red' 'green' >> 'blue' [stop_] _name Fred >> loop_ _colour 'red''green''blue'_name Fred # loop_ _colour 'red' 'green' >> 'blue' [stop_] _name Fred >> loop_ _colour 'red''green''blue' _name Fred # Ditto >> >> Another coercion rule. The separator for lists is the comma. What if that is >> given as a space? >> >> >> _name {{1 2 3} # newlines mean nothing, so inserted for clarity/typesetting. >> {4 5 6} >> {7 8 9}} >> >> We suggest this is a 3x3 matrix (which you would from the dictionary anyway) >> and it should be coerced in to >> >> _name {{1,2,3}, # newlines mean nothing, so inserted for >> clarity/typesetting. >> {4,5,6}, >> {7,8,9}} >> >> This is consistent with the loop rule above where we split. Similar rule for >> all lists. >> >> cheers >> >> Nick >> >> -------------------------------- >> Associate Professor N. Spadaccini, PhD >> School of Computer Science & Software Engineering >> >> The University of Western Australia t: +61 (0)8 6488 3452 >> 35 Stirling Highway f: +61 (0)8 6488 1089 >> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >> <http://www.csse.uwa.edu.au/%7Enick> >> MBDP M002 >> >> CRICOS Provider Code: 00126G >> >> e: Nick.Spadaccini@uwa.edu.au >> >> >> >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):