[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD TRIPLE QUOTES - Specification
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD TRIPLE QUOTES - Specification
- From: Nick Spadaccini <nick@csse.uwa.edu.au>
- Date: Mon, 14 Sep 2009 09:26:15 +0800
- Authentication-Results: postfix;
- In-Reply-To: <20090912025123.B67330@epsilon.pair.com>
Herb presses on a number of issues here that require quite a bit of rethinking about the way things have been done versus how they will be done. Firstly I preface what I say below with the recognition that under the new DDLm and syntax there will have to be a large degree of remediation of archived cifs. That is a given because in the past un-delimited strings have been submitted which cannot be allowed under the new changes. I want people to think about how we can do things and not necessarily think about how does what we have done fit into what we want to do. I don't want to get in to the situation where Microsoft continually builds holes in its operating systems simply because it wants to be backwards compatible all the way back to DOS 1.0. Firstly un-delimited strings are a bad thing right and were from day one. I can understand they look "nice" because you read them the way your would in a journal paper - which is actually the underlying reason for why CIF is the way it is. Back in 1987 the dominant view of data was through the eyes of a journal article. But un-delimited strings are a problem. We should increasingly "push" delimited strings. (1) A previously perfectly legal string like 1,1-trans-,[Bis]Carbonato-Whatever is no longer acceptable. Why? Because in a allowing for compound data structures Lists, Tables etc the , indicates the next element is being defined, and the [] indicates a recursive form. I favour not allowing un-delimited strings, however these have a foothold and I am unlikely to win that one. So un-delimited strings will HAVE TO HAVE a restricted alphabet. The details will come in a follow-up thread (I am in an airport lounge ready to head home, so they will come in the next day). I will push for the restriction in un-delimited strings to be held for datanames also. We have gone through all existing small molecule cifs and with these restrictions we will be breaking few data names or values. mmcif will have more because of their use of [] in data names. But these are easy work a-rounds. (2) The parsing of CIF should be very easy to define both through and LALR(k) or LL(k) formalism. However the original proposal that every character be available in a string, even the token delimiters has made it much more difficult. In any other language a quote delimited string is initiated by a " and terminated by a " (not with standing support for any elides). Not so in STAR/CIF they are <whitespace(s)>" and "<whitespace(s)>. So wheras any other language considers whitespace as separator of tokens and meaningless. In STAR?CIF the whitespace IS the token (along with "). This becomes silly when you define a recursive compound data structure (or even its BNF). A list is initiated by a <whitespace(s)>[ and is recursive, BUT don't use the same definition for lists that you just used because recursively that list DOESN'T need a leading whitespace(s). Part of the follow-up threads will propose establishing token delimits which are single characters or multiple characters. The only one I can't get around is the <newline>; construct. But that is enshrined and has to be supported. What I will propose will make things much easier, both in defining in a BNF and in building a parser using compiler-compiler tools. We will want people to continue to use a space before the token delimiter, but if they don't there is a recovery mechanism if we choose to invoke it. James knows what is coming because I spent the last 2 weeks in a room with him banging out these issues. The three weeks I spent in Chester "jelled" some new ideas about how to use DDLm in quite different discipline domains. The ANSTO stay has made me think about how to be more rigorous to simplify building things. Nick. Those threads will be out shortly. They will be longish because I will try to cover all our thinking over the last several weeks. On 12/09/09 3:17 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com> wrote: > Dear Colleagues, > > Inasmuch as we have already generated many CIFs on under the existing > handling of the reverse solidus, which was to treat it as subordinate > to the existing quoting schemes, I would suggest that we retain the > same ordering of lexical scan for the treble quotes. This would > allow us to keep the handling of embedded treble quotes within treble > quotes without much special handling at all -- just break up any > embedded treble quotes with a reverse solidus. To be more specific, > here is what I would suggest: > > 1. On reads: > 1.1. The reverse solidus is an ordinary character in the lexical > scan of a quoted string; > 1.2. At the level of a CIF we retain the rule that no terminal > quote mark is recognized unless followed by whitespace > 1.3. At the level of a CIF we strip trailing whitespace from all > lines prior to the lexical scan > 1.4. That, on read, recognition of the reverse solidus is an > optional semantic interpretation (perhaps handled by a second level > lexical scan or handled in the application) following the same > rules as Brian laid out for comments, semi-colon delimited strings > and, now, for treble quoted strings. > > 2. On writes: > In writing a treble quoted string, if a treble quote is > encounterd as part of the quoted text, a reverse-solidus-newline > digraph would be inserted after the third quote mark, i.e. > > """This is an example > of a treble-quoted > string""" > > might be written as > > """"""\ > This is an example of a treble-quoted > string"""\ > """ > > This interesting case to then consider is whether we need to > do any quoting to protect the reverse soliduses (solidii?) > in the example when quoting it > > The main advantage of this approach is the ordinary quoted > cif-strings such as > > _mugwump "muddy big water" > > could then be treble quoted very directly as > """_mugwump "muddy big water" """ > > rather than as > > """_mugwump " muddy big water" """ > > as would be required under Nick's suggestion > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Fri, 11 Sep 2009, Herbert J. Bernstein wrote: > >> The main value of the treble quoted string is that it allows a much >> neater presentation of examples of chunks of CIFS and text in which >> presenting such information within semi-colon quoted strings gets >> somewhat confusing. >> >> For this reason, I would suggest that the most important test of >> Nick's suggestions would be how faithfully a semi-colon delimited >> example could be included with _no_ added or subtracted characters, >> so that people reading dictionaries by eye will reproduce those >> examples correctly. >> >> For that reason, I hope we can stay as close to """ and ''' delimiting >> truly raw data as possible. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Fri, 11 Sep 2009, Nick Spadaccini wrote: >> >>> Our last discussion on the implementation of triple quoted strings resulted >>> in much to-ing and fro-ing and in the end the conclusion was that its >>> behaviour was to be identical to the semi-colon delimited strings. I >>> preferred a greater degree of parsing of the string but this was not >>> popular. >>> >>> Now our illustrious chair, who sits next to me now, asks the question "what >>> is the point of the triple quoted string", to which I can only shrug my >>> shoulders. The triple quote string will be useful for containing strings >>> that include ", ' and ; (in the first character in a record). They will of >>> course fail when you attempt to include the sequence """. Hence they are no >>> different to a ; delimited string that cannot include a ; as the first >>> character of a line. >>> >>> Here is a suggestion. The triple quote string (delimited by """) will treat >>> its contents a raw, except that >>> >>> (a) When writing the string, ALL quotes contained within will have a space >>> inserted immediately after the " character. This will allow the triple >>> quote >>> to be contained within the string by breaking the sequence with spaces so >>> the tokeniser is not fooled in to terminating the string. Clearly the >>> reverse operation is required in reading the string. I this way is is >>> possible to include all manner of text, markup and programming scripts >>> within a triple quoted string. >>> >>> (b) We will formally accept in this string the "eliding" of the newline >>> character. Hence a reverse solidus (\) immediately prior to the record >>> terminating character(s) will imply the \ and the record terminating >>> characters are deleted from the stream, and the next line is wrapped >>> around. >>> >>> To allow for the odd case when one want's to literally include the \ and >>> the >>> record terminating character(s) in the string then the required \ will be >>> elided. >>> >>> In parsing the contents of a string the only things required are >>> (1) delete one of the spaces after every " >>> (2) treat \<newline> as a wrap around >>> (3) treat \\<newline> as the raw string \<newline> >>> >>> All other characters are left as is. >>> >>> >>> cheers >>> >>> Nick >>> >>> -------------------------------- >>> Associate Professor N. Spadaccini, PhD >>> School of Computer Science & Software Engineering >>> >>> The University of Western Australia t: +61 (0)8 6488 3452 >>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>> MBDP M002 >>> >>> CRICOS Provider Code: 00126G >>> >>> e: Nick.Spadaccini@uwa.edu.au >>> >>> >>> >>> >>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >> cheers Nick -------------------------------- Associate Professor N. Spadaccini, PhD School of Computer Science & Software Engineering The University of Western Australia t: +61 (0)8 6488 3452 35 Stirling Highway f: +61 (0)8 6488 1089 CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick MBDP M002 CRICOS Provider Code: 00126G e: Nick.Spadaccini@uwa.edu.au _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] THREAD TRIPLE QUOTES - Specification (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] THREAD TRIPLE QUOTES - Specification
- Next by Date: Re: [ddlm-group] Interactions with methods
- Prev by thread: Re: [ddlm-group] THREAD TRIPLE QUOTES - Specification
- Next by thread: [ddlm-group] Interactions with methods
- Index(es):