[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 24 Sep 2009 06:46:59 -0400 (EDT)
- In-Reply-To: <C6E123F5.11EB6%nick@csse.uwa.edu.au>
- References: <C6E123F5.11EB6%nick@csse.uwa.edu.au>
Dear Colleagues, I, too, am a fan of consistency, but in this case, we have a huge problem (in terms of numbers of existing data sets) if we drop the established convention that the terminal quote marks in a CIF are only effective if they are followed by whitespace. The PDB uses such constructs as 'O'', which would be broken by making a consistent improvement. Similar, but more subtle and difficult to debug problems arise if we drop the requirement for trailign whitespace to recognize a valid trailing \n; text field close, requiring major changes both the to context of CIFS and the line folding logic. CIF whitespace handling is different from whitespace handling in, say python or java or C, and in DDLm we expect to be quoting things that involve both conventions. I think we are going to need to sacrifice consistency in favor of being able to handle existing CIFs. So, I would recommend having the lexical scan work in levels. At the top level, in a CIF, even a ] should require trailing white space to be valid, even though that is an unnecessary requirement, while within a bracketed or quoted construct, different rules would apply depending on the construct being used. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Thu, 24 Sep 2009, Nick Spadaccini wrote: > > Before addressing specifics let me say I am coming at this problem by > exploiting 50 years of computer science's fascination with parsing and > getting it right. In the end having structures that are consistent and > closed under their repetitive operation is important. That is a list is > defined by a production rule, and a list of lists should just be a recursive > call to that production rule. Not so in the current CIF/Star. One has to > define a "different" list rule recursively since whitespace before [ and > after ] are NOT required. If you did require tat whitespace, it would not be > sensible. > > Herb is pushing for (essentially) the status quo. I argue that simply is not > possible. See below. > > On 22/09/09 5:55 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com> > wrote: > >> Dear Colleagues, >> >> I think it urgent to at least hear from the PDB and the IUCr journals >> operation on the subject of remediating all existing CIFs, as well as >> from the managers of the major graphics and data processng packages very >> early in the discussion. >> >> However, never one to fail to rush in where angels fear to tread, here >> are my comments on substance: >> >> I would prefer to retain the current CIF approach of recognizing >> anything that can be whitespace delimited and which is not an a small >> list of reserved items as a whitespace delimited value. I would suggest > > Does this mean whitespace delimited values are EXCLUDED from forming part of > a compound data structure, such as a list? Because it will have to be if we > don't adopt a restriction to the allowed character set of these values. > > If they can be part of a compound data structure then currently a legitimate > whitespace delimited SINGLE value like x,y+1/2,-z is no longer acceptable. > > Hence the character set has to be restricted, or the ONLY white space > delimited tokens are identifiers (datanames), keywords and numerical values, > and everything else HAS to be in delimited strings. > > We have to change something (there is no argument there), and while we are > at it lets change things to be more consistent and sensible all round. > >> that the reserved items be: >> >> Any item beginning with an underscore ('_') >> Any item beginning with "data_" or "save_" (case insensitive) >> Any item consisting of "global_", "loop_", "stop_" (case insensitive) >> Any item beginning with of the quote marks: >> '"' (double quote) >> '\'" (single quote) >> '\n;' (newline-semicolon) (where newline is system dependent) >> '[', '{', '(' (the three bracket constructs in the original DDLm >> proposal) >> '\'\'\'' or '"""' (the two treble quote marks used in other languages > > Agree with these, reiterating that I don't see a great need for ''', but I > am not religious about that objection. HOWEVER I (nor can anyone I discuss > parsing and compiler writing with) can't see the need that the delimiter > HAS to preceded (and followed) by one or more whitespaces. Yes it is nice > to, and 99.99999% of programs WILL, but it should not be an enforced > requirement of the language (especially when it is totally unnecessary). In > this way the grammar definition for a data type can be called recursively > when injected in to a compound data structure, rather than having to create > a second (almost identical) production rule when the same data type is to be > injected in to a compound data structure. That just simply doesn't make > sense. > >> When an item begins with one of the quote marks it would then have to >> conform to the conventions specified for those quote marks, but in general >> at the top level, the mating terminal quote mark would not be recognized >> as a terminal quote mark unless followed by whitespace. > > Ditto for the argument as to why a trailing white space is not needed, and > hence should not be specified as part of the language. > >> I would prefer to handle the elides one level down, i.e. not treating >> '"""\\\n' as a terminal treble quote mark because the last '"' is followed >> by a reverse solidus rather than by whitespace. > > The approach usually taken does not involve what comes after a character but > what happens before. So """ terminates the triple quote token irrespective > of what follows. In my previous email on this topic (different thread) I > suggested spaces, but that is actually introducing yet another non-standard > approach. Lets stick with elides. As far as the scanner is concerned if I > elide a character then it is NOT to be considered as part of a sequence to > determine tokens and is to interpreted as raw. > > Hence > > _ADataName """Here is a string including a \""" quote""" > > is OK. Note I only need to elide the first quote since the scanner would > know to ignore the first quote, and the following two quotes don't mean > anything in the grammar rule. The scanner should return the raw string when > asked, that is > > Here is a string including a \""" quote > > It is then the consumer of this lexeme to deal with what to do. This is > James' favoured approach. At the highest level the elides mean ignore the > next character, and then return the contents as a raw string. It is then up > to the down stream application to do what it needs to. I think this is a > much neater approach, and certainly one more consistent with the way things > are done in most parsing applications. > > In the end the easiest thing is to wrap all strings up in a triple quote. > Then any single or double quote characters in the lexeme won't cause > trouble. If a triple quote has to be included, elide the first quote. > > In this new approach to delimiters, the string > > ""Hello", he said" > > which is acceptable in the current specification, but any writer of parsers > would be seriously concerned, would have to now be > > '"Hello", he said' > > or > > """"Hello", he said""" > > To date I have only discussed accepting elides in the new triple quote > strings. I would prefer NOT to introduce eliding in the old single or double > quote strings. > >> I would prefer to accept all UTF-8 text. > > Support for a wider character set is important, but I am not clear exactly > what this preference means. Are you saying we are to accept the binary UTF-8 > character set, which would mean cifs opened in a standard text editor would > look like gobbledygook, or are you saying we will accept an ascii-fied > string representing the UTF-8? That is \uABCD or &#ABCD; or some such > equivalent? > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):