[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: Nick Spadaccini <nick@csse.uwa.edu.au>
- Date: Thu, 24 Sep 2009 13:26:13 +0800
- Authentication-Results: postfix;
- In-Reply-To: <20090922053327.P20642@epsilon.pair.com>
Before addressing specifics let me say I am coming at this problem by exploiting 50 years of computer science's fascination with parsing and getting it right. In the end having structures that are consistent and closed under their repetitive operation is important. That is a list is defined by a production rule, and a list of lists should just be a recursive call to that production rule. Not so in the current CIF/Star. One has to define a "different" list rule recursively since whitespace before [ and after ] are NOT required. If you did require tat whitespace, it would not be sensible. Herb is pushing for (essentially) the status quo. I argue that simply is not possible. See below. On 22/09/09 5:55 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com> wrote: > Dear Colleagues, > > I think it urgent to at least hear from the PDB and the IUCr journals > operation on the subject of remediating all existing CIFs, as well as > from the managers of the major graphics and data processng packages very > early in the discussion. > > However, never one to fail to rush in where angels fear to tread, here > are my comments on substance: > > I would prefer to retain the current CIF approach of recognizing > anything that can be whitespace delimited and which is not an a small > list of reserved items as a whitespace delimited value. I would suggest Does this mean whitespace delimited values are EXCLUDED from forming part of a compound data structure, such as a list? Because it will have to be if we don't adopt a restriction to the allowed character set of these values. If they can be part of a compound data structure then currently a legitimate whitespace delimited SINGLE value like x,y+1/2,-z is no longer acceptable. Hence the character set has to be restricted, or the ONLY white space delimited tokens are identifiers (datanames), keywords and numerical values, and everything else HAS to be in delimited strings. We have to change something (there is no argument there), and while we are at it lets change things to be more consistent and sensible all round. > that the reserved items be: > > Any item beginning with an underscore ('_') > Any item beginning with "data_" or "save_" (case insensitive) > Any item consisting of "global_", "loop_", "stop_" (case insensitive) > Any item beginning with of the quote marks: > '"' (double quote) > '\'" (single quote) > '\n;' (newline-semicolon) (where newline is system dependent) > '[', '{', '(' (the three bracket constructs in the original DDLm > proposal) > '\'\'\'' or '"""' (the two treble quote marks used in other languages Agree with these, reiterating that I don't see a great need for ''', but I am not religious about that objection. HOWEVER I (nor can anyone I discuss parsing and compiler writing with) can't see the need that the delimiter HAS to preceded (and followed) by one or more whitespaces. Yes it is nice to, and 99.99999% of programs WILL, but it should not be an enforced requirement of the language (especially when it is totally unnecessary). In this way the grammar definition for a data type can be called recursively when injected in to a compound data structure, rather than having to create a second (almost identical) production rule when the same data type is to be injected in to a compound data structure. That just simply doesn't make sense. > When an item begins with one of the quote marks it would then have to > conform to the conventions specified for those quote marks, but in general > at the top level, the mating terminal quote mark would not be recognized > as a terminal quote mark unless followed by whitespace. Ditto for the argument as to why a trailing white space is not needed, and hence should not be specified as part of the language. > I would prefer to handle the elides one level down, i.e. not treating > '"""\\\n' as a terminal treble quote mark because the last '"' is followed > by a reverse solidus rather than by whitespace. The approach usually taken does not involve what comes after a character but what happens before. So """ terminates the triple quote token irrespective of what follows. In my previous email on this topic (different thread) I suggested spaces, but that is actually introducing yet another non-standard approach. Lets stick with elides. As far as the scanner is concerned if I elide a character then it is NOT to be considered as part of a sequence to determine tokens and is to interpreted as raw. Hence _ADataName """Here is a string including a \""" quote""" is OK. Note I only need to elide the first quote since the scanner would know to ignore the first quote, and the following two quotes don't mean anything in the grammar rule. The scanner should return the raw string when asked, that is Here is a string including a \""" quote It is then the consumer of this lexeme to deal with what to do. This is James' favoured approach. At the highest level the elides mean ignore the next character, and then return the contents as a raw string. It is then up to the down stream application to do what it needs to. I think this is a much neater approach, and certainly one more consistent with the way things are done in most parsing applications. In the end the easiest thing is to wrap all strings up in a triple quote. Then any single or double quote characters in the lexeme won't cause trouble. If a triple quote has to be included, elide the first quote. In this new approach to delimiters, the string ""Hello", he said" which is acceptable in the current specification, but any writer of parsers would be seriously concerned, would have to now be '"Hello", he said' or """"Hello", he said""" To date I have only discussed accepting elides in the new triple quote strings. I would prefer NOT to introduce eliding in the old single or double quote strings. > I would prefer to accept all UTF-8 text. Support for a wider character set is important, but I am not clear exactly what this preference means. Are you saying we are to accept the binary UTF-8 character set, which would mean cifs opened in a standard text editor would look like gobbledygook, or are you saying we will accept an ascii-fied string representing the UTF-8? That is \uABCD or &#ABCD; or some such equivalent? cheers Nick -------------------------------- Associate Professor N. Spadaccini, PhD School of Computer Science & Software Engineering The University of Western Australia t: +61 (0)8 6488 3452 35 Stirling Highway f: +61 (0)8 6488 1089 CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick MBDP M002 CRICOS Provider Code: 00126G e: Nick.Spadaccini@uwa.edu.au _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):