Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Colleagues,

   I, too, am a fan of consistency, but in this case, we have a huge
problem (in terms of numbers of existing data sets) if we drop the
established convention that the terminal quote marks in a CIF are only
effective if they are followed by whitespace.  The PDB uses such
constructs as 'O'', which would be broken by making a consistent
improvement.

   Similar, but more subtle and difficult to debug problems arise if
we drop the requirement for trailign whitespace to recognize a
valid trailing \n; text field close, requiring major changes both
the to context of CIFS and the line folding logic.

   CIF whitespace handling is different from whitespace handling in,
say python or java or C, and in DDLm we expect to be quoting things
that involve both conventions.  I think we are going to need to
sacrifice consistency in favor of being able to handle existing CIFs.

   So, I would recommend having the lexical scan work in levels.  At the
top level, in a CIF, even a ] should require trailing white space to
be valid, even though that is an unnecessary requirement, while within
a bracketed or quoted construct, different rules would apply depending
on the construct being used.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Thu, 24 Sep 2009, Nick Spadaccini wrote:

>
> Before addressing specifics let me say I am coming at this problem by
> exploiting 50 years of computer science's fascination with parsing and
> getting it right. In the end having structures that are consistent and
> closed under their repetitive operation is important. That is a list is
> defined by a production rule, and a list of lists should just be a recursive
> call to that production rule. Not so in the current CIF/Star. One has to
> define a "different" list rule recursively since whitespace before [ and
> after ] are NOT required. If you did require tat whitespace, it would not be
> sensible.
>
> Herb is pushing for (essentially) the status quo. I argue that simply is not
> possible. See below.
>
> On 22/09/09 5:55 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
> wrote:
>
>> Dear Colleagues,
>>
>>    I think it urgent to at least hear from the PDB and the IUCr journals
>> operation on the subject of remediating all existing CIFs, as well as
>> from the managers of the major graphics and data processng packages very
>> early in the discussion.
>>
>>    However, never one to fail to rush in where angels fear to tread, here
>> are my comments on substance:
>>
>>    I would prefer to retain the current CIF approach of recognizing
>> anything that can be whitespace delimited and which is not an a small
>> list of reserved items as a whitespace delimited value.  I would suggest
>
> Does this mean whitespace delimited values are EXCLUDED from forming part of
> a compound data structure, such as a list? Because it will have to be if we
> don't adopt a restriction to the allowed character set of these values.
>
> If they can be part of a compound data structure then currently a legitimate
> whitespace delimited SINGLE value like x,y+1/2,-z is no longer acceptable.
>
> Hence the character set has to be restricted, or the ONLY white space
> delimited tokens are identifiers (datanames), keywords and numerical values,
> and everything else HAS to be in delimited strings.
>
> We have to change something (there is no argument there), and while we are
> at it lets change things to be more consistent and sensible all round.
>
>> that the reserved items be:
>>
>>    Any item beginning with an underscore ('_')
>>    Any item beginning with "data_" or "save_" (case insensitive)
>>    Any item consisting of "global_", "loop_", "stop_" (case insensitive)
>>    Any item beginning with of the quote marks:
>>       '"' (double quote)
>>       '\'" (single quote)
>>       '\n;' (newline-semicolon) (where newline is system dependent)
>>       '[', '{', '(' (the three bracket constructs in the original DDLm
>>          proposal)
>>       '\'\'\'' or '"""' (the two treble quote marks used in other languages
>
> Agree with these, reiterating that I don't see a great need for ''', but I
> am not religious about that objection. HOWEVER I (nor can anyone I discuss
> parsing and compiler writing with) can't see the need  that the delimiter
> HAS to preceded (and followed) by one or more whitespaces. Yes it is nice
> to, and 99.99999% of programs WILL, but it should not be an enforced
> requirement of the language (especially when it is totally unnecessary). In
> this way the grammar definition for a data type can be called recursively
> when injected in to a compound data structure, rather than having to create
> a second (almost identical) production rule when the same data type is to be
> injected in to a compound data structure. That just simply doesn't make
> sense.
>
>> When an item begins with one of the quote marks it would then have to
>> conform to the conventions specified for those quote marks, but in general
>> at the top level, the mating terminal quote mark would not be recognized
>> as a terminal quote mark unless followed by whitespace.
>
> Ditto for the argument as to why a trailing white space is not needed, and
> hence should not be specified as part of the language.
>
>>    I would prefer to handle the elides one level down, i.e. not treating
>> '"""\\\n' as a terminal treble quote mark because the last '"' is followed
>> by a reverse solidus rather than by whitespace.
>
> The approach usually taken does not involve what comes after a character but
> what happens before. So """ terminates the triple quote token irrespective
> of what follows. In my previous email on this topic (different thread) I
> suggested spaces, but that is actually introducing yet another non-standard
> approach. Lets stick with elides. As far as the scanner is concerned if I
> elide a character then it is NOT to be considered as part of a sequence to
> determine tokens and is to interpreted as raw.
>
> Hence
>
> _ADataName """Here is a string including a \""" quote"""
>
> is OK. Note I only need to elide the first quote since the scanner would
> know to ignore the first quote, and the following two quotes don't mean
> anything in the grammar rule. The scanner should return the raw string when
> asked, that is
>
> Here is a string including a \""" quote
>
> It is then the consumer of this lexeme to deal with what to do. This is
> James' favoured approach. At the highest level the elides mean ignore the
> next character, and then return the contents as a raw string. It is then up
> to the down stream application to do what it needs to. I think this is a
> much neater approach, and certainly one more consistent with the way things
> are done in most parsing applications.
>
> In the end the easiest thing is to wrap all strings up in a triple quote.
> Then any single or double quote characters in the lexeme won't cause
> trouble. If a triple quote has to be included, elide the first quote.
>
> In this new approach to delimiters, the string
>
> ""Hello", he said"
>
> which is acceptable in the current specification, but any writer of parsers
> would be seriously concerned, would have to now be
>
> '"Hello", he said'
>
> or
>
> """"Hello", he said"""
>
> To date I have only discussed accepting elides in the new triple quote
> strings. I would prefer NOT to introduce eliding in the old single or double
> quote strings.
>
>>    I would prefer to accept all UTF-8 text.
>
> Support for a wider character set is important, but I am not clear exactly
> what this preference means. Are you saying we are to accept the binary UTF-8
> character set, which would mean cifs opened in a standard text editor would
> look like gobbledygook, or are you saying we will accept an ascii-fied
> string representing the UTF-8? That is \uABCD or &#ABCD; or some such
> equivalent?
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.