Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.


Before addressing specifics let me say I am coming at this problem by
exploiting 50 years of computer science's fascination with parsing and
getting it right. In the end having structures that are consistent and
closed under their repetitive operation is important. That is a list is
defined by a production rule, and a list of lists should just be a recursive
call to that production rule. Not so in the current CIF/Star. One has to
define a "different" list rule recursively since whitespace before [ and
after ] are NOT required. If you did require tat whitespace, it would not be
sensible.

Herb is pushing for (essentially) the status quo. I argue that simply is not
possible. See below.

On 22/09/09 5:55 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
wrote:

> Dear Colleagues,
> 
>    I think it urgent to at least hear from the PDB and the IUCr journals
> operation on the subject of remediating all existing CIFs, as well as
> from the managers of the major graphics and data processng packages very
> early in the discussion.
> 
>    However, never one to fail to rush in where angels fear to tread, here
> are my comments on substance:
> 
>    I would prefer to retain the current CIF approach of recognizing
> anything that can be whitespace delimited and which is not an a small
> list of reserved items as a whitespace delimited value.  I would suggest

Does this mean whitespace delimited values are EXCLUDED from forming part of
a compound data structure, such as a list? Because it will have to be if we
don't adopt a restriction to the allowed character set of these values.

If they can be part of a compound data structure then currently a legitimate
whitespace delimited SINGLE value like x,y+1/2,-z is no longer acceptable.

Hence the character set has to be restricted, or the ONLY white space
delimited tokens are identifiers (datanames), keywords and numerical values,
and everything else HAS to be in delimited strings.

We have to change something (there is no argument there), and while we are
at it lets change things to be more consistent and sensible all round.

> that the reserved items be:
> 
>    Any item beginning with an underscore ('_')
>    Any item beginning with "data_" or "save_" (case insensitive)
>    Any item consisting of "global_", "loop_", "stop_" (case insensitive)
>    Any item beginning with of the quote marks:
>       '"' (double quote)
>       '\'" (single quote)
>       '\n;' (newline-semicolon) (where newline is system dependent)
>       '[', '{', '(' (the three bracket constructs in the original DDLm
>          proposal)
>       '\'\'\'' or '"""' (the two treble quote marks used in other languages

Agree with these, reiterating that I don't see a great need for ''', but I
am not religious about that objection. HOWEVER I (nor can anyone I discuss
parsing and compiler writing with) can't see the need  that the delimiter
HAS to preceded (and followed) by one or more whitespaces. Yes it is nice
to, and 99.99999% of programs WILL, but it should not be an enforced
requirement of the language (especially when it is totally unnecessary). In
this way the grammar definition for a data type can be called recursively
when injected in to a compound data structure, rather than having to create
a second (almost identical) production rule when the same data type is to be
injected in to a compound data structure. That just simply doesn't make
sense.
 
> When an item begins with one of the quote marks it would then have to
> conform to the conventions specified for those quote marks, but in general
> at the top level, the mating terminal quote mark would not be recognized
> as a terminal quote mark unless followed by whitespace.

Ditto for the argument as to why a trailing white space is not needed, and
hence should not be specified as part of the language.

>    I would prefer to handle the elides one level down, i.e. not treating
> '"""\\\n' as a terminal treble quote mark because the last '"' is followed
> by a reverse solidus rather than by whitespace.

The approach usually taken does not involve what comes after a character but
what happens before. So """ terminates the triple quote token irrespective
of what follows. In my previous email on this topic (different thread) I
suggested spaces, but that is actually introducing yet another non-standard
approach. Lets stick with elides. As far as the scanner is concerned if I
elide a character then it is NOT to be considered as part of a sequence to
determine tokens and is to interpreted as raw.

Hence

_ADataName """Here is a string including a \""" quote"""

is OK. Note I only need to elide the first quote since the scanner would
know to ignore the first quote, and the following two quotes don't mean
anything in the grammar rule. The scanner should return the raw string when
asked, that is 

Here is a string including a \""" quote

It is then the consumer of this lexeme to deal with what to do. This is
James' favoured approach. At the highest level the elides mean ignore the
next character, and then return the contents as a raw string. It is then up
to the down stream application to do what it needs to. I think this is a
much neater approach, and certainly one more consistent with the way things
are done in most parsing applications.

In the end the easiest thing is to wrap all strings up in a triple quote.
Then any single or double quote characters in the lexeme won't cause
trouble. If a triple quote has to be included, elide the first quote.

In this new approach to delimiters, the string

""Hello", he said"

which is acceptable in the current specification, but any writer of parsers
would be seriously concerned, would have to now be

'"Hello", he said'

or

""""Hello", he said"""

To date I have only discussed accepting elides in the new triple quote
strings. I would prefer NOT to introduce eliding in the old single or double
quote strings.

>    I would prefer to accept all UTF-8 text.

Support for a wider character set is important, but I am not clear exactly
what this preference means. Are you saying we are to accept the binary UTF-8
character set, which would mean cifs opened in a standard text editor would
look like gobbledygook, or are you saying we will accept an ascii-fied
string representing the UTF-8? That is \uABCD or &#ABCD; or some such
equivalent?

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au





_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.