Re: [ddlm-group] Space as a list item separator

Dear all

One point I read in David's comments is that there are no legacy issues with respect to lists, associative arrays etc.
Does anyone disagree? Obviously it makes life easier when considering lists etc if the 'legacy' word doesnt rear its head.

From: David Brown <idbrown@mcmaster.ca>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Monday, 30 November, 2009 19:56:30
Subject: Re: [ddlm-group] Space as a list item separator

Pleasse forgive me, everyone, but what is all this CIF1.5 about? 

Why do we need it?

If a DDLm application is presented with with a CIF data file written using a DDL1 or DDL2 dictionary, which I assume uses CIF1.1 syntax, why can't we continue to use CIF1.1 since this works just fine for these files?  Why do we need CI1.5?

CIF data files written using DDL1 and DDL2 dictionaries do not contain lists and arrays because lists and arrays were not invented when these files were written, and any data files written with these dictionaries in the future (and there may be many of them) will still use the CIF1.1 syntax.  There is no danger of arrays slipping into these data files unnoticed because they are not defined (and never will be) in DDL1 and DDL2 dictionaries (CIF1.1 does not allow it.)

Of course our DDLm application (if we ever get it off the ground) will need to be able to read data files written with CIF1.1 syntax because we are required to ensure that this application can read in any existing CIF data file.  It will also need to be able to read files written in CIF2 syntax because CIF2 will be needed for reading in the DDLm dictionaries (the only dictionaries that contain dREL) and the CIF2 data files (which may, unlike the CIF1.1 data files, also contain arrays and lists).

As I pointed out earlier (and it seems to have come as something of a shock or epiphany to some), the DDLm dictionaries include very nice lists of aliases that contain every data name that was ever used for a given item.  The data names in this alias list are, of course, quoted data values within the DDLm dictionary. and some contain characters that CIF2 would not recognize in a data name, but that is fine because they appear only in data values, and quoted data values no less,

When confronted with a datafile written in CIF1.1, our hypothetical application would switch on its CIF1.1 lexer to read in the CIF1 data file, and pass the results into a preparser which would match the data name in the CIF1.1 data file with an alias name in the DDLm dictionary, and immediately substitute the DDLm data name for the original DDL1 or DDL2 data mame.  Now all the problem with the old data names has disappeared.  The preparser might have to make other changes to the data value (I am not sure that there are any, perhaps adding delimiters to all strings so they could be stripped away by the parser?).  At this point you have a fully compliant CIF2-DDLm data set, which you can dREL to your heart's content.  In particular, if dREL calls for an array, the item associated with that array will contain a dREL mothod for assembling the array from the individual data items that were originally stored in the input CIF and are now stored under a DDLm defined name.  The only thing that would be difficult to do would be to reconstruct a DDL1 or DDL2 compliant data output file, but even this could be done if it was thought necessary.

Please let's not make this exercise more confusing than necessary. 

You guys need to get on with defining what you want in CIF2.  CIF1 can then look after itself using the existing tools together with the aliases for renaming the items.


Herbert J. Bernstein wrote:
Dear Colleagues,

  Instead of looking at the minimally disruptive approach as a modification to CIF 2, in order to in fact be minimally disruptive, I would suggest looking at CIF 1.5 in terms if what would need to be changed in CIF 1.1 in order to support DDLm.

  I think the following will do it:

  For data values, only, recognize three new initial string delimiters in addition to the existing single quote ("'"), double quote ("\"") and newline-semicolon ("\n;"):

  left brace ("{")
  left square bracket ("[")

Unless these are encountered in a left to right scan at a point at which the first character if a data value is expected, the parse remains the same as for CIF 1.1.

Once the left brace or left square bracket is encountered, then whatever the formally agreed rules for the CIF2 parse are would apply until the balancing terminal right brace or right square bracket.  It is only the top level terminal right brace or right square bracket that would be required to be followed by whitespace.

The new dictionaries would _not_ be written in CIF 1.5, only in full CIF 2, but parsers would be expected to process any CIF not clearly self-identifying as a CIF 2 file as a CIF 1.5 file.  This means that the only major use of CIF 2 constructs in CIF 1.5 would be to allow users to provide list, matrix and vector data values.

This also means, for example, as per David's suggestion, that the only way a tag with embedded square brackets or embedded braces would be handled in a new dictionary would be as an alias, but the formality of CIF 1.5 would give applications a clean way to make use of those aliases in parsing data files.

If we follow this approach, then we would be honoring the published commitment to be able to keep essentially all exsiting data files unchanged, and still be able to handle them with DDLm.  The only exception would be data files that happen to include data values that begin with '{' or '[', which would now have to be quoted. I do not believe that there are many such cases, and I believe that there would be acceptance of the need to add such quoting if encountered.

To summarize:

  Development of CIF 2 with DDLm support would continue and be used for
new dictionaires; and

  Development of CIF 1.5 to serve as a bridge between CIF 1.1 and DDLm would start, primarily giving uses the ability to provide list, matrix and vector data values, would be started to allow for a smooth transition to wider use of DDLm and CIF 2


 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769


On Sun, 29 Nov 2009, SIMON WESTRIP wrote:

Yes that summarizes the differences. Unfortunately, the single-byte
non-delimited strings have to be separated by
white space in this approach, which is perhaps counter-intuitive and mght
have some legacy issues?

From: James Hester <jamesrhester@gmail.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Sunday, 29 November, 2009 3:45:18
Subject: Re: [ddlm-group] Space as a list item separator

Hi Simon: I'm trying to read between the lines here as to how the syntax we
have been discussing diverges from what you have described, and have come up
with the following list:

1. Presumably the []{} characters must be surrounded by whitespace in your
2. We have restricted the character sets of the non-delimited strings and
tags more than strictly necessary.
3. Comma might be included in the single-byte non-delimited string list

Are there any other differences that you would identify?

On Sat, Nov 28, 2009 at 10:58 PM, SIMON WESTRIP
<simonwestrip@btinternet.com> wrote:
      Dear all

      I was chatting with the man who 'writes the cheques' yesterday
      about some of the
      changes he might expect with CIF2, and based on this I feel I
      ought to at least have
      a go at exploring a 'minimally disruptive' approach, so at the
      risk of being shouted at,
      here goes at a slightly different way of looking at CIF:

      CIF contains a list of strings separated by whitespace.

      A string can be nondelimited or delimited.

      Nondelimited strings have a restricted character set (minimally
      whitespace is excluded)

      A nondelimited string cannot start with any of the delimiters

      Nondelimited strings can have special meaning governing what
      follows them:

          reserved words, e.g. loop_

          tags, e.g. data_ , _foo

          single-byte nondelimited strings, e.g. [ ] { } :

      All other strings are treated as raw data values

      There, least I can say I tried :-)



From: SIMON WESTRIP <simonwestrip@btinternet.com>
To: Group finalising DDLm and associated dictionaries
Sent: Saturday, 28 November, 2009 10:01:38

Subject: Re: [ddlm-group] Space as a list item separator

I had been under the assumption that the separation of list items by a
comma was 'set in stone'
(and was one reason for dropping the CIF1 syntax of requiring space
after data values),
but if its up for negotiation I would opt for using the space as a
separator as elsewhere in the CIF,
partly because then a list can essentially be treated much like a
single-item loop - i.e. same basic parsing
of <value><space><value><space>...



From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries
Cc: Nick.Spadaccini@uwa.edu.au
Sent: Friday, 27 November, 2009 11:43:10
Subject: Re: [ddlm-group] Space as a list item separator

Dear Colleagues,

  I have no objection to accepting either comma or whitespace
as a valid separator in a list.  I can't object -- I have been
coding to that standard since 1997, and now would only have to
remove the message generated for the case of the space.  We already
accept multiple glyphs as valid separators at all levels:

  whitespace itself it one of several character sequences in rather
complex combinations:  any number of blanks, tabs, newlines and
The comma itself is handled in a complex way.  We accept (or should
accept) any whitespace before and after a comma as valid, as in
{a,b} versus {a , b }.  Adding the option of leaving out the comma
itself and just having the whitespace as the separator make just
as much sense.

  I see nothing to be gained by now forbidding the comma.  The meaning
of {a,,b,} is the same as {a,.,b,.} or {a,?,b,?} or, under this new
(and I think more sensibsle and realistic approach) {a . b .} or {a ?
b ?}.

  The blank reads particularly well in dealing with vectors and
matrices. The comma reads well when dealing with strings.

  I think we would do best with both as valid alternatives (no error,
no warning for either one).

    Herbert
Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769


On Fri, 27 Nov 2009, SIMON WESTRIP wrote:

> At first glance, you're considering using space instead of commas as
> separators?
> which is not so far away from the CIF1 requirement of space
following a
> delimiter?
> But I'm only on my first cup of coffee this morning :-)
> From: Nick Spadaccini <nick@csse.uwa.edu.au>
> To: Group finalising DDLm and associated dictionaries
> Sent: Friday, 27 November, 2009 7:46:44
> Subject: Re: [ddlm-group] Space as a list item separator
> On 27/11/09 2:32 PM, "James Hester" <jamesrhester@gmail.com> wrote:
> > See comments below:
> >
> > On Fri, Nov 27, 2009 at 3:09 PM, Nick Spadaccini
> wrote:
> >> Timely email, come in just after the one I sent.
> >>
> >> My position is if we specify the syntax then we encourage its
correct use
> but
> >> acknowledge that there may be cases where one might be able to
> >> intent. But I wouldn?t encourage those cases.
> >
> > Absolutely, which is why I would like to elevate space-separated
> items to
> > be correct syntax rather than 'wrong but intent is clear' syntax.
> >>
> >> You could say that token separator in lists are a or b or c, but
> just
> >> adds a level of complexity for very little gain. The choice of
> makes it
> >> seamless to translate from the raw CIF data straight in to most
> >> specific data declaration. The only language I know that accepts
one or
> the
> >> other or both is MatLab.
> >
> > Re ease of translation: you speak as if a viable approach to a CIF
> file
> > is to take whole text chunks and throw them at some language
> > without doing your own parse.  Quite apart from being a rather
> > approach, this is impossible, as without parsing you won't know
where the
> list
> > finishes.  If you do do your own parse, you can populate your
> datastructures
> > directly during the parse, and what list separator was originally
used in
> the
> > data file is completely irrelevant.
> >
> > Re complexity: not sure how you are planning to deal with
whitespace in
> the
> > formal grammar, but consider the following, where I have assumed
that each
> > token 'eats up' the following whitespace.
> >
> > <dataitem> = <dataname><whitespace>+<datavalue>
> > <datavalue> = {<list>|<string>}<whitespace>+
> > <listdatavalue> = {<list>|<string>}<whitespace>*
> > <list> = '[' <whitespace>* {<listdatavalue>
> > {<comma><whitespace>*<listdatavalue>}*}* ']'
> >
> > If we make comma or whitespace possible separators, the last
> > becomes:
> > <list> =  '[' <whitespace>* {<listdatavalue> {<comma or
> > whitespace><listdatavalue>}*}* ']'
> >
> > This looks like no extra complexity, and from a user's point of
> > whitespace as an alternative separator is simple to understand and
> consistent
> > with space as a token separator used everywhere else in CIF. 
Anyway, if
> > reduction of grammar complexity is your goal, you can just
> exclude
> > commas as list separators!
> Why not? Make them spaces only, and you become consistent across the
> I have to think about the possibility of pathological cases where
> won't work. I can't think of any at the moment.
> >
> > Some questions about how commas behave:
> > 1: is a trailing comma e.g. [1,2,3,4,] a syntax error?
> > 2. are two commas in a row a syntax error? E.g. [1,2,3,,4]
> I would say yes to syntax error. I an easily determine they may need
to be
> an additional list value, but can't determine what.
> > Note the above productions assume that the answer to both is yes.
> >
> >>
> >> What big advantage to a language is there to specify you can use
a comma
> or
> >> whitespace as a token separator? Will you be happy with the first
> who
> >> interprets this as being ok
> >>
> >> loop_
> >>   _severalvalues 1,2,3,4,5,6,7 # these being the 7 values of
> severalvalues
> >>
> > Note sure what you are getting at here: I am proposing the
> >
> > _nicelist      [1 2 3 4 5 6 7]
> >
> > being the same as
> >
> > _nicelist      [1,2,3,4,5,6,7]
> >
> >  Don't see how this relates to loops.
> The point was, once you say a space and comma are equivalent token
> separators then will it be an interpretation that they are always so
even in
> loops? My example was not a list, just 7 values that were separated
> commas not spaces.
> >
> > James.
> > ------
> >>
> >> On 27/11/09 11:41 AM, "James Hester" <jamesrhester@gmail.com
> >> <http://jamesrhester@gmail.com> > wrote:
> >>
> >>> Dear All: looking over the list I posted previously of items
left to
> >>> resolve, I see only one serious one outstanding: whether or not
to allow
> >>> space as a separator between list items.  Nick has stated:
> >>>
> >>> " I will propose it has to be a comma, but make the coercion
rule that
> space
> >>> separated values in a list-type object be coerced into comma
> >>> values. That is, read spaces as you want, but don't encourage
> >>>
> >>> I would like to counter-propose, as Joe did originally, that
> be
> >>> elevated to equal status with comma as a valid list separator. 
I see no
> >>> downside to this.  Would anyone else like to speak to this issue
> we
> >>> vote?  In particular, I would be interested to hear why Nick
> want to
> >>> encourage spaces.
> >>
> >> cheers
> >>
> >> Nick
> >>
> >> --------------------------------
> >> Associate Professor N. Spadaccini, PhD
> >> School of Computer Science & Software Engineering
> >>
> >> The University of Western Australia    t: +61 (0)8 6488 3452
> >> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> >> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3:
> >> <http://www.csse.uwa.edu.au/%7Enick>
> >> MBDP  M002
> >>
> >> CRICOS Provider Code: 00126G
> >>
> >> e: Nick.Spadaccini@uwa.edu.au <http://Nick.Spadaccini@uwa.edu.au>
> >>
> >>
> >>
> >> _______________________________________________
> >> ddlm-group mailing list
> >> ddlm-group@iucr.org
> >> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >>
> >
> >
> cheers
> Nick
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA  w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
> CRICOS Provider Code: 00126G
> e: Nick.Spadaccini@uwa.edu.au
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

ddlm-group mailing list

