[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Revisiting list delimiters. .. .. .


On Thursday, April 07, 2011 7:27 PM James Hester wrote:

>I agree that focussing on the new information is appropriate.
>
>On Fri, Apr 8, 2011 at 3:31 AM, Bollinger, John C
><John.Bollinger@stjude.org> wrote:
>
>[edit]
>
>> I agree that it is not worthwhile to repeat the earlier discussion, but that is no reason to jump directly to a vote.  It seems reasonable to instead focus on any new information or insights that did not inform the previous discussion, and then to consider whether their combination with the considerations already discussed leads anyone to change their previous opinion.
>>
>> So, what is the new information we should consider?  James raised these points:
>>
>> JH> the only fully-functional software for processing DDLm domain dictionaries (Nick, Syd and Ian's demonstration software) expects a comma separator
>>
>> JH> [James's] understanding is that Syd and Nick (now) are strongly in favour of sticking with comma as the list separator for STAR2
>>
>> JH> other non-CIF domains are already using comma as a list separator in STAR2 data files.
>>
>> JH> for some, a comma may be a useful visual aid for distinguishing looped items and listed items.
>>
>> I responded to each of those points in my first message yesterday: http://www.iucr.org/__data/iucr/lists/ddlm-group/msg01244.html.  The short form is: (a) the direction of STAR2 is not persuasive (and I now add that James's proposal still diverges from STAR2), (b) the demo software will have to be changed anyway, including in this area, and (c) syntactically distinguishing looped and listed items has significant drawbacks directly associated with it.
>
>We are not in a situation where we would produce a standard that
>matched exactly either STAR2 or Nick's program.


Yes, that is part of the subtext of my arguments.


>  I am happy to drop
>the STAR2 conformance argument in (a), but for (b) there is practical
>value in reducing the mismatch with the current software.  For
>example, with commas reinstated I believe that it would be possible to
>write a CIF2 syntax, DDLm-based dictionary that could be processed by
>Nick et. al's software as is.  There are alternative workarounds of
>course, such as preprocessing CIF2 syntax into STAR2 syntax.


I see minimal value in reducing mismatch between CIF2 and the current DDLm processing software because the mismatch cannot anyway be reduced to zero, and because we're talking about only one program.

It might be possible to write DDLm and DDLm-based dictionaries using only the common subset of CIF2 and STAR2, but I think it both unwise and unreasonable to *require* all DDLm-based dictionaries to be restricted to that subset.  If we nevertheless chose that route then I could only hope that the required restrictions on the dictionary-CIF2 dialect would be fully documented among the CIF2 specifications.

On the other hand, if we don't require DDLm-based dictionaries to use the common CIF2/STAR2 subset, then I see little advantage from the one existing program being able to read and process only some DDLm-based dictionaries.

If we suppose instead that we will have to modify the existing program to work with the final, full CIF2 specifications, then I see no reason to believe that the difference in one-time work required to adapt it one way or the other would be significant enough to compare with any of the continuing costs and benefits of the various specification alternatives.


>Regarding (c):  I reproduce your points from yesterday below:
>
>JB> I don't personally see a significant advantage in visually
>distinguishing looped items from list elements.  Indeed, there are
>disadvantages springing directly from such a distinction, among them:
>
>JB> a) any visual distinction places a burden on parsers to make the
>same distinction
>JB> b) a distinction here seems arbitrary and inconsistent.  Why
>should CIF use differing syntax for the same function (delimiting a
>sequence of values)?
>JB> c) if adding comma delimiters means further restricting the
>character set for whitespace-delimited values, then we thereby
>increase CIF2's incompatibility with CIF1
>
>The semantic meaning of a sequence of list values is fundamentally
>different from that of a sequence of looped values.  There is no
>inherent order for looped values, as the column and row order is
>completely arbitrary.  This is not true for lists.   There is
>admittedly a certain duality in the separation of values in a list: at
>the very basic level they are a sequence of tokens, so whitespace
>would be the CIF2 way of separating them (as I argued previously,
>before my road to Damascus moment); but on the other hand, unlike most
>(all?) other values in a CIF block, the actual order that they are
>presented in the data file must be preserved, so it would be desirable
>to indicate this.
>
>So, in response to (a) I would say: yes, but the parsers must
>distinguish loops and lists anyway, and may store them differently.
>For example, a loop value might go directly into a database table, but
>a list value must be accumulated somewhere first.


Parsers (human and electronic) already can distinguish loops from lists by the fact that lists are enclosed in square brackets.  I really don't see a significant advantage there to reintroducing comma separators, especially given that we would be retaining whitespace-only separators as an alternative.  I find Herbert's human-readability arguments more persuasive here, but not sufficient to outweigh the advantages to both humans and computers of syntactic consistency for lists, tables, and loops.


>In response to (b) I repeat that a sequence of looped values and a
>sequence of listed values are semantically different.  If anything,
>the tendency to see them as identical would suggest a comma is a
>useful reminder that they are not.


Yet you propose comma separators only as one alterative.  I expect that the very people who might most benefit from a reminder of the semantic difference are the same ones who are most likely to be confused by the fact that commas are allowed, but not required, as list item separators, yet disallowed as loop and table item separators.  When presented with a CIF containing a list with only whitespace separators, or with mixed separators, what are these semantically-uncertain people going to think?  I think the situation for them is *worse* if both separators are allowed.


>As for (c) I think we might want to in any case remove comma from the
>non-delimited string character set, because if we stick with the
>current CIF2 spec, the following is a legitimate single-element list:
>[1,2,3].


Indeed it is, but I think the confusion factor is small if the specifications are consistent in using whitespace as the only separator for value sequences.  If the proposed reintroduction of commas as list item separators is not accepted, however, then it would be reasonable to entertain the possibility of this change by itself.

With STAR2 in the picture, however, I'm curious: how is this handled there and in the DDLm demo software?


>JB> Are you?  I know you favor allowing both whitespace and comma
>separators, but I think you misread JB> James' productions when you
>assert (elsewhere) that [,,] would match them.  I don't read them that
>JB> way, and James previously wrote that it was not his intention to
>allow that sort of construct.
>
>You are right, it was not my intention (although I have no particular
>issue with allowing it). I think that can be cleared up during a
>semantic tidy-up phase rather than right now.
>
>JB> Furthermore, the productions as currently written are flawed at
>least because they do not permit
>JB> tables as list items.  They also yield odd results for where
>whitespace is allowed relative to commas
>JB> (allowed before, but not after).  Those issues can be addressed
>with relative ease, of course, but
>JB> they're a good reason to defer voting on specific productions.
>
>Indeed. Try these:
>
><list> = '[' <whitespace>* {<listdatavalue> {<comma or
>whitespace><listdatavalue>}*}* ']'
><listdatavalue> = <whitespace>*{<list>|<string>|<table>}<whitespace>*


There remains the problem that Herbert pointed out, that <string> remains open to interpretation (i.e. does it mean just a possibly-empty sequence of characters, or does it rather mean a CIF data value that is neither a list nor a table?).  I admit that I am quibbling, but here is a form I would like better if this proposal were adopted:

list := '[' <whitespace>* {<datavalue> {<listitemseparator> <datavalue>}* <whitespace>*}? ']'

listitemseparator := {<whitespace>* ',' <whitespace>*} | <whitespace>+

, where the <datavalue> production matches all data values supported at top level, outside lists and tables.  I agree with Herbert that the <whitespace> production should accommodate comments, in the same way that comments constitute whitespace in other parts of the grammar.


Aside: I have never been enthusiastic about the term "whitespace-delimited data value", but if commas are allowed as list item delimiters and unquoted strings continue to be allowed as list items, then "whitespace-delimited" is suddenly a rather poor description.  This is not itself an argument against adopting comma delimiters, but if comma delimiters are adopted then I hope we can agree to change the term "whitespace-delimited" used in the specification to something else that better describes that type of value.


Regards,

John

--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]