[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Semantics of whitespace-delimited values

Dear all

I extended the mini survey of current applications a little and looked  closer at some of the less-liberal parsers:

one of the applications I've looked at did not complain when I included some non-ASCII text in the CIF, while another complained about a data value constructed as '''z''' (valid CIF1), and one displayed rather quirky behaviour with regard to semicolon-delimited strings, rejecting the contained 'value' if it had a leading newline but not if it had a leading space -
all of these particular applications complained about delimited numbers to the extent that the application stopped processing.

Based on this (albeit limited) survey of some current well- known CIF applications, regarding the introduction of CIF2 it would definitely be prudent to indicate that 'yes indeed' the interpretation of CIF1.1 wihitespace-delimited values retains significance in CIF2. However, if possible I think it would be in the interests of flexibility and unambiguity if somehow we could suggest that CIF applications started to turn to the dictionary rather than syntax to determine the exact nature of a data item (afterall, as I see it, that's one very strong  motivation for developing CIF2 in the first place - and is the preferred approach in CIF1 too).  Thankfully (from my point of view) this isn't even an issue for the majority of applications I have looked at (they simply grab the data however they've found it and make use of it if they can, or they carefully validate the data against the dictionary). So what is challenging me is how we achieve this - i.e. nudging some applications to be a little more flexible (which in my experience is what many  'users' would most appreciate)., while at the same time maintaining the convention that numbers especially are still presented in an undelimited (uncluttered) fashion. I've no convincing answer to this yet.

Regarding the ? and . 'null' values I hesitate to suggest that we could take these out of the issue altogether by making them CIF key tokens - I hesitate because I suspect that some applications simply ignore their significance anyway and incorrect usage rarely presents a real problem (and also I haven't yet attempted to see if its actually possible to define them in this way in any case:-) - so its probably unnecessary and may even seem like a new complication to applications that were not particularly aware of, or bothered by, the significance of these tokens in the first place.

Cheers

Simon







From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>; SIMON WESTRIP <simonwestrip@btinternet.com>
Sent: Wednesday, 8 July 2015, 16:21
Subject: RE: [ddlm-group] Semantics of whitespace-delimited values

 
Thanks Simon, James, and John.  I am uncertain how many distinct parsers are represented by the reports so far, but it seems there must be at least five.
 
I think we agree that parsers and applications should be permitted, if not required, to distinguish between the values . and '.', and between the values ? and '?'.  We also seem to agree that it is not useful to insist that parsers or applications refuse to interpret quoted values as numbers, although some CIF 1.1 parsers in fact do so at their own discretion, and some warn instead of refusing (with even that relying on taking the position that numbers are not supposed to be quoted).
 
Not being enamored of special cases, and not wanting CIF 2.0 to rule out CIF interpretation practice that is accepted and common in CIF 1.1 applications, I find myself favoring CIF 2.0 taking the position that in general, it is permitted but not required to interpret any string value differently when it is presented in whitespace-delimited form than when it is presented in any of the other forms.  The conventions for the special values . and ? could then be taken to apply on a domain-wide basis, whereas the convention for the form of numbers could be taken to apply on a more selective basis (per-DDL, per-dictionary, or even per-definition).  An implication of this position, however, is that whether or not a value is presented whitespace-delimited becomes a property of that value that a fully general CIF 2.0 parser must make available to its clients.  Moreover, for better or for worse, future dictionaries could establish additional items or data types whose values are required to be presented unquoted.
 
We could perhaps characterize that more specifically, maybe by saying that the exact form of values presented in any of the quoted forms is significant, or something along those lines, whereas values presented in whitespace-delimited form may afford equivalent alternative expressions.  That doesn’t exactly fit . and ?, but perhaps some similar statement could do so better.
 
 
John
 
--
John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
(901) 595-3166 [office]
 
 
 
 


From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Wednesday, July 08, 2015 4:44 AM
To: SIMON WESTRIP; Group finalising DDLm and associated dictionaries
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values
 
Simon's quick survey is very useful.

I will confess that my PyCIFRW library completely ignores the delimiters and delivers every data value as a string.  The calling application is responsible for converting values to numbers if so required (but I do provide routines to do this).  If a dictionary is explicitly linked to a data block, PyCIFRW will attempt to return numeric values for dataitems that are specified in the dictionary as being of numeric type, regardless of the delimiters that were originally used.  Python software that uses PyCIFRW (at least PyMol and a few others) will therefore behave in this way.
On output PyCIFRW does not delimit numeric values.
James.
 
On 8 July 2015 at 03:09, SIMON WESTRIP <simonwestrip@btinternet.com> wrote:
OLEX2, JANA, OpenBabel and Avogadro also seem not to care that the numbers are delimited by apostrophes, while enCIFer correctly warns that the values are not correctly formatted.
 

From: SIMON WESTRIP <simonwestrip@btinternet.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Tuesday, 7 July 2015, 17:07
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values
 
A quick test of some programs I have readily available with an 'invalid' CIF1.1 cif that contains delimited site coordinates:
 
checkCIF (powered by PLATON) - issues alerts but nevertheless processes the CIF using the delimited values as numbers
 
publCIF - warns that they should not have delimiters but reads the value as a number anyway (according to the dictionary)
 
Jmol - renders models as expected.
 
I'll test a few others in due course, but am pleased to see that these programs would not be scuppered by reading 'delimited numbers'. (NB obviously checkCIF/publCIF could fairly easily drop the alerts for CIF2, which are annoying in any case)
 
Cheers
 
Simon
 
 

From: James Hester <jamesrhester@gmail.com>
To: ddlm-group <ddlm-group@iucr.org>
Sent: Tuesday, 7 July 2015, 15:17
Subject: [ddlm-group] Semantics of whitespace-delimited values
 
Dear All,
One issue that has not been discussed in the context of the CIF2 syntax is the special interpretation of whitespace-delimited values.  In CIF1.1 as recorded in Volume G, a whitespace-delimited question mark and a whitespace-delimited period have a special interpretation as "unknown" and "default/not applicable/null" respectively.  Furthermore, only a whitespace-delimited value matching a specified syntax (which includes optional appended esd values) may be interpreted as a numeric value, and it would strictly speaking be a semantic error for a CIF processor to interpret as a  number a numeric value enclosed in delimiters.
I have no issue with question mark or period, as these are necessary for semantic completeness. 
What I would like to discuss for CIF2.0 is the following:
(i) The interpretation of a data value as numeric is determined solely by the dictionary with no regard to the particular delimiters used in the CIF file;
(ii) A convention is encouraged for CIF writers whereby numeric values are not enclosed by delimiters.
(iii) The precise construction of numeric values is moved into the DDLm attribute dictionary.
 
The advantage of this simpler scheme is a clean separation between syntax and human-relevant semantics.  The only CIF applications that can have a use for the CIF1 scheme are those that are written without reference to a dictionary, most obviously pretty-printers that might want to tabulate numbers by lining up decimal points instead of left-justifying.  Even if such formatting applications get it wrong, they will not change the meaning of the file and so I would view point (ii) as sufficient support for such applications.  Conversely, any application that wishes to operate on a number as opposed to operating on the textual representation of the number will of necessity need to know what this number means and will therefore be written with reference to a dictionary, making it unnecessary to signal "numericness" using whitespace deliimited datavalues.
What do others think?  If there is a body of CIF1 applications out there that have been designed to raise errors when values expected to be numeric are enclosed by delimiters, this proposal would represent a further annoying change from CIF1, and it would be good to have some idea of how many such applications there are.  I speculate that many applications ignore the delimiter status, for reasons both of laziness, the authority of the dictionary definitions, and the philosophy of writing liberal parsers.
all the best,
James.
 
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

 
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
 

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]