Re: Specifying values 'less than something' in CIFs?

  • From: Peter Murray-Rust <pm286@cam.ac.uk>
  • Date: Sun, 29 Apr 2012 09:26:56 +0100
On Sun, Apr 29, 2012 at 9:09 AM, Saulius Grazulis <grazulis@ibt.lt> wrote:
A large amount of value type violations come from data items like this:

_refine_ls_shift/su_mean <0.001

(see e.g. http://www.crystallography.net/2232747.cif)

The data type in the core dictionary is specified as 'numb', but many
CIFs give string ('char') values, because of the attached "less than" sign.

This is not a trivial problem. It is difficult to build software that manages alternative dataypes for an item. XML Schema p;rovides for this ("union") and I tried to use it for CML but gave up. I spent a lot of time on the logic of if...else...

For a human reader, the message in these data items seems more-or-less
clear: in interpret it as if the authors wanted to convey that they are
"pretty sure that the value negligible and can be treated as 0 for all
practical purposes; with very high probability it is less than <0.001"

How do we express this in CIF dictionary-consistent way?

There is a social as well as a technical problem. Clearly authors are creating CIFs inconsistent with the standard. Draconian validation software would simply fail on these documents. This is a constant problem in standards - people bend the rules. Reacting to bent rules puts a huge burden on software developers
One possibility would be to put in the value 0 (this is the lowest
possisble value for the _refine_ls_shift/esd_mean and other such tags),
denoting that in computations, the values (shifts) can be neglected;
then we could reason that since the authors put '<0.001' they are pretty
sure about it, so the probabilities for this to be true are above 99%;
therefore, if the measured values were normally distributed around the
mean 0, 0.001 would be something like 3*sigma ("the three sigma rule"),
and thus the esd would be 0.001/3 approx. = 0.0003. This would yield the
CIF encoding:

Do we assume that there are other strings than "<0.001"?

But I like Saulius approach for this particular problem.

It gets more difficult with phrases such as "room temperature". I don't know whether thingslike this occur but they shouldn't!

There are other problems of automatic conversion, especially interpreting the absence of information. Only good author discipline and gentle firmness from IUCr will tackle these.


Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
