[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Specifying values 'less than something' in CIFs?

Hi, Peter,

many thanks for you fast reply and the comments!

On 04/29/2012 11:26 AM, Peter Murray-Rust wrote:

> The data type in the core dictionary is specified as 'numb', but
> many CIFs give string ('char') values, because of the attached "less 
> than" sign.
> This is not a trivial problem. It is difficult to build software
> that manages alternative datatypes for an item. XML Schema provides
> for this ("union") and I tried to use it for CML but gave up. I spent
> a lot of time on the logic of if...else...

I agree, its not quite straightforward. Still I feel like picking up the

The point is that IUCr dictionaries are very cool stuff to hunt down
semantic errors in CIFs. Take for example a validation message from my
current run (I hope it is comprehensible):

../2/00/66/2006634.cif data_2006634:

NOTE, tag '_geom_torsion' value "177.89(0.20)" is of type 'UQSTRING'
while it should be numeric, i.e. 'FLOAT', or 'INT'

Clearly, the value has incorrect ESU syntax, but it has slipped through
all syntax and semantics checks until now, and would probably go
unnoticed by many programs that only take numeric value but do not use
ESU. And those that interpret ESU would either report an error or, even
worse, yield unpredictable and possibly incorrect results. And we do not
know this until we run that specific program on the specific CIF.

Validator, on the other hand, catches such mistakes early and in a
completely regular way, along with many others! If we run such validator
on all available CIFs, we could find and fix or mark the problems in a
regular, semi-automatic way. I believe that extensive use of validators
would produce much better CIFs.

Of course, currently we have a log of "bening" validation messages, like
the '<0.001' being non-numeric which I discussed. They distract
attention and put undue strain on human editors.

But we can identify them, and the ones that happen most often can be
then processed automatically. In this way, only serious semantic
discrepancies will be left for human reviewers.

In short, IUCr dictionaries are a very powerful thing, and indeed a huge
help for crystallographers and software developers, and we should use
them even more often. My praise to IUCr and COMCIFS for the dictionaries :)

> There is a social as well as a technical problem. Clearly authors
> are creating CIFs inconsistent with the standard. Draconian
> validation software would simply fail on these documents. This is a
> constant problem in standards - people bend the rules. Reacting to
> bent rules puts a huge burden on software developers

True. Most people do not read CIF grammar or dictionaries, and they
apparently follow conventions used in paper-based typography for
centuries. So, e.g., chemists give temperature values in degrees C, and
specify units as 'C', '\%C' or whatever. Never mind that cif_core.dic
mandates K and no units (very few seem to know/care about this fact ;).

There might be several solutions to the problem, not mutually exclusive:

a) educate users/students to know what the CIF rules are; do not change
the CIF rules in an incompatible way in the future;

b) make software that detects and, when possible, corrects the most
common 'mistakes'; e.g. it is probably safe to change '100C' to '373.15'
(Kelvins), with a benign warning (COD deposition tools do this on the fly).

c) Probably (for COMCIFS consideration?), codify the most widespread
practices in future CIF dictionaries. For instance, all scientists are
trained to put units next to values, everywhere. This is a good style at
least and an absolute necessity often. So why should CIF require a
different style? What if one introduces into cif_core.dic (with the
appropriate extensions of DDL1) something like this:

# ...
  units # specifies that units may be attached to this number
'C' 1.0     273.15 "degrees Celsius"
'F' .555555 459.67 "degrees Fahrenheit; 0.555... = 5/9"

And now programs could automagically figure out that:

_chemical_melting_point 100

is the same as

_chemical_melting_point 100K


_chemical_melting_point -173.15C

Easy? Backwards compatibility is granted, along with the automatic
conversion possibility of CIFs to be readable by older programs :) And
chemists could just cut-n-paste values with units from their papers.

> One possibility would be to put in the value 0 (this is the lowest 
> possible value for the _refine_ls_shift/esd_mean and other such
> tags), denoting that in computations, the values (shifts) can be
> neglected; then we could reason that since the authors put '<0.001'
> they are pretty sure about it, so the probabilities for this to be
> true are above 99%; therefore, if the measured values were normally
> distributed around the mean 0, 0.001 would be something like 3*sigma
> ("the three sigma rule"), and thus the esd would be 0.001/3 approx. =
> 0.0003. This would yield the CIF encoding:
> Do we assume that there are other strings than "<0.001"?

Oh, yes, for sure! The values for _refine_ls_shift/esd_mean range from
"<0.001" to "<0.0001", and probably even wider.

Also, there are data items like:

_diffrn_standards_decay_% '<1'

with values ranging from '<1' to '<5', and also having strings like
"none", "insignificant", "negligible" etc. :)

> But I like Saulius approach for this particular problem.

I accept this as encouragement to think in this direction... :)

> It gets more difficult with phrases such as "room temperature". I
> don't know whether things like this occur but they shouldn't!

I must confess that we have already put our hands on such "values"... We
have a script (written by my former student, Adriana Daskevic) that
automatically fixes most widespread (as determined by COD scans) values
(100K, 200C, '100 \%C', "room temperature", etc.), from a pre-compiled list.

For anything that is understood by us humans as "room temperature" (RT,
"room temp.", "ambient temp.", etc.), we assume that the average is
meant to be 22 deg. C (comfort level in a lab), and the uncertainty to
be +/- 2 degrees (assuming it is unlikely that human crystallographers
would measure above 28 degrees C or below 16 degrees, yielding a 60%
(1*sigma) confidence interval of 2 degrees, on a broader side), ending
up with a "justified wild guess" of 295(2) (Kelvins).

> There are other problems of automatic conversion,

For sure, we can not convert automatically everything. But the strategy
would be to fix most obvious things. If a scientific, peer reviewed
paper published data measured at "room temperature", then the readers
obviously should have understood what was meant by this statement and
found this to be good enough to go for publication. We can now apply the
same understanding and write it down in a form that software
comprehends, and that excludes incorrect interpretation.

> especially interpreting the absence of information.

This is tricky... And goes back to what the special values '?' and '.' mean.

I would interpret data item with the '?' the same way as if this data
item was missing altogether...

> Only good author discipline and gentle firmness from IUCr will tackle
> these.

ACK. Discipline is necessary, but I guess "gentle" is also important :)


Dr. Saulius Gražulis
Institute of Biotechnology, Graiciuno 8
LT-02241 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556
mobile: (+370-684)-49802, (+370-614)-36366
comcifs mailing list

Reply to: [list | sender only]