Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Specifying values 'less than something' in CIFs?

  • To: "Discussion list of the IUCr Committee for the Maintenance of the CIF Standard (COMCIFS)" <comcifs@iucr.org>
  • Subject: Re: Specifying values 'less than something' in CIFs?
  • From: Peter Murray-Rust <pm286@cam.ac.uk>
  • Date: Sun, 29 Apr 2012 12:11:47 +0100
  • DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;d=googlemail.com; s=20120113;h=mime-version:sender:in-reply-to:references:date:x-google-sender-auth:message-id:subject:from:to:content-type;bh=G2hFVLLViRqGACzXiDsTJV1pHHIbQiZ2byshCs1AMHI=;b=eu2vD5ZWmpDtYwRonqVIrxa3lFV/fy4WxYM0W9nfXTAPAOhAhHDJVTeF67KYrFqRONFUOeNq7/4HEDEbkNITHy83b4QaRrrfH0jl8SleTXMS3tfggt7I/3+ys5xqVvEtRvbCVxlyqgmTjzfITc2kDYZan3hUhf4eMF8PB3VxbW37DfofNyW+J5UH6pwhGfliKx3fYRJyRRyw37I6IH1775MC8UAgVafP/PE5B3LPnHkvRgIkO4nwLFJv+pu8V9Yqze4aSiV3jfIPCFQXYer3JKc6SY3uPoQEfIUC1VXGS/VXBlU6nx2il7fXJnm1MkKlfg9L9LcYW0RkN2xN8hSw1g==
  • In-Reply-To: <4F9D1273.8000901@ibt.lt>
  • References: <4F9CF725.9030608@ibt.lt><CAD2k14PJMtQcvhaQkwaBKb9Kf1UAb1d8J=NVe59COSm2bf_jng@mail.gmail.com><4F9D1273.8000901@ibt.lt>


On Sun, Apr 29, 2012 at 11:05 AM, Saulius Grazulis <grazulis@ibt.lt> wrote:
Hi, Peter,


The point is that IUCr dictionaries are very cool stuff to hunt down
semantic errors in CIFs. Take for example a validation message from my
current run (I hope it is comprehensible):

/home/saulius/src/cod-tools/trunk/perl-scripts/cif_validate:
../2/00/66/2006634.cif data_2006634:

NOTE, tag '_geom_torsion' value "177.89(0.20)" is of type 'UQSTRING'
while it should be numeric, i.e. 'FLOAT', or 'INT'

Clearly, the value has incorrect ESU syntax, but it has slipped through
all syntax and semantics checks until now, and would probably go
unnoticed by many programs that only take numeric value but do not use
ESU. And those that interpret ESU would either report an error or, even
worse, yield unpredictable and possibly incorrect results. And we do not
know this until we run that specific program on the specific CIF.

Yes - my JUMBOConverter system will do something similar.
 

a) educate users/students to know what the CIF rules are; do not change
the CIF rules in an incompatible way in the future;

Necessary but not sufficient!

b) make software that detects and, when possible, corrects the most
common 'mistakes'; e.g. it is probably safe to change '100C' to '373.15'
(Kelvins), with a benign warning (COD deposition tools do this on the fly).

This is really difficult. I don't like changing things and certainly not without metadata. Maybe an additional local COD_ field that gives the heuristic value. That way we don't corrupt the past but also allow people to search and compute on "better" information
 
c) Probably (for COMCIFS consideration?), codify the most widespread
practices in future CIF dictionaries. For instance, all scientists are
trained to put units next to values, everywhere. This is a good style at
least and an absolute necessity often. So why should CIF require a
different style? What if one introduces into cif_core.dic (with the
appropriate extensions of DDL1) something like this:

data_chemical_melting_point
# ...
loop_
_type_conditions
 esd
 units # specifies that units may be attached to this number
loop_
_type_unit_name
_type_unit_coefficient
_type_unit_offset
_type_unit_comment
'C' 1.0     273.15 "degrees Celsius"
'F' .555555 459.67 "degrees Fahrenheit; 0.555... = 5/9"

And now programs could automagically figure out that:

_chemical_melting_point 100

is the same as

_chemical_melting_point 100K

or

_chemical_melting_point -173.15C

Easy?

Afraid not. CIF is a flat syntax and cannot easy manage complex objects with facets and attributes.
 
Backwards compatibility is granted, along with the automatic
conversion possibility of CIFs to be readable by older programs :) And
chemists could just cut-n-paste values with units from their papers.

The harder challenge is backwards compatibility for the data. Examples are disorder which - say 10 years old - is much harder to interpret than modern CIF. 
>
> Do we assume that there are other strings than "<0.001"?

Oh, yes, for sure! The values for _refine_ls_shift/esd_mean range from
"<0.001" to "<0.0001", and probably even wider.

Also, there are data items like:

_diffrn_standards_decay_% '<1'

with values ranging from '<1' to '<5', and also having strings like
"none", "insignificant", "negligible" etc. :)


What about adding the field:

COD_diffrn_standards_decay_% '<1'

and removing the old old. Removal of an invalid field is better than throwing away the whole CIF (which is the only logical approach to error). 


I must confess that we have already put our hands on such "values"... We
have a script (written by my former student, Adriana Daskevic) that
automatically fixes most widespread (as determined by COD scans) values
(100K, 200C, '100 \%C', "room temperature", etc.), from a pre-compiled list.

For anything that is understood by us humans as "room temperature" (RT,
"room temp.", "ambient temp.", etc.), we assume that the average is
meant to be 22 deg. C (comfort level in a lab), and the uncertainty to
be +/- 2 degrees (assuming it is unlikely that human crystallographers
would measure above 28 degrees C or below 16 degrees, yielding a 60%
(1*sigma) confidence interval of 2 degrees, on a broader side), ending
up with a "justified wild guess" of 295(2) (Kelvins).

Again - convert to
COD_temperature...

 
> There are other problems of automatic conversion,


This is tricky... And goes back to what the special values '?' and '.' mean.

I would interpret data item with the '?' the same way as if this data
item was missing altogether...

I have spent years trying to see if there is semantic consensus on these two happy characters. My best guess is:

"?" can be ignored (or better deleted). It is simply there for human authors and readers, perhaps to prompt them that they should notice there is nothing there

"." exists to pad out tables/loops and stop the reader failing
 
> Only good author discipline and gentle firmness from IUCr will tackle
> these.

ACK. Discipline is necessary, but I guess "gentle" is also important :)


--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
_______________________________________________
comcifs mailing list
comcifs@iucr.org
http://mailman.iucr.org/mailman/listinfo/comcifs

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.