Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Purely calculated structural data in CIF

Dear all,

the Crystallography Open Database (COD) maintainers have also encountered
a similar problem of identifying and marking purely calculated (theoretical) entries
that accidentally make it into the COD. Our approach is similar to the one proposed
by John -- we use a set of heuristics to semi-automatically identify potentially theoretical
entries and manually mark these entries using the '_cod_struct_determination_method'
data item from the COD CIF dictionary. This data item currently takes 1 of 3 enumerated
values ['single crystal', 'powder diffraction', 'theoretical'] so in a sense it can be viewed as
a rudimentary, COD-specific version of the '_exptl.method' data item. Having a more
standardised approach would be extremely helpful.

Actually, the latest DDLm version of the CIF CORE dictionary [1] already contains
the '_exptl.method' data item that John mentioned, although in a slightly different
form than the one in the mmCIF dictionary [2]. The main difference is that the
CIF CORE version is a free-form text field while the mmCIF version in an
enumerated set with 13 different values such as "X-RAY DIFFRACTION",
"ELECTRON MICROSCOPY", etc. one of which is "THEORETICAL MODEL".
I think that converting the CIF CORE version to an enumerated set would also
make sense, especially for the application discussed in this thread.

Several alternative approaches could also be explored, for example:
a) Introduce a new data item that marks a structure as theoretical
    (e.g. with yes/no values).
b) Introduce a new data item that specifies the *theoretical* method
    that was used (e.g. with values such as "Ab initio optimization",
    "Geometric modelling", "Molecular dynamics", etc.). This data item
    would only appear in theoretically calculated data files and in combination
    with the '_exptl.method' data item would allow to describe various
    situations such as "theoretical X-ray diffraction data calculated using
    geometric modelling", "powder diffraction experiment calculated using
    the Monte-Carlo method", etc. If I am not mistaken, a similar approach
    is adopted by the ICSD (see paper [3], especially section 4.2).

Also, I would like to add to the list of heuristics that John proposed.
Often in the files of theoretically calculated structures:
* Lattice parameters are provided with a very precise decimal part
  (more than 4 digits) and without standard uncertainties (no trailing
  parentheses with the s.u. values).
* The Z number ('_cell_formula_units_Z') is not provided.
* Atomic displacement parameters are either not provided at all or
  all values are set to 0 ('_atom_site_U_iso_or_equiv',
  'ATOM_SITE_ANISO' loop).

Finally, I invite you to take a look at the theoretical structures that
are already in the COD to expand the set of heuristics even further.
Note, that the COD files have undergone some curation so some of
the strange features might have been stripped out, however, all of
the files contain references to the original publication in case you
would like to take a more purist approach. The full list of theoretical
structures can be retrieved using the following MySQL query:

mysql -u cod_reader -h www.crystallography.net cod -e 'SELECT `file` FROM `data` WHERE `method`="theoretical"';

Feel free to write me a personal email in case you need further
advice on retrieving data from the COD.
Sincerely,
Antanas Vaitkus

On Fri, 25 Feb 2022 at 19:00, Bollinger, John C via coreDMG <coredmg@iucr.org> wrote:

Dear Mike,

 

As far as I am aware, we have no convention for this in Core CIF, but in mmCIF, it appears that one would be expected to use …

 

_exptl.method 'theoretical model'

 

… to flag a computed structure.  Other values of that data name supported by mmCIF provide for identifying various kinds of diffraction and NMR experiments by which the associated structure was determined.  We could consider adding a corresponding item to Core CIF to support such marking going forward, but of course that does not help with recognizing existing CIFs describing computed structures.

 

As for identifying existing core CIFs describing structures determined ab initio or from molecular modeling, I don’t see a better approach than heuristics such as you describe already using.  Additional characteristics that such heuristics might check, especially in the context of checkCIF, would be absence of non-null values for substantially all data names in the _diffrn*, _exptl*, _refine*, _refln* and _reflns* categories.  Exceptions that  might be expected to be present include the proposed _exptl_method item; *_details items; and a handful of items, such as _exptl_crystal_absorpt_coefficient_mu, that are actually computed from the structure rather than being measured.

 

Best regards,

 

John Bollinger

 

 

--

John C. Bollinger, Ph.D., RHCSA

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

John.Bollinger@StJude.org

(901) 595-3166 [office]

www.stjude.org

 

 

 

 

From: coreDMG <coredmg-bounces@iucr.org> On Behalf Of Mike Hoyland via coreDMG
Sent: Thursday, February 24, 2022 11:03 AM
To: coredmg@iucr.org
Cc: Mike Hoyland <mh@iucr.org>
Subject: Purely calculated structural data in CIF

 

Caution: External Sender. Do not open unless you know the content is safe.

 

Dear All,

We are currently working on improving the checkCIF handling of powder diffraction CIFs, and have coincidentally fallen across an issue with handling purely calculated structural data, e.g. by DFT calculation. So far we have relied on finding the use of "DFT" within various datanames, e.g.

_computing_structure_solution
_diffrn_measurement_device_type

There is no guarantee of course that it would be present in this form.

Therefore, I would like to ask if anyone has any thoughts about how we would be able to simply identify or mark a particular structural datablock as containing calculated rather than experimental data.

With thanks for any thoughts or suggestions,

Mike Hoyland
Systems Developer
IUCr, Chester



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
coreDMG mailing list
coreDMG@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/coredmg


--
Antanas Vaitkus,
Vilnius University,
Life Sciences Center,
Institute of Biotechnology,
room C521, Saulėtekio al. 7,
LT-10257 Vilnius, Lithuania


_______________________________________________
coreDMG mailing list
coreDMG@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/coredmg

[Send comment to list secretary]
[Reply to list (subscribers only)]