[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Powder CIF Proposals

Subject: Re: Powder CIF Proposals
From: "ROBIN SHIRLEY (USER)" <R.Shirley@xxxxxxxxxxxx>
Date: Thu, 28 Sep 2000 15:55:13 +0100 (BST)
Brian and others

Let me begin by saying how pleased I am to see these proposals at
last being debated (I only received Brian's extract of February
correspondence yesterday).

They were originally submitted in 1993 at the time of the Beijing
Congress, but then seemed to go missing and had to be resubmitted
after Seattle, and so on.  It has been frustrating to see how slowly
such standardisation processes can move.  This is not meant as a
criticism of Brian but as a general comment.

The background to my four proposals is that I am heavily involved in
gathering the various mature indexing programs into an integrated
suite (Crysfire) and moving this towards becoming an expert system
for powder indexing.  In the course of this I wished to make it
possible for indexing programs to support Powder CIF standards.

This has not yet really been the case, because of specific omissions
from Powder CIF that reflect both a failure to take the needs of the
indexing stage into account and which I think also involve some
differences in underlying paradigm.  My proposals were an attempt to
address these needs, which I took up because I have worked on the
indexing problem since 1967 and am probably as well placed as anyone
to establish what would be needed.

I think that a difference in underlying paradigm may exist because
the principle on which CIF standards are built is that CIF files
exist primarily to record measured data, plus various derived
quantities that have been obtained from measured data by calculation.
Of course such derived quantities also introduce an element of
model-building and hence of hypothesis, but CIF definitions tend to
downplay this aspect.

The various quantities that are derived in the course of a structure
analysis such as positional parameters, and to a greater extent
thermal parameters and occupancies, reflect hypotheses concerning the
nature of what makes up the average structure - hypotheses that may
well not be unique.  However the default expectation within CIF seems
to be that, at each stage in a crystallographic investigation, both
measured and derived quantities have unique values.

Powder indexing moves further away from this assumption than does
structure analysis, since it is inherently an inductive,
multi-solution process - more so, for example, than the determination
of the phases of reflections in direct methods, since these converge
abruptly and convincingly to a definite set of values when the
correct point in solution space is found, at which nearly all the
calculated intensities agree with their measured values within the
reasonable error bounds of the measurements.

That is not the case with powder indexing, where the best that can be
hoped for is that the favoured solution will have a sufficiently low
set of obs-calc line-position differences to stand out from the
number (often large) of other trial cells that also account for the
set of measured line positions within their reasonable error bounds.

Regarding people's responses to my four specific proposals:

00-2-11.1) _pd_proc_quadr_Q  (or _pd_index_quad_Q - see discussion
below)

I accept that if this could be derived directly from
_pd_peak_d_spacing, then the case for including it would be weak,
although it is actually the preferred measure used by most indexing
programs, being a linear function of the representation of the cell
as powder constants (in which Q(A) = 1/a_squared, etc, often scaled
up by 10000 for convenience).

The objections raised in the discussion have clarified for me what
is actually intended by this item, or rather what is *not* intended.

At each stage during indexing, a particular set of observed Q values
will be used, which often remains the same throughout the indexing
process.  However, these Q values *need not* be derived in the same
way for each line.

Because apparently minor deficiencies in the dataset can lead to
disproportionately large increases in the time needed for successful
indexing, experienced practitioners often scrutinise the data
carefully first, and make adjustments to line positions on an
individual basis, for example by substituting a combined estimate
obtained from several runs which may well have been made under
different experimental conditions (such as different wavelengths
and/or instruments).

Similarly, because the positions of individual low-angle lines are
particularly important for some indexing methods and are often
harder to measure accurately than those at higher angles, it may be
beneficial to make adjustments based on lines at higher angles that
appear to be their higher orders.

Hence the Q values (or d-spacings, etc.) used during the indexing
stage need not and often will not all be derived directly and
consistently from _pd_peak_d_spacing, nor _pd_peak_2theta_centroid,
nor any other single measure of line position.  It remains important
to record them, otherwise whatever expertise has gone into adjusting
them will be lost and it will become impossible to reproduce the
reported indexing results.

That is in part why I originally proposed putting this item in the
_pd_proc_ section rather than _pd_peak_.

On reflection, I am now persuaded that, since these data are actually
specific to the indexing stage, it might be preferable to move it to
_pd_index_ so that it would become _pd_index_quadr_Q.  An equivalent
version might be _pd_index_d_spacing (I would prefer both to be
defined, but would settle for either).

00-2-11.2) _pd_index_appendix

Brian queried whether this might be accommodated within _pd_refln_.
I would argue strongly for a distinct section for indexing (and now
favour moving all indexing-specific items into it, such as quadr_Q
and index_merit - see below).

The sort of indexing history envisaged in my original proposal can
now be captured and updated automatically in the form of the Crysfire
logfile for that dataset - an example is attached.

00-2-11.3) _pd_proc_index_merit (or _pd_index_merit - see discussion
below)

Decisions concerning both this item and the next one need to reflect
the potentially very large number of trial cells that can index a
powder pattern within the reasonable error bounds of all its lines -
typically several hundred and often more than 5000.  Thus it can
become unrewarding to capture many parameters of each possible trial
cell on an individual basis, when sufficient basic information may
already be present within a cumulative logfile stored under
_pd_index_appendix (item 2 above).

Brian has suggested that a formal definition of the figure of merit
(FOM) would be required, but my experience in this field leads me to
doubt whether this is yet practicable - I am more inclined to support
Bob Von Dreele's suggestion to record instead the program that
calculated the FOM, at least for Q-based FOM belonging to the De
Wolff M20 family.

This precaution should not be necessary for the Smith & Snyder FN
measure, since that was well defined in their original 1979 paper
(as four distinct components).  However, in my judgement it is not
primarily an indexing FOM (one designed to indicate the plausibility
of a proposed *cell*) but rather one that indicates the quality of a
measured *pattern* by reference to an *assumed correct cell*.

Bob Snyder claims that FN is independent of crystal system, which is
true in its role as a measure of data quality, but not when used as a
measure of indexing success, since, for example, it does not favour
more parsimonious 1-parameter cubic models over 6-parameter triclinic
ones as I believe an indexing FOM should (Bob does not see it this
way and after many debates we have agreed to differ!).

Hence the four components of FN (N=number of observed lines used,
F=the actual FOM, D=mean absolute difference in 2theta, and
Nposs=number of possible calculated lines used) belong primarily
with the pattern, and so in _pd_proc_ (although F can indeed also be
used as a kind of indexing FOM).  An additional reason is that FN is
based on the more pattern-specific observable quantity 2theta, which
is available directly only for angle-dispersive data, while M20-type
FOM are based on less direct and more general Q values.

It would then be defined as _pd_proc_FN N F D Nposs, where N, F, D
and Nposs are as defined above.

De Wolff's *original* M20 measure was also well-defined, but only if
*all* lines were considered to be indexed, which will not be the
case if any lines are excluded as "not indexed", since the criterion
for this was not defined but left to the user.  However, because of
implementation difficulties, I know of no indexing program that
actually uses De Wolff's original definition - not even Visser's ITO
which originated in De Wolff's own lab.

The implementation problem here is that De Wolff's original
definition excluded any "unindexed" lines from the 20 observed lines
that were used, so that, for example, a dataset with X20=2 unindexed
lines would require *22* observed lines to provide the 20 observed
*and indexed* lines needed for M20.  This means that for a dataset
containing less than 39 observed lines, the original M20 can be
calculated for some trial cells but not for others, depending on
whether or not their number X20 of excluded lines pushes the total
required above the number of lines that have actually been observed.

Thus *all* indexing programs that claim to calculate M20 actually
use some variant of Visser's "M20" which is based on just the first
20 observed lines, so that, for example, if X20=3 then only 17
observed lines would be used rather than 20.  Thus, although less
rigorous, it needs only 20 observed lines where the original M20
would have needed 23.

The bottom line here is that since all programs tackle such
implementation issues a bit differently, it is desirable to record
the program as well as the FOM measure.  Thus I propose that three
items are needed per FOM entry: first the value (M) then the name of
the FOM ( usually M or M20), then the name of the program version
concerned (*not* that of an attributed FOM author such as Snyder or
Visser) which the Powder CIF standard obviously could not predefine.

Thus this would become:
   _pd_index_merit M FOM program

   (e.g. _pd_index_merit 21.7 M20 ITO12,
   or _pd_index_merit 54.215 M1 CRYS934h).

Some indexing programs calculate several FOM variants, so in theory
this sub-section could contain more than one entry for a single trial
cell.

In principle I agree with Brian's suggestion that there could be a
loop containing _pd_index_trialid (for consistency, _trialid not
_cellid), ..trial_a, etc., but one needs to be aware, as discussed
above, of the fact that this could easily contain thousands of
entries, as is often the case with a Crysfire summary-file (see the
example attached, which is for the same dataset as the example
logfile), upon which it could indeed be based.  If several trial
cells were to be tabulated in a loop as suggested, then I think that
all the item names should be prefixed _pd_index_trial_. (including
the _merit entries).

Other items that are useful in such summaries and which might also
be included are: _pd_index_trial_Nobs, .._I20, .._volume, .._spgroup
(space-group, as far as determined -  usually just a Bravais
lattice), .._date, .._time, .._pedigree (a program-specific audit
code for tracing how that solution was arrived at).  This list may
well require updating in due course, since the field of powder
indexing is still under active development.

00-2-11.4) _pd_peak_index_status (or _pd_index_peak_status - see
discussion below)

Since it would involve an entry for every line (perhaps 50-100) of
every trial cell (there could be thousands), I think that Brian's
option of looping this as a (cells x peaks) matrix is unattractive
and probably should not be attempted.

My intention was that this status list (like the list of trial hkl
indices) would be maintained only for the single, current front
runner among the trial cells.

However, for consistency with the emerging section schema, I now
think that the name should more logically become
_pd_index_peak_status.  In other words, that all items that are
associated specifically with indexing operations should be gathered
in the_pd_index_ section.  That would make it easier for human
readers to locate what is relevant to indexing, and also for
automated CIF readers to include or exclude such material
systematically.

Such indexing status flags would most naturally be placed in the
loop that contains the trial indices for that cell - I have not yet
thought through the best way to cater for the common situation where
the calculated pattern yields multiple indexings for a single
observed line.  Maybe such complications aren't worth bothering
about, leaving the recorded information for that line confined to
just the indices for the closest calculated line plus a "mult"
index-status flag.

With best wishes

Robin Shirley

---------------------------------------------------------------

> I have put my pdCIF hat back on again and have remembered that I
> would like to get some comments from you on the pdCIF proposals
> that you initiated a while back. I have attached a series of
> e-mail messages related to these questions. Please get back to me
> when you have the time.

> Thanks,

> Brian
 
> ********************************************************************
> Brian H. Toby, Ph.D.                    Leader, Crystallography Team
> [email protected]      NIST Center for Neutron Research, Stop 8562
> voice: 301-975-4297     National Institute of Standards & Technology
> FAX: 301-921-9847                        Gaithersburg, MD 20899-8562
                http://www.ncnr.nist.gov/xtal
> ********************************************************************
Reply to: [list | sender only]

Prev by Date: Re: Powder CIF Proposals

Next by Date: Re: Powder CIF Proposals

Prev by thread: Re: Powder CIF Proposals

Next by thread: Re: Powder CIF Proposals

Index(es):

Date

Thread
Discussion List Archives

Re: Powder CIF Proposals