(59) Powder indexing; closing on pdCIF

To: COMCIFS@iucr.ac.uk
Subject: (59) Powder indexing; closing on pdCIF
From: bm
Date: Wed, 12 Mar 1997 12:09:54 GMT
Dear Colleagues

In this mailing I describe how I have modified the powder dictionary in the
light of our recent discussions; the latest version is 0.995, and will be
posted on the web today. I have one substantial addition to propose; this is
the request for terms for indexing that Robin Shirley details in item D59.1
below. I'd like to add these terms, suitably modified to match the latest
version of the dictionary, if Brian T. agrees. I have agreed with Brian that
the order of topics shall be retained ("topics" are the thematic categories
into which the datanames are grouped, not the formal database-style _category
groupings); within each topic datanames shall be arranged alphabetically - I
haven't done this yet for the current revision.  Apart from these two items, I
consider the current review essentially complete, and invite formal approval
from those of our colleagues who have not yet responded (unless they retain any
active concerns).

I think this note from Peter Murray-Rust may already have gone out to everyone,
but I reproduce it here for the record. It's good to hear from you again,
Peter; and I look forward to trying out your sample software.

PMR> Dear COMCIFs colleagues,
PMR> 	I haven't posted recently but I haven't been completely idle on
PMR> CIF.  I have been developing a universal browser for molecular science
PMR> which is based on structured document technology, especially SGML and Java
PMR> and which is capable of reading a wide variety of formats.  I am delighted
PMR> with the new CIF and there has been a lot of valuable structuring,
PMR> including the category_, units_ and so on.  
PMR> 
PMR> 	I can read and display the components of toz.cif and also BrianT's
PMR> powder example (including the diffractogram, which is displayed in a
PMR> zooable graphics tool). I am now working on linking the dictionary to the
PMR> data file, in a way that I started with tkCIF but is much better
PMR> structured with XML/Java (XML is the new easy SGML from W3).  This means
PMR> that the data file can download the dictionary items and act upon them
PMR> (apart from about 20 hardcoded items like the cell dimensions and atomic
PMR> attributes which have special routines.)
PMR> 
PMR> 	The tool does syntactic but not semantic validation (i.e. it does
PMR> not check that categories are all loop_ed together - it could if required).
PMR> It also sorts items and can act as a graphical editor and provides a
PMR> structured document search (on the text).  It will be freely available
PMR> quite shortly.  
PMR> 
PMR> 	I have not found any errors in the dictionaries or toz.cif.  This
PMR> is now at the stage where if people want to add methods to dictionary
PMR> items this could be done as java classes and downloaded over the WWW.  You
PMR> may remember that originally I suggested this in tcl, but Java is going to
PMR> be a better solution, I think.

---------------------

You may like to look at the HTML version of the DDL1 and core CIF
dictionaries at http://www.iucr.ac.uk/cif/home.html, if you haven't already
done so. These were done in part to link to Acta C Notes for Authors and
Data Validation criteria, but I'm quite happy to add the new dictionaries
when they're approved.

D59.1 Powder indexing
---------------------
I was asked some considerable time ago to introduce the attached proposal
from Robin Shirley for the introduction of a few terms for powder indexing,
and owe him an apology because it slipped my mind. I include his proposal
verbatim; it refers to a considerably earlier pd draft, and clearly some
items as proposed won't work - _pd_index_appendix should be
_pd_index_details, say. But I would appreciate an indication of whether
these suggestions are suitable in the spirit in which they were intended -
as a working minimum set of terms for a particular purpose. I would comment
also that the proposal of similarly well worked definitions from other members
of the community would be no bad thing.

                             CIF Powder Extensions (v.0.7)

                         Additions to include powder indexing

                                     Robin Shirley
Psychology Department, University of Surrey, & Clarendon Laboratory, Oxford
University, UK.

Introduction

While the existing powder extensions are strong on most aspects of data
measurement and characterisation, they appear to have little explicit provision
for recording powder indexing information.

This is a significant omission, because powder indexing is an essential stage
in the characterisation of powder data from unknown solid phases, and one that
is in some ways closer to experimental aspects like calibration, or
identification of lines from additional phases, than it is to the subsequent
interpretation of the data. Indexing is also gaining increasing importance
with the growth in the determination of ab initio structures from powder data.

Like calibration and the identification of spurious lines, it forms part of
the essential clarification of the raw observed data that must take place
before they can be used for many other purposes. Also, like them, the
indexing history of a powder pattern needs to be recorded, because of the
serious consequences if it goes wrong (and the frequency with which this
turns out to occur), and the need to be able to backtrack and re-examine the
indexing when this happens.

It can be argued that catchall categories like _pd_proc_appendix are available
for these purposes, but there is a danger that, unless explicit provision is
made, the proper inclusion of indexing information is liable to be overlooked.

For these reasons, I respectfully urge that explicit provision be made to
encourage the inclusion of at least basic indexing information in Powder CIFs.
The additional definitions proposed below provide a way of doing this that
would be in keeping with existing definitions, and would provide a structure
for basic indexing information that could readily be exploited for the
recording of more detailed information by users that wished to do so.

Proposed Additional Definitions

_pd_proc_quadr_Q                                                   (numb)
Quadratic form Q=1/d^2^
The derived measure of line position used by most indexing programs.

Appearance in list: both. The permitted range is 0.0 to infinity. The units
extensions are `(reciprocal Angstroms squared *1.0) 'QU' (Q-units /10000.)
                                                                  Lpd~proc]

_pd_index_appendix                                                (null)
This section contains descriptive information on the indexing history of the
pattern.
                                                                 [appendix]

_pd_proc_index_merit                                              (char)
Indexing figure of merit for the observed pattern and the chosen set of cell
constants. The type(s) of figure of merit used should be specified. Previous
values (and the associated cell, etc.) can be saved in _pd_index_appendix.

Example: `M20 = 26.72'
Appearance in list: no. E.s.d. expected: no.
                                                                 Lpd~proc]

_pd_peak_index_status                                            (char)
A four-character code word showing the status of indexing (or non-indexing)
for the peak.  Some suggested values are `uniq' (uniquely indexed by a single
set of Miller indices), `mult' (multiply indexed by several possible sets of
Miller indices), `part' (singly indexed because a multiply-indexed peak
has been intensity-partitioned into its components; do not use `part' for
peaks that have been extracted by deconvolution, which may be flagged as
`uniq' or `mult' as the case may be), `omit' (peak omitted from indexing
because from a known secondary phase or otherwise flagged as unreliable on
a priori grounds - not mere failure to index), `fail' (failed to index).

Appearance in list: yes. If looped, _pd_peak_id must be present in the
same list.                                                      Lpd~peak]


D59.2 Spelling of colours
-------------------------
This from Andy Hammersley:

AH> I showed the pdCIF definitions and example to Andy Fitch (one of our 
AH> local powder diffractionists) and he found the definitions clear and 
AH> covering a very wide range of possibilities. 
AH> 
AH> I approve the powder CIF dictionary.
AH> 
AH> Just a very minor comment. The _pd_char_colour definition contains 
AH> a mix of British and American spellings i.e. "colour" and "gray". 
AH> Is there a recommended system of spellings, where "English" words
AH> are spelt differently in different countries ?

'Gray' is acceptable to Chambers (admittedly a Scottish dictionary), though
'grey' is certainly more common in England. But I take it the ICDD terms are
offered as an enumeration list to standardise colour terms, and therefore we
should stick to their spelling. However, I do notice that the list of
suggested colours for crystals in Acta C Notes for Authors uses 'grey'.

D59.3 mmCIF: _atom_type.oxidation_number
----------------------------------------
D>         I have noticed in cifdic.m96 that the enumeration range for 
D> oxidation states has a maximum of 6.  This should be changed to 8 as in 
D> the core.
 
 
Ongoing Discussions
===================

(57)D56.1 Review of pdCIF - minor points
----------------------------------------
B> all changes seem fine (_pd_proc_intensity_total is correct)

D56.2 Categories
----------------
D>         Now that I have had a chance to look at the mmcif dictionary and 
D> get some sense of how DDL2 works, I am less impressed by the inevitable 
D> need for all the dictionaries eventually to convert to DDL2.  The 
D> flexibility 
D> that we have in DDL1 is an advantage that needs to be weighed heavily 
D> against the increased specificity in DDL2.  I can see that DDL2 makes 
D> sense for the macromolecularists who are working in a field that has a 
D> chemical structure which extends from crystallography to many other 
D> techniques, and where database searching based on this chemical structure 
D> is going to be essential for further progress.  Such is not the case in 
D> other crystallographic fields.  However, I could imagine a situation 
D> where, say, the organic chemists might wish to develop a similar 
D> structure.  That would be time enough to consider developing a 
D> specialised DDL2 version of a cif dictionary geared to that particular 
D> need.  mmCIF has managed to adapt the core dictionary to its own needs, 
D> and this could be done again for another field if necessary.  Therefore I 
D> agree that we do not need to be overly fussy about the definitions of 
D> categories and category names just because we want to smooth the 
D> transition to DDL2.  I find the practice in the cifdic.p97 is acceptable.

B> I was not very careful in the assignment of category names. I did make an
B> attempt to see that anything that one might want to include in a single loop
B> would be assigned to the same category. I was probably overzealous in
B> eliminating categories from the previous draft dictionary. I do find the task
B> of assigning categories rather difficult. In many cases one does not want to
B> require a one-to-one relation between two lists - for example, measured and
B> processed data points - but most frequently there will only be one list. It
B> is very difficult to explain to a user (or programmer who is not educated
B> in the art of database normalization [normalisation - for those not
B> afflicted by Webster]) that they cannot place these two related items in
B> a single loop because they have been assigned different categories.
B> 
B> We could realign the _category assignments for all data items that will not
B> be looped to match the item names - but as I have discussed before, we can't
B> fix the naming conflicts where items will be looped. I don't really have an
B> efficient way to explore all the possible ways that a scientist might want to
B> group together looped items.
B> 
B> (2) All items that I thought might be needed in a single data loop were
B> assigned to pd_data. I tried to differentiate between descriptive information
B> from quantitative information with pd_meas_info and pd_meas_method but I may
B> not have done too careful a job.

I've left the current descriptive text as it is, though it might be useful
to include some additional explanation in any documentation that accompanies
the dictionary.

B> (3-5)_pd_calib_2theta_offset will likely be looped with _pd_calib_detector_id
B> for multidetector instruments, but frequently will not be looped for single
B> detector instruments or will be looped with _pd_calib_2theta_off_ but not
B> _pd_calib_detector_id. I do not foresee _pd_calib_std_ items in the same
B> loop_ as _pd_calib_detector_id or _pd_calib_2theta_
B> 
B> I would hope that _pd_calib_conversion_eqn would not be looped. Life is
B> already complex enough without having to choose from a list of equations.
B> 
B> I seem to recall that I wanted to loop _pd_instr items with _pd_calib_ items
B> but now I don't see why. It would make sense to set the category for the
B> _pd_calib_ items to pd_calib. It might make sense to split them into two
B> categories, as you suggested.

I have changed all the _pd_calib_ categories to 'pd_calib', except for
_pd_calib_conversion_eqn and *_special_details, which I have renamed as
_pd_calibration_conversion_eqn and *_special_details, and assigned to the
formal category 'pd_calibration'. This is slightly cumbersome, but does
allow the generation of loops for the 'pd_calib' items and non-looped
'pd_calibration' information. This is explained in the '_pd_calib_[pd]'
definition with the extra sentence "The _pd_calibration_ items, however,
are never looped."

B> (6) I do not foresee too many occasions where one would have more than one
B> _pd_proc_ block, but I could foresee a the case where I might run a sample
B> for you here and send you the raw and processed data. You might then do a
B> more careful job of processing and create a new CIF with the original raw
B> data block but a new processed data block.

OK. As I mentioned last time, the current dictionary has possibilities for
extension to cover multiple datasets in a systematic way if and when there
is a pressing need to introduce the necessary additional list identifiers.
I'll take no action for now.

B> (7) _pd_proc_number_of_points can be assigned any convenient category

It's now in category pd_proc_info.

D57.1 E.s.d. -> s.u.
--------------------
B> I would prefer:
B>     "Datasets that are measured as counts, where a standard uncertainty
B>     can be considered equivalent to the standard deviation and where the
B>     standard deviation can be estimated as the square root of the number
B>     of counts recorded...
B> 
B>     "Standard uncertainties should not be quoted for these values.
B>     If the standard uncertainties differ from the square root of the
B>     number of counts, _pd_meas_intensity_ should be used."

I've changed the wordings as suggested.

D57.2 Errors in the _pd_ example file
-------------------------------------
B> Thanks!
B> 
B> I think that we could allow spaces in the _pd_block_id, but life might be a
B> bit simpler if we do not. I will be happy either way.

Let's make life simpler. Spaces are not permitted.

D57.3 sample/specimen
---------------------
B> I would prefer that the _samp/* and _*/samp items be renamed to spec for
B> consistency.
B> 
B> The form  *_src/samp is better.

Done.

D57.4 _pd_calc_method appears in list
-------------------------------------
B> I do not foresee _pd_calc_method appearing in a loop.

I have deleted '_list yes'.

D57.5 Uniqueness criterion for _pd_refln_peak_id
------------------------------------------------
B> I have not worried about DDL rules too much so there may well be
B> discrepancies like this one in other places.

As I suggested, I've removed the '_list_uniqueness' reference here. I've
tried to check for DDL rules by eye, but perhaps one day we'll have a
program that will do the whole job.

D57.6 '?' vs '.'
----------------
B> Yes, '.' is more appropriate than '?'

Changed as appropriate.

D58.1 Enumeration constraints
-----------------------------
D>         _pd_spec_shape
D>                 The two shapes given here are the shapes into which 
D> virtually all powder samples are formed.  Any other shape is covered by 
D> 'irregular', but there is nothing to prevent us from adding to this 
D> enumeration list in later versions if a new shape suddenly becomes 
D> fashionable.  The list does not have to be exhaustive, providing there is 
D> an escape value such as 'irregular'.

B> D58.1 Enumeration constraints on various pd items
B> 
B> I used -180.0:360.0 as an enumeration range because some people prefer to
B> specify angles as -180:180 and others 0:360 and this covers either option.
B> -360.0:360.0 is OK but I don't see any advantage in that constraint.

D58.2 Example for _pd_calib_conversion_eqn
------------------------------------------
B> By popular request, here is an example equation that might be used.
B> 
B>    two-theta_actual = two-theta_setting + arctan(
B> 	     cos(P1) / (1/(P0 * (CC - CH0 - P2*CC**2)) - sin(P1)) )
B> 
B> This allows for the calibration of two-theta for a linear PSD where the PSD 
B> has been set so that the "center channel" (CH0) is located at
B> two-theta_setting as a function of the channel number (CC). In addition to
B> CH0, variables P0, P1 and P2 are calibration constants, where P0 is the
B> width of a PSD channel in degrees, P1 is the angle of the PSD with respect
B> to the perpendicular and P2 is a quadratic term for non-linearities in
B> the detector. Would anyone like to convert this to a CIF form?

I have incorporated this in the latest draft, both as an annotated example
to _pd_calib_[pd] and a standalone example in _pd_calib_conversion_eqn.

D58.3 Systematic data naming
----------------------------
B> S> I would have thought that the data names....
B> S>                                '_pd_proc_intensity_calc_bkg'
B> S>                                '_pd_proc_intensity_fix_bkg'
B> S> would be more systematic as...
B> S>                                '_pd_proc_intensity_bkg_calc'
B> S>                                '_pd_proc_intensity_bkg_fix'
B> 
B> I agree

Changes incorporated in latest draft.

D58.4 R factor nomenclature
---------------------------
B> R~B~ is correct. R~exp~ is better too.


Best wishes
Brian
Prev by Date: (58) pdCIF categories; pdCIF minutiae; mmCIF remarks
Next by Date: (60) Powder indexing; final call on pdCIF 1.0
Index(es):
- Date
Discussion List Archives

(59) Powder indexing; closing on pdCIF