(57) Further review of pdCIF

To: [email protected]
Subject: (57) Further review of pdCIF
From: bm
Date: Wed, 12 Feb 1997 18:09:42 GMT
Dear Colleagues


D56.1 Review of pdCIF - minor points
------------------------------------

This item addressed several minor quibbles and typos.

B>>D> 2. _*_author_name should refer to the conventions in the core
B>>D> (_publ_author_name).  Similarly for _*_fax and _*_phone (this occurs in
B>>D> several places)
B> 
B> Yes. Could I ask you (Brian M.) to do this?

Done. This, and several of the other points mentioned below, are implemented
in revision 0.992, which I've posted on the web page and ftp server. You
need only collect this version if you're not confident I'll get things
right, or if you are a fanatic about detail.

B>>D> 4. _pd_calib_conversion_eqn
B>>D>         An example would be helpful here.  Does this data item have a
B>>D> syntax that would allow a program to check whether a conversion had been
B>>D> correctly made?  It would seem to be an item looking for a regular
B>>D> expression, but perhaps we are not ready yet for this step.
B>>D>
B> 
B> There are a number of equations in the dictionary where I would love to have
B> a regular expression -- but I am not sure I can be general enough and still
B> impose enough structure to allow for machine parsing. My current thoughts are
B> to leave this for the next version.

Agreed. I remind potential volunteers that the _type_construct feature of
DDL1.4 remains to be tested and implemented.

B>>D> 6. _pd_instr_cons_illum_len
B>>D> Last paragraph of definition should refer to _*_var_* not _*_cons_*
B> 
B> Yes.

Done.

B>>D> 7. _pd_proc_info_excluded_regions
B>>D>         Should this item be computer readable or is this not necessary
B>>D> since the excluded region will be marked in the diffractogram by zero
B>>D> weights?
B> 
B> My original thought on this was that it should be computer readable, but as
B> the dictionary evolved it became part of the documentation -- why did the
B> author remove observations from fit?

B>>D> 8. _pd_proc_intensity
B>>D>         The last paragraph referring to normalisation factors should
B>>D> include a reference to preferred orientation which I assume would be
B>>D> included.  It is mentioned elsewhere in the dictionary, but this seems to
B>>D> be the only place where the correction is recorded.
B> 
B> There is _pd_proc_ls_pref_orient_corr -- to describe the correction, but I do
B> not have a table for the correction values. It could be added if there are
B> strong feelings. Perhaps a reference is needed in the _pd_proc_intensity
B> definition.

Here are  the other significant changes I've made in version 0.992:

-ization changed to -isation throughout, to make more consistent with the
core (though I'm not sure why this was done in the core; traditional British
setting uses -iz- for Greek derivations). Likewise analyze -> analyse, for
which there is etymological justification.

In _pd_block_id I changed
        "and possibly the data_ name. It may be a sample name."
to
        "and possibly the name of the current CIF data block (i.e. the string
        xxxx in a data_xxxx identifier). It may be a sample name."
in the hope that it would be clearer to a non-expert reader. In the same
definition, I think
        "As blocks are created in a CIF, the original <sample_id>
        should be retained, but the <creator_name>"
should say something like
        "As blocks are created in a CIF, the original sample identifier
        (i.e. <block_name>) should be retained, but the <creator_name>"

In _pd_calib_conversion_eqn, the reference to _pd_meas_distance_value should
be to _pd_meas_position.

In _pd_instr_[pd], the reference to _diffrn_radiation_polaris_ratio should
be to _diffrn_radiation_polarisn_ratio.

In _pd_proc_ls_prof_, the reference to _pd_proc_total should
be to _pd_proc_intensity_total (I think).

I added "in kelvins" to the definition of _pd_prep_temperature.

D56.2 Categories in pdCIF
-------------------------

B>>D> 3. I am a little unhappy about the category pd_data whose items seem to be
B>>D> scattered throughout a number of other categories.  I can understand why
B>>D> Brian has done this, but it violates our unwritten convention that the
B>>D> first part of a data name should be the same as the category name (to
B>>D> assist, inter alia, in any conversion to DDL2). 
B> 
B> I am caught between two rules -- (1) data in different categories may not
B> be in the same loop; implicit rule (2) "the first part of a data name should
B> be the same as the category name." I anticipate that _pd_meas and _pd_proc
B> items will frequently (but not always be) be in the same loop (see my example
B> pdCIF file where I have a single loop_ with  _pd_meas_counts_total
B> _pd_proc_ls_weight _pd_proc_intensity_calc_bkg and
B> _pd_calc_intensity_total). This is why I was not enthusiastic about use of
B> categories for database normalization about two years back and was even more
B> unhappy about the DDL2 changes, for those that remember.
B> 
B> I am less than pleased about changing the names of many -- perhaps the
B> majority of the _pd_ items to _pd_data_. I propose we drop rule (1) and
B> allow items from different categories to be combined into a single loop.
B> This allows adoption of rational category names. Note that with rule (1)
B> in place it will never be possible to have a data item named _pd_ in a
B> loop with a core data item without breaking rule (2)!

I see things a little differently. Rule (2) is unwritten; rule (1), however,
is part of the published DDL1.4 standard. Infraction of rule (1) is counter
to the standard; infraction of rule (2) merits at most a warning (Herbert
Bernstein's new working of CYCLOPS will raise a warning if the initial part
of the data name differs from the category name; this is a useful check for
dictionary authors, but an inconvenience to most users). Further, rule (1)
is more important in normalising data tables than rule (2), and hence of
more importance in allowing a transition to DDL2 if this is considered
appropriate.

What is this all about? To recapitulate for the benefit of the more recent
members, a relational database can be considered as a series of tables. Each
row in a table must be unique, and must be identified by a unique key value
(this can be the value of one field, or of a number of fields taken
together). Some fields may have values which can be used to link to
information in other tables - these comprise "foreign keys", or pointers to
key values in the other tables. "Normalisation" of a database involves the
imposition of a data model that separates data items that are closely
related to each other into separate tables, linked by appropriate foreign
keys to tables bearing more distant relationships. The structuring of
information in the macromolecular CIF dictionary is strongly bound to such a
data model. This is enforced by the DDL version 2 that defines data
relationships in the mmCIF dictionary. The DDL version 1.4 used in the core
maps approximately to this way of representing data (it leads fairly readily
to "unnormalised" tables, where the organisation of tables is less clean, in
some sense). This permits the core definitions to be embedded in the full
relational model of mmCIF without undue difficulty; but it also permits the
construction of data files that are less tightly bound to the relational
database model. And sometimes that approach can have benefits.

Brian T. is particularly anxious to be able to use that freedom in
organising the contents of pdCIF files (particularly the tabulated data
sets) in flexible ways. Consider, for instance, the definition for
_pd_proc_wavelength: "Wavelength in angstroms for the incident radiation ...
This will be a single value for continuous wavelength methods or may vary for
each data point and be looped with the intensity values for energy-dispersive
measurements." The advantage of citing an unvarying value once, rather than
tens of thousands of times (within the list of measured data points) is
rather clear; but in a full relational model the data item must be assigned
to a table where it always occurs. There is just about enough freedom to do
what is wanted under DDL1.4, where every data item is assigned to a category
that has the following properties (from the DDL1.4 paper): "The _category
attribute is used to specify the group, or basis set, to which a data item
belongs. This attribute is specified as a character string which matches the
portion of the data name, following the leading underscore." [That sentence
ought to read "which usually matches..."]. "Data items in a list must be of
the same _category. Data items of a given category may exist in different
lists provided each list contains an appropriate 'reference' data item (see
_list_reference). Items belonging to different categories should not appear
in the same list."

As Brian T. observes, it's not possible to reconcile the conflicting desires
of loop membership and rigorous category naming. In particular, his data
names _pd_refln_peak_id, _pd_refln_phase_id and _pd_refln_wavelength_id
should all belong to the CORE category "refln", because they will be added
to the reflections list.

Hence the pd_data category gives the flexibility of including various
_pd_meas_, _pd_proc_ and _pd_calc_ data points together in the same looped
list, or of breaking them out into different lists if that's appropriate.
(Strictly speaking, the present model is defective, in that such a
separation should only be made if there are appropriate _list_reference items
that can be installed in each separate list, i.e.

loop_ _pd_data_id _pd_meas_foo _pd_calc_foo _pd_proc_foo
           1          000          aaa          AAA
           2          111          bbb          BBB

should be separated out as something like

loop_ _pd_meas_id _pd_meas_foo        1          000        2          111
loop_ _pd_calc_id _pd_calc_foo        1          aaa        2          bbb
loop_ _pd_proc_id _pd_proc_foo        1          AAA        2          BBB

But the missing list references, which I've indicated here as _pd_meas_id,
_pd_calc_id and _pd_proc_id, COULD be added in later if Brian were convinced
that this normalisation was mandatory.)

Anyway, accepting the current category assignments, I note the following
oddities:

(1) The _pd_refln_ items should go in category refln (I mentioned that above).

(2) Explanatory items such as _pd_meas_[pd] refer to elements in several
    categories (here, items belong to the categories pd_data, pd_meas_info
    and pd_meas_method). Perhaps some explanation of this should be added to
    the definitions.

(3) Should all the items beginning _pd_calib_ be assigned the category
    pd_calib rather than the existing pd_instr? They seem to belong not only
    in the formal category pd_calib, but to be related in the "logical"
    category pd_calib (i.e. they are related specifically to the instrumental
    calibration, which could fairly be regarded as a distinct topic from
    the other pd_instr items).

(4) _pd_calib_detector_id can act as an identifier in a list of multiple
    calibrations; but some of the other data items here
    (_pd_calib_2theta_offset, _pd_calib_std_external_) may also be looped.
    Should there actually be two or more categories here?

(5) For _pd_calib_conversion_eqn, should there be a _list both (i.e. "May
    appear in list" - presumably "containing _pd_calib_detector_id")? Or
    will this data item never be looped?

(6) The definition of _pd_proc_[pd] includes the sentence: "If the dataset
    is reprocessed, this section may be replaced (with addition of a new
    _pd_block_id entry)." This would imply that you will often have a
    _pd_meas_ dataset in one datablock, with several _pd_proc_ datasets
    in other blocks. This would seem to strengthen the case for having 
    multiple identifiers within the pd_data category (e.g. my hypothetical
    _pd_meas_id, _pd_proc_id above) to link particular data points across
    the various datablocks.

(7) _pd_proc_number_of_points appears to be the only occupant of category
    _pd_proc_data. Is there a good reason for this, or should it be subsumed
    in pd_proc_info?

(8) Should not the _pd_proc_2theta_range_ data names be in a different
    category from pd_data, perhaps pd_proc_info (or even pd_proc_data)?


==============================================================================
 
D57.1 E.s.d. -> s.u.
--------------------

B>>D> 1.  E.s.d. should be replaced by s.u. throughout.
B>>
B>> I hesitated for a while over this, since there are discussions of counting
B>> statistics in the dictionary, and therefore cases where 'standard
B>> deviations' may indeed be intended, rather than 'standard uncertainties'.
B>> Brian, can you let me know if there are cases where this distinction does
B>> need to be made? Otherwise I'll make the requested change throughout.
B> 
B> To be honest, I am not sure which would be more appropriate for data. Is the
B> "estimated error" of SQRT(I) on an observation of I counts an ESD or a SU? I
B> would like to consult Ted Prince on this but I will be out of the office for
B> a week. Certainly derived quantities have a SU rather than a ESD. Anyone else
B> have an opinion?

I have gone through the dictionary searching for instances of the term, and
make the following suggestions:

_pd_meas_[pd]
    "Datasets that are measured as counts, where the estimated standard
    deviations (e.s.d.'s) are the square-root of the intensity,
    should be recorded in the _pd_meas_counts_* fields."
might be changed to
    "Datasets that are measured as counts, where a standard uncertainty
    might be estimated as the standard deviation, or square root of the
    number of counts recorded, should use the _pd_meas_counts_* fields."
Here there is no uncertainty in the value measured, which is a discrete
number; but it may reflect an uncertainty in the actual value that is
related to a typical Gaussian spread in the counting statistics.

_pd_meas_counts_
    "E.s.d.'s may not be specified for these values as they
    will be the square-root of the number of counts."
might read
    "Standard uncertainties should not be quoted for these values.
    Experimental uncertainty may be related to a standard deviation,
    or square root of the number of counts."

_pd_meas_intensity_
    "Use these entries for measurements where intensity values are not
    counts (use _pd_meas_counts_ for event counting measurements where the
    estimated standard deviation is the square-root of the number of counts)."
This might stand as is, or the last part might read
    "where the standard uncertainty is estimated as the square root of the
    number of counts)."

_pd_proc_intensity_
    "Inclusion of e.s.d.'s for these values is strongly recommended."
should change to
    "Inclusion of s.u.'s for these values is strongly recommended."

_pd_proc_ls_weight
    "Weight applied to each profile point. These values may be omitted if
    the weights are 1/\s^2^ where \s is the e.s.d. for the
    _pd_proc_intensity_net values."
should change (with the adoption of the u nomenclature in place of sigma) to
    "... These values may be omitted if the weights are 1/u^2^ where u is
    the s.u. for the _pd_proc_intensity_net values."

_pd_proc_ls_background_function
    "Include also the values used for the coefficients used in the background
    function with their e.s.d.'s."
should be
    "... with their s.u.'s."

_pd_proc_ls_pref_orient_corr
    "Include the value(s) used for the correction with e.s.d.'s."
should be

_pd_proc_ls_profile_function
    "Include the values used for the profile function coefficients and their
    e.s.d.'s."
should be
    "... function coefficients and their s.u.'s."

_pd_peak_intensity
_pd_peak_pk_height
    "Good practice is to include e.s.d.'s for these values."
should likewise be
    "Good practice is to include s.u.'s for these values."

I won't implement these changes until we've had some expert guidance.


D57.2 Errors in the _pd_ example file
-------------------------------------
Brian T.'s example file is a very useful accompaniment to the dictionary. 
However, it contains a few discrepancies.

 _diffrn_radiation_symbol   should be _diffrn_radiation_xray_symbol (l.191)
 _pd_instr_radiation_probe  should be _diffrn_radiation_probe (l.192)
 _pd_refln_i100_meas is not in the dictionary (l.279)
 _pk_width_2theta (l.216) should presumably be _pd_peak_width_2theta

Also, the definition for _pd_block_id and related id's in the dictionary
states "Blank spaces may also not be used.", but there is a space in the
"B. Toby" <creator_name> fragment of the examples. (A similar space has also
crept in to the formatted dictionary; my typesetting software is trying to
be too clever in imposing the Acta house rules for initials!) Can I check
that Brian T. does wish to retain the "no spaces" rule?

D57.3 sample/specimen
---------------------
Brian has been very diligent in differentiating "sample" from "specimen",
e.g. in _pd_spec_[pd]: "This section contains information about the specimen
used for measurement of the diffraction dataset. Note that information
about the sample (the batch of material where the specimen was obtained), is
specified in _pd_prep_." However, in a plethora of definitions relating to
instrument geometry, e.g. _pd_instr_dist_mono/samp, the "samp" in the data
name jars against the correct "specimen" in the definition. Is it too late
to change these datanames to ...spec ?

An unrelated stylistic point in these definitions is the usage of
abbreviations, sometimes dist_src/samp, sometimes _src/samp. The convention
in the original CIF paper was to use constructs of the form *_src/samp, and
I'm happy to enforce this convention consistently, unless there are
objections.

D57.4 _pd_calc_method appears in list
-------------------------------------
Should it? [i.e. should "_list yes" appear in _pd_calc_method?]

D57.5 Uniqueness criterion for _pd_refln_peak_id
------------------------------------------------
The list attributes for _pd_refln_peak_id include:

    _list                       yes
    loop_ _list_reference      '_refln_index_h'
                               '_refln_index_k'
                               '_refln_index_l'
    _list_link_parent          '_pd_peak_id'
    loop_ _list_uniqueness     '_pd_refln_peak_id'
                               '_pd_refln_phase_id'
  
_list_uniqueness should only appear in a definition in which _list_mandatory
is set to "yes", and it is intended to define the complete set of data
values which must occur uniquely within each loop packet (or table row, if
you will). It doesn't make sense to have "loop_ _list_uniqueness
'_pd_refln_peak_id' '_pd_refln_phase_id'" here, because there are many
entries in the reflections list that have the same values of _pd_refln_peak_id
and _pd_refln_phase_id when, for instance, _pd_refln_peak_id is '.'. I
suggest deleting the _list_uniqueness entry here.

D57.6 '?' vs '.'
----------------
Syd's STAR papers use the convention that the values '?' and '.' have
special meaning, respectively "value not known" and "value not appropriate".
It seems to me that there are two instances, in _pd_proc_intensity_ ("Use a
value of '?' for data points where a fixed background has not been defined.")
and _pd_refln_[pd] ("Reflections may also be included that are not observed;
use '?' for the _pd_refln_peak_id.") where '.' is more appropriate than '?'

This ties in with the usage in the example file _refln_ list.


Regards
Brian
Prev by Date: (56) pd data categories; new core data names; latest mmCIF draft; su's
Next by Date: (58) pdCIF categories; pdCIF minutiae; mmCIF remarks
Index(es):
- Date
Discussion List Archives

(57) Further review of pdCIF