[Date Prev][Date Next][Date Index]
(57) Further review of pdCIF
- To: COMCIFS@iucr.ac.uk
- Subject: (57) Further review of pdCIF
- From: bm
- Date: Wed, 12 Feb 1997 18:09:42 GMT
Dear Colleagues D56.1 Review of pdCIF - minor points ------------------------------------ This item addressed several minor quibbles and typos. B>>D> 2. _*_author_name should refer to the conventions in the core B>>D> (_publ_author_name). Similarly for _*_fax and _*_phone (this occurs in B>>D> several places) B> B> Yes. Could I ask you (Brian M.) to do this? Done. This, and several of the other points mentioned below, are implemented in revision 0.992, which I've posted on the web page and ftp server. You need only collect this version if you're not confident I'll get things right, or if you are a fanatic about detail. B>>D> 4. _pd_calib_conversion_eqn B>>D> An example would be helpful here. Does this data item have a B>>D> syntax that would allow a program to check whether a conversion had been B>>D> correctly made? It would seem to be an item looking for a regular B>>D> expression, but perhaps we are not ready yet for this step. B>>D> B> B> There are a number of equations in the dictionary where I would love to have B> a regular expression -- but I am not sure I can be general enough and still B> impose enough structure to allow for machine parsing. My current thoughts are B> to leave this for the next version. Agreed. I remind potential volunteers that the _type_construct feature of DDL1.4 remains to be tested and implemented. B>>D> 6. _pd_instr_cons_illum_len B>>D> Last paragraph of definition should refer to _*_var_* not _*_cons_* B> B> Yes. Done. B>>D> 7. _pd_proc_info_excluded_regions B>>D> Should this item be computer readable or is this not necessary B>>D> since the excluded region will be marked in the diffractogram by zero B>>D> weights? B> B> My original thought on this was that it should be computer readable, but as B> the dictionary evolved it became part of the documentation -- why did the B> author remove observations from fit? B>>D> 8. _pd_proc_intensity B>>D> The last paragraph referring to normalisation factors should B>>D> include a reference to preferred orientation which I assume would be B>>D> included. It is mentioned elsewhere in the dictionary, but this seems to B>>D> be the only place where the correction is recorded. B> B> There is _pd_proc_ls_pref_orient_corr -- to describe the correction, but I do B> not have a table for the correction values. It could be added if there are B> strong feelings. Perhaps a reference is needed in the _pd_proc_intensity B> definition. Here are the other significant changes I've made in version 0.992: -ization changed to -isation throughout, to make more consistent with the core (though I'm not sure why this was done in the core; traditional British setting uses -iz- for Greek derivations). Likewise analyze -> analyse, for which there is etymological justification. In _pd_block_id I changed "and possibly the data_ name. It may be a sample name." to "and possibly the name of the current CIF data block (i.e. the string xxxx in a data_xxxx identifier). It may be a sample name." in the hope that it would be clearer to a non-expert reader. In the same definition, I think "As blocks are created in a CIF, the original <sample_id> should be retained, but the <creator_name>" should say something like "As blocks are created in a CIF, the original sample identifier (i.e. <block_name>) should be retained, but the <creator_name>" In _pd_calib_conversion_eqn, the reference to _pd_meas_distance_value should be to _pd_meas_position. In _pd_instr_[pd], the reference to _diffrn_radiation_polaris_ratio should be to _diffrn_radiation_polarisn_ratio. In _pd_proc_ls_prof_, the reference to _pd_proc_total should be to _pd_proc_intensity_total (I think). I added "in kelvins" to the definition of _pd_prep_temperature. D56.2 Categories in pdCIF ------------------------- B>>D> 3. I am a little unhappy about the category pd_data whose items seem to be B>>D> scattered throughout a number of other categories. I can understand why B>>D> Brian has done this, but it violates our unwritten convention that the B>>D> first part of a data name should be the same as the category name (to B>>D> assist, inter alia, in any conversion to DDL2). B> B> I am caught between two rules -- (1) data in different categories may not B> be in the same loop; implicit rule (2) "the first part of a data name should B> be the same as the category name." I anticipate that _pd_meas and _pd_proc B> items will frequently (but not always be) be in the same loop (see my example B> pdCIF file where I have a single loop_ with _pd_meas_counts_total B> _pd_proc_ls_weight _pd_proc_intensity_calc_bkg and B> _pd_calc_intensity_total). This is why I was not enthusiastic about use of B> categories for database normalization about two years back and was even more B> unhappy about the DDL2 changes, for those that remember. B> B> I am less than pleased about changing the names of many -- perhaps the B> majority of the _pd_ items to _pd_data_. I propose we drop rule (1) and B> allow items from different categories to be combined into a single loop. B> This allows adoption of rational category names. Note that with rule (1) B> in place it will never be possible to have a data item named _pd_ in a B> loop with a core data item without breaking rule (2)! I see things a little differently. Rule (2) is unwritten; rule (1), however, is part of the published DDL1.4 standard. Infraction of rule (1) is counter to the standard; infraction of rule (2) merits at most a warning (Herbert Bernstein's new working of CYCLOPS will raise a warning if the initial part of the data name differs from the category name; this is a useful check for dictionary authors, but an inconvenience to most users). Further, rule (1) is more important in normalising data tables than rule (2), and hence of more importance in allowing a transition to DDL2 if this is considered appropriate. What is this all about? To recapitulate for the benefit of the more recent members, a relational database can be considered as a series of tables. Each row in a table must be unique, and must be identified by a unique key value (this can be the value of one field, or of a number of fields taken together). Some fields may have values which can be used to link to information in other tables - these comprise "foreign keys", or pointers to key values in the other tables. "Normalisation" of a database involves the imposition of a data model that separates data items that are closely related to each other into separate tables, linked by appropriate foreign keys to tables bearing more distant relationships. The structuring of information in the macromolecular CIF dictionary is strongly bound to such a data model. This is enforced by the DDL version 2 that defines data relationships in the mmCIF dictionary. The DDL version 1.4 used in the core maps approximately to this way of representing data (it leads fairly readily to "unnormalised" tables, where the organisation of tables is less clean, in some sense). This permits the core definitions to be embedded in the full relational model of mmCIF without undue difficulty; but it also permits the construction of data files that are less tightly bound to the relational database model. And sometimes that approach can have benefits. Brian T. is particularly anxious to be able to use that freedom in organising the contents of pdCIF files (particularly the tabulated data sets) in flexible ways. Consider, for instance, the definition for _pd_proc_wavelength: "Wavelength in angstroms for the incident radiation ... This will be a single value for continuous wavelength methods or may vary for each data point and be looped with the intensity values for energy-dispersive measurements." The advantage of citing an unvarying value once, rather than tens of thousands of times (within the list of measured data points) is rather clear; but in a full relational model the data item must be assigned to a table where it always occurs. There is just about enough freedom to do what is wanted under DDL1.4, where every data item is assigned to a category that has the following properties (from the DDL1.4 paper): "The _category attribute is used to specify the group, or basis set, to which a data item belongs. This attribute is specified as a character string which matches the portion of the data name, following the leading underscore." [That sentence ought to read "which usually matches..."]. "Data items in a list must be of the same _category. Data items of a given category may exist in different lists provided each list contains an appropriate 'reference' data item (see _list_reference). Items belonging to different categories should not appear in the same list." As Brian T. observes, it's not possible to reconcile the conflicting desires of loop membership and rigorous category naming. In particular, his data names _pd_refln_peak_id, _pd_refln_phase_id and _pd_refln_wavelength_id should all belong to the CORE category "refln", because they will be added to the reflections list. Hence the pd_data category gives the flexibility of including various _pd_meas_, _pd_proc_ and _pd_calc_ data points together in the same looped list, or of breaking them out into different lists if that's appropriate. (Strictly speaking, the present model is defective, in that such a separation should only be made if there are appropriate _list_reference items that can be installed in each separate list, i.e. loop_ _pd_data_id _pd_meas_foo _pd_calc_foo _pd_proc_foo 1 000 aaa AAA 2 111 bbb BBB should be separated out as something like loop_ _pd_meas_id _pd_meas_foo 1 000 2 111 loop_ _pd_calc_id _pd_calc_foo 1 aaa 2 bbb loop_ _pd_proc_id _pd_proc_foo 1 AAA 2 BBB But the missing list references, which I've indicated here as _pd_meas_id, _pd_calc_id and _pd_proc_id, COULD be added in later if Brian were convinced that this normalisation was mandatory.) Anyway, accepting the current category assignments, I note the following oddities: (1) The _pd_refln_ items should go in category refln (I mentioned that above). (2) Explanatory items such as _pd_meas_[pd] refer to elements in several categories (here, items belong to the categories pd_data, pd_meas_info and pd_meas_method). Perhaps some explanation of this should be added to the definitions. (3) Should all the items beginning _pd_calib_ be assigned the category pd_calib rather than the existing pd_instr? They seem to belong not only in the formal category pd_calib, but to be related in the "logical" category pd_calib (i.e. they are related specifically to the instrumental calibration, which could fairly be regarded as a distinct topic from the other pd_instr items). (4) _pd_calib_detector_id can act as an identifier in a list of multiple calibrations; but some of the other data items here (_pd_calib_2theta_offset, _pd_calib_std_external_) may also be looped. Should there actually be two or more categories here? (5) For _pd_calib_conversion_eqn, should there be a _list both (i.e. "May appear in list" - presumably "containing _pd_calib_detector_id")? Or will this data item never be looped? (6) The definition of _pd_proc_[pd] includes the sentence: "If the dataset is reprocessed, this section may be replaced (with addition of a new _pd_block_id entry)." This would imply that you will often have a _pd_meas_ dataset in one datablock, with several _pd_proc_ datasets in other blocks. This would seem to strengthen the case for having multiple identifiers within the pd_data category (e.g. my hypothetical _pd_meas_id, _pd_proc_id above) to link particular data points across the various datablocks. (7) _pd_proc_number_of_points appears to be the only occupant of category _pd_proc_data. Is there a good reason for this, or should it be subsumed in pd_proc_info? (8) Should not the _pd_proc_2theta_range_ data names be in a different category from pd_data, perhaps pd_proc_info (or even pd_proc_data)? ============================================================================== D57.1 E.s.d. -> s.u. -------------------- B>>D> 1. E.s.d. should be replaced by s.u. throughout. B>> B>> I hesitated for a while over this, since there are discussions of counting B>> statistics in the dictionary, and therefore cases where 'standard B>> deviations' may indeed be intended, rather than 'standard uncertainties'. B>> Brian, can you let me know if there are cases where this distinction does B>> need to be made? Otherwise I'll make the requested change throughout. B> B> To be honest, I am not sure which would be more appropriate for data. Is the B> "estimated error" of SQRT(I) on an observation of I counts an ESD or a SU? I B> would like to consult Ted Prince on this but I will be out of the office for B> a week. Certainly derived quantities have a SU rather than a ESD. Anyone else B> have an opinion? I have gone through the dictionary searching for instances of the term, and make the following suggestions: _pd_meas_[pd] "Datasets that are measured as counts, where the estimated standard deviations (e.s.d.'s) are the square-root of the intensity, should be recorded in the _pd_meas_counts_* fields." might be changed to "Datasets that are measured as counts, where a standard uncertainty might be estimated as the standard deviation, or square root of the number of counts recorded, should use the _pd_meas_counts_* fields." Here there is no uncertainty in the value measured, which is a discrete number; but it may reflect an uncertainty in the actual value that is related to a typical Gaussian spread in the counting statistics. _pd_meas_counts_ "E.s.d.'s may not be specified for these values as they will be the square-root of the number of counts." might read "Standard uncertainties should not be quoted for these values. Experimental uncertainty may be related to a standard deviation, or square root of the number of counts." _pd_meas_intensity_ "Use these entries for measurements where intensity values are not counts (use _pd_meas_counts_ for event counting measurements where the estimated standard deviation is the square-root of the number of counts)." This might stand as is, or the last part might read "where the standard uncertainty is estimated as the square root of the number of counts)." _pd_proc_intensity_ "Inclusion of e.s.d.'s for these values is strongly recommended." should change to "Inclusion of s.u.'s for these values is strongly recommended." _pd_proc_ls_weight "Weight applied to each profile point. These values may be omitted if the weights are 1/\s^2^ where \s is the e.s.d. for the _pd_proc_intensity_net values." should change (with the adoption of the u nomenclature in place of sigma) to "... These values may be omitted if the weights are 1/u^2^ where u is the s.u. for the _pd_proc_intensity_net values." _pd_proc_ls_background_function "Include also the values used for the coefficients used in the background function with their e.s.d.'s." should be "... with their s.u.'s." _pd_proc_ls_pref_orient_corr "Include the value(s) used for the correction with e.s.d.'s." should be _pd_proc_ls_profile_function "Include the values used for the profile function coefficients and their e.s.d.'s." should be "... function coefficients and their s.u.'s." _pd_peak_intensity _pd_peak_pk_height "Good practice is to include e.s.d.'s for these values." should likewise be "Good practice is to include s.u.'s for these values." I won't implement these changes until we've had some expert guidance. D57.2 Errors in the _pd_ example file ------------------------------------- Brian T.'s example file is a very useful accompaniment to the dictionary. However, it contains a few discrepancies. _diffrn_radiation_symbol should be _diffrn_radiation_xray_symbol (l.191) _pd_instr_radiation_probe should be _diffrn_radiation_probe (l.192) _pd_refln_i100_meas is not in the dictionary (l.279) _pk_width_2theta (l.216) should presumably be _pd_peak_width_2theta Also, the definition for _pd_block_id and related id's in the dictionary states "Blank spaces may also not be used.", but there is a space in the "B. Toby" <creator_name> fragment of the examples. (A similar space has also crept in to the formatted dictionary; my typesetting software is trying to be too clever in imposing the Acta house rules for initials!) Can I check that Brian T. does wish to retain the "no spaces" rule? D57.3 sample/specimen --------------------- Brian has been very diligent in differentiating "sample" from "specimen", e.g. in _pd_spec_[pd]: "This section contains information about the specimen used for measurement of the diffraction dataset. Note that information about the sample (the batch of material where the specimen was obtained), is specified in _pd_prep_." However, in a plethora of definitions relating to instrument geometry, e.g. _pd_instr_dist_mono/samp, the "samp" in the data name jars against the correct "specimen" in the definition. Is it too late to change these datanames to ...spec ? An unrelated stylistic point in these definitions is the usage of abbreviations, sometimes dist_src/samp, sometimes _src/samp. The convention in the original CIF paper was to use constructs of the form *_src/samp, and I'm happy to enforce this convention consistently, unless there are objections. D57.4 _pd_calc_method appears in list ------------------------------------- Should it? [i.e. should "_list yes" appear in _pd_calc_method?] D57.5 Uniqueness criterion for _pd_refln_peak_id ------------------------------------------------ The list attributes for _pd_refln_peak_id include: _list yes loop_ _list_reference '_refln_index_h' '_refln_index_k' '_refln_index_l' _list_link_parent '_pd_peak_id' loop_ _list_uniqueness '_pd_refln_peak_id' '_pd_refln_phase_id' _list_uniqueness should only appear in a definition in which _list_mandatory is set to "yes", and it is intended to define the complete set of data values which must occur uniquely within each loop packet (or table row, if you will). It doesn't make sense to have "loop_ _list_uniqueness '_pd_refln_peak_id' '_pd_refln_phase_id'" here, because there are many entries in the reflections list that have the same values of _pd_refln_peak_id and _pd_refln_phase_id when, for instance, _pd_refln_peak_id is '.'. I suggest deleting the _list_uniqueness entry here. D57.6 '?' vs '.' ---------------- Syd's STAR papers use the convention that the values '?' and '.' have special meaning, respectively "value not known" and "value not appropriate". It seems to me that there are two instances, in _pd_proc_intensity_ ("Use a value of '?' for data points where a fixed background has not been defined.") and _pd_refln_[pd] ("Reflections may also be included that are not observed; use '?' for the _pd_refln_peak_id.") where '.' is more appropriate than '?' This ties in with the usage in the example file _refln_ list. Regards Brian
- Prev by Date: (56) pd data categories; new core data names; latest mmCIF draft; su's
- Next by Date: (58) pdCIF categories; pdCIF minutiae; mmCIF remarks
- Index(es):