(56) pd data categories; new core data names; latest mmCIF draft; su's

To: COMCIFS@iucr.ac.uk
Subject: (56) pd data categories; new core data names; latest mmCIF draft; su's
From: bm
Date: Mon, 10 Feb 1997 14:30:13 GMT
Dear Colleagues

I am pleased to introduce Dr Otto Ritter, who joins us as a Consultant on
behalf of the Protein Data Bank, in a role sponsored by Joel Sussmann and
endorsed by Ted Baker, to ensure active and fruitful involvement by the PDB
in the acceptance of the mmCIF dictionary.

I welcome Otto to our number, and take the opportunity to remind old and more
recent members of the Committee of our mode of operation. All our business
is conducted by email, and active contributions are encouraged from both
full members and Consultants. Approval of dictionaries is in practice by
consensus, though in the event of dispute a formal vote may be taken among
the full members only. Members are required to give their endorsement to a
dictionary that has completed its review process.

We are currently reviewing the powder extension dictionary and the mmCIF
dictionary, both of which have been made available in draft form to the
crystallographic committee for some time. While the powder dictionary is
being discussed in-house in the manner we are accustomed to, the mmCIF
project, which is much larger in scale, has involved the establishment of
a public mailing list, through which many participants have engaged in
detailed discussions over the last couple of years. I refer COMCIFS members
to the archive of these discussions maintained at
      http://ndbserver.rutgers.edu/NDB/mmcif/resources/mail/index.html
for access to the many contributions that have been made from within the
crystallographic community. This will provide additional background to many
of the items in the current dictionary proposal.

It is hoped that both current dictionaries will be approved within the next
few weeks.

In addition to the proposals on the table, there are proposals for the
further extension of the core dictionary, and working groups involved in 
modulated structures, electron density and symmetry dictionaries, and I hope
that we shall soon be able to turn our attention to these.

Information on the current state of the CIF project is maintained at the
IUCr web site http://www.iucr.ac.uk/cif/home.html. I have recently placed
the latest mmCIF draft on the appropriate subsidiary page as detailed below.

The records of our past discussions, often extensive and detailed technical
debates, are available to COMCIFS members by making an ftp connection to
agate.iucr.ac.uk, connecting as user comcifs with password wheatear, and
retrieving the files cc.1 through cc.55.

==============================================================================

D56.1 Review of pdCIF - minor points
------------------------------------

David Brown writes:

D>         I have had a chance to review the pdCIF dictionary over the 
D> weekend and was surprised how straightforward the process was.  A lot of 
D> credit goes to Brian Toby for producing a document that is clear, 
D> unambiguous and about as readable as a cif dictionary can get.  My 
D> comments are largely minor and may have already been dealt with by you.
D> 
D> 1.  E.s.d. should be replaced by s.u. throughout.

I hesitated for a while over this, since there are discussions of counting
statistics in the dictionary, and therefore cases where 'standard
deviations' may indeed be intended, rather than 'standard uncertainties'. 
Brian, can you let me know if there are cases where this distinction does
need to be made? Otherwise I'll make the requested change throughout.

D> 2. _*_author_name should refer to the conventions in the core 
D> (_publ_author_name).  Similarly for _*_fax and _*_phone (this occurs in 
D> several places)

(Point 3., on data categories, has been broken out to the next item below.)

D> 4. _pd_calib_conversion_eqn
D>         An example would be helpful here.  Does this data item have a 
D> syntax that would allow a program to check whether a conversion had been 
D> correctly made?  It would seem to be an item looking for a regular 
D> expression, but perhaps we are not ready yet for this step.
D> 
D> 5. _pd_instr_cons_illum_flag
D>         In the definition 'know' should be 'known'
D> 
D> 6. _pd_instr_cons_illum_len
D>         Last paragraph of definition should refer to _*_var_* not _*_cons_*
D> 
D> 7. _pd_proc_info_excluded_regions
D>         Should this item be computer readable or is this not necessary 
D> since the excluded region will be marked in the diffractogram by zero 
D> weights?
D> 
D> 8. _pd_proc_intensity
D>         The last paragraph referring to normalisation factors should 
D> include a reference to preferred orientation which I assume would be 
D> included.  It is mentioned elsewhere in the dictionary, but this seems to 
D> be the only place where the correction is recorded.
D> 
D> 9. Printing conventions are needed for Angstrom (\%Angstr\"om) throughout.

This is taken into account in the formatted dictionary.

D> 10. _pd_proc_ls_prof
D>         In the expressions, 'sum' should be replaced by \S.  Is there a 
D> printing convention for SQRT?

Likewise as for (9.). I've manually introduced the math typesetting into the
formatted version which is derived from the dictionary source file. It is
possible to include this in the source file (as TeX commands), but that
decreases the readability of the source ASCII version. Whether this is a
constraint we should allow to bother us may be worth discussing, but at this
juncture the formatted version should be proofread independently of the
source file.

D>         To facilitate the approval, I give my vote in favour, subject to 
D> such other matters as may come up in discussion.

And this from Howard Flack:

H>   I would refer you to my remarks on the CORE dictionary concerning the
[see D54.2 below]
H> use of the term standard uncertainty symbolized by u. These remarks are
H> entirely pertinent to both the pdCIF and mmCIF.
H> 
H> For pdCIF I suggest changing:
H>      esd      to       su
H>      e.s.d.   to       s.u.
H>      estimated standard deviation   to standard uncertainty
H>      \s       to       u

(See point 1. above.)

D56.2 Categories in pdCIF
-------------------------

D> 3. I am a little unhappy about the category pd_data whose items seem to be
D> scattered throughout a number of other categories.  I can understand why
D> Brian has done this, but it violates our unwritten convention that the
D> first part of a data name should be the same as the category name (to
D> assist, inter alia, in any conversion to DDL2).  Two solutions to this
D> difficulty suggest themselves.  One is to insert the word _data after _pd
D> in the names of those data items that belong to this category, e.g. 
D> _pd_data_meas_counts_total instead of _pd_meas_counts_total.  This is a
D> little awkward, but it does make the category clear.  The other would be
D> to split the category pd_data into pd_meas and pd_proc since I do not
D> imagine that items in these subcategories would normally appear in a loop
D> together.  This is still not perfect because in both categories there
D> would be data names that do not include the category name.  On the whole I
D> would push for the first solution.  The question then also arises as to
D> whether the pd_data items should be listed together (as is our convention)
D> rather than split among other categories.  I am not sure which is the
D> least confusing. 

I'll leave Brian to give a full account of why he has worked in this way,
but I recall that there are reasons for preferring the looser definition of
category in DDL1.4 to allow more flexible presentation of data sets. We're
currently completing an exercise in database building for our in-house
journals production system, from which it's clear that there are often
alternative views of a data structure that it might be appropriate to
support. The mmCIF enterprise is imposing a very specific view (for what I
imagine are very good reasons in terms of manging the complexity of the
information they wish to store). It may be that powder data sets benefit
from being accessible through a less rigid model. At this point I am willing
to be persuaded either way.

I raise the related, but less crucial, question of how the definitions
should be ordered in the dictionary: purely alphabetically; alphabetically
within each category ordered alphabetically (as in the Core); alphabetically
within each logical grouping; or left as is? I'd prefer at least some degree
of alphabetic ordering, as an aid to browsing either the source or the
formatted dictionary - it's not obvious that one should look at the bottom
of the file for _pd_char_special_details!


D56.3 mmCIF
-----------
The modified version of the mmCIF dictionary that I promised last time has
been posted as version 0.9.01 at
     ftp://ftp.iucr.ac.uk/pub/cifdic.m96
Paula has promised that this will remain stable during our review.
I have also added a VERY PRELIMINARY draft formatted version in ps/pdf
formats accessible through the web page at
http://www.iucr.ac.uk/cif/mm/index.html. Much needs to be done before this
can be considered a beautiful document, and I don't propose to undertake any
further work on it until the final substance of the dictionary is approved;
but it's there for members who find it a more convenient format for reading
(158 pages instead of 584!). I remind you, however, that the ASCII source
should be retrieved also and consulted as the definitive version.

Howard has made the following statement regarding the adoption of s.u. here
also:

H> For mmCIF I suggest changing:
H>      esd      to       su
H>      e.s.d.   to       s.u.
H>      sigma and sig  need altering on a case by case basis. In many cases
H> sigma needs replacing by su (but not everywhere), sigma(I) becomes u(I),
H> but some sigma's in the dict do not refer to su's and do not need
H> altering.
H>      \s       to       u
H>  The 'weight' is sometimes used where 'mass' is intended.

However, I've discussed this with Paula who is unwilling to change the data
names at this point, since they have been in the public draft for so long.
She will agree to introducing data names that alias the "esd" ones when the
mmCIF dictionary is next aligned withthe Core (i.e. at release 2.1 of the
Core). I shall look into the introduction of the term s.u. into the
definition text, in a way similar to that employed for Core 2.0.

==============================================================================

D54.2 Replacement data names in the core CIF dictionary
-------------------------------------------------------
There have been several messages from Howard, which I post here for
the record. Attention to some of these points may be deferred until the
pdCIF and mmCIF dictionaries are approved.

H>   At long last I have had time to study messages 51, 52, 53 and 54. I've
H> also (almost) finished adapting my diffractometer software which
H> converts manufacturer specific files into CIF raw-intensity data files.
H> For the latter, I studied all of the new Notes for Authors for Acta C
H> and some parts of cif_core_2.0.1.dic with a fine tooth comb.
H> 
H> In cif_core_2.0.1.dic, I suggest:
H>  (1) Put the superseded items at the END of the file.
H>  (2) Other than in the data names change all occurrences of greek
H> symbols written in clear (e.g. psi, omega) into their back slash
H> equivalent form.

These points have both been touched upon in a different context above. It is
useful to keep the definitions in alphabetic order (at least within
categories) to assist browsing either through the ASCII source file or
the printed dictionary. Likewise, the use of typographic symbols in the
definition text makes the typesetting easier, but browsing the source file
more puzzling to the uninitiated. However, I'm increasingly being urged to
adopt the viewpoint that users should be discouraged from ever browsing the
source file in its raw state. This is a dialogue we can usefully undertake
later.

H> D54.2 Replacement data names.
H>   Standard Uncertainty: The changes suggested by Syd are OK with me and
H> clearly the descriptive text has been modified to take account of the
H> change of name. Nevertheless the modifications are not complete. esd.
H> e.s.d., sigma, sig, \s turn up in many places in the dictionary files
H> and need modification. The official symbol for the standard uncertainty
H> of Y is u(Y). So as a MIME attachment I enclose a file containing my
H> detailed suggestions for the changes to purify the two ddl's and the
H> core dictionary for the use of standard uncertainty.
H> 
H>   Concerning the use of the word observed as applied to the core
H> dictionary, a search on 'obs' shows that it is used in two different
H> senses. I approve of the Journals Commission pressure to rid the
H> dictionary of the term as used to describe measurements that fall
H> outside some specified threshold. However I find the suggested names and
H> definitions unsatisfactory:
H> 
H> _reflns_number_gt          Number of reflections > sigma threshold (new)
H> _reflns_number_observed    Number of "observed" reflections 
H> 
H> _reflns_threshold_sigma_expression
H>                            Sigma expression for F, F2 or I threshold
H> (new)
H> _reflns_observed_criterion Sigma expression for "observed" F, F2 or I 
H> 
H>   We can not use "sigma" because that is not the symbol used for
H> standard uncertainty - u is used. Secondly the semantics are not clear. 
H> Why does it have to be a "sigma"_expression? 'I > 5' would seem to me be
H> a perfectly good threshold criterion for some sorts of experiment. Then
H> with 'I > 5' as the sigma_expression, why is there another > in the
H> definition of _reflns_number_gt. The number counts the reflections
H> obeying the criterion not those that have something > I > 5. 
H>   If you would like to say that the "sigma"_expression for the above
H> case takes a value of '5', in the _number_gt you have no way of knowing
H> to what quantity the threshold applies. I think something along the
H> lines:
H> 
H> data_reflns_over-threshold_criterion
H>     _name                      '_reflns_over-threshold_criterion'
H>     _category                    reflns
H>     _type                        char
H>     _example                    'I > 2u(I)'
H>     _definition
H> ;             Reflection measurements obeying the criterion are said to be
H>               over-threshold. The criterion is often expressed
H>               in terms of I and u(I), |F|^2^ and u(|F|^2^) or |F| and u(|F|).
H> ;
H> 
H> data_reflns_number_
H>     loop_ _name                '_reflns_number_total'
H>                                '_reflns_number_over-threshold'
H>     _category                    reflns
H>     _type                        numb
H>     _enumeration_range           0:
H>     _definition
H> ;              The total number of reflections, and the number of
H>                over-threshold reflections, in the _refln_ list (not the
H>                _diffrn_refln_ list).
H>                The over-threshold reflections satisfy the
H>                _reflns_over-threshold_criterion.
H>                They may include Friedel equivalent reflections according
H>                to the nature of the structure and the procedures used. The
H>                item _reflns_special_details describes the reflection data.
H> ;
H> 
H> To say the least I think it was inadvisable to include the 'new' but
H> unapproved data names in the Acta Cryst C Notes for Authors. I can not
H> believe that the pressure from the Journals and Nomenclature Commissions
H> was so strong as to justify this half measure.
H> 
H> One more point I noticed. Some of the data names use the word "weight"
H> where "mass" is meant. The associated descriptive text has been
H> corrected and uses the word "mass". Its a great pity. Who knows what the
H> correct SI name for "density" is? The school children in Switzerland and
H> the post docs from France we have had in the lab all talk about "masse
H> volumetrique". In any case the scientific term "densite" in French is a
H> dimensionless quantity. I suppose it corresponds to the old English
H> "specific gravity".

Complete formal definitions have not yet been drafted for the proposed new data
names, so I am happy to adopt these as candidate definitions for inclusion
in version 2.1. Comments welcome. Howard's MIME attachment can be made
available to anyone who is interested. I note the requirement to change the
terms referring to "esd" in the DDL to "su", and ask Syd to consider these
as an inclusion in the forthcoming revision to DDL version 1.

That's all I have time for just now. More soon.

Regards
Brian
Prev by Date: (55) Call for approval: pdCIF, mmCIF
Next by Date: (57) Further review of pdCIF
Index(es):
- Date
Discussion List Archives

(56) pd data categories; new core data names; latest mmCIF draft; su's