(64) Response to pdCIF category critique

To: [email protected]
Subject: (64) Response to pdCIF category critique
From: bm
Date: Mon, 12 May 1997 13:39:03 +0100
Dear Colleagues

D61.1 pdCIF categories
----------------------
There follows Brian Toby's considered response to Paula's critique in
circular 61, and extracts from correspondence between Herbert Bernstein and
Syd Hall, who have independently been discussing this issue in connection
with the implementation of category checking routines in the CIFtbx library.

 From Brian:

I would like to thank Paula, John and Helen for their effort in
analyzing the structure of the pd dictionary. The objections that
Paula raises, though, are not easily resolved. In fact they are
nothing new.  Since the time when database normalization issues were
first raised (in Tarrytown?), I have kvetched about their impact on
the pd dictionary. In a way I am glad that she raised an objection as
it offers a chance to reopen the normalization discussion, which was
never resolved to my satisfaction.

While I did not put as much work looking at the loop and category
relationships, as Paula et al did, the relationships in pd_cif are not
haphazard. I will respond to Paula's questions and perhaps my examples
will make these choices more clear.  I have to apologize that I am
working "on the road" this week, so I have not done a complete review
of looping rules and categories using her tables and I imagine that 
there is some "fine tuning" that I will want to do.

1) pd_phase
Thanks to Paula's analysis, I can now see that _pd_phase_block_id
should be list=both.  Other than this, the rest is correct.  When a
sample contains a single chemical phase, one can use _pd_phase_name to
specify a name or structure type for the material. For a single
component material, _pd_phase_id and _mass_% are not used. Use of
_pd_phase_block_id in this circumstance will probably be rare.

However, in the quite common case where a sample contains two or more
chemical phases, _pd_phase_name will be looped with names for each
phase. If any of the other _pd_phase_* items appear, they would be in
the same loop as _pd_phase_name.

2) pd_proc_info
Breaking up these items into two groups might be a good idea, but
_datetime follows the category of _author_. Why? Because if several
different people take turns at data analysis, there should be a
correspondance between dates and people. If the items will ever be in
a single loop, they must be in the same category. 

Does it make sense to have a loop only containing _datetime? Yes! If I
am the only person who looks at a set of data and I make several
passes at analysis, I will list my e-mail address etc once, but will
have several timestamps.  Further, _author_ and _datetime might show
up in different loops: if I work on the data on two occasions with a
grad student and a postdoc, one loop would list the three participants
and another the two dates.

3) pd_meas_method 
When I defined _pd_meas_rocking_axis as list=both, I felt it was
unlikely that anyone would ever rock a sample on more than one axis in
an experiment. As it turns out, I have just gotten a proposal from
someone who wants to use my instrument at NIST to do exactly that in
the next few weeks.

4) pd_instr
Paula is correct as it may make sense to break up these items into
more categories. There are also good reasons to not do this (see
below).

5) pd_calib
I have to give this area more thought after I get back from the
synchrotron and get some sleep.

6) pd_data 
It may make sense to change the category for _beam_size_ and
_rocking_angle, but I am sure that all the rest *must* be in the same
category under the existing rules. I would not want to change the
names to match the category either, as this would result in a
significant loss in clarity. Perhaps an example will make clear the
reasons for this:

In the most common pd experiment, someone measures a series of
_pd_meas_intensity_total values. S/he applies corrections to obtain a
_pd_proc_intensity_net value for each observation.  S/he determines a
crystallographic model and from that the expected intensity is
computed from the model for each data point. The CIF will contain a

	loop_ 
	   _pd_calc_intensity_net 
	   _pd_meas_intensity_total
	   _pd_proc_intensity_total

BTW, The difference in names here is important: _meas_ values are the
experimental settings or observables, _proc_ are determined from
processing and _calc_ values are derived from models.

That was the simple case. There is no requirement that one have a
processed point for every observed data point. Case in point: here at
NIST we have a 32 detector instrument. No one models each detector. We
then normalize the data as if it had been collected with a single
detector instrument and reduce the number of data points by ~45%. So
in this case there is no longer a 1:1 relation between the original
data (_pd_meas_intensity_total) and the reduced data
(_pd_proc_intensity_total) and they appear in different loops.

	loop_ 
	   _pd_calc_intensity_net 
	loop_
	   _pd_meas_intensity_total
	   _pd_proc_intensity_total
	
Further, for the above examples, data is collected at constant
wavelength so there is a single _pd_proc_wavelength value. However,
TOF and energy-dispersive instruments, the wavelength varies for each
data point so that _pd_proc_wavelength appears inside a loop!

Perhaps the way that mmCIF would deal with these possible (but not
manditory) relationships between data items would be to require each
set of items to appear in a different loop and by adding a set of
pointers between the loops.  To do this would make the pdCIF
dictionary exceedingly complex and would probably delay the acceptance
of pdCIF substantially.

In summary: what are the conflicts?
-----------------------------------
A) Names don't follow categories. They can't. For example, if names
must begin with _pd_ since _pd_refln_* are in refln.

B) Different types of information (measurements vs processed values vs
derived values) appear in the same category. This is not my
preference, but the alternative is to prohibit very sensible groupings
of data -- as long as we require that loops contain only one category.

C) From a database normalization usage, data items are given different
uses depending on context.  For example, _info_author_ was used in
three different ways in (2) above. One could differentiate between
members of a research group who collaborate versus sequential
processing of data by different individuals by defining different data
items for the purpose. In my opinion, in the pd dictionary we have
already too many cases where we have different data items for the same
physical variable. For example, _pd_meas_2theta_fixed and
_pd_meas_2theta_scan and _pd_meas_2theta_range_* all record the same
experimental setting. The different names distinguish how this setting
varies through the course of the experiment.  This distinction may
make sense to a computer jock, but will confound most scientists.

D) Categories could be further divided to separate items that may be
looped from related items that will never be looped. This creates a
bit of work for Brian & Brian, but I don't object. On the other hand
this will fracture further the divisions between names and categories,
so I am not certain this work has value.  Why should we separate the
category assignment for items that are not looped when items that are
list=both will have to be in the same category as items that are
always looped.  I would like to get some input from COMCIFS as to why
this should or should not be done.

Conflicts (A), (B) and in part (D) arise from the requirement that
items in different categories cannot appear in a single loop.  At the
risk of giving voice to my annoyance, I have to repeat that I have
always objected to this rule. Conflict (C) can be resolved at a later
time by defining new entries, if this is actually a problem.

Conclusions and possible resolutions: What are the choices?
-----------------------------------------------------------

I) We could have a wholesale renaming of the data names in the pd
dictionary to match categories. I would suggest to whoever takes on
this task that they should drop the _pd_ prefixes while they are at it,
(unless _pd_ is needed for a particular category to differentiate it
from a core/mm/... category.) Any volunteers?

or 
II) Paula could withdraw her objections. We would then make some minor
changes to list usage and perhaps one or two category reassignments
and punt.

or
III) We change the requirement that items in different categories
cannot appear in a single loop. This was always a foolish choice. A
better rule is that all items in a category, if looped, must appear in
a single loop. Two loops (with different categories) may be combined
when desired. 

This will not create database normalization problems, as the implied
pointers can be generated on the fly by any software that will parse
CIF into a relational db. Just as now, when a single value is specifed
for a data item that can optionally appear looped with other values,
the relational model must expand it from a scalar to a vector of
identical elements. If really necessary to satisfy the database folks,
a new DDL entry could be defined that would identify the categories
that might end up in a single loop, along the lines of a friend function
in C++.

My proposed revision would allow categories to be assigned in a
rational fashion in the pd dictionary and thus most of Paula's points
would be satisfied. (III) is the most sensible choice.

I look forward to comments from Paula and the rest of COMCIFS.

----------------------------------------------------------------------------
In the extracts from the CIFtbx correspondence which I have reproduced below,
'H>' is Herbert Bernstein, 'BM>' is Brian McMahon and 'S>' is Syd Hall:

H> The dictionary change does produce more warnings ... I've been thinking
H> about the core messages.  The major problem seems to be a mismatch for a 
H> few old tokens between the category chosen and the initial characters
H> of the token.

BM> The powder dictionary has much less stringent rules about
BM> keeping a relationship between data names and the names of their parent
BM> category. It may be that this will change during the course of [COMCIFS]
BM> discussions, but it has been pointed out that under the rules of STAR and
BM> DDL1.4, it is not *necessary* to have such a correspondence between the
BM> leading characters of a data name and the category name. I would therefore
BM> suggest that you modify the DEFAULT behaviour of the checks against the
BM> dictionary to NOT report such mismatches for DDL1.4 dictionaries. There
BM> should ideally be a switch which says "perform this check in any case",
BM> because in practice many dictionaries will conform to the convention of
BM> matching the names, and in such circumstances it is advantageous to run the
BM> check.

H>   It seems sad that the powder dictionary won't share the elegance of the 
H> core and mm dictionaries.  It is nice when things have a regular 
H> structure.  Tends to avoid mistakes.  I'll add a switch to control the 
H> name against category checking in the next pass of ciftbx, but I'd like 
H> to bounce around the question of how to handle the default behavior.  
H> 
H> Clearly for the core and mm people the default of checking will help to 
H> catch garbles in dictionaries, so that is the right default for them, 
H> while for the powder people, it sounds like the right default is the 
H> other way.  Rather than visit the sins or virtues of one of the 
H> communities on the other, it sounds like it is time to introduce a choice 
H> of sets of defaults in a canned file (say a .tbh for tool-box header 
H> file) which would be a file the user could prepare with desired defaults
H> when working in a particular context.

(I have included the above part of the discussion to illustrate how
different CIF applications, based on different dictionaries, may well
require different behaviour from a common software package, say in the
strictness of validation or type checking. It may become part of the future
role of COMCIFS to provide more directives, or at least direction, to 
software developers on a per-application basis. The .tbh file approach
sounds to me a good one.)

S> Herbert: This business of categories is a complicated business ...
S> linking the name and the category does appear to provide real
S> advantages and that philosophy has been adhered to in DDL1 and
S> DDL2 dictionaries in the past...and that's why I was nervous about
S> the deviations from this in the powder dictionary. However, would we
S> regret in the longer term a formal association between tag and category?

H> I know of no basis ... why it is desirable to have 
H> disorganized naming conventions.  Certainly it is undesirable to impose 
H> unnecessary semantics on any language, and I am all in favor of giving 
H> everyone all the freedom they need to express the nouns of each domain in 
H> terms that its practitioners understand, but doubt there are any powder 
H> practioners other than BT who care about the names of tables internal to 
H> a relational database, which is the only information conveyed by the 
H> names chosen for categories, which could prefix rather arbitrary sets of
H> tokens.  I know there are people who are firmly convinced that one cannot 
H> make a solid assignment of a given token to a particular table, that some 
H> tokens must be assigned to multiple tables, and other tokens belong in 
H> splendid isolation, but, once you try to build a real database, you have 
H> to make the hard decision of where things go (as the old joke goes, 
H> "everybody got to be someplace"), and putting the same data item in two 
H> places, instead of making one instance the parent and other instances 
H> child pointers, gives you major headaches in then updating the database.

(I throw in here the reminder that the original CIF/DDL model was not
designed to equate to a relational database model; and the option of
introducing the STAR save_ and looped list features still provides
for hierarchical and other data models. While DDL2 is unashamedly
relational, DDL1 allows for more flexibility, though that has (arguably)
been compromised by the category assignment which effectively normalises the
permitted loop structures - see Brian T.'s discussion above.)

H>   Yes, you can achieve the same result by using the category as a token 
H> attribute, and that is all that the DDL requires, but that is an 
H> invitation to mistakes, and, as a matter of good style, should be kept to 
H> a minimum.  Think of some poor soul pulling together a loop.  He has to 
H> restrict the loop to tokens from a single category, and in normal cases, 
H> he will want to put all the tokens from that category into the loop.  
H> Life is just so much simpler if the tokens for a given category begin 
H> with the same characters. (And even simpler than that if you follow the 
H> DDL2 convention of using a period to unambiguously flag the category, but 
H> that is less important)
H>   Is the powder dictionary now at that necessary minimum?  That I don't 
H> know.  I would suggest that COMCIFS ask that specific question of the 
H> community (not just the powder folks, but the entire IUCr) and then make 
H> a decision on what to do informed by whatever comments they get.  The 
H> approach of asking the world at large for their thoughts on the difficult 
H> questions seems to have worked well for the mmCIF dictionary.  Why not do 
H> the same for the powder dictionary?
H>   ... if COMCIFS does not start enforcing some sort of 
H> stylistic consistency, we will weaken the impact of CIF and cause the 
H> entire concept to be blamed for problems which actually arise from a lack 
H> of editorial discipline.

H> ... there is much more to a good language than syntax. Semantics and
H> style are also very important. There are often good reasons to stretch 
H> the semantics of a language and to adopt unusual styles.  Thus do 
H> languages grow.  But each such use stands out in bold relief from the sea 
H> of uses which are consistent with accepted semantics and popular style, 
H> and, as any experienced editor knows, such things should not be overdone, 
H> or inelegance and confusion may result, with less information 
H> communicated.  BT may be right in this case.  He may also be wrong.  The 
H> truth, most likely, is not at the extremes, but somewhere in the realm of 
H> compromise.  I wish COMCIFS luck in finding the right compromise.

I have left in some of Herbert's opinions and philosophies, as well as his
technical points, because it's always good to have yet another point of
view. I have also added him to the mailing list for the present round of
discussions.

I would not wish to second-guess the outcome of this line of debate,
but I believe that whatever the outcome, COMCIFS has a major responsibility
to publish in as clear a way as possible the assumptions and constraints,
the similarities and differences, that apply to applications software
development under the DDL1 and DDL2 models.

Regards
Brian
Prev by Date: (63) more on pdCIF categories
Next by Date: (65) more on pdCIF categories
Index(es):
- Date
Discussion List Archives

(64) Response to pdCIF category critique