[Date Prev][Date Next][Date Index]
(64) Response to pdCIF category critique
- To: COMCIFS@iucr.ac.uk
- Subject: (64) Response to pdCIF category critique
- From: bm
- Date: Mon, 12 May 1997 13:39:03 +0100
Dear Colleagues D61.1 pdCIF categories ---------------------- There follows Brian Toby's considered response to Paula's critique in circular 61, and extracts from correspondence between Herbert Bernstein and Syd Hall, who have independently been discussing this issue in connection with the implementation of category checking routines in the CIFtbx library. From Brian: I would like to thank Paula, John and Helen for their effort in analyzing the structure of the pd dictionary. The objections that Paula raises, though, are not easily resolved. In fact they are nothing new. Since the time when database normalization issues were first raised (in Tarrytown?), I have kvetched about their impact on the pd dictionary. In a way I am glad that she raised an objection as it offers a chance to reopen the normalization discussion, which was never resolved to my satisfaction. While I did not put as much work looking at the loop and category relationships, as Paula et al did, the relationships in pd_cif are not haphazard. I will respond to Paula's questions and perhaps my examples will make these choices more clear. I have to apologize that I am working "on the road" this week, so I have not done a complete review of looping rules and categories using her tables and I imagine that there is some "fine tuning" that I will want to do. 1) pd_phase Thanks to Paula's analysis, I can now see that _pd_phase_block_id should be list=both. Other than this, the rest is correct. When a sample contains a single chemical phase, one can use _pd_phase_name to specify a name or structure type for the material. For a single component material, _pd_phase_id and _mass_% are not used. Use of _pd_phase_block_id in this circumstance will probably be rare. However, in the quite common case where a sample contains two or more chemical phases, _pd_phase_name will be looped with names for each phase. If any of the other _pd_phase_* items appear, they would be in the same loop as _pd_phase_name. 2) pd_proc_info Breaking up these items into two groups might be a good idea, but _datetime follows the category of _author_. Why? Because if several different people take turns at data analysis, there should be a correspondance between dates and people. If the items will ever be in a single loop, they must be in the same category. Does it make sense to have a loop only containing _datetime? Yes! If I am the only person who looks at a set of data and I make several passes at analysis, I will list my e-mail address etc once, but will have several timestamps. Further, _author_ and _datetime might show up in different loops: if I work on the data on two occasions with a grad student and a postdoc, one loop would list the three participants and another the two dates. 3) pd_meas_method When I defined _pd_meas_rocking_axis as list=both, I felt it was unlikely that anyone would ever rock a sample on more than one axis in an experiment. As it turns out, I have just gotten a proposal from someone who wants to use my instrument at NIST to do exactly that in the next few weeks. 4) pd_instr Paula is correct as it may make sense to break up these items into more categories. There are also good reasons to not do this (see below). 5) pd_calib I have to give this area more thought after I get back from the synchrotron and get some sleep. 6) pd_data It may make sense to change the category for _beam_size_ and _rocking_angle, but I am sure that all the rest *must* be in the same category under the existing rules. I would not want to change the names to match the category either, as this would result in a significant loss in clarity. Perhaps an example will make clear the reasons for this: In the most common pd experiment, someone measures a series of _pd_meas_intensity_total values. S/he applies corrections to obtain a _pd_proc_intensity_net value for each observation. S/he determines a crystallographic model and from that the expected intensity is computed from the model for each data point. The CIF will contain a loop_ _pd_calc_intensity_net _pd_meas_intensity_total _pd_proc_intensity_total BTW, The difference in names here is important: _meas_ values are the experimental settings or observables, _proc_ are determined from processing and _calc_ values are derived from models. That was the simple case. There is no requirement that one have a processed point for every observed data point. Case in point: here at NIST we have a 32 detector instrument. No one models each detector. We then normalize the data as if it had been collected with a single detector instrument and reduce the number of data points by ~45%. So in this case there is no longer a 1:1 relation between the original data (_pd_meas_intensity_total) and the reduced data (_pd_proc_intensity_total) and they appear in different loops. loop_ _pd_calc_intensity_net loop_ _pd_meas_intensity_total _pd_proc_intensity_total Further, for the above examples, data is collected at constant wavelength so there is a single _pd_proc_wavelength value. However, TOF and energy-dispersive instruments, the wavelength varies for each data point so that _pd_proc_wavelength appears inside a loop! Perhaps the way that mmCIF would deal with these possible (but not manditory) relationships between data items would be to require each set of items to appear in a different loop and by adding a set of pointers between the loops. To do this would make the pdCIF dictionary exceedingly complex and would probably delay the acceptance of pdCIF substantially. In summary: what are the conflicts? ----------------------------------- A) Names don't follow categories. They can't. For example, if names must begin with _pd_ since _pd_refln_* are in refln. B) Different types of information (measurements vs processed values vs derived values) appear in the same category. This is not my preference, but the alternative is to prohibit very sensible groupings of data -- as long as we require that loops contain only one category. C) From a database normalization usage, data items are given different uses depending on context. For example, _info_author_ was used in three different ways in (2) above. One could differentiate between members of a research group who collaborate versus sequential processing of data by different individuals by defining different data items for the purpose. In my opinion, in the pd dictionary we have already too many cases where we have different data items for the same physical variable. For example, _pd_meas_2theta_fixed and _pd_meas_2theta_scan and _pd_meas_2theta_range_* all record the same experimental setting. The different names distinguish how this setting varies through the course of the experiment. This distinction may make sense to a computer jock, but will confound most scientists. D) Categories could be further divided to separate items that may be looped from related items that will never be looped. This creates a bit of work for Brian & Brian, but I don't object. On the other hand this will fracture further the divisions between names and categories, so I am not certain this work has value. Why should we separate the category assignment for items that are not looped when items that are list=both will have to be in the same category as items that are always looped. I would like to get some input from COMCIFS as to why this should or should not be done. Conflicts (A), (B) and in part (D) arise from the requirement that items in different categories cannot appear in a single loop. At the risk of giving voice to my annoyance, I have to repeat that I have always objected to this rule. Conflict (C) can be resolved at a later time by defining new entries, if this is actually a problem. Conclusions and possible resolutions: What are the choices? ----------------------------------------------------------- I) We could have a wholesale renaming of the data names in the pd dictionary to match categories. I would suggest to whoever takes on this task that they should drop the _pd_ prefixes while they are at it, (unless _pd_ is needed for a particular category to differentiate it from a core/mm/... category.) Any volunteers? or II) Paula could withdraw her objections. We would then make some minor changes to list usage and perhaps one or two category reassignments and punt. or III) We change the requirement that items in different categories cannot appear in a single loop. This was always a foolish choice. A better rule is that all items in a category, if looped, must appear in a single loop. Two loops (with different categories) may be combined when desired. This will not create database normalization problems, as the implied pointers can be generated on the fly by any software that will parse CIF into a relational db. Just as now, when a single value is specifed for a data item that can optionally appear looped with other values, the relational model must expand it from a scalar to a vector of identical elements. If really necessary to satisfy the database folks, a new DDL entry could be defined that would identify the categories that might end up in a single loop, along the lines of a friend function in C++. My proposed revision would allow categories to be assigned in a rational fashion in the pd dictionary and thus most of Paula's points would be satisfied. (III) is the most sensible choice. I look forward to comments from Paula and the rest of COMCIFS. ---------------------------------------------------------------------------- In the extracts from the CIFtbx correspondence which I have reproduced below, 'H>' is Herbert Bernstein, 'BM>' is Brian McMahon and 'S>' is Syd Hall: H> The dictionary change does produce more warnings ... I've been thinking H> about the core messages. The major problem seems to be a mismatch for a H> few old tokens between the category chosen and the initial characters H> of the token. BM> The powder dictionary has much less stringent rules about BM> keeping a relationship between data names and the names of their parent BM> category. It may be that this will change during the course of [COMCIFS] BM> discussions, but it has been pointed out that under the rules of STAR and BM> DDL1.4, it is not *necessary* to have such a correspondence between the BM> leading characters of a data name and the category name. I would therefore BM> suggest that you modify the DEFAULT behaviour of the checks against the BM> dictionary to NOT report such mismatches for DDL1.4 dictionaries. There BM> should ideally be a switch which says "perform this check in any case", BM> because in practice many dictionaries will conform to the convention of BM> matching the names, and in such circumstances it is advantageous to run the BM> check. H> It seems sad that the powder dictionary won't share the elegance of the H> core and mm dictionaries. It is nice when things have a regular H> structure. Tends to avoid mistakes. I'll add a switch to control the H> name against category checking in the next pass of ciftbx, but I'd like H> to bounce around the question of how to handle the default behavior. H> H> Clearly for the core and mm people the default of checking will help to H> catch garbles in dictionaries, so that is the right default for them, H> while for the powder people, it sounds like the right default is the H> other way. Rather than visit the sins or virtues of one of the H> communities on the other, it sounds like it is time to introduce a choice H> of sets of defaults in a canned file (say a .tbh for tool-box header H> file) which would be a file the user could prepare with desired defaults H> when working in a particular context. (I have included the above part of the discussion to illustrate how different CIF applications, based on different dictionaries, may well require different behaviour from a common software package, say in the strictness of validation or type checking. It may become part of the future role of COMCIFS to provide more directives, or at least direction, to software developers on a per-application basis. The .tbh file approach sounds to me a good one.) S> Herbert: This business of categories is a complicated business ... S> linking the name and the category does appear to provide real S> advantages and that philosophy has been adhered to in DDL1 and S> DDL2 dictionaries in the past...and that's why I was nervous about S> the deviations from this in the powder dictionary. However, would we S> regret in the longer term a formal association between tag and category? H> I know of no basis ... why it is desirable to have H> disorganized naming conventions. Certainly it is undesirable to impose H> unnecessary semantics on any language, and I am all in favor of giving H> everyone all the freedom they need to express the nouns of each domain in H> terms that its practitioners understand, but doubt there are any powder H> practioners other than BT who care about the names of tables internal to H> a relational database, which is the only information conveyed by the H> names chosen for categories, which could prefix rather arbitrary sets of H> tokens. I know there are people who are firmly convinced that one cannot H> make a solid assignment of a given token to a particular table, that some H> tokens must be assigned to multiple tables, and other tokens belong in H> splendid isolation, but, once you try to build a real database, you have H> to make the hard decision of where things go (as the old joke goes, H> "everybody got to be someplace"), and putting the same data item in two H> places, instead of making one instance the parent and other instances H> child pointers, gives you major headaches in then updating the database. (I throw in here the reminder that the original CIF/DDL model was not designed to equate to a relational database model; and the option of introducing the STAR save_ and looped list features still provides for hierarchical and other data models. While DDL2 is unashamedly relational, DDL1 allows for more flexibility, though that has (arguably) been compromised by the category assignment which effectively normalises the permitted loop structures - see Brian T.'s discussion above.) H> Yes, you can achieve the same result by using the category as a token H> attribute, and that is all that the DDL requires, but that is an H> invitation to mistakes, and, as a matter of good style, should be kept to H> a minimum. Think of some poor soul pulling together a loop. He has to H> restrict the loop to tokens from a single category, and in normal cases, H> he will want to put all the tokens from that category into the loop. H> Life is just so much simpler if the tokens for a given category begin H> with the same characters. (And even simpler than that if you follow the H> DDL2 convention of using a period to unambiguously flag the category, but H> that is less important) H> Is the powder dictionary now at that necessary minimum? That I don't H> know. I would suggest that COMCIFS ask that specific question of the H> community (not just the powder folks, but the entire IUCr) and then make H> a decision on what to do informed by whatever comments they get. The H> approach of asking the world at large for their thoughts on the difficult H> questions seems to have worked well for the mmCIF dictionary. Why not do H> the same for the powder dictionary? H> ... if COMCIFS does not start enforcing some sort of H> stylistic consistency, we will weaken the impact of CIF and cause the H> entire concept to be blamed for problems which actually arise from a lack H> of editorial discipline. H> ... there is much more to a good language than syntax. Semantics and H> style are also very important. There are often good reasons to stretch H> the semantics of a language and to adopt unusual styles. Thus do H> languages grow. But each such use stands out in bold relief from the sea H> of uses which are consistent with accepted semantics and popular style, H> and, as any experienced editor knows, such things should not be overdone, H> or inelegance and confusion may result, with less information H> communicated. BT may be right in this case. He may also be wrong. The H> truth, most likely, is not at the extremes, but somewhere in the realm of H> compromise. I wish COMCIFS luck in finding the right compromise. I have left in some of Herbert's opinions and philosophies, as well as his technical points, because it's always good to have yet another point of view. I have also added him to the mailing list for the present round of discussions. I would not wish to second-guess the outcome of this line of debate, but I believe that whatever the outcome, COMCIFS has a major responsibility to publish in as clear a way as possible the assumptions and constraints, the similarities and differences, that apply to applications software development under the DDL1 and DDL2 models. Regards Brian
- Prev by Date: (63) more on pdCIF categories
- Next by Date: (65) more on pdCIF categories
- Index(es):