On Thursday, January 27, 2011 10:16 AM, Herbert J. Bernstein wrote:

>Let me just talk about the category join issue.  The current documentation
>is vague about the issue of how one should match up the keys, and John B.'s
>interpretation may well be what was intended, but I think for the join
>the actually be useful, it has to be extended to cover the normalization
>and denormalization cases, in which the choice of keys depends on
>the degree of normalization.

I don't think the draft is so vague, in that section 4.3 describes joined categories as being applicable to "categories have equivalent category keys", and it remarks that "the keys of joined categories may be used interchangeably in the instance document."  It may be that it would be useful to expand the scope of the feature as Herbert suggests, but I don't think the draft can be read to already define it that way.

I always thought that that the origin of this feature was the old ATOM_SITE vs. ATOM_SITE_ANISO issue, where the choice of whether to join does not involve normalization.  (Instead, it involves whether null anisotropic displacement parameters are explicitly recorded for atoms refined isotropically, and it relates to the way small-molecule structural results have traditionally been tabulated.)  In fact, the DDLm draft refers specifically to that case.  For that and similar cases, the current definition is already useful.

>  This actually gets back to an old
>disagreement between CCP4 and the PDB, which could finally be resolved
>with a liberal (i.e. denormalization-friendly) interpretation
>of category join.

Can you summarize this disagreement, please?  Is it still an issue, or has it effectively been settled?

>When you normalize a category, you often strip out several columns
>that were originally key components in the larger category, and put them
>entirely in the child category, so there is less repetition in the
>parent category.  If we are to allow the option of using the
>dictionary with the normalized categories with fewer key components to be
>presented as the original wider, flatter denormalized categories,
>then we need to interpret the _category.parent_join in a way
>that permits more key components in the denormalized presentation,


The fundamental question is whether we do want to allow a denormalized presentation in such cases.  What are the advantages?  I currently see this one:

() if denormalizing joins are allowed then some normalizations can be performed in existing dictionaries that otherwise could not be performed without invalidating existing instance documents.  At least in principle.

I do not, however, see any special advantage inherent generally in multiplying the ways in which future instance documents can be written.

Herbert argues that a denormalized presentation is more convenient for "data harvest", but I'm not clear on what he means by that term as distinguished from "database loads," which he presents as an alternative use.  I'm also not clear whether merely _allowing_ denormalized presentation is sufficient to serve the data harvesting use case.  Once I understand this argument better, I may agree that there is an advantage here.

Are there other advantages?

I see this disadvantage:

() if denormalizing joins are allowed then that introduces a new type of validity error that CIF authors may inadvertently introduce into their files and that CIF validators must test for: duplicate parent-category keys with different parent-category attributes.  That's a reasonably complicated problem because "different" depends in part on the semantics of the non-key items' types.

There might be other disadvantages, but I have not yet identified any.

If we decide we do want to allowed denormalized presentation in such cases, then we can surely come up with suitable semantics.  Herbert presented one possibility, but before we discuss details let's first settle whether we even need to go there.



John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

