Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] DDLm aliases (subject changed). .. .. .. .

Hi John and Herbert,

I do not wish to complicate the discussion but I have a somewhat different perspective on
the the issue of normalization.   Certainly in the development of the mmCIF dictionary
anything approaching normalization was so at odds with familiar data organization that
it was not practical.   As a result mmCIF has a highly denormalized organization in which
each category mirrors the organization of a typical data file.   To cope with this
data organization style, parent-child relationships were introduced between common
identifiers in key and non-key roles.   A further practical complication comes from
having to track multiple nomenclatures composed of natural keys some of these having
unusual null-value rules.

To better address this in software we have added DDL2 extensions to define parent/child linking groups -

See -


categories - pdbx_item_link_group and pdbx_item_link_group_list

The groups defined in these categories allow validation of common items between categories
with multiple connecting relationships.   For instance, tables of bonds, angles and torsions
have multiple independent collections of natural keys times the number of nomenclatures.
In some cases the validation must make independent comparisons of each group against
the same group of parents items.

I raise this issue because it is an unavoidable consequence of denormalization.  And,
as Herbert points out the denormalized organization is important in data harvesting
and generally maintaining a connection to laboratory practice.

In the original design of DDLm their was an emphasis on adopting simple rather than
complex category keys.  This has been an issue of some concern for me as this does
not map well to our data which is rich in complex natural keys.


On 1/27/11 1:24 PM, Bollinger, John C wrote:
> On Thursday, January 27, 2011 10:16 AM, Herbert J. Bernstein wrote:
>> Let me just talk about the category join issue.  The current documentation
>> is vague about the issue of how one should match up the keys, and John B.'s
>> interpretation may well be what was intended, but I think for the join
>> the actually be useful, it has to be extended to cover the normalization
>> and denormalization cases, in which the choice of keys depends on
>> the degree of normalization.
> I don't think the draft is so vague, in that section 4.3 describes joined categories as being applicable to "categories have equivalent category keys", and it remarks that "the keys of joined categories may be used interchangeably in the instance document."  It may be that it would be useful to expand the scope of the feature as Herbert suggests, but I don't think the draft can be read to already define it that way.
> I always thought that that the origin of this feature was the old ATOM_SITE vs. ATOM_SITE_ANISO issue, where the choice of whether to join does not involve normalization.  (Instead, it involves whether null anisotropic displacement parameters are explicitly recorded for atoms refined isotropically, and it relates to the way small-molecule structural results have traditionally been tabulated.)  In fact, the DDLm draft refers specifically to that case.  For that and similar cases, the current definition is already useful.
>>   This actually gets back to an old
>> disagreement between CCP4 and the PDB, which could finally be resolved
>> with a liberal (i.e. denormalization-friendly) interpretation
>> of category join.
> Can you summarize this disagreement, please?  Is it still an issue, or has it effectively been settled?
>> When you normalize a category, you often strip out several columns
>> that were originally key components in the larger category, and put them
>> entirely in the child category, so there is less repetition in the
>> parent category.  If we are to allow the option of using the
>> dictionary with the normalized categories with fewer key components to be
>> presented as the original wider, flatter denormalized categories,
>> then we need to interpret the _category.parent_join in a way
>> that permits more key components in the denormalized presentation,
> Agreed.
> The fundamental question is whether we do want to allow a denormalized presentation in such cases.  What are the advantages?  I currently see this one:
> () if denormalizing joins are allowed then some normalizations can be performed in existing dictionaries that otherwise could not be performed without invalidating existing instance documents.  At least in principle.
> I do not, however, see any special advantage inherent generally in multiplying the ways in which future instance documents can be written.
> Herbert argues that a denormalized presentation is more convenient for "data harvest", but I'm not clear on what he means by that term as distinguished from "database loads," which he presents as an alternative use.  I'm also not clear whether merely _allowing_ denormalized presentation is sufficient to serve the data harvesting use case.  Once I understand this argument better, I may agree that there is an advantage here.
> Are there other advantages?
> I see this disadvantage:
> () if denormalizing joins are allowed then that introduces a new type of validity error that CIF authors may inadvertently introduce into their files and that CIF validators must test for: duplicate parent-category keys with different parent-category attributes.  That's a reasonably complicated problem because "different" depends in part on the semantics of the non-key items' types.
> There might be other disadvantages, but I have not yet identified any.
> If we decide we do want to allowed denormalized presentation in such cases, then we can surely come up with suitable semantics.  Herbert presented one possibility, but before we discuss details let's first settle whether we even need to go there.
> Regards,
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
> Email Disclaimer:  www.stjude.org/emaildisclaimer
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

   John Westbrook, Ph.D.
   Rutgers, The State University of New Jersey
   Department of Chemistry and Chemical Biology
   610 Taylor Road
   Piscataway, NJ 08854-8087
   e-mail: jwest@rcsb.rutgers.edu
   Ph:  (732) 445-4290  Fax: (732) 445-4320
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.