[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] DDLm aliases (subject changed). .. .. .. .. .. .... .

Dear Herbert,

On Monday, January 31, 2011 3:09 PM, Herbert J. Bernstein wrote:

>At 1:20 PM -0600 1/31/11, Bollinger, John C wrote:

[...]

>This discussion began with adding what we were then calling styles
>to group related sets of tags.  One tag could have multiple styles.
>In normalized form, that would mean creating a new relation with
>the tags and the styles as components of a composite key, so the
>say key could be repeated with multiple styles and the same
>style could be repeated with multiple keys.

Indeed so.  This is what the ALIAS_DEFINITION_SET category provides (by whichever name it's now going).

>Placing that directly in the alias category instead of
>in a separate relation _is_ a denormalization.

In a formal sense, I think you're saying that the result would not satisfy second normal form because _alias.dictionary_uri would depend on only part of the key (_alias.definition_id).  I agree.  That does rely on _alias.dictionary_uri not being part of a candidate key, but the current definition assumes that.

If the only attributes were _alias.definition_id and _alias.definition_set_id, however, and both were elements of the key, then the category would comply even with domain-key normal form.  One might in that case complain that the meaning of the ALIAS category was changed, and that would be true, but it would be as normalized as can be.

>  You happen to
>have preferred to use the xref_code, but adding that to the
>alias category key is and was a denormalization.  In CIF, until
>now at least, COMCIFS has tried to maintain a global name space,
>with a given tag having one meaning across multiple dictionaries.
>That is why there is a prefix registration system, so adding
>the dictionary to the alias key should not be necessary.

So this is exactly one of the conversations I said we needed to have: "What is the entity being modeled, and what assumptions are being made about it?  [... T]his question could be framed as 'Should a dictionary identifier be added to the ALIAS category key?'"  Thank you for indulging me.

Xref_code, or some other dictionary identifier, is a different case than definition_set_id.  Whereas there is no viable argument for definition_set_id being part of a candidate key for ALIAS as that category is currently defined, there *are* arguments for xref_code being part of a candidate key.  We can choose how we want to model things, but the decision is not arbitrary: it has technical, semantic, and policy implications.

>From a technical perspective, the question can be again reframed as "does a definition_id determine the dictionary in which its definition appears?"  Inasmuch as the definition does not presently include dictionary_uri in the category key, DDLm as currently constituted appears to say "yes."  I think that's erroneous.  At minimum, COMCIFs' intention seems to be to redefine many mmCIF data names in a DDLm dictionary, and Herbert has expressed plans to do similarly for imgCIF.  Herbert nevertheless offers a contrasting view:

>The idea in CIF is that you _don't_ use the same tag name with
>different meanings in different dictionaries, but with the introduction
>of DDL2 and mmCIF we ended up with 2 versions of the same core definitions
>having the same meanings but different tag names.  Thus we needed to
>have aliases to relate the DDL2 dotted notation versions of the
>tags to the DDL1 undotted notations of the tags.

I understand the original impetus for aliases.  Interpreting DDL2, however, I conclude that the concept was broadened during development, and that the assumption of data names having global scope was intentionally avoided.  Others here were closer to the process than I, but I observe that the description of the DDL2 ITEM_ALIASES category specifically says "Each alias name is *identified by* the name and version of the dictionary to which it belongs" (emphasis added).  Indeed, the category key is (_item_aliases.alias_name, _item_aliases.dictionary, _item_aliases.version).  That's even broader than anything currently under discussion for DDLm.  ITG remarks that "_item_aliases.dictionary [... is] provided to distinguish between dictionaries [...]," which would not be necessary if a given data name could be assumed to be defined in only one dictionary, or even to be defined equivalently in every dictionary where it appears.

As much as the idea may be to globally avoid data name clashes, it is not necessary to assume that they are successfully avoided.  Rejecting that assumption not only protects against failures and policy changes in the CIF community, but it also makes DDLm a better candidate for adoption in disciplines with less central authority.  Furthermore, although we do not need to follow DDL2 here, it does establish a precedent for scoping aliases to specific dictionaries.  These are all good reasons to choose that, for DDLm's purposes, definition_id PLUS some form of dictionary identifier are required to uniquely identify an alias definition.  Are there good reasons to choose otherwise?

Supposing that we do adopt the view that unique identification of definitions requires at least definition_id and a dictionary identifier, ALIAS is not even a proper relation unless a dictionary identifier (such as xref_code) is added to the category key.

[...]

>I would be very happy having fully normalized DDLm dictionaries, but
>I can cope with denormalized dictionaries, just as I have to cope
>with denormalized datafiles -- indeed, for some search procedures,
>I deliberately denormalize dictionaries internally.  It
>sounds like John B. wants to stick to fully normalized DDLm dictionaries.

Hmm.  I would be happy to see dictionaries define data models that comply with higher normalization forms, but that is a design decision that should rest with their authors and maintainers.  I would in particular like DDLm itself to describe a highly normalized model for its own domain (dictionaries), though exactly which form would be most appropriate is an open question.  Ensuring that DDLm describes a well-normalized data model does not force other DDLm dictionaries to describe equally normalized models.  *Presentation* of these models, on the other hand, remains a separate issue, discussed next.

>While this has some impact on software developers, it has very little
>direct impact on users -- so what do people think:
>
>   Should all DDLm dictionaries be fully normalized (if so, to which level
>of normalization) or
>
>   Should DDLm dictionaries bee allowed the same flexibility as
>data files in being denormalized?

I see no reason why DDLm instance documents (i.e. dictionaries) should have different presentation rules than the instance documents they themselves describe.  Given a valid, possibly-denormalized instance document and a dictionary with which it complies, it must be possible to programmatically normalize the instance to the form described by the dictionary (else the document contains inconsistencies and therefore is invalid).  DDLm dictionaries are instance documents of DDLm, so there is no need for different behavior with respect to them.

Although I think the same applies to DDLm's own presentation, I am concerned about what would happen if DDLm were presented in a denormalized form that contained inconsistencies.  Rather than expend continuing effort to ensure that a denormalized presentation of DDLm remains consistent, I would rather expend effort to express and maintain DDLm in its self-defined normalized form.  In any case, I emphasize again that allowing a denormalized presentation is not at all the same thing as defining a denormalized model.


None of the foregoing settles just what presentation rules DDLm should actually require with respect to joined categories.  Should denormalizing joins be permitted?  There is a cost/benefit analysis to be performed here, but I'm not up to attempting it at the moment.


John

--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]