Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] DDLm aliases (subject changed). .. .. .. .. .. .. .

On Friday, January 28, 2011 9:23 PM, Herbert J. Bernstein wrote:

>   I am sorry to hear that John Bollinger "cannot agree to the expanded key
>for the ALIAS category" because he believes "This is not a philosophical
>question, but rather one of correctly modeling the data domain.
>Furthermore, this particular question also has nothing to do with
>macromolecular data processing.  DDLm is a language for writing
>*dictionaries*.  Macromolecular data CIFs will not contain items from the
>ALIAS, DICTIONARY_XREF, or IDENTIFIER_SET categories, nor from any other
>DDLm category."
>   This is remarkably strange coming from someone who was already pressing
>to expand the ALIAS category with the xref_code object, and who seems
>unaware the the only reason for any need to expand the key of ALIAS is to
>allow the denormalized form that he and David were originally asking for.

I strive always for precision and clarity in my writing, but apparently I have not been as successful as I had thought.  Allow me, then, to clarify:

1) Having a composite key does not automatically make a category denormalized, and in particular, using ( _alias.xref_code, _alias.definition_id ) as the key for category ALIAS is not a denormalization.  It represents modeling the ALIAS entity as a particular data name *as defined in a particular dictionary*.  The dictionary identifier in which the alias data name is defined thus serves as a namespace for the data name, which seems perfectly natural to me.  For what it's worth, there is a strong analogy with XML qualified data names.

If data names were assumed to be defined in one dictionary only, globally, then xref_code would not need to be part of the key.  Under the latest proposal's specification that every data name is implicitly aliased to itself, however, that assumption would prevent definitions from being aliased to the same data name defined in some other dictionary.  (Otherwise there would be a duplicate category key.)  Moreover, the assumption is patently unsafe, as anyone is free to write his own dictionary defining any data name he wants, any way he wants.  We could ignore that possibility, and *probably* it would not later cause us pain, but I'm a purist: I would rather the model be fundamentally correct and thus foreclose any possibility of future pain arising from faulty assumptions.

Alternatively, if globally all definitions of each data name were assumed equivalent, then the ALIAS category would not need and should not have xref_code or dictionary_uri at all, nor any other attribute specific to a particular dictionary in which the alias name is defined.  Such attributes would better go into a separate category.  Perhaps this is the assumption from which Herbert is working, as in this view my proposed structure of the ALIAS category is indeed denormalized.  If that were the view that the group chose to adopt, then I would have to withdraw my support for all versions of the proposal that so far have been offered, including my own.  However, the assumption underlying this view is exactly as unsafe as the one underlying the single global definition view above, so again, I prefer to reject it to foreclose any possibility of future pain arising from faulty assumptions.  Furthermore, the additional category bearing per-dictionary attributes for each alias woul
 d look very much like the proposed ALIAS category now does, so taking this approach saves us nothing yet requires a more complex dictionary model.

As for adding definition_set_id to the key, on the other hand, that *is* a denormalization.  An alias can be partially characterized by the dictionary in which it is defined, but the definition set(s) in which it appears are not defining characteristics.

Denormalizing a dictionary, including DDLm itself, creates practical problems.  If we assume the denormalized definition to represent the same underlying validity constraints as the normalized one, then software must implement special case code for affected categories to validate instance documents correctly.  To put it a different way: DDLm would not be complete if ALIAS or any other category were denormalized.  On the other hand, if we do not assume any special case logic then the resulting dictionary is not denormalized after all; instead, it represents a different data model, with different validation rules.

>   It is even mare remarkably strange to hear the view that "this
>particular question has nothing to do with macromolecular data
>processing."  The _only_ reason DDL2 exists at all was to allow for the
>creation of mmCIF.

2) Are there then are no dictionaries other than mmCIF written using DDL2?  Is there no interest in DDLm having any applicability to fields other than macromolecular crystallography?  Are we shutting out even the small-molecule and powder communities?  DDLs 1, 2, and m are data modeling languages designed, to various degrees, to have rich semantics convenient for defining scientific data.  I have always thought it apropos that its initialism is the same as that of SQL's "Data Definition Language" subset, as the two are similar in purpose and scope.  Nothing in DDL2 or DDLm is inherently specific to macromolecular crystallography.

Let me restate my point, then: this particular question has nothing *directly* to do with macromolecular data processing.  I was writing in response to the discussion of using denormalized data presentation for data harvesting purposes.  DDLm defines the language in which the *dictionary* is written, not the language in which *data* are written, therefore denormalizing it as Herbert has lately proposed would have no effect on the validity of presenting macromolecular data in denormalized form.


>   The only point of having the dictionaries and the various DDLs is to
>support the data domains, and if we cannot ground features in the needs of
>those domains we really should consider dropping those features.

By that logic, then, we should not include definition_set_id in the ALIAS category key.  It does nothing to serve the needs of the data domains.  At best, it provides a convenience for a subset of dictionary authors.

I have already agreed that there is a more fundamental question of whether DDLm validation rules should generally support denormalized presentation, and if so, with what semantics.  If it ultimately does support that, then addition of definition_set_id to the ALIAS key serves no purpose whatever.  Either way, it is important to choose the right key for this and every category, for key choice embodies some of the validation rules.

>   So, returning to actually getting work done -- if David needs similar
>features to support definition sets up at the ALIAS catgeory level then my
>proposal is a reasonable way to do both that and to support the more
>normaized form I will be using. If David is not going to be using such
>features for the core, we can leave out the ability to do the denormalized
>form for now.
>   So, there seeming to be nothing left of substance in this discussion
>other than matters of taste,

Again technical disagreement is discounted as an insubstantial difference in taste?  I assume that I did not previously present my technical position clearly, and I hope that I rectified that failing above.

> could we please choose one of three
>approaches to my introducing the definition sets:

We have at least three distinct, albeit related, areas under discussion here:

A) The attributes and structure of the ALIAS category.  This can be considered separately from definition sets and implemented either with or without them.  There seems to be broader interest here than in definition sets.

There are several questions to settle in this area:

1. What is the entity being modeled, and what assumptions are being made about it?  This directs suitable choices of category key, and the key choice could be discussed instead as a proxy.  In particular, this question could be framed as "Should a dictionary identifier be added to the ALIAS category key?"  This influences

2. whether to expand the ALIAS category to directly or indirectly provide additional attributes, such as those David named.  The several draft proposals from the end of last week all perform such an expansion.  The latter ones provide for all of David's attributes except dictionary version.  These additional questions depend on whether we do expand the ALIAS attributes:

  2a. If the ALIAS category is expanded more or less along the proposed lines, whether to deprecate _alias.dictionary_uri or to remove it.
  2b. If the ALIAS category is expanded more or less along the proposed lines, whether to deprecate _definition.xref_code or to remove it.
  2c. If the ALIAS category is expanded more or less along the proposed lines, should it have an attribute defining the earliest version of its dictionary in which each alias appears?

If I understand him correctly, Herbert's three closing questions can be couched in these terms (B and C together):

B) Whether DDLm validation rules should allow parent and child categories to be presented together in a denormalized  joined form.  This has potentially far-reaching implication, much beyond aliases and definition sets.

C) The structure and adoption of definition sets.

The form and attributes of the proposed definition set categories seem relatively uncontroversial, so the primary question here is

1. Whether to adopt the definition set categories at all.

If we agree to adopt definition sets, however, then there is a significant ancillary question:

2. Should definition_set_id be added to the ALIAS category and to its key (whatever it may be)?  The decision on (B) may factor into opinions on this question.  If we do add definition_set_id to the ALIAS category key, then David has raised an interesting question:

  2a. Do we need or want a separate ALIAS_DEFINITION_SET category?


Many of these questions are separable to at least some degree, so I think the best way forward is probably to handle them as separately as possible.  John W. and David both seem to be seeking more information about the definition set concept, and I'm sure the normalization question would bear more discussion.  Perhaps, however, we would be prepared to decide on the alias-specific questions in group A above?

I think these aspects of those questions are so far agreed without dissent:

* The ALIAS category should be expanded with additional attributes sufficient to describe the properties David enumerated, +- dictionary version.

* ALIAS should refer to the existing DICTIONARY_XREF category to provide information about the dictionary(-ies) in which alias data names are defined.  Therefore, an attribute _alias.xref_code should be added.

* Adding _alias.xref_code to ALIAS makes _definition.xref_code and _alias.dictionary_uri superfluous.  Each of those should be at least deprecated.  (Some would prefer that they be removed.)

* It is reasonable and appropriate to add an attribute _alias.deprecated as described in the various proposals.

Does anyone object?  Is it needful or appropriate to call a vote on these?

These aspects seem still in doubt:

* Whether xref_code should be added to the ALIAS category key.  (This is independent of whether definition_set_id is added.)  I claim it should be.

* Whether _alias.dictionary_uri should be removed (rather than just deprecated).  I currently prefer that it be removed, but I'm open to the possibility that doing so would bring undue hardship on early adopters.  Who, specifically, would it harm?

* Whether _definition.xref_code should be removed (rather than just deprecated).  I currently prefer that it be removed, but I'm open to the possibility that doing so would bring undue hardship on early adopters.  Who, specifically, would it harm?

Is there any further discussion of these questions?  If not, then can we have a vote?


>P.S. "Data modeling is part and parcel of dictionary authorship, so there
>is every reason to expect that dictionary authors will be prepared to
>express their dictionaries in suitably normalized form, according to
>whatever presentation normalization rules ultimately are adopted for
>DDLm." is unrealistic for the macromolecular crystallographic community,
>which has vigorously rebelled against the strictures of DDL2 and seems
>likely to totally reject anything in DDLm that makes it any more complex
>and confusing.

Indeed?  Is it DDL2 that the macromolecular community finds troublesome, or is it the mmCIF dictionary?  If there is a sore spot here then it would help me, at least, to have more detail than "the community rebels".  In particular, the portion of the community to which DDLm (and DDL2) is directly relevant is dictionary authors and validator developers.  Do we not have multiple representatives of those constituencies within this very working group?  I cannot speak for dictionary authors, but as a validator developer I can certainly say that my life would be a bit easier if denormalized presentation is not considered valid.

I agree that controlling complexity and minimizing confusion are important objectives, but it is by no means obvious to me that allowing for instance documents to be provided in denormalized form would make DDLm or dictionaries based on it any less complex or confusing.  I'm inclined to think the opposite.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.