Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] DDLm aliases (subject changed). .. .. .. .. .. .. .

At 1:20 PM -0600 1/31/11, Bollinger, John C wrote:
>On Friday, January 28, 2011 9:23 PM, Herbert J. Bernstein wrote:
>>    I am sorry to hear that John Bollinger "cannot agree to the expanded key
>>for the ALIAS category" because he believes "This is not a philosophical
>>question, but rather one of correctly modeling the data domain.
>>Furthermore, this particular question also has nothing to do with
>>macromolecular data processing.  DDLm is a language for writing
>>*dictionaries*.  Macromolecular data CIFs will not contain items from the
>>ALIAS, DICTIONARY_XREF, or IDENTIFIER_SET categories, nor from any other
>>DDLm category."
>>    This is remarkably strange coming from someone who was already pressing
>>to expand the ALIAS category with the xref_code object, and who seems
>>unaware the the only reason for any need to expand the key of ALIAS is to
>>allow the denormalized form that he and David were originally asking for.
>I strive always for precision and clarity in my writing, but 
>apparently I have not been as successful as I had thought.  Allow 
>me, then, to clarify:
>1) Having a composite key does not automatically make a category 
>denormalized, and in particular, using ( _alias.xref_code, 
>_alias.definition_id ) as the key for category ALIAS is not a 
>denormalization.  It represents modeling the ALIAS entity as a 
>particular data name *as defined in a particular dictionary*.  The 
>dictionary identifier in which the alias data name is defined thus 
>serves as a namespace for the data name, which seems perfectly 
>natural to me.  For what it's worth, there is a strong analogy with 
>XML qualified data names.

This discussion began with adding what we were then calling styles
to group related sets of tags.  One tag could have multiple styles.
In normalized form, that would mean creating a new relation with
the tags and the styles as components of a composite key, so the
say key could be repeated with multiple styles and the same
style could be repeated with multiple keys.

Placing that directly in the alias category instead of
in a separate relation _is_ a denormalization.  You happen to
have preferred to use the xref_code, but adding that to the
alias category key is and was a denormalization.  In CIF, until
now at least, COMCIFS has tried to maintain a global name space,
with a given tag having one meaning across multiple dictionaries.
That is why there is a prefix registration system, so adding
the dictionary to the alias key should not be necessary.
If you want to use CIF like XML, it is the prefix, rather than
the dictionary that gives you the equivalent of XML qualified
data names.

>If data names were assumed to be defined in one dictionary only, 
>globally, then xref_code would not need to be part of the key. 
>Under the latest proposal's specification that every data name is 
>implicitly aliased to itself, however, that assumption would prevent 
>definitions from being aliased to the same data name defined in some 
>other dictionary.  (Otherwise there would be a duplicate category 
>key.)  Moreover, the assumption is patently unsafe, as anyone is 
>free to write his own dictionary defining any data name he wants, 
>any way he wants.  We could ignore that possibility, and *probably* 
>it would not later cause us pain, but I'm a purist: I would rather 
>the model be fundamentally correct and thus foreclose any 
>possibility of future pain arising from faulty assumptions.

The idea in CIF is that you _don't_ use the same tag name with
different meanings in different dictionaries, but with the introduction
of DDL2 and mmCIF we ended up with 2 versions of the same core definitions
having the same meanings but different tag names.  Thus we needed to
have aliases to relate the DDL2 dotted notation versions of the
tags to the DDL1 undotted notations of the tags.

>Alternatively, if globally all definitions of each data name were 
>assumed equivalent, then the ALIAS category would not need and 
>should not have xref_code or dictionary_uri at all, nor any other 
>attribute specific to a particular dictionary in which the alias 
>name is defined.  Such attributes would better go into a separate 
>category.  Perhaps this is the assumption from which Herbert is 
>working, as in this view my proposed structure of the ALIAS category 
>is indeed denormalized.  If that were the view that the group chose 
>to adopt, then I would have to withdraw my support for all versions 
>of the proposal that so far have been offered, including my own. 
>However, the assumption underlying this view is exactly as unsafe as 
>the one underlying the single global definition view above, so 
>again, I prefer to reject it to foreclose any possibility of future 
>pain arising from faulty assumptions.  Furthermore, the additional 
>category bearing per-dictionary attributes for each alias would look 
>very much like the proposed ALIAS category now does, so taking this 
>approach saves us nothing yet requires a more complex dictionary 
>As for adding definition_set_id to the key, on the other hand, that 
>*is* a denormalization.  An alias can be partially characterized by 
>the dictionary in which it is defined, but the definition set(s) in 
>which it appears are not defining characteristics.

Yes, it is precisely the same type of denormalization as adding
xref_code to alias.

>Denormalizing a dictionary, including DDLm itself, creates practical 
>problems.  If we assume the denormalized definition to represent the 
>same underlying validity constraints as the normalized one, then 
>software must implement special case code for affected categories to 
>validate instance documents correctly.  To put it a different way: 
>DDLm would not be complete if ALIAS or any other category were 
>denormalized.  On the other hand, if we do not assume any special 
>case logic then the resulting dictionary is not denormalized after 
>all; instead, it represents a different data model, with different 
>validation rules.
>>    It is even mare remarkably strange to hear the view that "this
>>particular question has nothing to do with macromolecular data
>>processing."  The _only_ reason DDL2 exists at all was to allow for the
>>creation of mmCIF.
>2) Are there then are no dictionaries other than mmCIF written using 
>DDL2?  Is there no interest in DDLm having any applicability to 
>fields other than macromolecular crystallography?  Are we shutting 
>out even the small-molecule and powder communities?  DDLs 1, 2, and 
>m are data modeling languages designed, to various degrees, to have 
>rich semantics convenient for defining scientific data.  I have 
>always thought it apropos that its initialism is the same as that of 
>SQL's "Data Definition Language" subset, as the two are similar in 
>purpose and scope.  Nothing in DDL2 or DDLm is inherently specific 
>to macromolecular crystallography.

I gave the historical fact -- without the creation of mmCIF, we would
only have had DDL1, not DDL2.

>Let me restate my point, then: this particular question has nothing 
>*directly* to do with macromolecular data processing.  I was writing 
>in response to the discussion of using denormalized data 
>presentation for data harvesting purposes.  DDLm defines the 
>language in which the *dictionary* is written, not the language in 
>which *data* are written, therefore denormalizing it as Herbert has 
>lately proposed would have no effect on the validity of presenting 
>macromolecular data in denormalized form.

I would be very happy having fully normalized DDLm dictionaries, but
I can cope with denormalized dictionaries, just as I have to cope
with denormalized datafiles -- indeed, for some search procedures,
I deliberately denormalize dictionaries internally.  It
sounds like John B. wants to stick to fully normalized DDLm dictionaries.
While this has some impact on software developers, it has very little
direct impact on users -- so what do people think:

   Should all DDLm dictionaries be fully normalized (if so, to which level
of normalization) or

   Should DDLm dictionaries bee allowed the same flexibility as
data files in being denormalized?

>>    The only point of having the dictionaries and the various DDLs is to
>>support the data domains, and if we cannot ground features in the needs of
>>those domains we really should consider dropping those features.
>By that logic, then, we should not include definition_set_id in the 
>ALIAS category key.  It does nothing to serve the needs of the data 
>domains.  At best, it provides a convenience for a subset of 
>dictionary authors.
>I have already agreed that there is a more fundamental question of 
>whether DDLm validation rules should generally support denormalized 
>presentation, and if so, with what semantics.  If it ultimately does 
>support that, then addition of definition_set_id to the ALIAS key 
>serves no purpose whatever.  Either way, it is important to choose 
>the right key for this and every category, for key choice embodies 
>some of the validation rules.

I agree that the matter goes back to whether to normalize or not, which
is why I have used the word denormalize so much.

>  >   So, returning to actually getting work done -- if David needs similar
>>features to support definition sets up at the ALIAS catgeory level then my
>>proposal is a reasonable way to do both that and to support the more
>>normaized form I will be using. If David is not going to be using such
>>features for the core, we can leave out the ability to do the denormalized
>>form for now.
>>    So, there seeming to be nothing left of substance in this discussion
>>other than matters of taste,
>Again technical disagreement is discounted as an insubstantial 
>difference in taste?  I assume that I did not previously present my 
>technical position clearly, and I hope that I rectified that failing 
>>  could we please choose one of three
>>approaches to my introducing the definition sets:
>We have at least three distinct, albeit related, areas under discussion here:
>A) The attributes and structure of the ALIAS category.  This can be 
>considered separately from definition sets and implemented either 
>with or without them.  There seems to be broader interest here than 
>in definition sets.
>There are several questions to settle in this area:
>1. What is the entity being modeled, and what assumptions are being 
>made about it?  This directs suitable choices of category key, and 
>the key choice could be discussed instead as a proxy.  In 
>particular, this question could be framed as "Should a dictionary 
>identifier be added to the ALIAS category key?"  This influences
>2. whether to expand the ALIAS category to directly or indirectly 
>provide additional attributes, such as those David named.  The 
>several draft proposals from the end of last week all perform such 
>an expansion.  The latter ones provide for all of David's attributes 
>except dictionary version.  These additional questions depend on 
>whether we do expand the ALIAS attributes:
>   2a. If the ALIAS category is expanded more or less along the 
>proposed lines, whether to deprecate _alias.dictionary_uri or to 
>remove it.
>   2b. If the ALIAS category is expanded more or less along the 
>proposed lines, whether to deprecate _definition.xref_code or to 
>remove it.
>   2c. If the ALIAS category is expanded more or less along the 
>proposed lines, should it have an attribute defining the earliest 
>version of its dictionary in which each alias appears?
>If I understand him correctly, Herbert's three closing questions can 
>be couched in these terms (B and C together):
>B) Whether DDLm validation rules should allow parent and child 
>categories to be presented together in a denormalized  joined form. 
>This has potentially far-reaching implication, much beyond aliases 
>and definition sets.
>C) The structure and adoption of definition sets.
>The form and attributes of the proposed definition set categories 
>seem relatively uncontroversial, so the primary question here is
>1. Whether to adopt the definition set categories at all.
>If we agree to adopt definition sets, however, then there is a 
>significant ancillary question:
>2. Should definition_set_id be added to the ALIAS category and to 
>its key (whatever it may be)?  The decision on (B) may factor into 
>opinions on this question.  If we do add definition_set_id to the 
>ALIAS category key, then David has raised an interesting question:
>   2a. Do we need or want a separate ALIAS_DEFINITION_SET category?
>Many of these questions are separable to at least some degree, so I 
>think the best way forward is probably to handle them as separately 
>as possible.  John W. and David both seem to be seeking more 
>information about the definition set concept, and I'm sure the 
>normalization question would bear more discussion.  Perhaps, 
>however, we would be prepared to decide on the alias-specific 
>questions in group A above?
>I think these aspects of those questions are so far agreed without dissent:
>* The ALIAS category should be expanded with additional attributes 
>sufficient to describe the properties David enumerated, +- 
>dictionary version.
>* ALIAS should refer to the existing DICTIONARY_XREF category to 
>provide information about the dictionary(-ies) in which alias data 
>names are defined.  Therefore, an attribute _alias.xref_code should 
>be added.
>* Adding _alias.xref_code to ALIAS makes _definition.xref_code and 
>_alias.dictionary_uri superfluous.  Each of those should be at least 
>deprecated.  (Some would prefer that they be removed.)
>* It is reasonable and appropriate to add an attribute 
>_alias.deprecated as described in the various proposals.
>Does anyone object?  Is it needful or appropriate to call a vote on these?
>These aspects seem still in doubt:
>* Whether xref_code should be added to the ALIAS category key. 
>(This is independent of whether definition_set_id is added.)  I 
>claim it should be.
>* Whether _alias.dictionary_uri should be removed (rather than just 
>deprecated).  I currently prefer that it be removed, but I'm open to 
>the possibility that doing so would bring undue hardship on early 
>adopters.  Who, specifically, would it harm?
>* Whether _definition.xref_code should be removed (rather than just 
>deprecated).  I currently prefer that it be removed, but I'm open to 
>the possibility that doing so would bring undue hardship on early 
>adopters.  Who, specifically, would it harm?
>Is there any further discussion of these questions?  If not, then 
>can we have a vote?

All fascinating questions -- orthogonal to the discussion of styles/sets/...
-- but fascinating.  We should have a meeting some time and discuss them.
Email is a terrible medium for this.

>>P.S. "Data modeling is part and parcel of dictionary authorship, so there
>>is every reason to expect that dictionary authors will be prepared to
>>express their dictionaries in suitably normalized form, according to
>>whatever presentation normalization rules ultimately are adopted for
>>DDLm." is unrealistic for the macromolecular crystallographic community,
>>which has vigorously rebelled against the strictures of DDL2 and seems
>>likely to totally reject anything in DDLm that makes it any more complex
>>and confusing.
>Indeed?  Is it DDL2 that the macromolecular community finds 
>troublesome, or is it the mmCIF dictionary?

Both.  DDLm should help with the complexity of parent child relationships
in the dictionary by allowing the parent to be fairly ignorant of the
children, and, at the heart of the resistance to mmCIF itself is the
complexity imposed by those parent child relationships, which are
in truth scientifically important.

>  If there is a sore spot here then it would help me, at least, to 
>have more detail than "the community rebels".  In particular, the 
>portion of the community to which DDLm (and DDL2) is directly 
>relevant is dictionary authors and validator developers.  Do we not 
>have multiple representatives of those constituencies within this 
>very working group?  I cannot speak for dictionary authors, but as a 
>validator developer I can certainly say that my life would be a bit 
>easier if denormalized presentation is not considered valid.

I happen to favor normalized presentations, but I assure you that there
are people who are passionately committed to flatter presentations, e.g.
for data harvest.

>I agree that controlling complexity and minimizing confusion are 
>important objectives, but it is by no means obvious to me that 
>allowing for instance documents to be provided in denormalized form 
>would make DDLm or dictionaries based on it any less complex or 
>confusing.  I'm inclined to think the opposite.

>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>ddlm-group mailing list

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.