Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Proposal to update dREL, part II

I can see nothing in this proposal that in any way undermines the rigorous relational nature of CIF data structures and the way that dREL handles them. Indeed the proposals leverage off that relational structure.  If I have missed something, by all means identify it as I am a big fan of the relational model.

On Tue, 9 Oct 2018 at 20:28, Herbert J. Bernstein <yayahjb@gmail.com> wrote:
Dear Colleagues,

  While there is great value in the approach proposed, there are also
risks.  When Codd first defined
relations and relational databases, many of us (including me), thought
he was being overly fussy in
rigidly requiring that the values of a key uniquely specify particular
tuples.  It took more than a decade for it to become clear that he was
right, but he was right.  I have no objection to being able to define
filters and other constructs, but somewhere is all of this, we need to
be crystal clear as to what columns uniquely specify rows, or we will
find ourselves back in the bad old days of hierarchical databases
lacking referential integrity instead of have reliable relational
databases.
  I know everybody thinks that NOSQL databases are wonderful.  They
are.  But without relational databases to support them our data will
get corrupted.  We need to be sure that somewhere we define category
keys such that we can reliably use those keys to find particular rows
in tables.

  Regards,
    Herbert
On Tue, Oct 9, 2018 at 2:44 AM James Hester <jamesrhester@gmail.com> wrote:
>
> Dear DDLm group,
>
> Somewhat belatedly, here are some answers to John's questions. I will update the proposal according to the response.
>
> On Thu, 20 Sep 2018 at 04:11, Bollinger, John C <John.Bollinger@stjude.org> wrote:
>>
>> Dear DDLm group,
>>
>>
>>
>> With respect to proposal 3, I agree in principle that the proposed syntax extension seems to yield an improvement, but the details are not completely clear to me.  Specifically,
>>
>>
>>
>> - May the _category.key_id be used in the expanded syntax?  Including if it is not named as a _category_key.name?
>
>
> _category_key.name always contains the data names forming the key of the category. _category.key_id identifies a data name whose value can also be used as a key. I propose that the dataname given by _category.key_id is never used when implicitly resolving any references in the new syntax, because in that case it is possible that both _category_key.name and _category.key_id contain a single dataname, at which point there is ambiguity as to which is to be used when a single value is provided.
>
>> - More generally, which attributes are permitted to be used to index a category?  Must they be among those whose names are listed in the category’s own _category_key.name attribute, or is this to become a more general facility?
>
>
> As far as syntax is concerned, there is no need to restrict the particular attributes that appear. When explicitly giving the attributes and their values, I was assuming that only key data names would be useful for indexing into a category, as otherwise more than one row could be identified and the result would not be a single packet.  On the other hand, allowing arbitrary attributes introduces the possibility of 'filtering' a category by values of attributes.  A category filtered in this way would remain a category object that could be looped over.  While there are some technical implementation complexities that would be introduced due to the result not always being a single row packet, this is probably heavily outweighed by the economy with which certain concepts could be expressed.  Any thoughts?
>
>>
>> - Is it necessary to specify a complete key when this syntax is used for a category with a compound natural key?
>
>
> If arbitrary attributes are allowed, and therefore the result may not be a single packet, then the complete key would need to be specified. If the idea of having arbitrary attributes is not attractive, then we can allow missing key attributes to be deduced based on the parent-child hierarchy. At the moment I am inclined to prefer explicitly identifying every attribute if any are specified.
>
>>
>> - In the proposed syntax, are the key names given as simple attribute names or as full CIF item names?
>
>
> As attribute names only.
>>
>>
>>
>> ----
>>
>>
>>
>> With respect to proposal 4, I agree with the general idea that dREL should prefer to avoid requiring method implementations to explicitly express category keys that can reliably be determined from context.  How that applies here depends to some extent on proposal 3, however.
>>
>>
>>
>> Additionally, before considering going forward with this proposal, I think we need to describe more formally the cases in which the key values can be conveyed implicitly.  For example, the description remarks that “this short cut is not possible where more than one data name is linked to the same category key”, but I’m not confident that I know how to recognize all such cases programmatically.
>
>
> How about "an attribute=value specification may be elided in those cases where a single attribute of the category within which the dREL method resides is linked to a key attribute of the category being referenced. Two attributes are "linked" to each other if a common attribute is reachable from both attributes by following '_name.linked_item_id' references."
>
> Programmatically I think it is just following '_name.linked_item_id' links as far as possible from both attributes, and if and only if they end up at the same attribute then the two are considered linked.
>>
>>
>>
>> Also relevant: are we assuming that linked items are always [components of] their categories’ keys?  Does anything break under this proposal if non-key attributes are linked?
>
>
> A key data name may only be elided if it is linked to an attribute of the category of the method that is performing the lookup, so the lack of a link does not break anything: in that case, the attribute used for lookup must be stated explicitly.
>>
>>
>>
>>
>>
>> Best,
>>
>>
>>
>> John
>>
>>
>>
>> --
>>
>> John C. Bollinger, Ph.D.
>>
>> Computing and X-Ray Scientist
>>
>> Department of Structural Biology
>>
>> St. Jude Children's Research Hospital
>>
>> John.Bollinger@StJude.org
>>
>> (901) 595-3166 [office]
>>
>> www.stjude.org
>>
>>
>>
>>
>>
>>
>>
>> From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
>> Sent: Monday, September 17, 2018 5:39 PM
>> To: ddlm-group <ddlm-group@iucr.org>
>> Subject: [ddlm-group] Proposal to update dREL, part II
>>
>>
>>
>> It appears that after preparing part II I completely forgot to send it to the group.  The marked-up version of this second proposal is available at https://github.com/COMCIFS/dREL/blob/master/drel_changes_2.rst
>>
>>
>>
>> Proposed changes to dREL, part II
>>
>> =================================
>>
>>
>>
>> Introduction
>>
>> ------------
>>
>>
>>
>> dREL is a machine-actionable language describing data relationships
>>
>> and designed to be embedded in DDLm dictionaries. The language is
>>
>> defined both explicitly in the dREL publication [1] and implicitly by
>>
>> the dREL code appearing in the DDLm core CIF dictionary. Note that
>>
>> the code in the core CIF dictionary significantly expands the language
>>
>> presented in the paper, for example, by adding category methods.
>>
>>
>>
>> The present changes were foreshadowed in the discussion about allowing
>>
>> set methods to become looped [2].  They are aimed at removing the
>>
>> current dREL-imposed requirement that all categories must have a
>>
>> single data name that acts as a key.
>>
>>
>>
>> Proposal 3: compound key specification
>>
>> --------------------------------------
>>
>>
>>
>> dREL as published permits a particular row in a loop to be specified
>>
>> by providing the value of the key for that loop using the syntax
>>
>> ``<category>[keyvalue]``, so for example, ``atom_site['O1']`` would be the
>>
>> row in the atom_site loop for which ``_atom_site.label`` (the key data
>>
>> name for category ``atom_site``) is 'O1'.  We propose expanding
>>
>> this syntax to allow multiple key values to be specified:
>>
>> ``<category>[name1=value1,name2=value2]`` would specify the row of
>>
>> ``<category>`` for which category objects ``name1`` and ``name2`` take
>>
>> values of ``value1`` and ``value2`` respectively.
>>
>>
>>
>> Explanation
>>
>> ~~~~~~~~~~~
>>
>>
>>
>> The current core CIF dictionary treats multi-key categories by
>>
>> defining a synthetic data name for each such category. These synthetic
>>
>> data names are currently just a list of the values of the multiple
>>
>> keys. Having such single-dataname keys allows the dREL syntax to
>>
>> be unambiguous for all Loop categories.
>>
>>
>>
>> This approach is suboptimal because:
>>
>> (1) The synthetic data names have no scientific relevance
>>
>> (2) A considerable amount of DDLm machinery has been developed simply
>>
>>     because of the resulting inhomogeneous lists. Without
>>
>>     these synthetic data names, there would be *no* need in the current
>>
>>     core dictionary for ragged and nested dimensions and multiple
>>
>>     data types within a single list, and therefore no requirement
>>
>>     for DDLm and dREL implementors to cope with such structures.
>>
>> (3) dREL methods wishing to index into a multi-key category have to
>>
>>     construct the synthetic keys from the individual values; the new
>>
>>     syntax would save that line of boilerplate
>>
>> (4) If a set category becomes looped, a number of looped categories
>>
>>     will acquire a new key data name. If single-key loops remain a
>>
>>     dREL requirement, previously single-key loops will require a new,
>>
>>     synthetic data name to be created. Note that it could be argued
>>
>>     that this is the way the system was designed to work.
>>
>>
>>
>> The previous syntax will still be acceptable in those situations where
>>
>> there is a single key, or where the values of the remaining keys are
>>
>> unambiguous in context (see next proposal).
>>
>>
>>
>> This proposed syntax has been included in the example EBNF for dREL
>>
>> and the transformation to Python code implements the proposed semantics.
>>
>>
>>
>> Proposal 4: elide keys where they are clear from context
>>
>> --------------------------------------------------------
>>
>>
>>
>> If category A contains data names which are parents or children of key
>>
>> data names in category B, dREL methods in category A do not need to
>>
>> explicitly specify the key values of category B when accessing rows of
>>
>> category B.
>>
>>
>>
>> Explanation
>>
>> ~~~~~~~~~~~
>>
>>
>>
>> If b.k1 and b.k2 are the keys of category B, and data names A.a1 and
>>
>> A.a2 are linked through ``_name.linked_item_id`` DDLm declarations to
>>
>> those keys, then any dREL method in category A can simply write ``b.d3``
>>
>> to access a specific value of dataname ``d3`` in category ``b``.  This is
>>
>> equivalent to writing ``b[k1=a.a1,k2=a.a2].d3`` under proposal 3.
>>
>>
>>
>> Note that this short cut is not possible where more than one data name
>>
>> is linked to the same category key, for example, in ``geom_bond``
>>
>> two data names are linked to ``atom_site.label``.
>>
>>
>>
>> Note that partial resolution of data names is also possible, so that
>>
>> key references that are missing from the original form may be resolved
>>
>> using linked data names.
>>
>>
>>
>> Discussion
>>
>> ----------
>>
>>
>>
>> The net result of the above two proposals is to make looping Set
>>
>> categories relatively painless. A dREL reference like ``cell.vector_a``
>>
>> may remain untouched when multiple cells are present, as long as the
>>
>> category within which the dREL method appears has only a single
>>
>> data name that is a child of the single key data name of ``cell``.
>>
>>
>>
>> However, in situations where the ``<category>[value]`` syntax has
>>
>> been used and ``<category>`` acquires a new key data name because
>>
>> some other category has become looped, dREL methods will need
>>
>> to be rewritten to explicitly specify the key data name that
>>
>> ``value`` corresponds to.  Going forward, the ``[key=value]``
>>
>> syntax should be preferred to minimise the need to rewrite
>>
>> methods in advanced looping applications.
>>
>>
>>
>> We should also be aware the dREL methods in our dictionaries are
>>
>> curated, and therefore we can apply style guidelines to prefer the
>>
>> explicit notation of proposal 3 as we see fit.
>>
>>
>>
>> [1] Spadaccini et. al,
>>
>> (2012) *J. Chem. Inf. Model.* **52**(8) pp 1917-1925
>>
>>
>>
>> [2] https://github.com/COMCIFS/comcifs.github.io/blob/master/looping_proposal.md
>>
>>
>>
>>
>>
>> --
>>
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>>
>>
>> ________________________________
>>
>> Email Disclaimer: www.stjude.org/emaildisclaimer
>> Consultation Disclaimer: www.stjude.org/consultationdisclaimer
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.