[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
--
Reply to: [list | sender only]
Re: [ddlm-group] Adding a DDLm attribute for uniqueness
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Adding a DDLm attribute for uniqueness
- From: James Hester <jamesrhester@gmail.com>
- Date: Mon, 17 Feb 2020 11:06:18 +1100
- In-Reply-To: <CALHYoX6St7GzJociroTv=DMpsMovpuot5+Zkq6Zk5dh2Gva-LA@mail.gmail.com>
- References: <CALHYoX6573gXqabRS0TwY5O0-wVtexjVrWs9KZi2jpH2u_Tm8A@mail.gmail.com><CAM+dB2crUCGAD+fUVgG38OFujxfNLc-Rc5r-6Cbhc6FijsJBuw@mail.gmail.com><CALHYoX6St7GzJociroTv=DMpsMovpuot5+Zkq6Zk5dh2Gva-LA@mail.gmail.com>
Dear DDLm group,
I think there may be a way to approach uniqueness that resolves enough issues to make it worthwhile. I suggest we view 'uniqueness' through a functional lens, so what we really mean when we say values of a dataname are unique is that there is a one-to-one mapping from that data name to the keys of the category (of course, there is already a one-to-one mapping from the keys to the dataname by definition of "key"). So if we explicitly specify the datanames whose collective values are unique for a given value of our 'unique' data name, we have specified 'uniqueness' in a way that is immune to expansions changes in the category key. We are also not limited by relationships involving keys.
A rough draft of a definition for a new DDLM category and attribute might be:
save_ENUMERATION_UNIQUE
_definition.scope Category
_definition.class Loop
_description.text
;
This category lists the datanames whose collective values are uniquely distinguished by the value of the defined dataname
;
loop_
_description_example.case
_description_example.detail
;
loop_
_enumeration_unique.dataname
;
;
This fragment could be added to the 'space_group_symop.operation_xyz' definition to show that each symmetry
operation corresponds to a unique _space_group_symop.id
;
save_
save_enumeration_unique.dataname
_definition.id _enumeration_unique.dataname
_definition.class Attribute
_description.text
;
Data names with the same _enumeration_unique.id have collective values that are uniquely distinguished by the value of the defined dataname.
If these data names coincide with the key of the category, each value of the defined data name will be unique.
;
_name.category_id Enumeration_unique
_name.object_id Unique
_type.container Single
_type.contents Tag
save_
What do we think about this?
On Sat, 15 Feb 2020 at 02:17, Antanas Vaitkus <antanas.vaitkus90@gmail.com> wrote:
_______________________________________________Dear James,thank you for the answer. I will do my best not to waste your time with never-ending discussions,
but I feel that a few of my previous comments require some clarification.On Fri, 14 Feb 2020 at 11:52, James Hester <jamesrhester@gmail.com> wrote:See inline comments belowOn Tue, 11 Feb 2020 at 02:30, Antanas Vaitkus <antanas.vaitkus90@gmail.com> wrote:Dear DDLm maintainers,thank you for allowing me to join the discussion. I will combine my answers to twoprevious post in a single e-mail.Great to have you in the discussion!
> The troubling part of this is "unique within a loop". The handling of
> relational keys is complex but clear, because categories are well-defined.
> The content of a loop beyond the relational model is not clear without much
> more information, especially for numeric data and unicode data, both of which
> come with major ambiguities in terms of uniqueness.
The proposed uniqueness constraint does not introduce any new ambiguities in
terms of value uniqueness. The '_category_key.name' data item already allows
to use data items of any type and, as a result, requires the validating program
to handle composite unique keys. In addition to that, in some cases even the'_category.key_id' data item references items that allow Unicode values(i.e. '_atom_site.label' in the ATOM_SITE category).I agree with you that the same uniqueness challenges are already faced by the requirementthat a key is unique within its column. A new uniqueness challenge not faced by keys is how to dealwith missing ('?') data values. We do not know, in principle, whether these are unique ornot. If we assume that they correspond to unique values, then there is not much point having validation checks for such a columnin general as we are prepared to accept uniqueness without checking. If we assume that they could be duplicates, then we canhave no missing data in such a column. How we choose between options would need to be part of theattribute definition, but may depend on the data name being defined. Perhaps similar considerations apply to '.', which is a particularvalue and so in theory should only occur once in a 'unique' column.As far as I understand, the DDLm does not explicitly forbid key data items to have unknown (?) or inapplicable (.) values,and, as a result, the challenge of handling these special values in the context of uniqueness still applies . For example,it was common practice (at least by some pieces of software) to place an inapplicable ('.') value instead of '1_555' for certainsymmetry data items ('_geom_bond_site_symmetry_1', '_geom_angle_site_symmetry_3', etc.). This was probably done due to
the fact that '1_555' was the explicit default value and that DDL1 did not consider these items as a part of the mandatory loop
reference. However, DDLm now properly includes them in composite keys, i.e. the key of the GEOM BOND category consistsof the '_geom_bond.atom_site_label_1', '_geom_bond.atom_site_label_2', '_geom_bond.site_symmetry_1','_geom_bond.site_symmetry_2' data items. There are many legacy CIFs like that in the wild, so it would be really usefulto have an official interpretation on how such values should be handled.My current approach during a uniqueness check is to silently skip key values that contain at least one special valuecomponent. As you mentioned, this approach does not guarantee total key uniqueness, but it at least allows to detectduplicates without special values (still better than nothing). I would be happy to conform to any official guidelines, though,once these are established.
> The situation gets even more confusing when trying to make a database from
> multiple entries. We add keys precisely to allow for duplication of existing
> keys. How will we handle these new pseudo-keys? I would suggest that any
> proposal be presented with a clear view of how we will handle databases
> without breaking the new proposed constraints
Each CIF data block can be viewed as a small relational database. In order
to store data from several such data blocks in a single database, one would
still need a column which maps values to their original data blocks. For
example, in order to store atom information from multiple data blocks,
the table would need a column that references the original data block,
i.e. an integer key which acts as a foreign key to the file/entry table.
If such key indeed exists in the table, then it can be combined with the unique
column(s) ("pseudo-key") to produce a new unique key. This new unique key should
be used instead of the one defined in the dictionary when dealing with databases
(atom labels may not be unique across several data blocks, but the combination
of an atom label and a data block identifier still retains uniqueness).I believe it is considered bad practice to use a natural key as the key to a relational database,because in the future a duplication of the natural key value which is consistent with its 'natural' semantics will become impossible.You will therefore lose robustness in the database. So I don't accept this particular argument for adding a uniqueness attribute. CIFhas historically made this mistake a few times, with semantically significant atom labels, symmetry operators, and hkl values being used ascategory keys. We do not want to repeat this mistake (I hope). Going forward we would only use keys that areexplicitly designated as such and are defined for the sole purpose of being keys.I fully agree that proper relational tables should have artificial primary keys and its great to
see this being implemented in DDLm. However, artificial primary keys do not prevent one from
retaining the uniqueness of other columns -- that's what the SQL UNIQUE KEY constraint isfor. Just because a value is declared as unique does not automatically mean that this won'tchange in future schema releases or that the column should be used as a joining key.As a result, the uniqueness constraint can be dropped without compromising the robustnessof the database or the validity of existing data.
DDLm alternative of this would be:1. Use dedicated data items as artificial primary keys (as is currently done). These items are intendedto be used to category (table) joins and are promised to always remain unique;2. Define some of the looped items as unique if needed. Do not use them as loop references anywhere else;3. If required, the constrain can be dropped in several ways:3.1. By releasing a new version of the dictionary. This does not break the existing schema nor make the existing files invalid.3.2. By overriding the category/item definition in an importing dictionary. This does not break the existing schema.
Files created in accordance to the original dictionary would not be valid under the importing dictionary, but, well,that is to be expected -- they are different dictionaries after all.> The proponent is aware of the currently available attributes for category keys.
> I believe this proposal is aimed at providing further checks in software for
> data names that are not category keys but are also supposed to be unique,
> the canonical example being symmetry operators. My objection is that expansion
> dictionaries can remove this uniqueness, e.g. listing magnetic symmetry operations
> as spatial symmetry operations + magnetic symmetry operations might involve
> repeating symmetry operations. We have developed an approach in DDLm to
> handle this for expanding category keys (the _audit.schema data name) but
> dealing with this for an independent uniqueness attribute seems to be a bit
> messy and I don't really see the benefit of that extra definitional work.
In general, the uniqueness constraint seems like a useful feature to have
when curating data/constructing an ontologies. Most relational databases,
XML Schema and even the recently defined JSON Schema all have equivalent
constraints. I fully understand the fear that people will not track the removal
of such constraints across dictionaries. However, there is also no guarantee
that people will honour the '_audit.schema' data item. Hopefully, as long as
there are well-behaving open implementations of a DDLm validator, they can be
used as a reference by other programmers dabbling in DDL/CIF.
> The other thing I've pointed out is that ad-hoc uniqueness checks can be
> coded in dREL and placed in a dictionary of data names to be used for
> validation.
dREL is a powerful tool, but in this case it introduces slight complexity and
does not really solve the underlying problem. The dREL methods can still
(probably?) be overridden in other dictionaries and although the dREL method
delivers the desired final result, it does it in a slightly less standardised
manner. Reading a fixed tag/keyword is much simpler that automatically
analysing actual code.It is true that a dREL-based check for uniqueness is more complex. It is alsomore flexible, as a particular validation suite of dREL checks can decide whether or not uniquenessshould apply in a particular context, rather than have that uniqueness attached to adata name. It can also decide how to handle 'missing' on a case-by-case basis.Yes, this allows a lot of fine tuning for applications that know how to interpret dREL.While this can be reasonably expected from CIF validators, it is not an unusual
practice of other CIF handling programs to simply hardcode constraints extractedfrom a specific dictionary version. And this is where the complexity added by dRELbegins to really matter.
It is probably not possible to automatically determine what the given dREL snippet isintended to do. Of course, for a person the intent is much clearer, but it still requiressignificantly more effort and dREL expertise to deduce the fact that the data itemvalues should be unique.
I understand that there are probably not that many IUCr-curated data items
that would actually benefit from an additional uniqueness constraint so the
whole proposal indeed seems somewhat excessive. However, my proposalwas more in the spirit of bringing the constraint set supported by DDLmdictionaries closer to that of other popular schema/ontology formats andin doing so make it more applicable in situations outside of the IUCr-curateddictionaries.The decision here comes down to finding the right balance between the extra effort involved increating and maintaining a uniqueness attribute, and the benefits of knowing that a value issupposed to be unique so that we can validate and optimise. Regarding the other standards havinguniqueness as an attribute, if those standards do not explicitly adopt the relational model then 'uniqueness'is often their way of identifying keys of the underlying relational model that they don't always admit exists.For example, I think that uniqueness in DDL1 came before the proper relational nature of CIF was articulated, and thatthese unique data names did act as keys later on (usually unfortunately).I am sure this is true for DDL1, but UNIQUE KEYS are widely used in relational databases with the
sole purpose of ensuring uniqueness of certain data items.One artificial example, would be a table describing a citizen. The table will have an artificial primary key,forename, surname and the national id number. The law dictates that the uniqueness of the national numbermust be ensured. Sure, this could be done at an application level, but why not do it at a database level usingthe built-in constraints? These constraints can be easily expressed as schema properties in XML and JSONschemas, but requires writing and interpreting an ad-hoc dREL code snippet in DDLm.Anyway, it really is for this group to assess that balance. I feel uniqueness is an unnecessary extra attributethat does not compensate for the extra work required to track it, but others here may differ.Thank you for seriously considering this proposal. Please note, that defining a new DDLm attribute does notnecessitate it being extensively used in IUCr dictionaries, but it does allow other dictionary maintainers to useit.Sincerely,Antanas_______________________________________________All the best,James._______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
--T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
--Antanas Vaitkus,PhD student at Vilnius University Institute of Biotechnology,
room V325, Saulėtekio al. 7,
LT-10257 Vilnius, Lithuania
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Adding a DDLm attribute for uniqueness (Bollinger, John C)
- Re: [ddlm-group] Adding a DDLm attribute for uniqueness (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] Adding a DDLm attribute for uniqueness (Antanas Vaitkus)
- Re: [ddlm-group] Adding a DDLm attribute for uniqueness (James Hester)
- Re: [ddlm-group] Adding a DDLm attribute for uniqueness (Antanas Vaitkus)
- Prev by Date: Re: [ddlm-group] Adding a DDLm attribute for uniqueness
- Next by Date: Re: [ddlm-group] Adding a DDLm attribute for uniqueness
- Prev by thread: Re: [ddlm-group] Adding a DDLm attribute for uniqueness
- Next by thread: Re: [ddlm-group] Adding a DDLm attribute for uniqueness
- Index(es):