Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Adding a DDLm attribute for uniqueness

  • To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Adding a DDLm attribute for uniqueness
  • From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
  • Date: Fri, 14 Feb 2020 20:46:37 +0000
  • Accept-Language: en-US
  • ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=passsmtp.mailfrom=stjude.org; dmarc=pass action=none header.from=stjude.org;dkim=pass header.d=stjude.org; arc=none
  • ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;bh=udwIV/FAMFt87ihigbjTFDTp/xGkj3BfqTWnLoAqUFg=;b=FTAcGmTGqgsUz3rViXeQ7J4Q4Bzc78VAn4T3c27Dqfoqm1kQBhQ5aNN9LO7YfIKiQYf8ERZHamMxQeBHxQ0OHgX8Y2aWwpGJprkhFxki3MAkqPzeuAWd3RshinL1vJVEoS4Mi4TZJ7tjrxn5FFTcqX7cDfsXbQB6Zjq0vWkdzdqjSxSyid/gzby8AMDOWVqlNP6wE4MfJn0IlxIeH9c0rb10x3ibW7o/NiLUp8woWFbnZshHfejc2Uz3wFguFja2WK3B9qAizUWIDJTSCnV8QUhujLuVkfK8/Q85+agQ2eVtPvjPvUlmOh2Q8g/deJATynooXujnj+t0y87Ow36F0w==
  • ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;b=fWz5KwfDMvS0ytlNFc9mMP9O6Nkc3ANfVBBREYMoFh6npM935YDyKrUBhw5yduGlsVu0UeIqhVizbChFAS1xJdHw2eMa+brj7LJuTpKOtEAJmCeDKlZWsCTY1Co7ywGmWSkecDbvKQy7lwqlfWN+2bTeNahRpXchC2OiPHqBu3nfZOKRGzMN8z6wkpVwaegrro4Op8S8lBuF4fpPiE4kMSa6trd/gdQKPZwMMtSP/qCOXPIwgFL7Ptu6NR21H+4cPdE5hXV2R33BP8CnS6xVRpQCdOQ/lpZmK25Gea7SAnz7idm5krGS12x/cZ7TLKTjkn5mW3ENWPx1xU9olp4axg==
  • authentication-results: spf=none (sender IP is )smtp.mailfrom=John.Bollinger@STJUDE.ORG;
  • In-Reply-To: <CALHYoX6St7GzJociroTv=DMpsMovpuot5+Zkq6Zk5dh2Gva-LA@mail.gmail.com>
  • IronPort-SDR: EA9KFnRl89YtmzokrC2mjBJ0vXiW/qOr7mbLDws4WX1973+iEGs4vD1upk03xRVUBI3LohRyvE0BvlerOEpNta1nw4NTebwAMUxKaJ1kz+Y81Z67NehcRgwwUiFBiwlIVN5/3OZbAXtbygh4IK3oAig0BVfHs4cB6+SN73HlmnvjycAm2iEmfjqFQIZdn8hznZAhKVg+Oy1csKd7wVMJ7OBoZujP64M49kgTgWzgI+UL563LzoWjXTkfMAZ3+xt/hVvwlDzBaVwqIgjpbz8n0lVTti6zyXCMeP+yu3dhmac=
  • References: <CALHYoX6573gXqabRS0TwY5O0-wVtexjVrWs9KZi2jpH2u_Tm8A@mail.gmail.com><CAM+dB2crUCGAD+fUVgG38OFujxfNLc-Rc5r-6Cbhc6FijsJBuw@mail.gmail.com><CALHYoX6St7GzJociroTv=DMpsMovpuot5+Zkq6Zk5dh2Gva-LA@mail.gmail.com>

Dear DDLM group,

 

With regard to the question of null values in keys (both simple and compound), and specifically with respect to Antanas’s comments that:

 

As far as I understand, the DDLm does not explicitly forbid key data items to have unknown (?) or inapplicable (.) values,

and, as a result, the challenge of handling these special values in the context of uniqueness still applies . For example,

it was common practice (at least by some pieces of software) to place an inapplicable ('.') value instead of '1_555' for certain

symmetry data items […]

 

That’s slightly misleading.  The null value represented in CIF text as '.' has two distinct common interpretations:

 

1.       There is no value because the item is not applicable, and

2.       There *is* a value, and it is specifically a default value defined for the item.

 

Although I appreciate that we have a bit of an issue with naming here, one must be careful about calling '.' an “inapplicable value”.  Note also that there are many items for which neither of these is ever sensible.  Before we consider the uniqueness properties of keys containing such values, we need first to acknowledge that keys can be invalid for containing this value at all.

 

[…] There are many legacy CIFs like that in the wild, so it would be really useful

to have an official interpretation on how such values should be handled.

My current approach during a uniqueness check is to silently skip key values that contain at least one special value

component. As you mentioned, this approach does not guarantee total key uniqueness, but it at least allows to detect

duplicates without special values (still better than nothing). I would be happy to conform to any official guidelines, though,

once these are established.

 

I’m not sure there are any official guidelines, but there doesn’t seem to be that much uncertainty here to me.

 

In the first place, unknown values must not appear in data items that constitute partial or complete category keys because a putative key containing such a value does not fully identify the entity to which the associated data apply.  Take the geom_bond category, for example.  The category key consists of the items _geom_bond.atom_site_label_1, _geom_bond.atom_site_label_2, _geom_bond.site_symmetry_1, and _geom_bond.site_symmetry_2.  If the value of any one of those items is unknown for a particular record then the meaning of the overall record is impossible to interpret.  Even if it is possible to fill in values in such records such that the overall CIF is valid, it is not logically consistent to accept the records in question as valid as they stand, nor the CIF containing them as valid overall.

 

The other null value yields a different story.  In the event that for a given item, that value serves as a default-value placeholder, key uniqueness should be validated by first filling in the appropriate default value where needed, and then validating the resulting set of keys.  If one is willing to risk assuming that a given CIF is consistent in its use of such placeholders, then one can shortcut by skipping the step of filling in the values, and treating the null as if it were a value that compared equal to itself and unequal to all non-null values.  On the other hand, if items where this null value conveys the sense of "inapplicable" are used in keys, then we must consider the implication of such an item being included in the key at all.  Supposing that the key contains enough items to uniquely identify entities in its category but no superfluous items, the only reasonable way to validate keys containing inapplicable-item values is as if (again) those values compared equal to each other and unequal to all non-null values.  If that seems not to make sense for some category, than that category’s definition is inconsistent.

 

Skipping keys with null components is never the right answer.  Keys with unknown-value nulls should be considered invalid on their face, without consideration of uniqueness.  Keys with default-value placeholders convey enough information (in the context of the relevant dictionary) for normal validation.  Keys with inapplicable-item values can and should be validated too, based on the presumption that the key definition is sensible to begin with.

 

I reserve judgement on the actual proposal.  I can see reasonable arguments both pro and con, and I’m not prepared at this time to carry water for either side.

 

 

Regards,

 

John

 

--

John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

John.Bollinger@StJude.org

(901) 595-3166 [office]

www.stjude.org

 

 



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.