Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Restricting identifiers to integers: a good idea?

  • To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Restricting identifiers to integers: a good idea?
  • From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
  • Date: Mon, 13 Sep 2021 15:14:43 +0000
  • Accept-Language: en-US
  • ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=passsmtp.mailfrom=stjude.org; dmarc=pass action=none header.from=stjude.org;dkim=pass header.d=stjude.org; arc=none
  • ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=p/jR6bMR8jHflge4QWW3KHzCVkndJi1dmYgBl18c1Sk=;b=SqjXzoqG1xakAS6yHKmTd5k7JoT6/h/njQS0szgkyf8fkVcY7lxsuziGFY6wg3hWvm1vvmu5SA7YGaG+cWnBXxm5hL7JS3VWMBd2qYVz8Db9uN9zPetbDMSzZpRHbYDVkYgq2CNFz+h7oJuIpJC24u5zGJ8EWccIhDbPYpm1COX1rpKHi/aDwAXNIvTSA1ao/9n8OApW40yInOHh5cWWEiu6NdVX0UP5c70J+kHZFz+4LttdPeJqY1zMvJpxEoEW9BbiobuARUnMqSK4ZEehl2SF+fUgf+uUevFupvLB9CukI0F+7sf3GsgblNwELQEADIQvD/MHQwdmQaM6FvhK2w==
  • ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;b=LdREn2/A+lcj8sucRBZm+Geemqxasg8K1+2vFNeQ+D21mXz3ISsV2aSepBSEdS2u6f6MBRLvbJ2lYfXKo3lwn4QwvJdROS7eiRecqH5aeuz1U8hHt/1O49OQBKu5KQf691a8qc3RFK9v16G4FJ/ZM/hEZRL6iI456LCDsamETKbY4NdqiMEmP8paF42j62mHeyvxu+WsD3SsrxhRy+MaQMwCiLxvOqvz6kVbCZoK5PpQAMLILatKWtyvXcj3mDyclgINR8YSZ09ifN4V61DpiGOGtCfBfAGHtPloOXytGIpU4uh//inwp/yrvAF+6dJb2soMym2UxNlQmyrS33Q1Aw==
  • In-Reply-To: <CAM+dB2eyrh0zf_CNOha5Y3oR9RrDqLQ7KKxkO=kLBmnGWD5ENw@mail.gmail.com>
  • IronPort-SDR: ukpQCVep7NxXNCZEUiC4XKOh44Iyj4p07xbYPQMUd+i7FxB1b1u3yQjk/XOPeyNUSa1sANjJr+nYoL3BKLOkNPA5I5E1DuW7WxtO3xVAC+S7cZyCWmO3FMZbv8iD2zZRFsH28ABSEQqlrILZl6j4nzgwiFyNhyCpSW5v+Y7+FmSzkR0mn1fEtpRiuvNO64gWS6quig0WVbc+TEqs1a+xuKdcjaRKahwddL0/JxPqihDHzmUg8/VCtpiLR0ej5qAgIN3SqUyvEzaPl5M1N5x9lTF4qZ3xj0Vbbk2z1/0bWB0=
  • References: <CAM+dB2eyrh0zf_CNOha5Y3oR9RrDqLQ7KKxkO=kLBmnGWD5ENw@mail.gmail.com>


There is a popular school of thought in DB design that every entity’s primary key should be a single-column surrogate key. Such a key is typically assigned a numeric data type for various practical reasons, such as storage size and key generation strategy.  The proposal seems to align with that practice, but I am not convinced that the practice is necessarily a good strategy for data dictionaries.


Although we do have some surrogate keys in our dictionaries (mmCIF’s _symmetry_equiv.id, for example), we tend to use natural keys instead.  Where a suitable natural key exists, this both simplifies our dictionaries and makes CIF instance documents simpler and easier for humans to read.  Also, continuing with _symmetry_equiv.id as an example, I observe that its definition explicitly states that it is NOT numeric.  I think is typical of our historic practice for such keys.


I am also skeptical of the proposal’s stated objective of simplifying input and storage, as this seems to fall outside the purpose of a CIF dictionary.  A CIF dictionary’s primary purpose is to define data semantics, and any consideration of its suitability as a storage model is secondary, at best, notwithstanding the relational characteristics of DDL2 and DDLm.


Moreover, it is not evident to me how input or storage would in fact be simplified without attributing some kind of additional significance to the key values, which would turn them into natural keys whose specific significance should be defined.  The dictionary authors might want to consider, for example, whether their input and storage plans would be foiled by key sets that were not contiguous, did not start at 0 or 1, included large gaps between keys, or included very large numbers.  I’m also suspicious that any perceived input advantages assume restrictions on the lexical form in which keys would be expressed.  For example, the dictionary authors should consider whether integers expressed in scientific notation (e.g. 0.10e+01 or 1e+00) or in other forms besides straight decimal digit sequences would defeat their goals.



Best regards,





John C. Bollinger, Ph.D., RHCSA

Computing and X-ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital



From: ddlm-group <ddlm-group-bounces@iucr.org> On Behalf Of James H
Sent: Monday, September 13, 2021 1:07 AM
To: ddlm-group <ddlm-group@iucr.org>
Subject: [ddlm-group] Restricting identifiers to integers: a good idea?


Caution: External Sender. Do not open unless you know the content is safe.


Hello DDLm experts,


This time I have a relational model question.


One of our dictionary author groups would like to restrict the key data name of a category (an opaque identifier) to positive integers (instead of arbitrary text), to simplify input and storage. I have commented that this risks the integer acquiring some sort of meaning, such as specifying that the items in the category are arranged in a particular sequence. However, I think some of you have more experience in why integer identifiers may or may not be a good idea. Can any of you comment on the value of restricting/not restricting the form of an identifier?


Note this is a new dictionary so I'm not talking about changing an existing data name.




T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
ddlm-group mailing list

Reply to: [list | sender only]