[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets

To: "Herbert J. Bernstein" <[email protected]>, "[email protected]"<[email protected]>, Group finalising DDLm and associated dictionaries<[email protected]>
Subject: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets
From: "Bollinger, John C" <[email protected]>
Date: Thu, 2 Apr 2020 15:50:55 +0000
Accept-Language: en-US
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=passsmtp.mailfrom=stjude.org; dmarc=pass action=none header.from=stjude.org;dkim=pass header.d=stjude.org; arc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;bh=rJWHZCyjgG0BhZpAiuMTRSgU6SnIQ5J/OSOl+s0B4MU=;b=QSrqZHEQF6RK2b7WOBYlSYkko8FsYBR4qyH15B3RRLn/B2vGjPFhCOBlon41niDT8MoXq1ABG2Flz9LgszJkcblwCIcRmxAlnvdRqe02wIL1xQcg4Fqo1eI/oPEmZ0GdokKJZzgt3mmTjiDfsaqH8Va3rxYGfRG8YmrdwMAlRYxmgmWANXPJYOdGaBhr0iGlb+CwkCURWkJ1PalYDSjFWxzVu5sG2NNK7wjysGRDUzCLoEZjmEeYBp/OM6f2AM9byI/AqkRmeRiG27HKwLAi2r+GPsojt6/YTouWvCFnt1/YTHrB6sGBGu59a2qu4jzg3iC7wR+t3RDFasVtq2zIBA==
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;b=MzlfkivwnW9O8NXzUsaDk/Dk2aTpgYv274fTwIPx3VO0MZgYsObnWwArIVcjpDtXXJpYUaW3mG4J4Rmf3uHVsgNIA7QpLvblzOrJPeX80rcOVoFCU71ewvMQKbdyBKGZGddmpv/9r1yFjUUvOKzv4tC3Ux0Dz0XPI6PLvtwAArIxSmqj/GA2CxI+qA21bcPtNDa3XOHHuCg93+DH/Zu8gUmmCzFNc3OPeqZo/yI+nmQvol0DkjwJpWFLIR0tsSvn994t2RPfqcfWKUrE1B7KutpgHWUfJi+ER6WfEu3TzqYFAWJM8Ql2ox4ow8BC/cAFUlpK0NaAuoyPJUmBi/grxA==
authentication-results: spf=none (sender IP is )[email protected];
In-Reply-To: <CAM+dB2dsQd3wU69bmeRZxbV5v+bK=Q841=Wgr53aJ3nHudhr6Q@mail.gmail.com>
IronPort-SDR: 57I1NiZiGWl6T3DVizCPVCyZAaQY3XdZI7Y/rXDKTBduE+Tw2u9NDPXfM1vXGATtl6mPvk3xqzg4neZ96T0xzOAGkA+xDM2Ve7SJtLqyKAzNCtOJtzcoLeOGooaUug0VuuWroPbBs1kBx3t727+zYFXJ3f+sl/yAZ+tta479w0jnFmTaDVnfkzE4dHR8h9ezuDp5GfXUF8FnjgsBWgrPwcBCnSj7rl3UeABN7pvfUcvNiwe6kuJlzkOYafhuzoM/F4bjQftjROmmctoFBOcG4F4QnLSnRga+ql+QbmqbMa0=
References: <CAM+dB2eFZ+-yUVWfNBVnKUaNNr9bUC9S3B8QJ9pYHNYk4ETnfA@mail.gmail.com><CABcsX26hg1KG+1P08W=GbjjV-upjKtbgyzbH4WW+qDhwZQR4zA@mail.gmail.com><CAM+dB2dBTdoXj_VegOibsFaKowy-+kXT6OQ2MxaVA=wOcD1akg@mail.gmail.com><[email protected]><CAM+dB2fOodbuyMFhRnY5EZebYtPP3+RWh9pRLbAQvYmxvHYBrw@mail.gmail.com><CABcsX27tt801DdX8cmFwuBFY5JmMcm2T3od-VgnNMygP29TfLQ@mail.gmail.com>,<CAM+dB2dsQd3wU69bmeRZxbV5v+bK=Q841=Wgr53aJ3nHudhr6Q@mail.gmail.com>

Herbert wrote:

In the same vein I propose that we unambiguously specify the mapping of

all non-looped DDLm categories into the equivalent DDL2 CIF presentation. I know there are people who think there is something special and

different about the unlooped categories, but I firmly believe that any information that cannot be presented as relations preserving referential

integrity is a disaster waiting to happen and eventually will become an unsearchable garble.

From the use of the DDL1-associated term "looped", I interpret Herbert to be at least partially referring here to the significance DDL1 purports to attribute to the form in which the CIF representation of a category is presented -- that is, a single-packet loop_ construct vs. one or more scalar items. (By "DDL1" I mean first and foremost the dictionary definition language itself.) I agree that such significance is artificial, and indeed out of place. In practice, it served to simplify software development, but probably also misled some people.

As far as I am aware, however, DDLm (the dictionary definition language) does not provide for making such distinction, and therefore none of the dictionaries expressed in that language do so. What DDLm does have is the concept of Set categories, which are a natural fit for categories whose DDL1 definitions specify that they be unlooped-only. The validity of a CIF representation of a DDLm Set category does not depend on whether a loop_ construct is used. When talking about DDLm or dictionaries expressed in that language, then, the term "looped" can be understood only from an historical perspective. The term remains significant for CIF documents, of course, but it is unrelated to their interpretation with respect to dictionaries expressed in DDLm.

James wrote:

I don't see why Herbert thinks that specifying the relationship between DDLm (I assume he means core CIF) and DDL2 (I assume he means mmCIF) is difficult. If the DIFFRN category is a set category in default core CIF, then it corresponds to a single-row DIFFRN category in mmCIF. I thoroughly agree that the fundamental underlying structure of any scientific data is relational, some data presentations require more untangling than others.

Indeed, Set categories are not problematic from a relational perspective. Such a category simply corresponds to a relation having a candidate key drawn from a single-element domain. The key's domain having only one element, its specific value has no semantic significance, and we need not and do not define or represent the key explicitly.

That model is of course focused on data blocks (and save frames) serving as self-contained databases. If we want to consider its embedding into a broader relational model encompassing a flat representation of multiple data blocks, then those erstwhile unspecified Set category keys must each have a 1:1 relationship with their host block, and thus with its unique identifier. A natural choice in this broader model is therefore to _equate_ those keys with their host blocks' unique identifiers, which, in combination with expanding the keys' domains appropriately, brings us to exactly the same place that mmCIF reached (via a similar route, I suspect).

On the other hand, although a flat relational representation affords advantages that I'm sure motivated its choice for mmCIF, it is not semantically different from the trivial embedding we already have: the one with data block identifiers as keys, and _data_blocks_ as values (to the extent that data block identifiers are unique).

Perhaps it would be worthwhile presenting something along the lines of the above discussion in an appropriate place (i.e. some place more official than this list), but it's unclear to me what more than that would be needful.

Regards,

John

John C. Bollinger, Ph.D.

Computing and X-ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

From: ddlm-group <[email protected]> on behalf of James Hester <[email protected]>
Sent: Wednesday, April 1, 2020 7:36 PM
To: Herbert J. Bernstein <[email protected]>
Cc: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: [ddlm-group] On schema, syntax and semantics, was Preparing CIF for multi-block datasets

Caution: External Sender

Dear all,

See my comments inline below.

On Thu, 2 Apr 2020 at 10:23, Herbert J. Bernstein <[email protected]> wrote:

Dear Colleages,

This issue is just another aspect to the matters we have discussed since 1995 on the relationships among database schema, syntax and semantics.

It is very important that the relational database schema be cleanly and clearly described independent of the syntax of the container languages we

use, so that we can work with interoperable presentations in CIF, XML, json, etc. DDL2 is very good at that. DDL1 was weak on that. DDLm is,

frankly, a bit muddled in that regard when the syntax and semantics of pieces of various combined dictionaries can make it hard to trace what

parts of which schema are intended to apply to which data.

I agree up until Herbert's final sentence. Is DDLm muddled because of the lack of decent documentation, or because the concepts are imperfect? As far as I can tell, DDLm in its current version provides a mechanism that is about as simple as it can be and still handle the enormous diversity of the powder/magnetism/modulated/Laue world and combinations thereof in a machine-actionable manner - machine-actionable means that dREL methods can be written that will work with whatever combination of dictionaries you have come up with (think combined neutron / X-ray powder diffraction on a mixture expressed using a dictionary). Doing this has relied on relational data structures.

Links are certainly useful, and I would favor adding them to CIF as a container language, just as we added the imgCIF binary data type to enable

the creation of CBFs, but just as with that case we be sure to precisely specify what the equivalent DDL2 CIF presentation is of the same information in

a single file, so that the schema can be unambiguously extracted.

I believe Herbert is thinking here of links within a CIF data block pointing to items that are not straightforward DDLm-conforming CIF data blocks, thus necessitating a mapping between the pointed-to contents and the DDLm schema. Absolutely true that such a mapping is necessary. So perhaps Herbert is suggesting a further '_audit_link' data name that would identify the particular mapping to use? I agree. The lack of such mappings doesn't mean we can't define the data name. I would also add that, while one scenario might put such links into a 'global' block (like a Nexus master file) making a sort of container for other data blocks, another scenario might simply link one block with the next one along.

In the same vein I propose that we unambiguously specify the mapping of

all non-looped DDLm categories into the equivalent DDL2 CIF presentation. I know there are people who think there is something special and

different about the unlooped categories, but I firmly believe that any information that cannot be presented as relations preserving referential

integrity is a disaster waiting to happen and eventually will become an unsearchable garble.

I don't see why Herbert thinks that specifying the relationship between DDLm (I assume he means core CIF) and DDL2 (I assume he means mmCIF) is difficult. If the DIFFRN category is a set category in default core CIF, then it corresponds to a single-row DIFFRN category in mmCIF. I thoroughly agree that the fundamental underlying structure of any scientific data is relational, some data presentations require more untangling than others.

By saying that a category is unlooped you are specifying the scope of a single data block (e.g. *one* compound, *one* sample), that is the significance of unlooped categories. DDL2 does exactly the same thing by specifying that the value of _entry.id is the data block identifier. So all children of _entry.id are single row i.e. Set categories. And there is no abandonment of relational integrity if you restrict some loops to having a single row as Herbert seems to be implying.

Just as to this day, COMCIFS has not pushed the binary data type into DDLm, it does not need to push links or looped sets into DDLm, but it

does need to suggest a reasonable way to present the information involved in a DDLm equivalent that can be used by applications to deal

with this information.

We already have 'looped sets' as a result of the _audit.schema discussions several years ago. The documentation might still be a bit sparse.

all the best,

James.

On Wed, Apr 1, 2020 at 6:46 PM James Hester <[email protected]> wrote:
I'll admit to being a bit mystified by John W's answer and so suspect that we might be talking about different things. For example "the semantics of the dictionary items has always been deferred to the application"...I thought the semantics were defined by the dictionary and the application had to conform to them? Also, there are no namespace issues as far as I can see, the present task is simply to identify which data blocks belong together and the way in which they combine is not the issue.

The goal of my post was finding a way to help a CIF processor identify all data blocks that belong to a given dataset. A minimalist, ad-hoc approach (that we already have) is to say that this is out of scope for CIF and external information will be used by the processor e.g. all files in a directory/file naming convention/everything in a zip file. The proposal below is an attempt to improve the robustness of this approach by providing ways to (optionally) verify membership in the dataset and to potentially link in e.g. calibration files that might be found separately.
On Thu, 2 Apr 2020 at 01:10, [email protected] <[email protected]> wrote:
Hi all,

In my view using dictionary names as proposed here would not be a useful direction. The semantics

of the dictionary items has always been deferred to the application and to prescribe a special meaning

to particular audit_* category items would not be a direction that would be useful in my domain. I would

suggest using the datablock name as am optional namespace, and with some recommended protocol

for combining the namespace with item names in the block. For example, the namespace may be simply

prepended to data items in the block using some concatenation scheme. This way you don't have to

start reading items in a datablock to understand how to name items in that block. This also avoids

entangling dictionary level semantics with item access level conventions.

John

On 4/1/20 9:41 AM, James Hester wrote:
This precisely my point (2) in the discussion. Also, I do not think that we should too tightly specify particular behaviour from the CIF processor, but instead we should provide sufficient datanames to give it the best chance of success. Are the data names I have proposed sufficient?

On Wed, 1 Apr 2020 at 23:24, Herbert J. Bernstein <[email protected]> wrote:

Dear Colleagues,

Here is a practical issue to consider in designing the multi-block scheme, which already arises in practice

in collecting HDF5-based NeXus datasets of Eiger images:

At the time you start your collection you are planning to collect, say, 3600 images in blocks of, say, 600 images.

Before you start the collection you lay out the metadata for the entire collection in one big master file, with links

to the planned 6 files on 600 images each. You close the master file and start collecting images into those

6 data files. A bit more then half-way through the collection, the collection stops, leaving you with only 3 and

a fraction of your intended 600-image data block files. The 4th file does have some useful images, but not

all of them. You want to process as much of the data as possible. The scheme should warn you about missing

files or missing images, but it should recover gracefully and give you the data that is actually there.

Regards,

    Herbert

On Wed, Apr 1, 2020 at 1:12 AM James Hester <[email protected]> wrote:

Dear DDLm group,

The time is coming when we need to have a good story for how datasets consisting of multiple blocks are handled within CIF.

As the de facto technical committee, can you:

(1) let me know what you think of the following heuristic for locating all data blocks associated with a dataset

(2) let me know what you think of the proposed _audit_link data block specification method

A: Heuristic for locating all data blocks associated with a data set, given a single data block or data set identifier

1. Collect all known data blocks that include the provided data set identifier in their _audit.dataset_id (this would be a new data name) loop

2. For each of the blocks found in 1, include any blocks referenced by _audit_link rows in those blocks (see below)

3. Repeat step 2 until no new blocks are obtained.

Note it is not an error for a data block to advertise a different dataset_id, as a single block could belong to multiple datasets (e.g. calibration data, reprocessed data), or it could have been incorporated into a larger dataset.

B: _audit_link data names

Currently _audit_link refers to data blocks using the _audit.block_code identifer. I would like to add more options:

_audit_link.URI A URI for the object containing the block (DOIs also possible here)

_audit_link.internal_address An opaque internal address (e.g. directory structure) for the object at the URI that will lead it to a data block with a CIF representation. Interpretation of this address will depend on information provided at the URI.

_audit_link.relationship An optional data name with a value drawn from an enumerated list. Some relationships would be non-machine-actionable, such as 'previous work'.

Discussion

========

1. The essential problem is that any identifier for a final dataset may not be known at the time a file is produced (particularly calibrations), so we have to allow both top-down and bottom-up searches.

2. CIF, being relational, degrades gracefully. A missing data block means either loops are shortened or absent. If a CIF processor does not find are have available all blocks, it will still be able to work coherently with what it has.

3. Until now we have tended to think of a 'dataset' as equivalent to a 'data block'. This is increasingly untenable as people use 'global' blocks or split composite structures over 3 or 4 blocks. We have a good story in DDLm for how to describe multi-block datasets in a machine-actionable way, all we need is a way of locating all of the blocks.

4. audit_link.block_code is supposed to be confined to the current data file. I would like to expand this to cover any file. It may be advisable to create a whole new category e.g. _audit_related.

all the best,

James.


--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
-- 
John Westbrook
RCSB, Protein Data Bank
Rutgers, The State University of New Jersey
Institute for Quantitative Biomedicine at Rutgers
174 Frelinghuysen Rd
Piscataway, NJ 08854-8087
e-mail: [email protected]
Ph: (848) 445-4290 Fax: (732) 445-4320
_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (Herbert J. Bernstein)

References:

[ddlm-group] Preparing CIF for multi-block datasets (James Hester)

Re: [ddlm-group] Preparing CIF for multi-block datasets (Herbert J. Bernstein)

Re: [ddlm-group] Preparing CIF for multi-block datasets (James Hester)

Re: [ddlm-group] Preparing CIF for multi-block datasets ([email protected])

Re: [ddlm-group] Preparing CIF for multi-block datasets (James Hester)

Re: [ddlm-group] Preparing CIF for multi-block datasets (Herbert J. Bernstein)

[ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (James Hester)

Prev by Date: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets

Next by Date: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets

Prev by thread: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets

Next by thread: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets