[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Preparing CIF for multi-block datasets

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Preparing CIF for multi-block datasets
From: James Hester <[email protected]>
Date: Fri, 3 Apr 2020 11:56:21 +1100
In-Reply-To: <[email protected]>
References: <CAM+dB2eFZ+-yUVWfNBVnKUaNNr9bUC9S3B8QJ9pYHNYk4ETnfA@mail.gmail.com><[email protected]>

Dear Brian,

The key difference between _audit_block_code and the proposed _audit.dataset_id is that the dataset id would take a single value for a whole collection of data blocks, so separate blocks could have the same _audit.dataset_id. The audit_link items could use existing block code data names, with the additional data names I proposed helpful in locating those blocks. I don't think that the CIF system can cope with a change in the meaning of _audit.block_code to encompass multiple blocks (which would be contrary to its name) and _audit.block_code is still useful in legacy data names as the target of pointers to particular blocks.

all the best

James.

On Thu, 2 Apr 2020 at 21:11, Brian McMahon <[email protected]> wrote:

Hi James

I feel that I am being obtuse, but please spell out what you think the
new _audit.dataset_id buys you that is absent from _audit.block_code.

I presume one possible distinction is that _audit.block_code is expected
to be a singleton identifier, while _audit.dataset_id is by design
multiple valued (so you can partition a single data set amongst many
different aggregates). But the core (DDL1) dictionary is silent about
the _list attribute for _audit.block_code, so it is in principle
loopable.

I see also that a new data name gives you more freedom to specify how
to construct its value, if that's thought to be a good thing. However,
there is a precedent for giving a recipe for constructing block codes
in a "helpful" way for at least one community of practice (the
AUDIT_LINK category in the msCIF dictionary).

[And by way of counterweight, pdCIF tries the approach of distinct
data names with _pd_block_id etc.]

In my view, the audit_link structure provides the right sort of
mechanism for capturing arbitrary relationships (I note, by the way,
that the msCIF dictionary explicitly states "The value of
_audit_block_code may be associated with a data block in the same
file or in a different file associated with the same data block.")
It may be that the category does need extension in the way you
suggest to ensure uniqueness of the datablocks, and I've no objection
to following through on that idea, but I'm still not clear on whether
for the block identifiers we need a new data name, or just better
guidelines on a per-community (per-dictionary) basis.

Brian

On 01/04/2020 06:11, James Hester wrote:
> Dear DDLm group,
>
> The time is coming when we need to have a good story for how datasets
> consisting of multiple blocks are handled within CIF.
>
> As the de facto technical committee, can you:
> (1) let me know what you think of the following heuristic for locating
> all data blocks associated with a dataset
> (2) let me know what you think of the proposed _audit_link data block
> specification method
>
> A: Heuristic for locating all data blocks associated with a data set,
> given a single data block or data set identifier
>
> 1. Collect all known data blocks that include the provided data set
> identifier in their _audit.dataset_id (this would be a new data name) loop
> 2. For each of the blocks found in 1, include any blocks referenced by
> _audit_link rows in those blocks (see below)
> 3. Repeat step 2 until no new blocks are obtained.
>
> Note it is not an error for a data block to advertise a different
> dataset_id, as a single block could belong to multiple datasets (e.g.
> calibration data, reprocessed data), or it could have been incorporated
> into a larger dataset.
>
> B: _audit_link data names
>
> Currently _audit_link refers to data blocks using the
> _audit.block_code identifer. I would like to add more options:
> _audit_link.URI A URI for the object containing the block
> (DOIs also possible here)
> _audit_link.internal_address An opaque internal address (e.g.
> directory structure) for the object at the URI that will lead it to a
> data block with a CIF representation. Interpretation of this address
> will depend on information provided at the URI.
> _audit_link.relationship An optional data name with a value drawn
> from an enumerated list. Some relationships would be
> non-machine-actionable, such as 'previous work'.
>
> Discussion
> ========
>
> 1. The essential problem is that any identifier for a final dataset may
> not be known at the time a file is produced (particularly calibrations),
> so we have to allow both top-down and bottom-up searches.
> 2. CIF, being relational, degrades gracefully. A missing data block
> means either loops are shortened or absent. If a CIF processor does not
> find are have available all blocks, it will still be able to work
> coherently with what it has.
> 3. Until now we have tended to think of a 'dataset' as equivalent to a
> 'data block'. This is increasingly untenable as people use 'global'
> blocks or split composite structures over 3 or 4 blocks. We have a good
> story in DDLm for how to describe multi-block datasets in a
> machine-actionable way, all we need is a way of locating all of the blocks.
> 4. audit_link.block_code is supposed to be confined to the current data
> file. I would like to expand this to cover any file. It may be advisable
> to create a whole new category e.g. _audit_related.
>
> all the best,
> James.
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Preparing CIF for multi-block datasets (Brian McMahon)

References:

[ddlm-group] Preparing CIF for multi-block datasets (James Hester)

Re: [ddlm-group] Preparing CIF for multi-block datasets (Brian McMahon)

Prev by Date: [ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics

Next by Date: [ddlm-group] Add 'Encoded' data type to DDLm

Prev by thread: Re: [ddlm-group] Preparing CIF for multi-block datasets

Next by thread: Re: [ddlm-group] Preparing CIF for multi-block datasets

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Preparing CIF for multi-block datasets