Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Preparing CIF for multi-block datasets

Hi Brian,

On Fri, 3 Apr 2020 at 20:45, Brian McMahon <bm@iucr.org> wrote:

So, is it fair to characterise your suggested approach thus?:

[1] The *ideal* way to tell an application what data blocks are related
to each other, and with what purpose(s), is to have a fully populated
AUDIT_LINK structure, able to link within and outside of a single file,
listing data block codes and locational identifier schemes such as DOI.

I wasn't making any judgement that this was ideal, only that it was acceptable, but yes. 

[2] For various reasons, such a manifest may be absent. In that case,
data blocks which contain the same value of _audit.dataset_id may be
considered to be related "in some way".

Yes, except each data block simply adds further rows to loops or completely new loops, so there is no need for specifying any relationship as the dictionaries take care of that.

In this reading, I would see the scope of _audit.dataset_id as extending
only within the current file. Is that also how you see it? Also, since
you admit to the possibility of this item being looped, one may observe
that a file can contain (overlapping) sets of datablocks that are linked
to each other with different intentions. I worry how useful such
implicit relationships will be in attempting to design general-purpose

No, _audit.dataset_id has unlimited scope.  As I imagine it, data blocks that are grouped together into a dataset must be "compatible", that is, no data block provides contradictory information for any row in any loop for the same values of the key data names. This scheme also assumes the application of our _audit.schema principles for splitting a single data block across multiple data blocks.   

Perhaps you're willing to flesh out a little more how the new data
item should be defined, and whether other companion items might help
to reduce ambiguities in its interpretation.

Yes, I think I need to write a proper discussion paper as I might have confused quite a few people. This came about as I was putting the finishing touches on an 'Advanced Dictionary Usage' draft chapter of Volume G so a lot of things were in my head that are not necessarily at the forefront of anybody else's head at the moment.

I understand your discussion point 1 that the ultimate use to which a
dataset is put may not be known at the point of generation, but I have
a feeling that the onus should be on the data management protocols to
resolve this problem (eventually) using the existing linking machinery,
rather than to seed the data files with "hints" that may turn out to be
misleading if initial expectations are broken.

Yes, an alternative approach might be to say that the method for aggregation of a dataset from data blocks lies outside the scope of CIF data blocks, and we only provide optional mechanisms for verification. 

I know Herbert has concerns particularly over high-data-rate data
collection, but the earlier discussion has made clear that it is in
principle possible to map out a complete linking manifest in advance,
but be able to recover gracefully if the anticipated collection is

Yes. I also think that editing information into data blocks after production is potentially fraught. The more I think about it the more I like the idea of providing links, and the result of tracing through all of these links is a collection of data blocks which may or may not be 'complete' (it may be complete for a given moment in time e.g. for time series data). Ideally the nature of these relationships does not need to be spelled out due to the compatibility idea above. Would it be worthwhile putting together a list of criteria for when a data block is compatible?  For example (1) Same _audit.schema value (2) No contradictions (3) Same _audit.dataset_id value if present. 

I feel that I'm coming across as rather negative here, but I'm not
yet convinced that Occam's razor shouldn't be applied, i.e. try to
encourage more uptake of the existing solution.

You may be right. We can go slowly. I'll write a discussion paper.

On 03/04/2020 01:56, James Hester wrote:
> Dear Brian,
> The key difference between _audit_block_code and the proposed
> _audit.dataset_id is that the dataset id would take a single value for a
> whole collection of data blocks, so separate blocks could have the same
> _audit.dataset_id.  The audit_link items could use existing block code
> data names, with the additional data names I proposed helpful in
> locating those blocks. I don't think that the CIF system can cope with a
> change in the meaning of _audit.block_code to encompass multiple blocks
> (which would be contrary to its name) and _audit.block_code is still
> useful in legacy data names as the target of pointers to particular blocks.
> all the best
> James.
> On Thu, 2 Apr 2020 at 21:11, Brian McMahon <bm@iucr.org
> <mailto:bm@iucr.org>> wrote:
>     Hi James
>     I feel that I am being obtuse, but please spell out what you think the
>     new _audit.dataset_id buys you that is absent from _audit.block_code.
>     I presume one possible distinction is that _audit.block_code is expected
>     to be a singleton identifier, while _audit.dataset_id is by design
>     multiple valued (so you can partition a single data set amongst many
>     different aggregates). But the core (DDL1) dictionary is silent about
>     the _list attribute for _audit.block_code, so it is in principle
>     loopable.
>     I see also that a new data name gives you more freedom to specify how
>     to construct its value, if that's thought to be a good thing. However,
>     there is a precedent for giving a recipe for constructing block codes
>     in a "helpful" way for at least one community of practice (the
>     AUDIT_LINK category in the msCIF dictionary).
>     [And by way of counterweight, pdCIF tries the approach of distinct
>     data names with _pd_block_id etc.]
>     In my view, the audit_link structure provides the right sort of
>     mechanism for capturing arbitrary relationships (I note, by the way,
>     that the msCIF dictionary explicitly states "The value of
>     _audit_block_code may be associated with a data block in the same
>     file or in a different file associated with the same data block.")
>     It may be that the category does need extension in the way you
>     suggest to ensure uniqueness of the datablocks, and I've no objection
>     to following through on that idea, but I'm still not clear on whether
>     for the block identifiers we need a new data name, or just better
>     guidelines on a per-community (per-dictionary) basis.
>     Brian
>     On 01/04/2020 06:11, James Hester wrote:
>      > Dear DDLm group,
>      >
>      > The time is coming when we need to have a good story for how
>     datasets
>      > consisting of multiple blocks are handled within CIF.
>      >
>      > As the de facto technical committee, can you:
>      > (1) let me know what you think of the following heuristic for
>     locating
>      > all data blocks associated with a dataset
>      > (2) let me know what you think of the proposed _audit_link data
>     block
>      > specification method
>      >
>      > A: Heuristic for locating all data blocks associated with a data
>     set,
>      > given a single data block or data set identifier
>      >
>      > 1. Collect all known data blocks that include the provided data set
>      > identifier in their _audit.dataset_id (this would be a new data
>     name) loop
>      > 2. For each of the blocks found in 1, include any blocks
>     referenced by
>      > _audit_link rows in those blocks (see below)
>      > 3. Repeat step 2 until no new blocks are obtained.
>      >
>      > Note it is not an error for a data block to advertise a different
>      > dataset_id, as a single block could belong to multiple datasets
>     (e.g.
>      > calibration data, reprocessed data), or it could have been
>     incorporated
>      > into a larger dataset.
>      >
>      > B: _audit_link data names
>      >
>      > Currently _audit_link refers to data blocks using the
>      > _audit.block_code identifer. I would like to add more options:
>      > _audit_link.URI                A URI for the object containing
>     the block
>      > (DOIs also possible here)
>      > _audit_link.internal_address       An opaque internal address (e.g.
>      > directory structure) for the object at the URI that will lead it
>     to a
>      > data block with a CIF representation. Interpretation of this address
>      > will depend on information provided at the URI.
>      > _audit_link.relationship    An optional data name with a value drawn
>      > from an enumerated list. Some relationships would be
>      > non-machine-actionable, such as 'previous work'.
>      >
>      > Discussion
>      > ========
>      >
>      > 1. The essential problem is that any identifier for a final
>     dataset may
>      > not be known at the time a file is produced (particularly
>     calibrations),
>      > so we have to allow both top-down and bottom-up searches.
>      > 2. CIF, being relational, degrades gracefully. A missing data block
>      > means either loops are shortened or absent. If a CIF processor
>     does not
>      > find are have available all blocks, it will still be able to work
>      > coherently with what it has.
>      > 3. Until now we have tended to think of a 'dataset' as equivalent
>     to a
>      > 'data block'. This is increasingly untenable as people use 'global'
>      > blocks or split composite structures over 3 or 4 blocks. We have
>     a good
>      > story in DDLm for how to describe multi-block datasets in a
>      > machine-actionable way, all we need is a way of locating all of
>     the blocks.
>      > 4. audit_link.block_code is supposed to be confined to the
>     current data
>      > file. I would like to expand this to cover any file. It may be
>     advisable
>      > to create a whole new category e.g. _audit_related.
>      >
>      > all the best,
>      > James.
>      >
>      >
>      > --
>      > T +61 (02) 9717 9907
>      > F +61 (02) 9717 3145
>      > M +61 (04) 0249 4148
>      >
>      > _______________________________________________
>      > ddlm-group mailing list
>      > ddlm-group@iucr.org <mailto:ddlm-group@iucr.org>
>      > http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
>      >
>     _______________________________________________
>     ddlm-group mailing list
>     ddlm-group@iucr.org <mailto:ddlm-group@iucr.org>
>     http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.