Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Preparing CIF for multi-block datasets

Hi all,

In my view using dictionary names as proposed here would not be a useful direction.  The semantics

of the dictionary items has always been deferred to the application and to prescribe a special meaning

to particular audit_* category items would not be a direction that would be useful in my domain.  I would

suggest using the datablock name as am optional namespace, and with  some recommended protocol

for combining the namespace with item names in the block.  For example, the namespace may be simply

prepended to data items in the block using some concatenation scheme. This way you don't have to

start reading items in a datablock to understand how to name items in that block.   This also avoids

entangling dictionary level semantics with item access level conventions.


On 4/1/20 9:41 AM, James Hester wrote:
This precisely my point (2) in the discussion.  Also, I do not think that we should too tightly specify particular behaviour from the CIF processor, but instead we should provide sufficient datanames to give it the best chance of success. Are the data names I have proposed sufficient?

On Wed, 1 Apr 2020 at 23:24, Herbert J. Bernstein <yayahjb@gmail.com> wrote:
Dear Colleagues,
  Here is a practical issue to consider in designing the multi-block scheme, which already arises in practice
in collecting HDF5-based NeXus datasets of Eiger images:

  At the time you start your collection you are planning to collect, say, 3600 images in blocks of, say, 600 images.
Before you start the collection you lay out the metadata for the entire collection in one big master file, with links
to the planned 6 files on 600 images each.  You close the master file and start collecting images into those
6 data files.  A bit more then half-way through the collection, the collection  stops, leaving you with only 3 and
a fraction of your intended 600-image data block files.  The 4th file does have some useful images, but not
all of them.  You want to process as much of the data as possible.  The scheme should warn you about missing
files or missing images, but it should recover gracefully and give you the data that is actually there.


On Wed, Apr 1, 2020 at 1:12 AM James Hester <jamesrhester@gmail.com> wrote:
Dear DDLm group,

The time is coming when we need to have a good story for how datasets consisting of multiple blocks are handled within CIF.  

As the de facto technical committee, can you:
(1) let me know what you think of the following heuristic for locating all data blocks associated with a dataset
(2) let me know what you think of the proposed _audit_link data block specification method

A: Heuristic for locating all data blocks associated with a data set, given a single data block or data set identifier

1. Collect all known data blocks that include the provided data set identifier in their _audit.dataset_id (this would be a new data name) loop
2. For each of the blocks found in 1, include any blocks referenced by _audit_link rows in those blocks (see below)
3. Repeat step 2 until no new blocks are obtained.

Note it is not an error for a data block to advertise a different dataset_id, as a single block could belong to multiple datasets (e.g. calibration data, reprocessed data), or it could have been incorporated into a larger dataset.

B: _audit_link data names

Currently _audit_link refers to data blocks using the _audit.block_code identifer. I would like to add more options:
_audit_link.URI                A URI for the object containing the block (DOIs also possible here)
_audit_link.internal_address       An opaque internal address (e.g. directory structure) for the object at the URI that will lead it to a data block with a CIF representation. Interpretation of this address will depend on information provided at the URI.
_audit_link.relationship    An optional data name with a value drawn from an enumerated list. Some relationships would be non-machine-actionable, such as 'previous work'.


1. The essential problem is that any identifier for a final dataset may not be known at the time a file is produced (particularly calibrations), so we have to allow both top-down and bottom-up searches.
2. CIF, being relational, degrades gracefully. A missing data block means either loops are shortened or absent. If a CIF processor does not find are have available all blocks, it will still be able to work coherently with what it has.
3. Until now we have tended to think of a 'dataset' as equivalent to a 'data block'. This is increasingly untenable as people use 'global' blocks or split composite structures over 3 or 4 blocks. We have a good story in DDLm for how to describe multi-block datasets in a machine-actionable way, all we need is a way of locating all of the blocks.
4. audit_link.block_code is supposed to be confined to the current data file. I would like to expand this to cover any file. It may be advisable to create a whole new category e.g. _audit_related.

all the best,

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

ddlm-group mailing list
John Westbrook
RCSB, Protein Data Bank
Rutgers, The State University of New Jersey
Institute for Quantitative Biomedicine at Rutgers
174 Frelinghuysen Rd
Piscataway, NJ 08854-8087
e-mail: john.westbrook@rcsb.org
Ph: (848) 445-4290 Fax: (732) 445-4320
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.