Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm Group,

First, let me update the specification of the "bare bones" option taking into account John's comments:

(i) Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive, directory).

(ii) The following backwards-compatible assumptions are made for DDLm:

   * A "Set" category is a "Loop" category for which only one row may be presented in a data block

   * The set of "Set" categories for a data block is determined by _audit.schema

   * The key data name of a "Set" category may be omitted from the dictionary if it is never referred to explicitly elsewhere in the dictionary (e.g. using _name.linked_item_id)

   * If a key data name of a "Set" category is omitted from the dictionary, child relationships with that key data name *must* be defined using a new category-level DDLm attribute in order for multi-block presentation of that Set category to be possible.

(iii) If a Set category key is absent from a data file, it and its child data names may be assigned a unique, arbitrary value. If any data blocks need to refer to Set category values in other data blocks, explicit values for Set category key data names must necessarily be provided in both the dictionary and data block

(iv) If a Set category key data name is absent from a dictionary and data file, values are populated as for (iii) as if arbitrary, unique data names existed for the key data name and children.


And let me for the sake of comparison provide the "fallback" scheme which we originally envisaged:


(i) No aggregation mechanism was specified

(ii) No principle for distribution among data blocks was specified

(iii) All multi-row Set categories are explicitly provided with key data names by extension dictionaries

(iv) All categories with child data names of Set category key data names are explicitly provided with them


Our goal as I see it is to resolve (i) and (ii), as well as to minimise the work involved in (iii) and (iv).


The constraints we are operating under are:

(i) Current single-data-block CIF data files must remain valid

(ii) If possible the approaches adopted by pdCIF and msCIF should remain valid

(iii) Data spread between multiple data blocks must map into a strictly relational structure

(iv) dREL methods must either remain valid or be updated


Note that I haven't mentioned constraint (iv) previously as I had overlooked it. dREL as we have developed it allows references to values of data names in other categories to be resolved implicitly based on parent-child relationships of key data names, so as long as these are unambiguous methods will generally not need to be redefined.


Going back to our "fallback" scheme, my assessment of our current situation is that there is an approximate agreement on aggregation (+/- an extra data block) and on how best to distribute data between data blocks. We are not so sure about the best way to minimise dictionary writing. The updated bare bones approach above requires the use of extension dictionaries, but these are in any case unavoidable when completely new categories are defined (eg PD_PHASE in powder diffraction). Such dictionaries would in addition add dependencies on the new category(s) to core categories, but not always have to write out every key data name definition.


Further comments inline below.


On Wed, 24 Nov 2021 at 09:40, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear DDLm Group,

 

James and I seem to agree that it is essential for any multiblock aggregation mechanism to address how to define the relational structure of the aggregate data set.  I take this to be consistent with Herbert’s view as well.


Yes, I agree that having a well-defined relational structure for the aggregate data is essential.

 

I am not especially tied to the particular model mechanism I proposed, but I do think that working out a functional model aggregation mechanism is an important exercise for clarifying the issues involved and focusing on possible solutions.  Working out two such mechanisms would be even better.  From that perspective, I offer additional comments inline below.


Yes.

 

On Tuesday, November 23, 2021 1:08 AM, James H jamesrhester@gmail.com wrote:

[edit]

 

 

The _cif_multiblock proposal should be contrasted with the bare-bones proposal below, which requires dictionary authors to specify ahead of time inter-category dependencies. For example, in John's example multiple chemical formulae require multiple cells. Is this "universal"? Are there different perspectives that are contradictory? Without attempting to determine such implicit inter-category relationships in the core dictionary I'm not sure.

 

 

For the record, in the example, there is lexically one formula and multiple cells.  This might arise, for example, in a diffraction experiment where the specimen is a multicrystal of one compound, with each component having been indexed independently.  Supporting this sort of data de-duplication is one of the multiblock objectives that was presented earlier.  I acknowledge, however, it may not be apparent from my short-form description why the given procedure results in multiple CHEMICAL_FORMULA rows when the input contains only one.


Ah yes, my mistake. So my version of the example would *not* loop chemical_formula and instead become 

#####

data_flattened

 

_chemical_formula.sum C6H12O6

 

loop_

_cell.id    #arbitrary cell identifier

_cell.length_a

_cell.length_b

_cell.length_c

_cell.angle_alpha

_cell.angle_beta

_cell.angle_gamma

xyz 6.000(2) 7.000(2) 8.000(2) 90 100.0(3) 90

abc 5.999(2) 7.003(2) 8.000(3) 90 100.1(3) 90

 

#######

 

I claim that we know from the coexistence of their definitions in the Core dictionary that there is or can be a formula that goes with each cell.  In that sense, yes, it is universal.  But on the other hand, what does it matter?  I infer that the concern is that it might be possible to express nonsensical relationships, but what if it is?  I am not much concerned with an opening for creating bad data as long as it is not unreasonably difficult to create good data.  I am much more interested in the ability to express all the information I want to convey.


The concern was not about expression of incorrect relationships but about being able to maximise the amount of information that is automatically deducible from the dictionaries. I also am not overly concerned with allowing bad data to be created, if it becomes possible to express complex but good data.

[edit]

 

I think a more fundamental need is to establish the requirements and responsibilities for aggregation mechanisms in general. Here is my initial cut at a list of mandatory, desirable, and additional characteristics for a viable aggregation mechanism:

 

1.       It must identify all of the data blocks comprised by an aggregate data set.  This is the essential and most fundamental requirement.

2.       It must designate all data names needed for forming or expanding category keys for the aggregate data set.

3.       It must specify values for the added key attributes on a per-datablock basis.  Together with the previous, this is what makes a simple aggregate into a data set.

4.       It should not depend on adding items to component data blocks.

5.       It may define an overall data set identifier, though that is not required.

6.       It should be machine actionable.  This is not an essential characteristic in any abstract sense, but I don’t see much scope for an aggregation scheme that is not machine actionable being of interest.

7.       It may itself be based on CIF syntax, but that is not required.  I see no inherent reason to insist on a specific machine-actionable form for the aggregation metadata.

Commenting on these points inline caused gmail to fiddle with the numbering, so I'll collect the comments here:

(1) Agreed. If I place a tar.gz file at the end of a DOI, I think I have identified all components of an aggregate data set (everything in the archive).

 

 

I agree that “all the contents of the archive” or “all the contents of the directory” or “all the contents of the CIF” are possible ways of identifying the components of a multiblock data set, but I do not accept that those are necessary interpretations of data delivered in an archive, or in a directory, or in a CIF containing multiple data blocks. Nor do I accept that the whole contents of X is always the best or most appropriate way to identify components of a multiblock set.  There is room for different mechanisms to do this differently, or for a given mechanism to allow multiple options.


Yes, indeed. I think it would be ideal if every such aggregate did include a "master" data object defining the contents. The default in the absence of such a master block would be everything in the aggregate, and if no contradictions (in the relational sense of two rows with the same key values but different attribute values) exist in either case, that is a valid dataset.  dREL expressions could also act as a further check.

My concern with the "master" block is that generating a correct master block for a large multi-object item might be quite difficult, whereas collecting a group of consistent files together is often much easier to do. So, for example, I might have 180 ADSC raw images, calibration files, XDS input and output files for those images, and a CIF file generated from a refinement of the integrated intensity data. I know that this forms a complete, consistent dataset, which someday in the future could be described relationally with data names that don't yet exist in CIF. Being unable to create a sensible cif_multiblock file should not preclude me from calling this aggregate a consistent dataset. Also, if I only add the calibration files later, but don't regenerate the cif_multiblock file, then the calibration files would not form part of the dataset, which is a concern.

 

 

(2) We can specify that either a Set category key data name is given explicitly (e.g. phase_id is provided in each data block), or else an arbitrary value distinct from all other values for that data name in the aggregate is chosen. In that case there is no need for a separate designation I would have thought.

 

 

At present, DDLm Set categories do not have category keys (unless we accept some kind of anonymous implicit key).  DDLm expressly specifies that _category_key.name gives a data name that is part of a category key for a Loop category, and the DDLm core dictionary accordingly does not define category keys for Set categories. There are no existing key data names available to be given for Set categories, so if we want any then we need either to generate them or to designate them.  And it’s not really any different with Loop categories.  Although these do have category keys, it is a natural use case that we would want to present multiblock data in which one or more loop categories’ keys are expanded with additional data names.  For example, a multiblock data set providing information about multiple distinct structures might need to present ATOM_SITE data for each one, which would require expanding the ATOM_SITE key to distinguish among the rows for the different structures and to avoid the risk of duplicate keys.


Yes, this is precisely how we would like to operate. We are looking for a way to conveniently expand the keys of the necessary categories, whether Set or Loop. 

 

But I feel like I may have lost the thread of the conversation.  I thought one of the objectives was to avoid creating permanent, formal extension dictionaries to support key extensions for the various multiblock scenarios that are not adequately supported for category keys as defined in our present dictionaries.  A combinatorial argument against that was raised, and an argument about how that would result in adding key data names that we would prefer to avoid including literally in multiblock CIF documents.  So what is the target, actually?


Yes, that is our goal. I've added in a discussion of our goals up the top of the post.

[edit out bare bones, see above for new version]

 

So again, (ii) supposes that there are Set category keys, and in particular, keys that can take multiple values.  I will understand this as implying that the relevant categories are actually Loop categories in a possibly-virtual extension dictionary obtained by converting some of the Set categories from a base dictionary.  Furthermore, this then seems to be proposing that contrary to our usual expectations for valid CIF, data for a category may be presented without presenting the (full) category key, at least in the special case where at most one value is presented for each attribute of the category in a given CIF.  This is supported in part by engaging a unique key generator (notionally, at least).


Yes, that is where I'm going with this. I've added in some assumptions to the bare-bones approach above to flesh that out.  I don't think I've bent DDLm too out of shape?

 

 

Requirement (2) is met by (ii) by not eliding any key data names whose values are not arbitrary.

 

 

I think we have a misunderstanding here.  By (2) I mean that the identities of all the data names composing each category key must be established deterministically by the aggregation scheme.  This supposes that these might be a superset of those defined by any given category’s dictionary definition.  The “bare bones” scheme seems to assume that these are always defined in an existing dictionary, no extension required.  That does satisfy (2), but that has little to do with the ability or duty to elide presentation of any data names.


I hope the rewritten "bare bones" explains how I imagine this would happen.

 

 

Note that such values a priori existed if the data made sense before output, regardless of any standard we might be talking about, because if the values are not arbitrary they must be referred to by a defined data name, and for that reference to make sense in software it must have a referent.

 

 

I do not accept that.  At minimum, collocation in the same CIF data block establishes relationships between data from different categories that is not expressed via any data name.  Those relationships need to be preserved when multiple blocks are combined into a larger data set, without also establishing unwanted relationships.  That will sometimes require new data names to be chosen and retained for which no data items exist in the dictionary or in component data blocks.

 

I think there are other plausible cases, too.


I am imagining that the aggregation into a single relational structure occurs virtually, rather than it being possible to print out a merged data block with agreed data names for all items. In the case of data values collocated in a single data block, the intention is that all of those categories would be linked by a web of implicit relationships for which explicit key data names do not need to be defined (under some helpful assumptions not yet specified). This behaviour is visible with the PDB. If you think of the PDB database itself as the result of merging all mmCIF/PDBx single-data-block submissions (which I think is fair), it is not always the case that every category is supplied with a data name pointing back to _entry.id (e.g. the atom_site category) so some internal data names are notionally added to such categories when each separate mmCIF data block is merged into the PDB database.

I'm of course open to discussion as to why such implicit data names and values would be harmful, and it may be the case that we can have a hybrid solution where in simple cases we can use implicit key data names but in more complex situations we have to define them explicitly. I would like to see an example of such a complex situation as none comes to mind at the moment.

 

 

Suppose, in the scheme proposed by John, that the special _cif_multiblock data block is absent in an aggregate of data, but the requirements in (ii) above are met: as far as I can tell software is able to populate a full relational schema without problems.  So I'd like to see a case where the cif_multiblock block provides information that the bare bones approach would not.

 

 

The example I presented is already such a case, as understood in relation directly to the (DDLm) Core dictionary.  The CHEMICAL_FORMULA category does not have a category key containing any data names, so no key value for it can be expressed explicitly in the data, neither for the CHEMICAL_FORMULA data themselves, nor for the CELL data to refer to.  Additionally, the CELL category does not have a category key containing any data names either, and without that, the combined data set, expressing two rows of CELL data, cannot be valid against the Core dictionary.


So, in my (corrected) example for bare bones, the cif core dictionary would be equipped with information as to which categories CELL depends upon (if any).  CELL can now have two rows, as I'm proposing that a "Set" category is nothing more than a "Loop" category that can only have one value in a data block and therefore does not need an explicit key data name in single-data-block presentations (at least). What I now don't understand in John's example is why there need to be two entries for chemical_formula - either the two components accidentally have the same chemical formula, or the chemical formula is independent of the cell and so only one line is required. In the former case, which would correspond to the composite structure dictionary, a separate category is defined that both of these categories would have child key data names of. The key data names could be implicit if we simply provide this information in the CELL and CHEMICAL FORMULA category definition.

 

At minimum, to be comparable with the cif_multiblock scheme, the bare bones scheme needs to be extended with a description of how and under what circumstances category keys are created / expanded, and what the validity requirements are for the resulting combined data set.


Hopefully I have done this above?

[edit out original and incorrect bare-bones example]

 

No doubt a mechanism could be defined that does generate that result without reference to instructions included with the data, but the “bare bones” approach, as defined, does not seem to be such a mechanism.  More on that below.

 

Where the ingesting software has followed the following logic:

 

1. Set category "chemical_formula" is repeated, with differing values, and no category key value is provided, so arbitrary, unique values for the key data name are assigned (xyz and abc)

2. Set category "cell" is repeated. As it depends on "chemical_formula", and this is repeated, no further action is taken.

3. The list of dependent categories for "chemical_formula" is provided with values for the child data names of chemical_formula.id

 

Let me update that logic now that I've updated the "bare bones" example and taking into account John's points below:

1. Set category CELL is repeated, *with differing values*, triggering the bare bones logic. An arbitrary key data name and values are assigned to CELL.
2. The dictionary specifies all other categories that are directly dependent on CELL. They are also assigned child key data name values. In this example there are no such categories
3. All categories in all data blocks are now checked for repeated rows for the same key data names: CHEMICAL_FORMULA has such a row (zero key data names) which is identical, and so it is merged into one row.

So the key difference in the bare bones approach is that, because the dictionary does not specify any connection between cell and chemical formula, the chemical formula category is not expanded.  You could argue that the information that a particular chemical formula is associated with different cells in different data blocks is lost, but the alternative view is that the same information had been duplicated between data blocks when the "original" data were sliced.

I don’t see how this is consistent with “If any data blocks need to refer to Set category values in other data blocks, explicit values for Set category keys must necessarily be provided.”  Explicit values for the Set category keys are *not* provided (and cannot be, because Set categories have no key data names), yet it is desired to form cross-data-block associations.  It seems, then, like bare bones should not even be applicable to the example input.  Or at least I assume that that provision is meant as a constraint on input, for if it were meant instead as a description of the result then it would leave a big gap in the “getting from here to there” department.

With respect to the specific steps above,

(1) No, the CHEMICAL_FORMULA category is _not_ repeated in the original multiblock representation.  There is no key data name either, but if we suppose that a virtual one is synthesized then yes, we can choose a value for it to go with the one provided value of _chemical_formula.sum (just ‘xyz’, then).

My mistake, corrected.

(2) Yes, the CELL category is repeated, but why does it depend on CHEMICAL_FORMULA (see also below)?  How does it matter to the bare bones procedure, as given, which categories are repeated or have multiple values?

My mistake, it does not depend on chemical formula.

(3) If CHEMICAL_FORMULA did have multiple values, as supposed, then it’s unclear to me how would those be matched to the CELL data.

 

I think we could define some automatic matching for simple cases like this, but I expect that such a scheme would struggle to handle more complex cases.  I think it would also be difficult to handle cases where a relationships could be formed, but should not be.

 

 

Importantly, steps (2) and (3) depend on a list of dependent categories being provided by the dictionary.

 

 

I certainly agree that steps (2) and (3) have such a dependency, but I don’t see why a dictionary would define one of the two categories involved as being dependent on the other, unless specifically to serve this particular pattern of multiblock combination.  That was among the reasons that I chose these particular categories for the example.  A chemical formula is meaningful without a unit cell.  A unit cell is meaningful without a chemical formula.  Neither has a functional dependency on the other.


Exactly. So under the bare-bones scheme, if the chemical formulae were different in your example, it would be an example of inconsistent data, and the data blocks cannot be merged. However, if we introduce a new Set category "COMPONENT", describing a single component of a multi-component object, then both "CELL" and "CHEMICAL_FORMULA" will be dependent on this and they will have child data names implicitly present as long as items from the COMPONENT category are present in each data block. I like this behaviour as it allows rejection of some datasets that are incomplete. It also means that particular multi-block dataset presentations are linked to particular dictionaries.

 

And there is a variety of other pairs of categories having a bidirectional association that does not involve one being dependent on or subordinate to another.  If we are to provide for multiblock aggregation schemes that rely on per-category lists of associated categories, then we need to think out the details carefully.  For sure, “depends on” is not a sufficient predicate for deciding about the contents of such lists, unless we want only very narrowly scoped facilities, and probably only very few of them.


Hopefully my component example above gives an idea of how I think this could work - the bidirectionality of the two categories is resolved because they both depend on some other category. We have both powder diffraction and the composite structure dictionary as test beds, as well as theoretical examples with Laue diffraction, and combinations of those three. A "Depends on" B does need to be precisely defined. Relationally it means that a subset of A's key data names can be used to access B. How does that play out in practice?

Let's look at ATOM_SITE. With powder CIF, I know that it depends on which phase you are talking about. With msCIF, I know it depends on which component. But what about core CIF? _atom_site.fract_{x,y,z} are values for model parameters, so do not depend on either CELL or SPACE_GROUP. However, a number of other quantities in ATOM_SITE require the space group in order to calculate (e.g. multiplicity). So I conclude that ATOM_SITE depends on SPACE_GROUP but not CELL.

I need to continue this exercise with a number of other categories to see (i) if there are unambiguous conclusions (ii) how introducing a new category e.g. "PHASE" flows through.  Explicit specification via cif_multiblock might turn out to be the lowest-effort option.

all the best,
James.

 

 

Best regards,

 

John

 



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]