Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm Group,

Having had some time to think about this further, I've had the following, not entirely groundbreaking, thought: there is no need to simplify the work of dictionary authors, as we have developed a dictionary style guide. This style guide allows the writing of automated tools to process dictionaries that do not introduce spurious changes in ordering or whitespace that would confuse people looking for substantial differences. Therefore, rather than simplify the task of dictionary authors by defining special behaviour and automatic implicit parent-child key data name relationships, it would be sufficient to produce a tool that could be fed a list of category relationships and a dictionary, outputting an updated dictionary with the relevant key data names defined appropriately. This in turn means that human dictionary readers can continue to simply look at the list of category keys and associated definitions to understand relationships.

I was also operating under the misguided assumption that implicit key data names and their children could be deduced for all Set categories in our current dictionaries. I'm no longer sure that this is true, in which case explicit specification of category keys and links is going to be required. I note that the "cif_multiblock" proposal assumed this from the start.

So an updated "bare bones" proposal reads as follows:

(i) Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive, directory).

(ii) The following backwards-compatible assumption is made for DDLm: A "Set" category is a "Loop" category for which only one row may be presented in a data block

(iii) The set of "Set" categories for a data block is determined by _audit.schema

(iv) A "Set" category may only be aggregated from multiple data blocks if a key data name for the "Set" category has been provided in the dictionary to which those data blocks conform.

(v) Where a "Set" category has been provided with a key data name in the dictionary as per (iv), all child data names must also be provided in the dictionary.

(vi) The value of a "Set" category key data name for a given data block may be explicitly stated or, if missing, an arbitrary, unique value is assigned.

(vii) Values for child data names of "Set" category key data names may always be elided.

In this case the example provided by John would remain as two separate data blocks, as the appropriate "Set" category key has not been defined.

Operation of dREL

The only dREL construct that still needs to be considered is the "Loop" over a category. This would normally consider every row of a category in turn. For example, one might loop over atom sites to determine the density of the compound. However, in a situation where a category has been equipped by a dictionary with extra key data names, the dREL routine needs to decide whether or not it should still consider all rows, or only a subset. So for a multi-phase powder sample, we would want to calculate density for each compound in turn, not simply use all atom sites as before. There are a number of ways of making this work: for example, a default rule that only those key data names of the looped-over category that are *not* child or parent data names of the keys of the category in which the calculation takes place would cover most situations that I can see. This rule would not cover surrogate keys, so might need to be expanded to cover all key data name relationships that can be deduced.

I'm wondering if the "bare bones" proposal above is acceptable?

all the best,
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]