Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Proposal to enhance the behaviour of a DDLm "Set"category: please consider

Dear John and group,

I've discussed John's points within his email, below.

On 26 May 2016 at 00:52, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear James and DDLm group,

 

I’m not sure that I have fully comprehended the proposal to alter the meaning of the 'Set' definition class, so let me try to summarize in my own words:

 

(*) Presently, DDLm categories defined as 'Sets' contain items that must not be looped, or at least must not appear in multi-packet loops.  Items in such categories take at most one value per data block or save frame.


Yes

 

(*) The choice between the 'Set' and 'Loop' category classes is made by dictionary developers based on the envisioned use of the category in data files.  For example, the SYMMETRY category in the DDLm version of the core dictionary is defined to be a ‘Set’ because the dictionary is structured around the idea that each data block or save frame in a data file describes at most one structure, and a structure has exactly one set of symmetry information.


Yes

 

(*) Substantially the same item may be relevant to different kinds of overall data sets, and the appropriate choice between 'Set' and 'Loop' (as they are presently defined) may vary between kinds of data sets.  This mismatch prevents some desired re-uses of definitions across dictionaries.


Exactly

 

(*) To enable the desired kinds of re-use, it is proposed that the 'Set' category class be redefined to require uniqueness only with respect to a category key.  New constraints are placed on the other categories that can appear in the same block or frame, so as to ensure that each datum can be associated with at most one value for any item in any 'Set' category.


This is correct, although note that it is already trivially the case that each datum is associated with at most one value for any item in the 'Set' category. The idea is now to allow *only* those categories that explicitly depend on the 'Set' category key to be included in the data block when the 'Set' category is looped, to maintain this unique association.  Thus existing users of the 'Set' category see no change (as the categories they use did not, and will not, have a key dataname pointing to the 'Set' category), and so the onus is on those who would loop the 'Set' category to come up with new categories that include the 'Set' category key within their loops.

 

Based on that understanding of the proposal:

 

1. I am concerned about the proposed new constraint on other categories that may appear in the same container with a 'Set' category.  I think I understand the purpose, but I also think this will be easier to get wrong and more complicated to validate.  Moreover, it introduces an unresolved conflict with categories that really ought to be 'Sets' as they currently are defined, as the proposal itself acknowledges with respect to the AUDIT category.


If a category key is not defined for a 'Set' category, then it behaves as it always has. So, if 'AUDIT' is a Set category, and no category key is defined, then no changes occur. The validation complexity is essentially "1. Is this 'Set' category looped? 2. If so, list the categories in the datablock. Do they all have a dataname that points to the 'Set' category."  This cost in complexity (which is almost exclusively borne by those who have the unusual use-cases, as it only occurs if step 1 gives 'yes') is what we should balance against the benefit of increased utility.

 

2. The proposed change almost completely erases the distinction between 'Set' and 'Loop' categories.  I am not convinced that retaining the two as separate classes with such a fine distinction between them is the best course of action.


DDL2 adopted the position that 'everything is a loop', and indeed, in ontological terms any concept you can think of is 'looped', that is, you can imagine situations where multiple values are available. 'Set' categories are simply a space-saving device.  To see this, consider that the numbers that we assign to the fractional atomic coordinates depend on (i) which atom we are considering, (ii) the space group, and (iii) the cell parameters.  For the vast majority of structural descriptions, we only need a single space group and set of cell parameters. Instead of providing two extra columns in our atom_site table (one for space group, one for cell parameters) where every entry in each column repeats the first entry, we create the concept of 'overall space group' and 'overall cell parameters' and save lots of space.  DDL1 and DDLm enshrine this shortcut in machine-readable attributes.  DDL2 does not, so instead mmCIF produces exactly the same 'overall information' semantic effect as a 'Set' category through an exclusively human-readable definition (see definition for entry.id) that simply states that there should only be one value per datablock for this item. The mmCIF aliases to the DDL1 datanames are thus precise (the real progress in DDL2 in this context was syntactical: 'overall information' datanames may appear in loops, so programmers are prepared for this, unlike for DDL1).

So the distinction is not so much 'Set' and 'Loop', but 'overall information' and 'per datum information'.  What the proposal describes is under what circumstances you can re-use the 'overall information' dataname as a 'per datum' dataname.

3. I am not fond of how conditional the proposed new definition text is.


Neither am I, but I don't believe it is possible to come up with anything simpler that meets the requirements.  Whether the requirements are justified may also be debated, of course.
 

4. It seems likely that all existing methods of current 'Set' items would be broken by the proposed change.

 
Actually, a lot less than you might expect. If you look at the definition for _cell.volume, it works just as well on a list of cells as on a single cell.  This is inherent in the design of dREL.  The methods that fail are those that expect overall ('Set') information in a Looped category, so for example, when the model sites are calculated (see category method for model_site) they assume a single value of _atom_sites_Cartn_transform.mat. This matrix will be looped if the cell parameters are looped, and so the model_site method will fail.  For this reason I have stipulated the condition that any category appearing with a looped 'Set' category must explicitly know that the set category can be looped by defining a key that points to the set category key.  This ringfences the currently existing categories, such as model_site, that assume overall information, from any effect of this change.

 

 

My present thinking is that changing specific 'Set' categories into bona fide 'Loop' categories would be better than making all 'Sets' loop-like without actually making them  'Loops'.  This could be reconciled with existing data files by introducing a mechanism for defaulting category key values or by allowing category keys to be omitted from category data when only one set of date from that category is presented.  I think an approach along these lines could solve the problem at hand while addressing my concerns 1-3.  I am uncertain whether a solution is possible that fully addresses my concern #4, but if we convert  'Sets' into 'Loops' only selectively, then at least we narrow the scope of the problems with methods, and perhaps also allow an incremental approach to be taken for updating dictionaries.

 

Unfortunately I don't believe that your suggested change would work, because (i) we need to take into account the response of existing software to datafiles with looped 'Set' categories (ii) we still need a way to indicate overall data in order to match DDL1 dataname meanings. 

Regarding (i),  while a system of defaulting key values and omitting them if only a single item is defined is a consistent description of currently-existing datafiles, the real issue is the opposite: what will happen when current software is faced with a file that *does* have multiple values in a 'Set' category?  Are we sure that it will not silently e.g. calculate too many atomic sites because we have listed symmetry operators from multiple spacegroups?

Regarding (ii), we gain nothing by use of the PDB 'entry.id' trick as we are once again left with single-valued categories that can't have more than one datum.

Regarding altering dREL methods, I think it is worth realising that the dREL methods express the mathematical relationships that are 'out there' in correctly-written user software. We can no more change a correctly-written dREL method than we can change all the software that expresses that method, and as soon as any proposal involves rewriting dREL methods, that proposal is effectively a non-starter for core_CIF.  The converse is that, if we somehow control all the software that expresses that method, then we can change it.

A clean alternative is simply to define a whole new set of datanames corresponding to a 'Set' category becoming looped.  This could be economically done with (i) a category definition using a single new DDLm attribute to say that all the datanames in the new category are derived from the datanames in the 'Set' category by just changing the name.category_id component (ii) a similar new category definition for each category which now needs a key dataname pointing to the newly-looped category.  To be honest, this would be my preferred option, although we would be stuck with space_group as a legacy category.  Anyway, under this alternative, machine transformation of datafiles between the 'Loop' and 'Set' versions would be straightforward, and capture all the changes in dependent categories (so the space group column in the atom_site_with_spacegroup category would disappear upon transformation to _atom_site, and multiple data blocks, one for each space group, would be generated).

What do you think of that?

all the best,
James.

 

Regards,

 

John

 

 

--

John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

John.Bollinger@StJude.org

(901) 595-3166 [office]

www.stjude.org

 

 


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.