[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] _enumerated_set.table_id

Hi John:

On Tue, Apr 21, 2015 at 1:19 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Hi James,

 

I agree that _enumeration_set.table_id seems a misfit.  Moreover, I observe that it is not documented in the 2008 DDLm paper.  That paper is aging a bit, but I take the attribute’s omission as an additional signal that it does not serve a role of any major import.


The canonical reference is now J. Chem. Inf. Model., 2012 52(8) pp 1907-1916.

 

Moreover, I agree that the particular usage you found is troublesome.  It might well be sensible to describe the allowed keys of a particular table via an enumerated set, but in that case those keys would be the *values* (states) expressed by the enumeration, hence the table_id attribute is superfluous.  (I guess that’s pretty much what you said, too; please bear with me as I get my DDLm brain engaged.)

 

I think we agree then that it is superfluous and can be dropped (or simply not picked up by COMCIFS).
 

More generally, I agree that there should be a mechanism for DDLm dictionaries to constrain, on a per-item basis, the form that tables may take.  The greatest expressive power in that area would involve being able to specify which keys are allowed (including the possibility of free-form keys), which of those are required, and what type of value must be associated with each key. To do that in full generality would require allowing the types of values inside a table to be defined in terms of other _type definitions in the dictionary, or something equivalently powerful.  Inasmuch as keys must be strings, I think the existing enumeration facility is probably strong enough to express constraints on keys.


Judging from the demonstration DDLm dictionaries, CIF2 tables are quite rare, and strictly speaking superfluous, as they can be directly transformed into a CIF loop structure with a small loss in concision.  They are used once in cif_core.dic to carry the individual atom form factor contributions to each hkl reflection so that a separate loop keyed on h,k,l and atom type doesn't have to be defined for such intermediate values. I think that anything remotely complicated (e.g. optional keys) would be better described using looped datanames.  This policy would allow us to restrict ourselves to simple cases. Therefore, we could settle for Doug's solution (but see below), with the meaning that the keys given in the _type.contents entry must be present for the item to be valid.  I would however be unruffled if DDLm *didn't* have a mechanism to constrain the form that tables may take on a per-item basis, for the above reasons and those in my next paragraph below.

Doug’s suggestion doesn’t provide the full expressiveness described above, but it may be reasonable and sufficient for the requirements of any dictionary we currently contemplate supporting.  It is limited at least in that it can express only mappings that _must_ be present or mappings that _may_ be present, but not both.  It appears also to be somewhat limited with regard to the constraints it allows to be placed on values in the defined table type.  Those may be limitations we can live with.

 

 
I have lately been contemplating the level of datavalue complexity we should actually cover in DDLm.  The initial assumption in writing DDLm was that the _type.contents and _type.dimension attributes should be able to describe arbitrarily complex datastructures.  I now think that this is unnecessary, because any inhomogeneous datastructure can be split into its component parts, each of which I would assert have a well-defined individual meaning.  The dictionary will necessarily need to describe those individual parts.  The *only* use-case I can find (counterexamples welcome) for inhomogeneous datastructures in the demonstration ddlm dictionaries is to conveniently create single-dataitem keys for joined categories, but even this use case can be replaced by e.g. a simple string concatenation.  Any use of the composite structure can be replaced in dREL by access to the individual components - which must be happening already anyway, because the values are inhomogeneous and so must be treated differently.

I am therefore planning to suggest that COMCIFS adopt a dictionary authoring policy which explicitly avoids using inhomogeneous datastructures (i.e. Arrays and Tables with values of a single type are OK, mixtures and irregular nesting are not).
 

As a practical matter, though, does DDLm have a way to define that the value for item _type.contents is either a table or a member of an enumeration_set?  In other words, can we write a definition of the proposed extended _type.contents item that DDLm can validate, without changing or adding other definitions?  If not, then perhaps that’s a good reason to consider a more comprehensive solution.


Yes indeed, a _type.contents value which is a table with arbitrary keys as suggested by Doug can't be part of a (finite) _type.contents enumerated list of datavalues, and so the current approach to _type.contents wouldn't work. Frankly, however, I think that such tables are not something we need to particularly support (see above), so I would be happy for us to use 'Table' as the _type.contents of _import.get and leave any detailed validation either to software that wishes to execute the dREL method or define a _type.contents_regex and do regular expression matching.

In passing, I note that the _enumeration_set.state for _type.contents does not actually correspond to the list of possible values of _type.contents, because the listed strings can be combined with commas, boolean operators and functions 'List' or 'Table'.   However, if we adopt a 'no inhomogeneous dataitem' approach, this problem almost completely disappears.

all the best,
James.

 

 

John

 

--

John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

John.Bollinger@StJude.org

(901) 595-3166 [office]

www.stjude.org

 

 

From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Sunday, April 19, 2015 9:45 PM
To: ddlm-group
Subject: [ddlm-group] _enumerated_set.table_id

 

Dear DDLm group,

(originally sent Feb 5th)

I have been going through ddl.dic with an eye to writing automated dictionary checking routines and came across _enumerated_set.table_id.  This attribute is used precisely once in all the draft DDLm dictionaries (which include all of the previous DDL1 dictionaries): and that is in ddl.dic itself in the definition for the DDLm _import.get attribute.   This attribute is intended to specify in a machine-readable way the possible values of CIF2 Table keys. In this particular case the CIF2 tables are themselves within a List:

    _type.purpose                Import
    _type.source                 Assigned
    _type.container              List
    _type.contents               Table(Code)
    _type.dimension              [{}]
    loop_
    _enumeration_set.state
    _enumeration_set.detail
    _enumeration_set.table_id
              1             'filename/URI of source dictionary'      file     
              2             'save framecode of source definition'    save     
              3             'mode for including save frames'         mode     
              4             'option for duplicate entries'   dupl   
              5             'option for missing duplicate entries'   miss
    loop_
    _method.purpose
    _method.expression
     Evaluation   
;
     With  i  as  import

    _import.get = [{"file":i.file_id, "save":i.frame_id, "mode":i.mode,
                    "dupl":i.if_dupl, "miss":i.if_miss}]
;


Because it is in the _enumerated_set category, the category key _enumerated_set.state must be present when listing these table keys, but instead of _enumerated_set.state listing the actual permitted values, it contains meaningless dummy values; table_id then lists table keys, not values, and so the restraints on the values of the keys are absent.  This looks like an abuse of the enumerated_set category when the natural solution as proposed by Doug du Boulay is to simply enhance _type.contents, i.e.

 

_type.contents = {"file":URL "save":Code "mode":Code "dupl":Code "miss":Code}

Note that _type.contents is implicitly interpreted (in the demonstration DDLm dictionaries) to describe the contents of Lists, not the whole list, so the above use is in line with this. I therefore suggest that we drop _enumerated_set.table_id from DDLm completely as there is no use case.

Are we in agreement on this?

James.

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]