Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] _enumerated_set.table_id

Hi James,

 

Yes, we agree that _enumeration_set.table_id can be dropped.  I am uncertain whether we agree about whether it should be replaced with something else.

 

I am prepared to accept these limitations on the data types that can be defined by a DDLm dictionary (including DDLm itself), if indeed DDLm itself and the other existing DDLm dictionaries can be expressed adequately under such constraints:

 

- The allowed types of values within a list cannot depend on their position in the list

- The allowed types of values within a table cannot depend on their associated keys

 

These assign primacy to categories / loops for defining complex, heterogeneous data, so that it is unnecessary (I think) to be able to define data types that use lists and / or tables analogously to C structs.

 

I am inclined to think that one of the greater weaknesses of the 2012 version of the DDLm dictionary is its provisions for defining complex data types.  They are somewhat inconsistent, and the provided definition text is unclear about exactly how one would go about defining complex data.  Moreover, if _type.dimension is intended to be the primary vehicle for defining complex internal structure then it must bear the weight of an entire schema language.  That seems to be exactly what it’s trying to do, but the details of that language are by no means adequately documented, and it seems an odd approach given that it’s hosted inside another language that itself can serve as a schema language.

 

This is what I think we should do:

 

1. Remove _enumeration_set.table_id.  It doesn’t work well for its intended purpose.

 

2. Redefine _type.dimension so that it is used only to specify the dimension(s) of values of items having _type.container in { 'List', 'Array', 'Matrix' }.  Relieve it of any responsibility for defining element types.  Possibly remove the ability to define ragged multi-dimensional arrays (which conflict with the proposed limitation that allowed types of values within a list cannot depend on their position in the list).

 

3. Clarify that when _type.container has value 'Table', _type.contents defines the characteristics of the *values* in the table.

 

4. Add a replacement mechanism to define constraints on table keys.  It might be sufficient, and consistent with the apparent intent of the current dictionary, to establish a parallel to the _enumeration_set category for constraining key values, maybe _key_enumeration_set.  It would be a smaller change at the dictionary level, however, to add a mechanism by which constraints on key type could be defined by reference to the type of another item (see also next).

 

5. Add a mechanism to allow items' content type to be defined by reference to another item.  This could be signaled by a new code for _type.contents, with a new attribute defining which other item’s type is to be used.  I don’t think that the existing contents code 'Inherited' can serve this purpose, but perhaps I’m mistaken.

 

Allowing types of keys / values to be defined by reference to the types of other items raises the possibility that dictionaries will occasionally want to define items solely for the purpose of defining their content type for reference by other definitions.  I don’t think this is harmful, but it might be best supported by a new value for _type.purpose, as demonstrated below.

 

If all those changes were implemented then the definition for DDLm_import.get might be revised like so:

 

    _type.purpose             'Import'

    _type.container           'List'

    _type.contents            'Text'

    _type.keys                'ByReference'

    _type.key_type_reference  'import.get_key_type'

 

That would require addition of a new attribute to category IMPORT, its definition containing the following (among other necessary attributes not shown):

 

save_import.get_key_type

    # ...

    _type.purpose             'Internal'  # New value

    _type.container           'Single'

    _type.contents            'Code'

 

     loop_

    _enumeration_set.state

    _enumeration_set.detail

        'file' 'filename/URI of source dictionary'

        'save' 'save framecode of source definition'

        'mode' 'mode for including save frames'

        'dupl' 'option for duplicate entries'

        'miss' 'option for missing duplicate entries'

save_

 

Additional attributes needed in category TYPE would be _type.keys (accepting the same values as _type.contents where those values describe string data), _type.key_type_reference (containing the _definition.id of the referenced item), and _type.contents_type_reference (not demonstrated; analogous to _type.key_type_reference).

 

 

John

 

 

From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Monday, April 20, 2015 10:33 PM
To: Group finalising DDLm and associated dictionaries
Subject: Re: [ddlm-group] _enumerated_set.table_id

 

Hi John:

 

On Tue, Apr 21, 2015 at 1:19 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Hi James,

 

I agree that _enumeration_set.table_id seems a misfit.  Moreover, I observe that it is not documented in the 2008 DDLm paper.  That paper is aging a bit, but I take the attribute’s omission as an additional signal that it does not serve a role of any major import.

 

The canonical reference is now J. Chem. Inf. Model., 2012 52(8) pp 1907-1916.

 

Moreover, I agree that the particular usage you found is troublesome.  It might well be sensible to describe the allowed keys of a particular table via an enumerated set, but in that case those keys would be the *values* (states) expressed by the enumeration, hence the table_id attribute is superfluous.  (I guess that’s pretty much what you said, too; please bear with me as I get my DDLm brain engaged.)

 

I think we agree then that it is superfluous and can be dropped (or simply not picked up by COMCIFS).
 

More generally, I agree that there should be a mechanism for DDLm dictionaries to constrain, on a per-item basis, the form that tables may take.  The greatest expressive power in that area would involve being able to specify which keys are allowed (including the possibility of free-form keys), which of those are required, and what type of value must be associated with each key. To do that in full generality would require allowing the types of values inside a table to be defined in terms of other _type definitions in the dictionary, or something equivalently powerful.  Inasmuch as keys must be strings, I think the existing enumeration facility is probably strong enough to express constraints on keys.

 

Judging from the demonstration DDLm dictionaries, CIF2 tables are quite rare, and strictly speaking superfluous, as they can be directly transformed into a CIF loop structure with a small loss in concision.  They are used once in cif_core.dic to carry the individual atom form factor contributions to each hkl reflection so that a separate loop keyed on h,k,l and atom type doesn't have to be defined for such intermediate values. I think that anything remotely complicated (e.g. optional keys) would be better described using looped datanames.  This policy would allow us to restrict ourselves to simple cases. Therefore, we could settle for Doug's solution (but see below), with the meaning that the keys given in the _type.contents entry must be present for the item to be valid.  I would however be unruffled if DDLm *didn't* have a mechanism to constrain the form that tables may take on a per-item basis, for the above reasons and those in my next paragraph below.

 

Doug’s suggestion doesn’t provide the full expressiveness described above, but it may be reasonable and sufficient for the requirements of any dictionary we currently contemplate supporting.  It is limited at least in that it can express only mappings that _must_ be present or mappings that _may_ be present, but not both.  It appears also to be somewhat limited with regard to the constraints it allows to be placed on values in the defined table type.  Those may be limitations we can live with.

 

 

I have lately been contemplating the level of datavalue complexity we should actually cover in DDLm.  The initial assumption in writing DDLm was that the _type.contents and _type.dimension attributes should be able to describe arbitrarily complex datastructures.  I now think that this is unnecessary, because any inhomogeneous datastructure can be split into its component parts, each of which I would assert have a well-defined individual meaning.  The dictionary will necessarily need to describe those individual parts.  The *only* use-case I can find (counterexamples welcome) for inhomogeneous datastructures in the demonstration ddlm dictionaries is to conveniently create single-dataitem keys for joined categories, but even this use case can be replaced by e.g. a simple string concatenation.  Any use of the composite structure can be replaced in dREL by access to the individual components - which must be happening already anyway, because the values are inhomogeneous and so must be treated differently.

I am therefore planning to suggest that COMCIFS adopt a dictionary authoring policy which explicitly avoids using inhomogeneous datastructures (i.e. Arrays and Tables with values of a single type are OK, mixtures and irregular nesting are not).

 

As a practical matter, though, does DDLm have a way to define that the value for item _type.contents is either a table or a member of an enumeration_set?  In other words, can we write a definition of the proposed extended _type.contents item that DDLm can validate, without changing or adding other definitions?  If not, then perhaps that’s a good reason to consider a more comprehensive solution.

 

Yes indeed, a _type.contents value which is a table with arbitrary keys as suggested by Doug can't be part of a (finite) _type.contents enumerated list of datavalues, and so the current approach to _type.contents wouldn't work. Frankly, however, I think that such tables are not something we need to particularly support (see above), so I would be happy for us to use 'Table' as the _type.contents of _import.get and leave any detailed validation either to software that wishes to execute the dREL method or define a _type.contents_regex and do regular expression matching.

In passing, I note that the _enumeration_set.state for _type.contents does not actually correspond to the list of possible values of _type.contents, because the listed strings can be combined with commas, boolean operators and functions 'List' or 'Table'.   However, if we adopt a 'no inhomogeneous dataitem' approach, this problem almost completely disappears.

all the best,

James.

 

 

John

 

--

John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

John.Bollinger@StJude.org

(901) 595-3166 [office]

www.stjude.org

 

 

From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Sunday, April 19, 2015 9:45 PM
To: ddlm-group
Subject: [ddlm-group] _enumerated_set.table_id

 

Dear DDLm group,

(originally sent Feb 5th)

I have been going through ddl.dic with an eye to writing automated dictionary checking routines and came across _enumerated_set.table_id.  This attribute is used precisely once in all the draft DDLm dictionaries (which include all of the previous DDL1 dictionaries): and that is in ddl.dic itself in the definition for the DDLm _import.get attribute.   This attribute is intended to specify in a machine-readable way the possible values of CIF2 Table keys. In this particular case the CIF2 tables are themselves within a List:

    _type.purpose                Import
    _type.source                 Assigned
    _type.container              List
    _type.contents               Table(Code)
    _type.dimension              [{}]
    loop_
    _enumeration_set.state
    _enumeration_set.detail
    _enumeration_set.table_id
              1             'filename/URI of source dictionary'      file     
              2             'save framecode of source definition'    save     
              3             'mode for including save frames'         mode     
              4             'option for duplicate entries'   dupl   
              5             'option for missing duplicate entries'   miss
    loop_
    _method.purpose
    _method.expression
     Evaluation   
;
     With  i  as  import

    _import.get = [{"file":i.file_id, "save":i.frame_id, "mode":i.mode,
                    "dupl":i.if_dupl, "miss":i.if_miss}]
;


Because it is in the _enumerated_set category, the category key _enumerated_set.state must be present when listing these table keys, but instead of _enumerated_set.state listing the actual permitted values, it contains meaningless dummy values; table_id then lists table keys, not values, and so the restraints on the values of the keys are absent.  This looks like an abuse of the enumerated_set category when the natural solution as proposed by Doug du Boulay is to simply enhance _type.contents, i.e.

 

_type.contents = {"file":URL "save":Code "mode":Code "dupl":Code "miss":Code}

Note that _type.contents is implicitly interpreted (in the demonstration DDLm dictionaries) to describe the contents of Lists, not the whole list, so the above use is in line with this. I therefore suggest that we drop _enumerated_set.table_id from DDLm completely as there is no use case.

Are we in agreement on this?

James.

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group




--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.