[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] _enumerated_set.table_id

That's an interesting and timely idea.

If "an instance of a category" is meant to comprise all data from a single data set and belonging to that category (i.e. all corresponding loop packets), then the most significant challenge I see involves keys.  CIF2 limits Table keys to quoted strings, but some categories in existing dictionaries have keys that are not conformable to that type (compound keys), or whose interpretation by  CIF processors may be different when quoted than when not (numeric keys).  Providing for Tables encompassing whole category instances will require a solution to those issues.

Additionally, although CIF2 does not constrain the type or form of the values inside a Table, DDLm provides no means to define Table types within which the values may have different types. Moreover, although it purports to provide a mechanism by which elements of a List value could have different types, that facility is not adequately developed or documented, and James and I have just agreed that we think it should be removed.  I don't see a good way around one or the other of those being needed for the elements of a top-level Table object representing a category.
 
On the other hand, dealing with typing issues by relying on existing CIF typing is precisely what I have just been advocating, albeit not necessarily in as comprehensive a way as you may have in mind.  I think there is much to like about reusing what we already have in that area.

John


> -----Original Message-----
> From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of
> Herbert J. Bernstein
> Sent: Thursday, April 23, 2015 10:00 AM
> To: Group finalising DDLm and associated dictionaries
> Subject: Re: [ddlm-group] _enumerated_set.table_id
> 
> This gives rise to an interesting possible extension and simplification for the
> future:  really making a Table into a table, as a way to carry all the information
> in an instance of a category as manipulable data, then all the typing issues
> could be dealt with by existing CIF typing, and we would be able to carry
> multiple order-independent rows unambiguously.
> 
> On Thu, Apr 23, 2015 at 10:51 AM, Bollinger, John C
> <John.Bollinger@stjude.org> wrote:
> > For better or for worse, "Table" is the CIF2 term for this data structure.  I do
> not think introducing an alias at this point would serve the interest of clarity,
> but I will try to remember to capitalize  when I use the word the CIF2 sense.
> >
> > John
> >
> >> -----Original Message-----
> >> From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of
> >> Herbert J. Bernstein
> >> Sent: Thursday, April 23, 2015 4:52 AM
> >> To: Group finalising DDLm and associated dictionaries
> >> Subject: Re: [ddlm-group] _enumerated_set.table_id
> >>
> >> May I suggest maintaining a clear distinction, at least by
> >> capitalizing the CIF2 type, or better, by referring to it as a dictionary type,
> as in Python?
> >>
> >> On Wed, Apr 22, 2015 at 11:25 PM, James Hester
> >> <jamesrhester@gmail.com>
> >> wrote:
> >> > Hi Herbert - the very important point here is that we are talking
> >> > about the 'Table' type in CIF2 i.e. {"key1":value "key2":value},
> >> > and most certainly not 'table' in the sense of 'relational database table'
> >> > (although you will appreciate the very close relationship between
> >> > the two
> >> datastructures).
> >> >
> >> > all the best,
> >> > James.
> >> >
> >> > On Thu, Apr 23, 2015 at 1:35 AM, Herbert J. Bernstein
> >> > <yayahjb@gmail.com>
> >> > wrote:
> >> >>
> >> >> Dear Colleagues,
> >> >>
> >> >>   I am puzzled  by the idea of constraints on table keys distinct
> >> >> from the constraints on the values and types for table columns.
> >> >> From a database perspective, a table key is just a set of one or
> >> >> more columns that uniquely identify rows in a table by their
> >> >> contents.  If a column has been designated as a key or as a member
> >> >> of a composite key, the normal practice is to use the type and
> >> >> value constraints of the column as the only constraints on what you are
> allowed to use.
> >> Please
> >> >> explain what is gained by having additional constraints specified?   I
> >> >> would suggest we keep as close to a relational model for CIF2
> >> >> tables as possible.
> >> >>
> >> >>   Regards,
> >> >>     Herbert
> >> >>
> >> >> On Wed, Apr 22, 2015 at 11:14 AM, Bollinger, John C
> >> >> <John.Bollinger@stjude.org> wrote:
> >> >> > Hi James,
> >> >> >
> >> >> > Comments inline below.  ((Lack of) formatting thanks to stupid
> >> >> > Microsoft
> >> >> > limitations.)
> >> >> >
> >> >> >> > 4. Add a replacement mechanism to define constraints on table
> keys.
> >> >> >> > It might be sufficient, and consistent with the apparent
> >> >> >> > intent of the current dictionary, to establish a parallel to
> >> >> >> > the _enumeration_set category for constraining key values,
> >> >> >> > maybe _key_enumeration_set.  It would be a smaller change at
> >> >> >> > the dictionary level, however, to add a mechanism by which
> >> >> >> > constraints on key type could be defined by reference to the
> >> >> >> > type of
> >> another item (see also next).
> >> >> >>
> >> >> >> What is the advantage of being able to validate key strings?
> >> >> >
> >> >> > What is the advantage of validating *anything*?  If there is a
> >> >> > constraint on document form and content then one would like to
> >> >> > be able to determine whether instance documents comply with that
> >> >> > constraint.  It can be useful to perform such validation for its
> >> >> > own sake, or programs can validate up front in order to minimize
> >> >> > or eliminate the need to sprinkle hand-rolled validity testing
> >> >> > throughout
> >> their implementation code.
> >> >> >
> >> >> > I suppose the real question is about the advantage of defining
> >> >> > constraints on table keys in the first place.  There are all
> >> >> > sorts of possible examples, but for now let's stick with
> >> >> > _input.get.  In each element (a table) of the list value of that
> >> >> > attribute, a few specific possible keys are meaningful, and all
> >> >> > others are meaningless / erroneous.  We might like to be able to
> >> >> > diagnose key misspellings in those tables.  We might like to be
> >> >> > able to process the values as lists of (key, value) pairs
> >> >> > without fear that any of the keys are invalid.  We might simply
> >> >> > like to provide a machine-
> >> readable definition of which keys are meaningful / allowed.
> >> >> >
> >> >> >>  As outlined in my previous email, I don't see that validating
> >> >> >> the keys will have much benefit as tables are rarely used.
> >> >> >> That aside, simply introducing an extra DDLm attribute is OK,
> >> >> >> especially as we are dropping _enumeration_set.table_id we are
> >> >> >> not
> >> enlarging DDLm.
> >> >> >
> >> >> > If it were going to require a great deal of additional work and
> >> >> > complexity to provide for constraints on table keys then I would
> >> >> > hesitate to suggest doing so.  I don't think that's the case.
> >> >> >
> >> >> > As it is, the current DDLm dictionary provides a mechanism
> >> >> > intended to support constraining table keys, and it uses it,
> >> >> > albeit only once.  Removing that ability without replacement
> >> >> > would not only delete the ability it supports, it would also
> >> >> > change the semantics of the DDLm item that currently
> >> >> > *uses* that ability.
> >> >> >
> >> >> > I am inclined to suppose that one reason tables are rarely used
> >> >> > in the current dictionaries is that the item descriptions in the
> >> >> > 2012 DDLm dictionary do a poor job of explaining how to define
> >> >> > items taking tables as their values, especially with respect to
> >> >> > constraints.  Furthermore, all of the current dictionaries --
> >> >> > even DDLm -- spring from a history and dictionary development
> >> >> > tradition that hadn't table values to rely on until now, so it
> >> >> > is not surprising that DDLm versions of those dictionaries have
> >> >> > little reliance on tables.  That does not mean that tables
> >> >> > cannot serve more prominently in future dictionaries, or future
> >> >> > versions of the
> >> current dictionaries.
> >> >> >
> >> >> >> > 5. Add a mechanism to allow items' content type to be defined
> >> >> >> > by reference to another item.  This could be signaled by a
> >> >> >> > new code for _type.contents, with a new attribute defining
> >> >> >> > which other item’s type is to be used.  I don’t think that
> >> >> >> > the existing contents code 'Inherited' can serve this
> >> >> >> > purpose, but perhaps I’m
> >> mistaken.
> >> >> >
> >> >> >> This is an intriguing idea.  As it happens, the demonstration
> >> >> >> DDLm dictionaries introduce setting the type of an item based
> >> >> >> on the type of a different item using a dREL-like function
> >> >> >> (although I have replaced these with explicit types in the
> >> >> >> latest version of the new
> >> cif_core dictionary).
> >> >> >> Your suggestion replaces this by a non-dREL approach, which is
> >> >> >> in general desirable for simple applications.  To check that
> >> >> >> I've understood your
> >> >> >> (corrected) example:
> >> >> >> (1) the elements of the _import.get List are items of the same
> >> >> >> type as _import.get_contents_type
> >> >> >
> >> >> > Yes.
> >> >> >
> >> >> >> (2) _import.get_contents_type is a Table, so _type.contents for
> >> >> >> it is the type of values in the table i.e. Text
> >> >> >
> >> >> > Yes.
> >> >> >
> >> >> >> (3) The possible key values are given by the possible values
> >> >> >> taken by the _type.key_type_reference dataname
> >> >> >
> >> >> > Yes, in this case.  My idea is that _type.keys would be parallel
> >> >> > to _type.contents, so that, for example, it might also take the
> >> >> > value 'Code' or 'Date' or 'Text' or an extension type, and in
> >> >> > that case not rely on a reference to a separate item definition.
> >> >> >
> >> >> >> We have two new 'internal' DDLm attributes as a result, as well
> >> >> >> as the new _type.keys, _type.key_content_reference and
> >> >> >> _type.key_type_reference datanames for a total of 5 new
> attributes.
> >> >> >
> >> >> > Those aren't exactly the data names I proposed, but yes, that's
> >> >> > the way my proposal plays out for DDLm.
> >> >> >
> >> >> >>  If we put the key list into the definition to which it
> >> >> >> relates, we can cut down on the number of new attributes, e.g:
> >> >> >> save_import.get_contents_type
> >> >> >>    # ...
> >> >> >>    _type.purpose             'Internal'
> >> >> >>    _type.container           'Table'
> >> >> >>    _type.contents            'Text'
> >> >> >>    loop_
> >> >> >>      _table_key_set.state
> >> >> >>      _table_key_set.detail
> >> >> >>        'file' 'filename/URI of source dictionary'
> >> >> >>        'save' 'save framecode of source definition'
> >> >> >>        'mode' 'mode for including save frames'
> >> >> >>        'dupl' 'option for duplicate entries'
> >> >> >>        'miss' 'option for missing duplicate entries'
> >> >> >> save_
> >> >> >
> >> >> > Yes, that would be a viable alternative to support the needs of
> >> >> > DDLm itself.  It would reduce the number of new items needed
> >> >> > from 3 to 2 (the two other proposed new items being related to
> >> >> > defining table *contents* by reference, which is a separate
> >> >> > issue).  The statistics look different for dictionaries other than DDLm
> itself.
> >> >> >
> >> >> > Your alternative appears to be roughly what I described in
> >> >> > passing as "to establish a parallel to the _enumeration_set
> >> >> > category for constraining key values."  Although it serves
> >> >> > DDLm's own needs just fine, it may be too restrictive for other
> >> >> > dictionaries that want to define (and constrain) tables, as it
> >> >> > supports only enumerable sets of keys.  In some other uses one
> >> >> > might instead want to constrain keys to the same form that (for
> >> >> > values) is represented by _type.contents = 'Date' or 'Version'
> >> >> > or some extension type, where it is
> >> not possible to enumerate all possible keys.
> >> >> >
> >> >> >> which results in new attributes _type.key_content_reference,
> >> >> >> _table_key_set.state and _table_key_set.detail with one
> >> >> >> internal attribute _import.get_contents_type, and also reduces
> >> >> >> the non-locality of the definition - that is, one less
> >> >> >> reference to track
> >> through the file.
> >> >> >> _import.get is admittedly an extreme example, because it is the
> >> >> >> only occurrence of a list of tables rather than just a table,
> >> >> >> which is what requires the creation of the 'internal' data attribute.
> >> >> >
> >> >> > Yes and no.  The creation of the new 'Internal' value for
> >> >> > _type.purpose and of items that use it are more a consequence of
> >> >> > my approach to lightening the load on _type.dimension, whose
> >> >> > current description and use appear to task it with providing a
> >> >> > complete layout of values of the item being defined.  Note in
> >> >> > particular the dimension specified in the current definition of
> _import.get:
> >> >> > '[{}]'.  I don't think we want to continue in that direction.
> >> >> >
> >> >> > The structure of _import.get's values does not inherently
> >> >> > require internal types to be defined under my proposed
> >> >> > structure.  If there were an ordinary item in the dictionary
> >> >> > that had the wanted type of the elements of an _import.get list,
> >> >> > then that type could be referenced instead of an internal one.
> >> >> > I can imagine circumstances under which such a reference would
> even be sensible.
> >> >> >
> >> >> >>  It is, however, a nice demonstration of how the attributes
> >> >> >> might work for future dictionary writers.  The new 'internal'
> >> >> >> dataname does have some meaning along the lines of 'a single
> >> >> >> import instruction' so a better dataname might be _import.single.
> >> >> >
> >> >> > Sure, that name would be fine with me.
> >> >> >
> >> >> >>  Is there any reason that you introduced a reference in order
> >> >> >> to specify the table keys?
> >> >> >
> >> >> > I introduced a reference in order to specify table keys so as to
> >> >> > provide for more alternatives than an enumeration of possible
> >> >> > keys, while minimizing the number of new DDLm items required.
> >> >> > Also, inasmuch as I was already proposing type-by-reference for
> >> >> > values, it seemed consistent to follow a parallel approach for key
> constraints.
> >> >> >
> >> >> >>  And do you agree that the alternative I've proposed above
> >> >> >> would also be sufficient?
> >> >> >
> >> >> > I agree that your alternative would be sufficient *for DDLm
> >> >> > itself*, but I would prefer more flexibility to be available to
> >> >> > other
> >> dictionaries.
> >> >> > Because DDLm itself will be harder to change than other DDLm
> >> >> > dictionaries, I would like to avoid it being overly restrictive.
> >> >> > At the same time, I don't think we need to go crazy by trying to
> >> >> > make DDLm capable of defining completely arbitrary CIF2 data
> >> >> > structures.  I have tried to choose a happy medium that is
> >> >> > minimally disruptive for existing DDLm dictionaries and software.
> >> >> >
> >> >> >> On a final note for _import.get, the dREL is broken as it
> >> >> >> assumes that there is only one value for each of the
> >> >> >> constituent _import datanames, which would make a list
> >> >> >> superfluous (only one element), but what it really wants to do
> >> >> >> is to create a list from a loop of _import.file etc. values.
> >> >> >> To do this it needs a sequence number, which isn't defined.
> >> >> >> Once this *is* defined, we could alternatively present the
> >> >> >> import instructions as a loop over _import.sequence and
> >> >> >> _import.single, or else _import.seqence,
> >> _import.file etc.
> >> >> >
> >> >> > I can't say I'm much surprised.  _import.get shows evidence of
> >> >> > having gone through a change at some point, and I don't think
> >> >> > that was fully and consistently implemented.  I note in
> >> >> > particular that its description (in the
> >> >> > 2012 version) is "A table of attributes [...]", not "A list of
> >> >> > tables of attributes [...]" or similar.  I also note that its
> >> >> > _type.container is given as 'List[Table]', which is not among
> >> >> > the enumerated alternatives for values of that attribute.
> >> >> >
> >> >> > As for the dREL, though, why do you need a sequence number, and
> >> >> > / or why can the dREL not generate one itself as it iterates
> >> >> > over the values of _import.get?  Given that each value is a
> >> >> > table providing the attributes describing one import;
> >> >> > co-occurrence in the same table already associates the various
> attributes of each import together.
> >> >> >
> >> >> >> To wrap up, I like the suggestion of a _type.contents that can
> >> >> >> work by reference to another dataname.  I don't see a
> >> >> >> particular need for a similar reference for table keys, nor do
> >> >> >> I particularly think explicitly specifying the keys is likely
> >> >> >> to be that useful, but I'm not against adding this capability.
> >> >> >> We envisage adding quite a few other attributes later on to
> >> >> >> improve DDL2 - DDLm
> >> translation anyway.
> >> >> >
> >> >> > I'm glad you like the idea of defining content type by reference.
> >> >> > I hope I've persuaded you about the keys, but even if not, I
> >> >> > still think that the ability to define machine-readable
> >> >> > specifications of allowed keys is important.  I'm not hung up on
> >> >> > the exact
> >> implementation I proposed, however.
> >> >> >
> >> >> >
> >> >> > Cheers,
> >> >> >
> >> >> > John
> >> >> >
> >> >> > --
> >> >> > John C. Bollinger, Ph.D.
> >> >> > Computing and X-Ray Scientist
> >> >> > Department of Structural Biology St. Jude Children's Research
> >> >> > Hospital John.Bollinger@StJude.org
> >> >> > (901) 595-3166 [office]
> >> >> > www.stjude.org
> >> >> >
> >> >> >
> >> >> >
> >> >> > _______________________________________________
> >> >> > ddlm-group mailing list
> >> >> > ddlm-group@iucr.org
> >> >> > http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
> >> >> _______________________________________________
> >> >> ddlm-group mailing list
> >> >> ddlm-group@iucr.org
> >> >> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > T +61 (02) 9717 9907
> >> > F +61 (02) 9717 3145
> >> > M +61 (04) 0249 4148
> >> >
> >> > _______________________________________________
> >> > ddlm-group mailing list
> >> > ddlm-group@iucr.org
> >> > http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
> >> >
> >> _______________________________________________
> >> ddlm-group mailing list
> >> ddlm-group@iucr.org
> >> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org
> > http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]