Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Making ddl2 <-> ddlm translation a reality

Dear DDLm group,

I have put together a DDLm attribute extension dictionary in a separate branch on Github. You can see it here: https://github.com/COMCIFS/cif_core/blob/ddl_extension/ddl_ddl2_extension.dic

All it does is define a dictionary-level category to contain ddl2 type code names and regular expressions, and an extra definition attribute to state the ddl2 type.  Thus it mirrors the item_type_code category in DDL2.

There is also a separate addition to the main DDL dictionary to allow conformance to a dictionary with a given DOI to be asserted. That discussion is still ongoing but I thought it could be usefully bundled into this branch.

Feel free to raise issues and suggestions here or on Github.

The next step is to actually get some draft dictionary translations working within this framework.  I plan to write them as dREL expressions.

James.





On Mon, 27 Apr 2020 at 16:27, James Hester <jamesrhester@gmail.com> wrote:
Dear DDLm group,

It appears that there are no objections, and as the only response (from John B) favours option 2 which I am also inclined towards, I will start things moving in this direction.

The most important technical issue that immediately arises is what to do about _dictionary.ddl_conformance in the case of an extension to DDLm. This has the value of the version number of the ddl dictionary to which a given dictionary conforms. Clearly if we have an extension to DDLm upon which a domain dictionary is based, there is no longer a simple linear version number. I suggest that we introduce a new data name, something like '_dictionary.ddl_conformance_url' which gives the URL of the DDL to which the dictionary conforms, and then _dictionary.ddl_conformance would be the version number of that dictionary, although that would be largely pointless as the dictionary at the URL would provide the version number anyway.

I will be shortly starting a discussion around assigning DOIs to dictionaries via a service like Zenodo in the core discussion group, BTW, and perhaps we would switch from URL to DOI in case of a successful outcome.

all the best,
James.



On Fri, 17 Apr 2020 at 00:05, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Dear DDLm group,

There are (at least) two kinds of type information that need somehow to be preserved in order to successfully round-trip a DDL2 dictionary through a DDLm representation: (i) the components of the item_type_list itself, and (ii) the type assignment for each item.  Although it appears that in practice, most DDL2 dictionaries use a subset of mmCIF's item_type_list, it is not safe in a general sense to take that list as universally applicable.  Moreover, taking it that way would moot the point of the category, inasmuch as I take that to be that dictionaries define their own data types.  Thus, I do not favor James's option (1).

Perhaps James has something more specific in mind for his (3) than I gather from his description, but I agree that the general idea seems fraught.  Pretty much anything ought to be doable in this general way, though, because it should be possible to write a meta-CIF dictionary in DDLm, with which we could then represent arbitrary CIF conforming to arbitrary dictionaries.  I emphasize, however, that I am not recommending this approach.

Of James's suggestions, that leaves (2).  I am inclined to think that it would be easiest to implement, too, and relatively clear.  The main disadvantage I see is that it would be specific to the DDL2-mapping case, but I could live with that, given the additional attributes being defined in an extension dictionary, as James suggests, not in DDLm proper.

It occurs to me to consider whether a DDLm definition of DDL2 itself (not necessarily round-trippable) would be useful here.  It's an interesting though experiment, and the exercise might even be worthwhile, but I'm inclined to think that it has nothing new to offer with respect to the matter presently at hand.


John



From: ddlm-group <ddlm-group-bounces@iucr.org> on behalf of James Hester <jamesrhester@gmail.com>
Sent: Thursday, April 16, 2020 12:33 AM
To: ddlm-group <ddlm-group@iucr.org>
Subject: [ddlm-group] Making ddl2 <-> ddlm translation a reality
 
Caution: External Sender

Dear DDLm group,

imgCIF has not yet been incorporated into the DDLm world, which I think is essential for us to take advantage of its excellent raw data descriptors. As we all know, raw data is becoming increasingly important, and even if data are not stored in CIF format, CIF descriptors can be adapted to any format. In order to support a DDLm version of the imgCIF dictionary, imgCIF maintainers want a DDL2 -> DDLm -> DDL2 dictionary round trip to preserve the important information. This will allow a single version of imgCIF to be maintained and be available to both the macromolecular community and to the communities covered by non-DDL2 dictionaries.  

As part of preliminary investigation I have analysed how the translation would work in both directions by writing (but not testing!!) dREL methods that operate on dictionary data. I'll make this document available on Github shortly.  Most of the fundamental relationships and data name information are simple to transform.

However, as a result of this investigation I have come up against a key problem that has long ago been identified, related to DDL2 types. So: item_type_list is a category that tabulates all possible DDL2 "types", by linking a type name with a regular expression, a primitive type (char/uchar/numb/null) and an explanation. In contrast to DDLm, these types are defined in the domain dictionary instead of semi-baked into the DDL.

While a mapping from DDL2 types to DDLm types is largely straightforward, in doing this a lot of the DDL2 imgCIF/mmCIF information is lost, particularly the highly detailed distinctions between various textual formats in imgCIF/mmCIF that are captured in regular expressions.  This means that sensible translation back to DDL2 is impossible, most fundamentally because the DDL2 names of the types are not preserved - DDLm has no dictionary-definable types.

Here are some options that I see for solving this, and by extension the basic DDLm -> DDL2 translation problem:
(1) When translating DDLm-> DDL2, the _item_type_list found in mmCIF/PDBx is consulted for matching regex and the corresponding code used. If none found, arbitrary code is generated.
(2) We create an extension dictionary to DDLm which defines a few extra attributes specifically for preserving DDL2 information (e.g. type codes).
(3) We create a new DDLm category for "foreign" attributes, where arbitrary foreign attributes with values can be listed.
(4) Your suggestions?

Option (1) is not that unnatural, as imgCIF (and as I understand it any mmCIF extension dictionary) should harmonise its units and item type lists with mmCIF. So the translation is not DDLm -> DDL2, but instead DDLm -> DDL2 -> mmCIF extension dictionary. However, in this case we would be using the regex as a natural key and so the DDL2->mmCIF extension step is a bit fragile e.g. there might be multiple ways to express the same text constraints using a regex and therefore matches might be missed if either DDL2 or DDLm sides update a regex.

Option (2) is easy enough to create, and has the advantage of extensibility if and when more things that are dropped in translation are desired. It also serves to "define" the differences in information granularity between DDL2 and DDLm.  It allows "pure" DDLm users to work in DDLm, and then if somebody wishes to incorporate that ontology into the mmCIF world, the list of necessary attributes to add to the definitions is available.  It does however create (yet) another DDL, although one that could be said to come under the DDLm umbrella.  Additionally, it may serve as a model for integrating with other ontologies beyond DDL2. If this group sees merit in this approach, we would probably organise formal approval in COMCIFS.

As far as I can see, Option (3) would only work in a non-clunky way for non-looped attributes which is fine for the particular case of item_type but is not extensible.

What I would like from this group is for us to consider the above options and for us to arrive at a preferred approach.

Thoughts?

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.