Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets

On Thu, 2 Apr 2020 at 12:12, Herbert J. Bernstein <yayahjb@gmail.com> wrote:
Dear James,
  I support the concept of external links.  I cannot support the current proposal, because I do not understand it because it is a minimally documented
concept, rather than a clear specification with syntax, semantics and examples. 

Before some group spends days and weeks formulating all of those things, I wanted to see if there was a general agreement with the direction and get some feeling for issues that might arise.  Nobody is going to claim that you agreed to the detail, only that you think it is a productive direction. I will at some point work up a full example as I haven't seen any general objections raised.
 
For me the most important thing that is missing is a fully
agreed mechanism to translate any DDLm dictionary into a DDL2 dictionary, i.e. a clear algortihmic definition of the schema to be read out of
the DDLm dictionary, first without this new feature, then with it.

This linking feature is not intended to change the schema covering a collection of data files in any way. The schema would continue to be specified using DDLm category definitions. The schema to be read out of a DDLm dictionary would therefore be identical before and after this change. I might be missing something, but I don't see why the DDLm/DDL2 relationship is somehow supposed to make or break this proposal.
 

  The core to my confusion is conveyed in
  "I don't see why Herbert thinks that specifying the relationship between DDLm (I assume he means core CIF) and DDL2 (I assume he means mmCIF)
is difficult. If the DIFFRN category is a set category in default core CIF, then it corresponds to a single-row DIFFRN category in mmCIF. I thoroughly
agree that the fundamental underlying structure of any scientific data is relational, some data presentations require more untangling than others. "

  I am worried about what DDLm means, not what any particular data dictionary as now written means, and how that relates to what DDL2 means, not what
any particular DDL2 data dictionary means, so I can write code to go between DDLm and DDL2, with or without the new proposal.    Yes, it is possible
to write code from the information presented so far, but there is a great risk that the code I think conforms will not do similar things to the code you
think conforms and neither will do similar things to the code John W. thinks conforms.  It is time to algorithmically specify the "untangling" required
by DDLm so we can always move reliably from a DDL2 world to a DDLm world and back reproducibly.

I know Herbert that you have been keen for this to happen for a long time. However, it is pretty clear to me that the main DDL2 users are quite happy to live in a DDLm free world and will not be devoting any time to tools related to DDLm. My limited time is currently devoted to polishing DDLm and the DDLm dictionaries and related chapters in Volume G.  While a translation tool between DDLm and DDL2 dictionaries is entirely plausible, I don't have time to do it myself, it changes nothing in the actual data files, and I don't know why it should be a roadblock to DDLm progress at this time. I would be very willing to join in discussions around such a tool, but I don't have the time to create it myself. 

That said, if Herbert could collect together a list of the translation issues that he is aware of that would be a good reference. From my point of view, DDLm and DDL2 dictionaries share identical concepts of categories and category keys, so the basic structure of dictionaries align.


  We agree that "that the fundamental underlying structure of any scientific data is relational".  Good.  Now let's make that a reality for all of CIF by
extending DDLm with all the infratructure needed to ensure that every DDLm-conforming dictionary will have an easy-to-untangle path to en
equivalent DDL2-conforming dictionary.  If we cannot do that we have not made to relational presentation of the DDLm-conforming dictionaries
clear.

Herbert, you propose that DDLm needs to be extended. What is DDLm missing? 

  We have gained a year to slow down, be careful and present a really well-documented, well-understood CIF next summer, a "Gold Standard" CIF, if you
will.

ddl.dic is the foundational documentation. What is missing?

On Wed, Apr 1, 2020 at 8:36 PM James Hester <jamesrhester@gmail.com> wrote:
Dear all, 

See my comments inline below.

On Thu, 2 Apr 2020 at 10:23, Herbert J. Bernstein <yayahjb@gmail.com> wrote:
Dear Colleages,
  This issue is just another aspect to the matters we have discussed since 1995 on the relationships among database schema, syntax and semantics.
It is very important that the relational database schema be cleanly and clearly described independent of the syntax of the container languages we
use, so that we can work with interoperable presentations in CIF, XML, json, etc.  DDL2 is very good at that.   DDL1 was weak on that.  DDLm is,
frankly, a bit muddled in that regard when the syntax and semantics of pieces of various combined dictionaries can make it hard to trace what
parts of which schema are intended to apply to which data.
 
I agree up until Herbert's final sentence. Is DDLm muddled because of the lack of decent documentation, or because the concepts are imperfect? As far as I can tell, DDLm in its current version provides a mechanism that is about as simple as it can be and still handle the enormous diversity of the powder/magnetism/modulated/Laue world and combinations thereof in a machine-actionable manner - machine-actionable means that dREL methods can be written that will work with whatever combination of dictionaries you have come up with (think combined neutron / X-ray powder diffraction on a mixture expressed using a dictionary). Doing this has relied on relational data structures.
 
  Links are certainly useful, and I would favor adding them to CIF as a container language, just as we added the imgCIF binary data type to enable
the creation of CBFs, but just as with that case we be sure to precisely specify what the equivalent DDL2 CIF presentation is of the same information in
a single file, so that the schema can be unambiguously extracted.

I believe Herbert is thinking here of links within a CIF data block pointing to items that are not straightforward DDLm-conforming CIF data blocks, thus necessitating a mapping between the pointed-to contents and the DDLm schema.  Absolutely true that such a mapping is necessary. So perhaps Herbert is suggesting a further '_audit_link' data name that would identify the particular mapping to use?  I agree. The lack of such mappings doesn't mean we can't define the data name. I would also add that, while one scenario might put such links into a 'global' block (like a Nexus master file) making a sort of container for other data blocks, another scenario might simply link one block with the next one along.

  In the same vein I propose that we unambiguously specify the mapping of
all non-looped DDLm categories into the equivalent DDL2 CIF presentation.  I know there are people who think there is something special and
different about the unlooped categories, but I firmly believe that any information that cannot be presented as relations preserving referential
integrity is a disaster waiting to happen and eventually will become an unsearchable garble.

I don't see why Herbert thinks that specifying the relationship between DDLm (I assume he means core CIF) and DDL2 (I assume he means mmCIF) is difficult. If the DIFFRN category is a set category in default core CIF, then it corresponds to a single-row DIFFRN category in mmCIF. I thoroughly agree that the fundamental underlying structure of any scientific data is relational, some data presentations require more untangling than others.  

By saying that a category is unlooped you are specifying the scope of a single data block (e.g. *one* compound, *one* sample), that is the significance of unlooped categories. DDL2 does exactly the same thing by specifying that the value of _entry.id is the data block identifier. So all children of _entry.id are single row i.e. Set categories. And there is no abandonment of relational integrity if you restrict some loops to having a single row as Herbert seems to be implying. 
 
  Just as to this day, COMCIFS has not pushed the binary data type into DDLm, it does not need to push links or looped sets into DDLm, but it
does need to suggest a reasonable way to present the information involved in a DDLm equivalent that can be used by applications to deal
with this information.

We already have 'looped sets' as a result of the _audit.schema discussions several years ago. The documentation might still be a bit sparse.

all the best,
James.


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.