Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets

Dear James,

  I am delighted with your last line. 

ddl.dic is the foundational documentation. What is missing?

You are right.  I have been pushing this for many years, so I can get cif_img.dic converted.  I need a few
extensions to DDLm.  Let's get started.

My request is simple -- I wish to add to ddl.dic whatever is necessary to completely support the DDL2 dictionaries, i.e. the PDB
dictionaries and cif_img.dic.  Let is start with one simple basic request, direct support for the specification of
data types as regular expressions.  I believe this goal can be achieved by adding to the _type.contents _enumeration_set
the ability to specify regular expressions, so we can do machine parsable validations.  Any objections to the concept?

   loop_
    _enumeration_set.state
    _enumeration_set.detail
   ...
   ByRegex
                            """The contents have the form specified by
                            _type.contents_regex."""

and adding _type.contents_regex using the DDL2 variant of regular expressions

save_type.contents_regex
    _definition.id                 '_type.contents_regex
    _definition.update          2020-04-02
    _definition.class            Attribute
    _description.text
;
     A regular expression giving the syntax of the type of this item.
     Meaningful only when this item's _type.contents attribute has
     value 'ByRegex'.

     The regular expressions defined here are not compliant
     with the POSIX 1003.2 standard as they include the
     '\n' and '\t' special characters.  These regular expressions
     have been tested using version 0.12 of Richard Stallman's
    GNU regular expression library in POSIX mode.
    In order to allow presentation of a regular expression
    in a text field concatenate any line ending in a backslash
    with the following line, after discarding the backslash.
 
    A formal definition of the '\n' and '\t' special characters
    is most properly done in the DDL, but for completeness, please
    note that '\n' is the line termination character ('newline')
    and '\t' is the horizontal tab character.  There is a formal
    ambiguity in the use of '\n' for line termination, in that
    the intention is that the equivalent machine/OS-dependent line
    termination character sequence should be accepted as a match, e.g.
 
       '\r' (control-M) under MacOS
       '\n' (control-J) under Unix
       '\r\n' (control-M control-J) under DOS and MS Windows
 

;
    _name.category_id       type
    _name.object_id           contents_regex
    _type.purpose               Encode
    _type.container             Single
    _type.contents              Text
     save_

For imgCIF, the DDL2 list is given below.  If you are agreeable to the above addition to
DDLm, with a few similar additions to DDLm, I can rapidly convert cif_img.dic to
that extended DDLm.  I would also include your link proposal.  It would allow a neat structural
match between NeXus/HFD5 Eiger datasets and CBF Eiger datasets

####################
## ITEM_TYPE_LIST ##
####################
#
#
#  The regular expressions defined here are not compliant
#  with the POSIX 1003.2 standard as they include the
#  '\n' and '\t' special characters.  These regular expressions
#  have been tested using version 0.12 of Richard Stallman's
#  GNU regular expression library in POSIX mode.
#  In order to allow presentation of a regular expression
#  in a text field concatenate any line ending in a backslash
#  with the following line, after discarding the backslash.
#
#  A formal definition of the '\n' and '\t' special characters
#  is most properly done in the DDL, but for completeness, please
#  note that '\n' is the line termination character ('newline')
#  and '\t' is the horizontal tab character.  There is a formal
#  ambiguity in the use of '\n' for line termination, in that
#  the intention is that the equivalent machine/OS-dependent line
#  termination character sequence should be accepted as a match, e.g.
#
#      '\r' (control-M) under MacOS
#      '\n' (control-J) under Unix
#      '\r\n' (control-M control-J) under DOS and MS Windows
#
     loop_
    _item_type_list.code
    _item_type_list.primitive_code
    _item_type_list.construct
    _item_type_list.detail
               code      char
               '[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*'
;              code item types/single words ...
;
               ucode      uchar
               '[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*'
;              code item types/single words (case insensitive) ...
;
               line      char
               '[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
;              char item types / multi-word items ...
;
               uline     uchar
               '[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
;              char item types / multi-word items (case insensitive)...
;
               text      char
             '[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
;              text item types / multi-line text ...
;
               binary    char
;\n--CIF-BINARY-FORMAT-SECTION--\n\
[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*\
\n--CIF-BINARY-FORMAT-SECTION----
;
;              binary items are presented as MIME-like ascii-encoded
               sections in an imgCIF.  In a CBF, raw octet streams
               are used to convey the same information.
;
               int       numb
               '-?[0-9]+'
;              int item types are the subset of numbers that are the negative
               or positive integers.
;
               float     numb
          '-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?'
;              float item types are the subset of numbers that are the floating
               point numbers.
;
               any       char
               '.*'
;              A catch all for items that may take any form...
;
               yyyy-mm-dd  char
;\
[0-9]?[0-9]?[0-9][0-9]-[0-9]?[0-9]-[0-9]?[0-9]\
((T[0-2][0-9](:[0-5][0-9](:[0-5][0-9](.[0-9]+)?)?)?)?\
([+-][0-5][0-9]:[0-5][0-9]))?
;
;
               Standard format for CIF date and time strings (see
               http://www.iucr.org/iucr-top/cif/spec/datetime.html),
               consisting of a yyyy-mm-dd date optionally followed by
               the character 'T' followed by a 24-hour clock time,
               optionally followed by a signed time-zone offset.

               The IUCr standard has been extended to allow for an optional
               decimal fraction on the seconds of time.

               Time is local time if no time-zone offset is given.

               Note that this type extends the mmCIF yyyy-mm-dd type
               but does not conform to the mmCIF yyyy-mm-dd:hh:mm
               type that uses a ':' in place if the 'T' specified
               by the IUCr standard.  For reading, both forms should
               be accepted,  but for writing, only the IUCr form should
               be used.

               For maximal compatibility, the special time zone
               indicator 'Z' (for 'zulu') should be accepted on
               reading in place of '+00:00' for GMT.
;


On Thu, Apr 2, 2020 at 12:52 AM James Hester <jamesrhester@gmail.com> wrote:
On Thu, 2 Apr 2020 at 12:12, Herbert J. Bernstein <yayahjb@gmail.com> wrote:
Dear James,
  I support the concept of external links.  I cannot support the current proposal, because I do not understand it because it is a minimally documented
concept, rather than a clear specification with syntax, semantics and examples. 

Before some group spends days and weeks formulating all of those things, I wanted to see if there was a general agreement with the direction and get some feeling for issues that might arise.  Nobody is going to claim that you agreed to the detail, only that you think it is a productive direction. I will at some point work up a full example as I haven't seen any general objections raised.
 
For me the most important thing that is missing is a fully
agreed mechanism to translate any DDLm dictionary into a DDL2 dictionary, i.e. a clear algortihmic definition of the schema to be read out of
the DDLm dictionary, first without this new feature, then with it.

This linking feature is not intended to change the schema covering a collection of data files in any way. The schema would continue to be specified using DDLm category definitions. The schema to be read out of a DDLm dictionary would therefore be identical before and after this change. I might be missing something, but I don't see why the DDLm/DDL2 relationship is somehow supposed to make or break this proposal.
 

  The core to my confusion is conveyed in
  "I don't see why Herbert thinks that specifying the relationship between DDLm (I assume he means core CIF) and DDL2 (I assume he means mmCIF)
is difficult. If the DIFFRN category is a set category in default core CIF, then it corresponds to a single-row DIFFRN category in mmCIF. I thoroughly
agree that the fundamental underlying structure of any scientific data is relational, some data presentations require more untangling than others. "

  I am worried about what DDLm means, not what any particular data dictionary as now written means, and how that relates to what DDL2 means, not what
any particular DDL2 data dictionary means, so I can write code to go between DDLm and DDL2, with or without the new proposal.    Yes, it is possible
to write code from the information presented so far, but there is a great risk that the code I think conforms will not do similar things to the code you
think conforms and neither will do similar things to the code John W. thinks conforms.  It is time to algorithmically specify the "untangling" required
by DDLm so we can always move reliably from a DDL2 world to a DDLm world and back reproducibly.

I know Herbert that you have been keen for this to happen for a long time. However, it is pretty clear to me that the main DDL2 users are quite happy to live in a DDLm free world and will not be devoting any time to tools related to DDLm. My limited time is currently devoted to polishing DDLm and the DDLm dictionaries and related chapters in Volume G.  While a translation tool between DDLm and DDL2 dictionaries is entirely plausible, I don't have time to do it myself, it changes nothing in the actual data files, and I don't know why it should be a roadblock to DDLm progress at this time. I would be very willing to join in discussions around such a tool, but I don't have the time to create it myself. 

That said, if Herbert could collect together a list of the translation issues that he is aware of that would be a good reference. From my point of view, DDLm and DDL2 dictionaries share identical concepts of categories and category keys, so the basic structure of dictionaries align.


  We agree that "that the fundamental underlying structure of any scientific data is relational".  Good.  Now let's make that a reality for all of CIF by
extending DDLm with all the infratructure needed to ensure that every DDLm-conforming dictionary will have an easy-to-untangle path to en
equivalent DDL2-conforming dictionary.  If we cannot do that we have not made to relational presentation of the DDLm-conforming dictionaries
clear.

Herbert, you propose that DDLm needs to be extended. What is DDLm missing? 

  We have gained a year to slow down, be careful and present a really well-documented, well-understood CIF next summer, a "Gold Standard" CIF, if you
will.

ddl.dic is the foundational documentation. What is missing?

On Wed, Apr 1, 2020 at 8:36 PM James Hester <jamesrhester@gmail.com> wrote:
Dear all, 

See my comments inline below.

On Thu, 2 Apr 2020 at 10:23, Herbert J. Bernstein <yayahjb@gmail.com> wrote:
Dear Colleages,
  This issue is just another aspect to the matters we have discussed since 1995 on the relationships among database schema, syntax and semantics.
It is very important that the relational database schema be cleanly and clearly described independent of the syntax of the container languages we
use, so that we can work with interoperable presentations in CIF, XML, json, etc.  DDL2 is very good at that.   DDL1 was weak on that.  DDLm is,
frankly, a bit muddled in that regard when the syntax and semantics of pieces of various combined dictionaries can make it hard to trace what
parts of which schema are intended to apply to which data.
 
I agree up until Herbert's final sentence. Is DDLm muddled because of the lack of decent documentation, or because the concepts are imperfect? As far as I can tell, DDLm in its current version provides a mechanism that is about as simple as it can be and still handle the enormous diversity of the powder/magnetism/modulated/Laue world and combinations thereof in a machine-actionable manner - machine-actionable means that dREL methods can be written that will work with whatever combination of dictionaries you have come up with (think combined neutron / X-ray powder diffraction on a mixture expressed using a dictionary). Doing this has relied on relational data structures.
 
  Links are certainly useful, and I would favor adding them to CIF as a container language, just as we added the imgCIF binary data type to enable
the creation of CBFs, but just as with that case we be sure to precisely specify what the equivalent DDL2 CIF presentation is of the same information in
a single file, so that the schema can be unambiguously extracted.

I believe Herbert is thinking here of links within a CIF data block pointing to items that are not straightforward DDLm-conforming CIF data blocks, thus necessitating a mapping between the pointed-to contents and the DDLm schema.  Absolutely true that such a mapping is necessary. So perhaps Herbert is suggesting a further '_audit_link' data name that would identify the particular mapping to use?  I agree. The lack of such mappings doesn't mean we can't define the data name. I would also add that, while one scenario might put such links into a 'global' block (like a Nexus master file) making a sort of container for other data blocks, another scenario might simply link one block with the next one along.

  In the same vein I propose that we unambiguously specify the mapping of
all non-looped DDLm categories into the equivalent DDL2 CIF presentation.  I know there are people who think there is something special and
different about the unlooped categories, but I firmly believe that any information that cannot be presented as relations preserving referential
integrity is a disaster waiting to happen and eventually will become an unsearchable garble.

I don't see why Herbert thinks that specifying the relationship between DDLm (I assume he means core CIF) and DDL2 (I assume he means mmCIF) is difficult. If the DIFFRN category is a set category in default core CIF, then it corresponds to a single-row DIFFRN category in mmCIF. I thoroughly agree that the fundamental underlying structure of any scientific data is relational, some data presentations require more untangling than others.  

By saying that a category is unlooped you are specifying the scope of a single data block (e.g. *one* compound, *one* sample), that is the significance of unlooped categories. DDL2 does exactly the same thing by specifying that the value of _entry.id is the data block identifier. So all children of _entry.id are single row i.e. Set categories. And there is no abandonment of relational integrity if you restrict some loops to having a single row as Herbert seems to be implying. 
 
  Just as to this day, COMCIFS has not pushed the binary data type into DDLm, it does not need to push links or looped sets into DDLm, but it
does need to suggest a reasonable way to present the information involved in a DDLm equivalent that can be used by applications to deal
with this information.

We already have 'looped sets' as a result of the _audit.schema discussions several years ago. The documentation might still be a bit sparse.

all the best,
James.


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.