[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets
- To: James Hester <james.r.hester@gmail.com>
- Subject: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets
- From: "Herbert J. Bernstein" <yayahjb@gmail.com>
- Date: Thu, 2 Apr 2020 08:22:36 -0400
- Cc: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- In-Reply-To: <CAM+dB2dMh=Pfe3kWWyZZ0aiKv5B7hVQPZL2tcVd5ya4jCssWgA@mail.gmail.com>
- References: <CAM+dB2eFZ+-yUVWfNBVnKUaNNr9bUC9S3B8QJ9pYHNYk4ETnfA@mail.gmail.com><CABcsX26hg1KG+1P08W=GbjjV-upjKtbgyzbH4WW+qDhwZQR4zA@mail.gmail.com><CAM+dB2dBTdoXj_VegOibsFaKowy-+kXT6OQ2MxaVA=wOcD1akg@mail.gmail.com><1ffdd7d8-f29f-4c7b-e6b7-0bff08358484@rcsb.org><CAM+dB2fOodbuyMFhRnY5EZebYtPP3+RWh9pRLbAQvYmxvHYBrw@mail.gmail.com><CABcsX27tt801DdX8cmFwuBFY5JmMcm2T3od-VgnNMygP29TfLQ@mail.gmail.com><CAM+dB2dsQd3wU69bmeRZxbV5v+bK=Q841=Wgr53aJ3nHudhr6Q@mail.gmail.com><CABcsX24x0AwPw4gaUEaPo0O+A0bhYrGpB-+ssiW+oPw+NQp+9Q@mail.gmail.com><CAM+dB2dMh=Pfe3kWWyZZ0aiKv5B7hVQPZL2tcVd5ya4jCssWgA@mail.gmail.com>
Dear James,
I am delighted with your last line.
ddl.dic is the foundational documentation. What is missing?
You are right. I have been pushing this for many years, so I can get cif_img.dic converted. I need a few
extensions to DDLm. Let's get started.
My request is simple -- I wish to add to ddl.dic whatever is necessary to completely support the DDL2 dictionaries, i.e. the PDB
dictionaries and cif_img.dic. Let is start with one simple basic request, direct support for the specification of
data types as regular expressions. I believe this goal can be achieved by adding to the _type.contents _enumeration_set
the ability to specify regular expressions, so we can do machine parsable validations. Any objections to the concept?
loop_
_enumeration_set.state
_enumeration_set.detail
_enumeration_set.state
_enumeration_set.detail
...
ByRegex
"""The contents have the form specified by
_type.contents_regex."""
"""The contents have the form specified by
_type.contents_regex."""
and adding _type.contents_regex using the DDL2 variant of regular expressions
save_type.contents_regex
_definition.id '_type.contents_regex
_definition.update 2020-04-02
_definition.class Attribute
_description.text
;
A regular expression giving the syntax of the type of this item.
Meaningful only when this item's _type.contents attribute has
_definition.id '_type.contents_regex
_definition.update 2020-04-02
_definition.class Attribute
_description.text
;
A regular expression giving the syntax of the type of this item.
Meaningful only when this item's _type.contents attribute has
value 'ByRegex'.
The regular expressions defined here are not compliant
with the POSIX 1003.2 standard as they include the
'\n' and '\t' special characters. These regular expressions
have been tested using version 0.12 of Richard Stallman's
GNU regular expression library in POSIX mode.
In order to allow presentation of a regular expression
in a text field concatenate any line ending in a backslash
with the following line, after discarding the backslash.
A formal definition of the '\n' and '\t' special characters
is most properly done in the DDL, but for completeness, please
note that '\n' is the line termination character ('newline')
and '\t' is the horizontal tab character. There is a formal
ambiguity in the use of '\n' for line termination, in that
the intention is that the equivalent machine/OS-dependent line
termination character sequence should be accepted as a match, e.g.
'\r' (control-M) under MacOS
'\n' (control-J) under Unix
'\r\n' (control-M control-J) under DOS and MS Windows
with the POSIX 1003.2 standard as they include the
'\n' and '\t' special characters. These regular expressions
have been tested using version 0.12 of Richard Stallman's
GNU regular expression library in POSIX mode.
In order to allow presentation of a regular expression
in a text field concatenate any line ending in a backslash
with the following line, after discarding the backslash.
A formal definition of the '\n' and '\t' special characters
is most properly done in the DDL, but for completeness, please
note that '\n' is the line termination character ('newline')
and '\t' is the horizontal tab character. There is a formal
ambiguity in the use of '\n' for line termination, in that
the intention is that the equivalent machine/OS-dependent line
termination character sequence should be accepted as a match, e.g.
'\r' (control-M) under MacOS
'\n' (control-J) under Unix
'\r\n' (control-M control-J) under DOS and MS Windows
;
_name.category_id type
_name.object_id contents_regex
_type.purpose Encode
_type.container Single
_type.contents Text
save_
_name.category_id type
_name.object_id contents_regex
_type.purpose Encode
_type.container Single
_type.contents Text
save_
For imgCIF, the DDL2 list is given below. If you are agreeable to the above addition to
DDLm, with a few similar additions to DDLm, I can rapidly convert cif_img.dic to
that extended DDLm. I would also include your link proposal. It would allow a neat structural
match between NeXus/HFD5 Eiger datasets and CBF Eiger datasets
####################
## ITEM_TYPE_LIST ##
####################
#
#
# The regular expressions defined here are not compliant
# with the POSIX 1003.2 standard as they include the
# '\n' and '\t' special characters. These regular expressions
# have been tested using version 0.12 of Richard Stallman's
# GNU regular expression library in POSIX mode.
# In order to allow presentation of a regular expression
# in a text field concatenate any line ending in a backslash
# with the following line, after discarding the backslash.
#
# A formal definition of the '\n' and '\t' special characters
# is most properly done in the DDL, but for completeness, please
# note that '\n' is the line termination character ('newline')
# and '\t' is the horizontal tab character. There is a formal
# ambiguity in the use of '\n' for line termination, in that
# the intention is that the equivalent machine/OS-dependent line
# termination character sequence should be accepted as a match, e.g.
#
# '\r' (control-M) under MacOS
# '\n' (control-J) under Unix
# '\r\n' (control-M control-J) under DOS and MS Windows
#
loop_
_item_type_list.code
_item_type_list.primitive_code
_item_type_list.construct
_item_type_list.detail
code char
'[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*'
; code item types/single words ...
;
ucode uchar
'[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*'
; code item types/single words (case insensitive) ...
;
line char
'[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
; char item types / multi-word items ...
;
uline uchar
'[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
; char item types / multi-word items (case insensitive)...
;
text char
'[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
; text item types / multi-line text ...
;
binary char
;\n--CIF-BINARY-FORMAT-SECTION--\n\
[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*\
\n--CIF-BINARY-FORMAT-SECTION----
;
; binary items are presented as MIME-like ascii-encoded
sections in an imgCIF. In a CBF, raw octet streams
are used to convey the same information.
;
int numb
'-?[0-9]+'
; int item types are the subset of numbers that are the negative
or positive integers.
;
float numb
'-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?'
; float item types are the subset of numbers that are the floating
point numbers.
;
any char
'.*'
; A catch all for items that may take any form...
;
yyyy-mm-dd char
;\
[0-9]?[0-9]?[0-9][0-9]-[0-9]?[0-9]-[0-9]?[0-9]\
((T[0-2][0-9](:[0-5][0-9](:[0-5][0-9](.[0-9]+)?)?)?)?\
([+-][0-5][0-9]:[0-5][0-9]))?
;
;
Standard format for CIF date and time strings (see
http://www.iucr.org/iucr-top/cif/spec/datetime.html),
consisting of a yyyy-mm-dd date optionally followed by
the character 'T' followed by a 24-hour clock time,
optionally followed by a signed time-zone offset.
The IUCr standard has been extended to allow for an optional
decimal fraction on the seconds of time.
Time is local time if no time-zone offset is given.
Note that this type extends the mmCIF yyyy-mm-dd type
but does not conform to the mmCIF yyyy-mm-dd:hh:mm
type that uses a ':' in place if the 'T' specified
by the IUCr standard. For reading, both forms should
be accepted, but for writing, only the IUCr form should
be used.
For maximal compatibility, the special time zone
indicator 'Z' (for 'zulu') should be accepted on
reading in place of '+00:00' for GMT.
;
## ITEM_TYPE_LIST ##
####################
#
#
# The regular expressions defined here are not compliant
# with the POSIX 1003.2 standard as they include the
# '\n' and '\t' special characters. These regular expressions
# have been tested using version 0.12 of Richard Stallman's
# GNU regular expression library in POSIX mode.
# In order to allow presentation of a regular expression
# in a text field concatenate any line ending in a backslash
# with the following line, after discarding the backslash.
#
# A formal definition of the '\n' and '\t' special characters
# is most properly done in the DDL, but for completeness, please
# note that '\n' is the line termination character ('newline')
# and '\t' is the horizontal tab character. There is a formal
# ambiguity in the use of '\n' for line termination, in that
# the intention is that the equivalent machine/OS-dependent line
# termination character sequence should be accepted as a match, e.g.
#
# '\r' (control-M) under MacOS
# '\n' (control-J) under Unix
# '\r\n' (control-M control-J) under DOS and MS Windows
#
loop_
_item_type_list.code
_item_type_list.primitive_code
_item_type_list.construct
_item_type_list.detail
code char
'[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*'
; code item types/single words ...
;
ucode uchar
'[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*'
; code item types/single words (case insensitive) ...
;
line char
'[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
; char item types / multi-word items ...
;
uline uchar
'[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
; char item types / multi-word items (case insensitive)...
;
text char
'[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*'
; text item types / multi-line text ...
;
binary char
;\n--CIF-BINARY-FORMAT-SECTION--\n\
[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*\
\n--CIF-BINARY-FORMAT-SECTION----
;
; binary items are presented as MIME-like ascii-encoded
sections in an imgCIF. In a CBF, raw octet streams
are used to convey the same information.
;
int numb
'-?[0-9]+'
; int item types are the subset of numbers that are the negative
or positive integers.
;
float numb
'-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?'
; float item types are the subset of numbers that are the floating
point numbers.
;
any char
'.*'
; A catch all for items that may take any form...
;
yyyy-mm-dd char
;\
[0-9]?[0-9]?[0-9][0-9]-[0-9]?[0-9]-[0-9]?[0-9]\
((T[0-2][0-9](:[0-5][0-9](:[0-5][0-9](.[0-9]+)?)?)?)?\
([+-][0-5][0-9]:[0-5][0-9]))?
;
;
Standard format for CIF date and time strings (see
http://www.iucr.org/iucr-top/cif/spec/datetime.html),
consisting of a yyyy-mm-dd date optionally followed by
the character 'T' followed by a 24-hour clock time,
optionally followed by a signed time-zone offset.
The IUCr standard has been extended to allow for an optional
decimal fraction on the seconds of time.
Time is local time if no time-zone offset is given.
Note that this type extends the mmCIF yyyy-mm-dd type
but does not conform to the mmCIF yyyy-mm-dd:hh:mm
type that uses a ':' in place if the 'T' specified
by the IUCr standard. For reading, both forms should
be accepted, but for writing, only the IUCr form should
be used.
For maximal compatibility, the special time zone
indicator 'Z' (for 'zulu') should be accepted on
reading in place of '+00:00' for GMT.
;
On Thu, Apr 2, 2020 at 12:52 AM James Hester <jamesrhester@gmail.com> wrote:
On Thu, 2 Apr 2020 at 12:12, Herbert J. Bernstein <yayahjb@gmail.com> wrote:Dear James,I support the concept of external links. I cannot support the current proposal, because I do not understand it because it is a minimally documentedconcept, rather than a clear specification with syntax, semantics and examples.Before some group spends days and weeks formulating all of those things, I wanted to see if there was a general agreement with the direction and get some feeling for issues that might arise. Nobody is going to claim that you agreed to the detail, only that you think it is a productive direction. I will at some point work up a full example as I haven't seen any general objections raised.For me the most important thing that is missing is a fullyagreed mechanism to translate any DDLm dictionary into a DDL2 dictionary, i.e. a clear algortihmic definition of the schema to be read out ofthe DDLm dictionary, first without this new feature, then with it.This linking feature is not intended to change the schema covering a collection of data files in any way. The schema would continue to be specified using DDLm category definitions. The schema to be read out of a DDLm dictionary would therefore be identical before and after this change. I might be missing something, but I don't see why the DDLm/DDL2 relationship is somehow supposed to make or break this proposal.The core to my confusion is conveyed in"I don't see why Herbert thinks that specifying the relationship between DDLm (I assume he means core CIF) and DDL2 (I assume he means mmCIF)is difficult. If the DIFFRN category is a set category in default core CIF, then it corresponds to a single-row DIFFRN category in mmCIF. I thoroughlyagree that the fundamental underlying structure of any scientific data is relational, some data presentations require more untangling than others. "I am worried about what DDLm means, not what any particular data dictionary as now written means, and how that relates to what DDL2 means, not whatany particular DDL2 data dictionary means, so I can write code to go between DDLm and DDL2, with or without the new proposal. Yes, it is possibleto write code from the information presented so far, but there is a great risk that the code I think conforms will not do similar things to the code youthink conforms and neither will do similar things to the code John W. thinks conforms. It is time to algorithmically specify the "untangling" requiredby DDLm so we can always move reliably from a DDL2 world to a DDLm world and back reproducibly.I know Herbert that you have been keen for this to happen for a long time. However, it is pretty clear to me that the main DDL2 users are quite happy to live in a DDLm free world and will not be devoting any time to tools related to DDLm. My limited time is currently devoted to polishing DDLm and the DDLm dictionaries and related chapters in Volume G. While a translation tool between DDLm and DDL2 dictionaries is entirely plausible, I don't have time to do it myself, it changes nothing in the actual data files, and I don't know why it should be a roadblock to DDLm progress at this time. I would be very willing to join in discussions around such a tool, but I don't have the time to create it myself.That said, if Herbert could collect together a list of the translation issues that he is aware of that would be a good reference. From my point of view, DDLm and DDL2 dictionaries share identical concepts of categories and category keys, so the basic structure of dictionaries align.We agree that "that the fundamental underlying structure of any scientific data is relational". Good. Now let's make that a reality for all of CIF byextending DDLm with all the infratructure needed to ensure that every DDLm-conforming dictionary will have an easy-to-untangle path to enequivalent DDL2-conforming dictionary. If we cannot do that we have not made to relational presentation of the DDLm-conforming dictionariesclear.Herbert, you propose that DDLm needs to be extended. What is DDLm missing?We have gained a year to slow down, be careful and present a really well-documented, well-understood CIF next summer, a "Gold Standard" CIF, if youwill.ddl.dic is the foundational documentation. What is missing?On Wed, Apr 1, 2020 at 8:36 PM James Hester <jamesrhester@gmail.com> wrote:Dear all,See my comments inline below.On Thu, 2 Apr 2020 at 10:23, Herbert J. Bernstein <yayahjb@gmail.com> wrote:Dear Colleages,This issue is just another aspect to the matters we have discussed since 1995 on the relationships among database schema, syntax and semantics.It is very important that the relational database schema be cleanly and clearly described independent of the syntax of the container languages weuse, so that we can work with interoperable presentations in CIF, XML, json, etc. DDL2 is very good at that. DDL1 was weak on that. DDLm is,frankly, a bit muddled in that regard when the syntax and semantics of pieces of various combined dictionaries can make it hard to trace whatparts of which schema are intended to apply to which data.I agree up until Herbert's final sentence. Is DDLm muddled because of the lack of decent documentation, or because the concepts are imperfect? As far as I can tell, DDLm in its current version provides a mechanism that is about as simple as it can be and still handle the enormous diversity of the powder/magnetism/modulated/Laue world and combinations thereof in a machine-actionable manner - machine-actionable means that dREL methods can be written that will work with whatever combination of dictionaries you have come up with (think combined neutron / X-ray powder diffraction on a mixture expressed using a dictionary). Doing this has relied on relational data structures.Links are certainly useful, and I would favor adding them to CIF as a container language, just as we added the imgCIF binary data type to enablethe creation of CBFs, but just as with that case we be sure to precisely specify what the equivalent DDL2 CIF presentation is of the same information ina single file, so that the schema can be unambiguously extracted.I believe Herbert is thinking here of links within a CIF data block pointing to items that are not straightforward DDLm-conforming CIF data blocks, thus necessitating a mapping between the pointed-to contents and the DDLm schema. Absolutely true that such a mapping is necessary. So perhaps Herbert is suggesting a further '_audit_link' data name that would identify the particular mapping to use? I agree. The lack of such mappings doesn't mean we can't define the data name. I would also add that, while one scenario might put such links into a 'global' block (like a Nexus master file) making a sort of container for other data blocks, another scenario might simply link one block with the next one along.In the same vein I propose that we unambiguously specify the mapping ofall non-looped DDLm categories into the equivalent DDL2 CIF presentation. I know there are people who think there is something special anddifferent about the unlooped categories, but I firmly believe that any information that cannot be presented as relations preserving referentialintegrity is a disaster waiting to happen and eventually will become an unsearchable garble.I don't see why Herbert thinks that specifying the relationship between DDLm (I assume he means core CIF) and DDL2 (I assume he means mmCIF) is difficult. If the DIFFRN category is a set category in default core CIF, then it corresponds to a single-row DIFFRN category in mmCIF. I thoroughly agree that the fundamental underlying structure of any scientific data is relational, some data presentations require more untangling than others.By saying that a category is unlooped you are specifying the scope of a single data block (e.g. *one* compound, *one* sample), that is the significance of unlooped categories. DDL2 does exactly the same thing by specifying that the value of _entry.id is the data block identifier. So all children of _entry.id are single row i.e. Set categories. And there is no abandonment of relational integrity if you restrict some loops to having a single row as Herbert seems to be implying.Just as to this day, COMCIFS has not pushed the binary data type into DDLm, it does not need to push links or looped sets into DDLm, but itdoes need to suggest a reasonable way to present the information involved in a DDLm equivalent that can be used by applications to dealwith this information.We already have 'looped sets' as a result of the _audit.schema discussions several years ago. The documentation might still be a bit sparse.all the best,James.--T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Preparing CIF for multi-block datasets (James Hester)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (Herbert J. Bernstein)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (James Hester)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (john.westbrook@rcsb.org)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (James Hester)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (Herbert J. Bernstein)
- [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (James Hester)
- Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (Herbert J. Bernstein)
- Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (James Hester)
- Prev by Date: Re: [ddlm-group] Preparing CIF for multi-block datasets
- Next by Date: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets
- Prev by thread: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets
- Next by thread: Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets
- Index(es):