Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics

Dear John B. and others:

On Sat, 4 Apr 2020 at 02:49, Bollinger, John C <John.Bollinger@stjude.org> wrote:
(1) DDLm handles Unicode, so some adjustment to the corresponding DDL2 attribute definition will be necessary, perhaps around normalisation.

For generality, yes, and maybe more than that.  But not necessarily to support the specific patterns in use in existing dictionaries.  That is, at least some of the patterns now used would work correctly on UTF-8 byte streams encoding some non-ASCII characters, and I imagine that others could be adjusted to work.  If any intend to continue limit values to ASCII characters then they also may work fine as-is.  We probably do want to address this, but it would be worthwhile to survey current DDL2 dictionaries to see where we actually stand.

OK. 

(2) DDLm should be applicable to both text and non-text data formats. So any new DDLm machinery for regexs would need to clarify that the regular expression applies only in those cases where the data value is represented as text.
I think this is drawing a weak point of DDLm into the spotlight.  As we position CIF dictionaries as general ontological objects, as opposed to objects dedicated specifically to CIF-format files, we find that DDLm and data dictionaries are straddling that divide a bit uncomfortably.  I see that in how DDLm is inconsistent about specifying text formats for various _type.contents alternatives.  It is quite specific about some, consistent with a role in defining details of CIF-format data representation, yet entirely silent about others.

Perhaps, then, this is an opportunity to sort that out somewhat.  Yes, we do clarify that the regexs specify the allowed *CIF-format* text representations of the various types. And we also provide a default regex for each _type.contents alternative, moving those details out of _type.contents's enumeration_set.

Absolutely we should be clear that any description of text formats is only applicable where a data value is represented as text. We have tried to remove any presumption of a textual representation (e.g. for integers) in ddl.dic and should reword any remaining items that do make this assumption. If you could note any remaining places that this occurs e.g. as a Github issue that would be great.

I'm not enthusiastic about providing a regex for every _type.contents alternative. In the case of integers, the representation of an integer is covered by the CIF syntax specifications and other text formats may have other approaches which we do not want to preclude.  I instead instead see regexes as a way to provide further granularity for validation of particular data items that are already declared as having a textual (i.e. sequence of Unicode code points) value. 


Looking at img_CIF, it would seem that the data item that needs regex capability is the CIF binary format section which has a particular string at the beginning and end.
That's not the only imgCIF (or mmCIF) data type that needs regex.  Consider imgCIF's 'code' type.  In the first place, it is not congruent with DDLm's 'Code' because the former is case-sensitive but the latter not.  imgCIF and mmCIF distinguish this from 'ucode', which is case-insensitive.  It looks like a DDLm-based version of imgCIF's 'code' would need to be a regex-restricted version of Text.  Additionally, if items of these imgCIF types need to remain restricted to ASCII characters, as their present regexes would do, then for this reason also they would need regex.  Similar considerations apply to several other mmCIF and imgCIF data types.

Indeed. There are an impressive number of types in mmCIF.  I would advocate mostly providing these as regexes in the definition of each particular data name that needs them.  'code30' in mmCIF/PDBx is a sequence of no more than 30 characters which sounds like a legacy PDB issue (none in imgCIF, needless to say). For any type that DDLm does not have but DDL2 does (ie most of them), the regex is a restriction on the general type provided by DDLm. The translation DDLm -> DDL2 can insert its own hardcoded regexes for types in common between DDL2 and DDLm.


DDLm's present type system is a bit eclectic and a little inconsistent, especially with regard to text-based data types.  It has no general-purpose case-insensitive text data type, but it has several special-purpose ones.  It has the highly domain-specific 'Symop'.  And several of the types that are not inherently textual nevertheless specify details of values' text representation, whereas others leave that ambiguous.  I'm not sure how much we can improve that at this point, but I think we can make at least some progress.

As I said above, if you could note any remaining things in a Github issue we can work through them individually.

As a digression, 'Symop' is a wart that is there purely to support legacy representations. Clearly the value is encoding information that should be presented as separate data names, and we have done this in the topological dictionary. Actually, with regex changes we could drop the 'Symop' type completely from DDLm and replace it with a regular expression in the definitions of the relevant data names - no semantics would change. What do we think about that?   
I therefore think we should seriously consider adding an 'Image' data type to DDLm; when represented as ascii, the regular expression applies.
We should avoid being so specific here.  What would be gained by defining an 'Image' type at the DDLm level instead of a more general 'Binary' type?  Yes, though, we can use the proposed regex mechanism to describe an external representation of such values.  It must of course be understood that that would apply within the framework of the data file's container format (CIF 1.1 or CIF 2.0, for example), and not override that format's requirements.

Yes, I agree that we should just have a 'Binary' type which is a sequence of bytes. As I suggested in a separate message on Friday 3rd, we can provide a built-in DDLm function 'Decode' that would take something of this type and return an array of integers.
 The whole machinery already in img_CIF definitions describing how to decode text-encoded images would be incorporated into the DDLm definition,
I take you to mean the details presented in the definition of _array_data.data (https://www.iucr.org/__data/iucr/cifdic_html/2/cif_img.dic/Iarray_data.data.html).  Most of that seems appropriate, but not all.  Unless we are also going to loosen the CIF format specifications to accommodate CBF (as opposed to imgCIF), it's unclear to me that we would want to include the CBF details.  There may be other bits that should be omitted.

Yes you are correct, thanks for putting the link in. My thinking is that CBF is a data format like any other and should be describable by a DDLm dictionary.  However, the value of _array_data.data for a CBF and a CIF are clearly different. Options:
(1) We could incorporate the decoding of _array_data.data into the CBF *format* specification, in which case _array_data.data would be delivered as an array of integers. This however still leaves _array_data.data as an opaque character string for imgCIF, for which we cannot adjust the CIF format specification (would that be CIF2.1?)
(2) We can do (1) and further state that _array_data.data in the case of imgCBF actually corresponds to a different dictionary data name, e.g. _array_data.as_integers. This is possible but would appear very obtuse and unexpected to most people.
(3) We can in both cases state that the value of _array_data.data is an opaque stream of bytes, which if interpreted as ASCII text*, always match the regex provided. We provide the built-in function 'Decode' to turn this opaque stream into an integer array associated with a different data name, providing it with the appropriate parameters drawn from the values of other data names. 

So I think (3) is the only option that is sensible and practical given the legacy we find ourselves with.

*: where non-ASCII bytes are substituted with '.'.


From: James Hester <jamesrhester@gmail.com>
Sent: Thursday, April 2, 2020 7:07 PM
To: Bollinger, John C <John.Bollinger@STJUDE.ORG>
Cc: Herbert J. Bernstein <yayahjb@gmail.com>; Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Adding Regular Expressions to DDLm, was On schema, syntax and semantics
 
Caution: External Sender

Hello again everybody:

I have taken the liberty of once again splitting this thread out into a separate topic for the benefit of future readers. I will also make a summary of this discussion in an issue in the core CIF github repository so that those not on this list can contribute.

I also don't have any great in-principle objections to adding regular expressions. I am vaguely concerned that regular expressions will be seen as a license to embed information into a data value instead of defining separate data names, but that danger can be managed by dictionary groups. 

As we all seem happy with the idea, moving on to technical comments:

(1) DDLm handles Unicode, so some adjustment to the corresponding DDL2 attribute definition will be necessary, perhaps around normalisation.
(2) DDLm should be applicable to both text and non-text data formats. So any new DDLm machinery for regexs would need to clarify that the regular expression applies only in those cases where the data value is represented as text. 
(3) Herbert's original suggestion of a 'ByRegex' addition to _type.contents fits in with (2) in that it explicitly stops any other _type.contents types from having regular expressions assigned, and so is a minimal change to the DDLm architecture.
(4) John B's suggestion that a new attribute be created (which we have to do anyway to hold the regular expression) is also workable in that we could specify in the attribute definition that the Regexp is only relevant for certain _type.contents values (Text/Code/Tag?)

Looking at img_CIF, it would seem that the data item that needs regex capability is the CIF binary format section which has a particular string at the beginning and end. Under suggestion (3), the definition for _array_data.data (which uses this) would have _type.contents 'ByRegex' and a regular expression supplied in a separate attribute. Under suggestion (4), the regular expression would be supplied in a separate attribute *and* the _type.contents set to 'Text'.

Both obviously work in this case. Now, if we consider imgCBF (or indeed nxMX), which is an alternate format for holding CIF data, _array_data.data is delivered from the format as binary, not text.  While our caveat of only applying the regular expression if the data value is text applies and therefore validation trivially passes for the imgCBF data value, it is not correct to say that the imgCBF value is 'Text'.  I therefore think we should seriously consider adding an 'Image' data type to DDLm; when represented as ascii, the regular expression applies. The whole machinery already in img_CIF definitions describing how to decode text-encoded images would be incorporated into the DDLm definition, and the dREL value for a data name with type.contents 'Image' would be the array of integers that resulted.

So in conclusion, I don't see problems with the addition of regexs, and prefer John's suggestion (4) with the addition of an 'Image' _type.contents. We can discuss the details of the Image type in a separate thread or on the img_CIF github issues page.

all the best
James.

On Fri, 3 Apr 2020 at 08:12, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Herbert wrote:

My request is simple -- I wish to add to ddl.dic whatever is necessary to completely support the DDL2 dictionaries, i.e. the PDB
dictionaries and cif_img.dic.  Let is start with one simple basic request, direct support for the specification of
data types as regular expressions.  I believe this goal can be achieved by adding to the _type.contents _enumeration_set
the ability to specify regular expressions, so we can do machine parsable validations.  Any objections to the concept?
This is not what I was commenting upon in my previous message.  I agree, however, that DDLm should be extended as necessary to conveniently express the contents of all the current DDL2 dictionaries.  I say "conveniently" because I recognize that quite a lot can, in principle, be done via DDLm methods (such as validating item values against regexes), yet some of those things would be better expressed via for-purpose dictionary structures.  I say "the contents of current DDL2 dictionaries" because I do not accept a need to match any expressive capability of DDL2 that is not actually used in practice in current dictionaries, if in fact any such capabilities exist.

I think regular-expression-based data validation is a fine place to start.  I take it that on the DDL2 side we are talking about the item_type_list category and especially its _item_type_list.construct attribute.  On the DDLm side, I interpret the suggestion to be to recognize a new value for _type.contents, which would indicate that item values are to be validated via a regex (presumably given by some other attribute) instead of as directed by one of the other codes.  Inasmuch as we would need a new attribute for the regex anyway, however, why add anything to _type.contents?  I would be inclined to let the regex apply *in addition to* the value constraints conveyed by _type.contents.  That would be a bit simpler, and also a bit more analogous to DDL2, with DDLm _type.contents having a role similar to DDL2 _item_type_list.primitive_code.


John



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.