Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics

(1) DDLm handles Unicode, so some adjustment to the corresponding DDL2 attribute definition will be necessary, perhaps around normalisation.

For generality, yes, and maybe more than that.  But not necessarily to support the specific patterns in use in existing dictionaries.  That is, at least some of the patterns now used would work correctly on UTF-8 byte streams encoding some non-ASCII characters, and I imagine that others could be adjusted to work.  If any intend to continue limit values to ASCII characters then they also may work fine as-is.  We probably do want to address this, but it would be worthwhile to survey current DDL2 dictionaries to see where we actually stand.

(2) DDLm should be applicable to both text and non-text data formats. So any new DDLm machinery for regexs would need to clarify that the regular expression applies only in those cases where the data value is represented as text.
I think this is drawing a weak point of DDLm into the spotlight.  As we position CIF dictionaries as general ontological objects, as opposed to objects dedicated specifically to CIF-format files, we find that DDLm and data dictionaries are straddling that divide a bit uncomfortably.  I see that in how DDLm is inconsistent about specifying text formats for various _type.contents alternatives.  It is quite specific about some, consistent with a role in defining details of CIF-format data representation, yet entirely silent about others.

Perhaps, then, this is an opportunity to sort that out somewhat.  Yes, we do clarify that the regexs specify the allowed *CIF-format* text representations of the various types. And we also provide a default regex for each _type.contents alternative, moving those details out of _type.contents's enumeration_set.

Looking at img_CIF, it would seem that the data item that needs regex capability is the CIF binary format section which has a particular string at the beginning and end.
That's not the only imgCIF (or mmCIF) data type that needs regex.  Consider imgCIF's 'code' type.  In the first place, it is not congruent with DDLm's 'Code' because the former is case-sensitive but the latter not.  imgCIF and mmCIF distinguish this from 'ucode', which is case-insensitive.  It looks like a DDLm-based version of imgCIF's 'code' would need to be a regex-restricted version of Text.  Additionally, if items of these imgCIF types need to remain restricted to ASCII characters, as their present regexes would do, then for this reason also they would need regex.  Similar considerations apply to several other mmCIF and imgCIF data types.

DDLm's present type system is a bit eclectic and a little inconsistent, especially with regard to text-based data types.  It has no general-purpose case-insensitive text data type, but it has several special-purpose ones.  It has the highly domain-specific 'Symop'.  And several of the types that are not inherently textual nevertheless specify details of values' text representation, whereas others leave that ambiguous.  I'm not sure how much we can improve that at this point, but I think we can make at least some progress.
I therefore think we should seriously consider adding an 'Image' data type to DDLm; when represented as ascii, the regular expression applies.
We should avoid being so specific here.  What would be gained by defining an 'Image' type at the DDLm level instead of a more general 'Binary' type?  Yes, though, we can use the proposed regex mechanism to describe an external representation of such values.  It must of course be understood that that would apply within the framework of the data file's container format (CIF 1.1 or CIF 2.0, for example), and not override that format's requirements.
 The whole machinery already in img_CIF definitions describing how to decode text-encoded images would be incorporated into the DDLm definition,
I take you to mean the details presented in the definition of _array_data.data (https://www.iucr.org/__data/iucr/cifdic_html/2/cif_img.dic/Iarray_data.data.html).  Most of that seems appropriate, but not all.  Unless we are also going to loosen the CIF format specifications to accommodate CBF (as opposed to imgCIF), it's unclear to me that we would want to include the CBF details.  There may be other bits that should be omitted.
 and the dREL value for a data name with type.contents 'Image' would be the array of integers that resulted.
Makes sense.


Best regards,

John


From: James Hester <jamesrhester@gmail.com>
Sent: Thursday, April 2, 2020 7:07 PM
To: Bollinger, John C <John.Bollinger@STJUDE.ORG>
Cc: Herbert J. Bernstein <yayahjb@gmail.com>; Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Adding Regular Expressions to DDLm, was On schema, syntax and semantics
 
Caution: External Sender

Hello again everybody:

I have taken the liberty of once again splitting this thread out into a separate topic for the benefit of future readers. I will also make a summary of this discussion in an issue in the core CIF github repository so that those not on this list can contribute.

I also don't have any great in-principle objections to adding regular expressions. I am vaguely concerned that regular expressions will be seen as a license to embed information into a data value instead of defining separate data names, but that danger can be managed by dictionary groups. 

As we all seem happy with the idea, moving on to technical comments:

(1) DDLm handles Unicode, so some adjustment to the corresponding DDL2 attribute definition will be necessary, perhaps around normalisation.
(2) DDLm should be applicable to both text and non-text data formats. So any new DDLm machinery for regexs would need to clarify that the regular expression applies only in those cases where the data value is represented as text. 
(3) Herbert's original suggestion of a 'ByRegex' addition to _type.contents fits in with (2) in that it explicitly stops any other _type.contents types from having regular expressions assigned, and so is a minimal change to the DDLm architecture.
(4) John B's suggestion that a new attribute be created (which we have to do anyway to hold the regular expression) is also workable in that we could specify in the attribute definition that the Regexp is only relevant for certain _type.contents values (Text/Code/Tag?)

Looking at img_CIF, it would seem that the data item that needs regex capability is the CIF binary format section which has a particular string at the beginning and end. Under suggestion (3), the definition for _array_data.data (which uses this) would have _type.contents 'ByRegex' and a regular expression supplied in a separate attribute. Under suggestion (4), the regular expression would be supplied in a separate attribute *and* the _type.contents set to 'Text'.

Both obviously work in this case. Now, if we consider imgCBF (or indeed nxMX), which is an alternate format for holding CIF data, _array_data.data is delivered from the format as binary, not text.  While our caveat of only applying the regular expression if the data value is text applies and therefore validation trivially passes for the imgCBF data value, it is not correct to say that the imgCBF value is 'Text'.  I therefore think we should seriously consider adding an 'Image' data type to DDLm; when represented as ascii, the regular expression applies. The whole machinery already in img_CIF definitions describing how to decode text-encoded images would be incorporated into the DDLm definition, and the dREL value for a data name with type.contents 'Image' would be the array of integers that resulted.

So in conclusion, I don't see problems with the addition of regexs, and prefer John's suggestion (4) with the addition of an 'Image' _type.contents. We can discuss the details of the Image type in a separate thread or on the img_CIF github issues page.

all the best
James.

On Fri, 3 Apr 2020 at 08:12, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Herbert wrote:

My request is simple -- I wish to add to ddl.dic whatever is necessary to completely support the DDL2 dictionaries, i.e. the PDB
dictionaries and cif_img.dic.  Let is start with one simple basic request, direct support for the specification of
data types as regular expressions.  I believe this goal can be achieved by adding to the _type.contents _enumeration_set
the ability to specify regular expressions, so we can do machine parsable validations.  Any objections to the concept?
This is not what I was commenting upon in my previous message.  I agree, however, that DDLm should be extended as necessary to conveniently express the contents of all the current DDL2 dictionaries.  I say "conveniently" because I recognize that quite a lot can, in principle, be done via DDLm methods (such as validating item values against regexes), yet some of those things would be better expressed via for-purpose dictionary structures.  I say "the contents of current DDL2 dictionaries" because I do not accept a need to match any expressive capability of DDL2 that is not actually used in practice in current dictionaries, if in fact any such capabilities exist.

I think regular-expression-based data validation is a fine place to start.  I take it that on the DDL2 side we are talking about the item_type_list category and especially its _item_type_list.construct attribute.  On the DDLm side, I interpret the suggestion to be to recognize a new value for _type.contents, which would indicate that item values are to be validated via a regex (presumably given by some other attribute) instead of as directed by one of the other codes.  Inasmuch as we would need a new attribute for the regex anyway, however, why add anything to _type.contents?  I would be inclined to let the regex apply *in addition to* the value constraints conveyed by _type.contents.  That would be a bit simpler, and also a bit more analogous to DDL2, with DDLm _type.contents having a role similar to DDL2 _item_type_list.primitive_code.


John



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.