Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics

Hello again everybody:

I have taken the liberty of once again splitting this thread out into a separate topic for the benefit of future readers. I will also make a summary of this discussion in an issue in the core CIF github repository so that those not on this list can contribute.

I also don't have any great in-principle objections to adding regular expressions. I am vaguely concerned that regular expressions will be seen as a license to embed information into a data value instead of defining separate data names, but that danger can be managed by dictionary groups. 

As we all seem happy with the idea, moving on to technical comments:

(1) DDLm handles Unicode, so some adjustment to the corresponding DDL2 attribute definition will be necessary, perhaps around normalisation.
(2) DDLm should be applicable to both text and non-text data formats. So any new DDLm machinery for regexs would need to clarify that the regular expression applies only in those cases where the data value is represented as text. 
(3) Herbert's original suggestion of a 'ByRegex' addition to _type.contents fits in with (2) in that it explicitly stops any other _type.contents types from having regular expressions assigned, and so is a minimal change to the DDLm architecture.
(4) John B's suggestion that a new attribute be created (which we have to do anyway to hold the regular expression) is also workable in that we could specify in the attribute definition that the Regexp is only relevant for certain _type.contents values (Text/Code/Tag?)

Looking at img_CIF, it would seem that the data item that needs regex capability is the CIF binary format section which has a particular string at the beginning and end. Under suggestion (3), the definition for _array_data.data (which uses this) would have _type.contents 'ByRegex' and a regular expression supplied in a separate attribute. Under suggestion (4), the regular expression would be supplied in a separate attribute *and* the _type.contents set to 'Text'.

Both obviously work in this case. Now, if we consider imgCBF (or indeed nxMX), which is an alternate format for holding CIF data, _array_data.data is delivered from the format as binary, not text.  While our caveat of only applying the regular expression if the data value is text applies and therefore validation trivially passes for the imgCBF data value, it is not correct to say that the imgCBF value is 'Text'.  I therefore think we should seriously consider adding an 'Image' data type to DDLm; when represented as ascii, the regular expression applies. The whole machinery already in img_CIF definitions describing how to decode text-encoded images would be incorporated into the DDLm definition, and the dREL value for a data name with type.contents 'Image' would be the array of integers that resulted.

So in conclusion, I don't see problems with the addition of regexs, and prefer John's suggestion (4) with the addition of an 'Image' _type.contents. We can discuss the details of the Image type in a separate thread or on the img_CIF github issues page.

all the best
James.

On Fri, 3 Apr 2020 at 08:12, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Herbert wrote:

My request is simple -- I wish to add to ddl.dic whatever is necessary to completely support the DDL2 dictionaries, i.e. the PDB
dictionaries and cif_img.dic.  Let is start with one simple basic request, direct support for the specification of
data types as regular expressions.  I believe this goal can be achieved by adding to the _type.contents _enumeration_set
the ability to specify regular expressions, so we can do machine parsable validations.  Any objections to the concept?
This is not what I was commenting upon in my previous message.  I agree, however, that DDLm should be extended as necessary to conveniently express the contents of all the current DDL2 dictionaries.  I say "conveniently" because I recognize that quite a lot can, in principle, be done via DDLm methods (such as validating item values against regexes), yet some of those things would be better expressed via for-purpose dictionary structures.  I say "the contents of current DDL2 dictionaries" because I do not accept a need to match any expressive capability of DDL2 that is not actually used in practice in current dictionaries, if in fact any such capabilities exist.

I think regular-expression-based data validation is a fine place to start.  I take it that on the DDL2 side we are talking about the item_type_list category and especially its _item_type_list.construct attribute.  On the DDLm side, I interpret the suggestion to be to recognize a new value for _type.contents, which would indicate that item values are to be validated via a regex (presumably given by some other attribute) instead of as directed by one of the other codes.  Inasmuch as we would need a new attribute for the regex anyway, however, why add anything to _type.contents?  I would be inclined to let the regex apply *in addition to* the value constraints conveyed by _type.contents.  That would be a bit simpler, and also a bit more analogous to DDL2, with DDLm _type.contents having a role similar to DDL2 _item_type_list.primitive_code.


John



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.