[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics
- To: "james.r.hester@gmail.com" <james.r.hester@gmail.com>
- Subject: Re: [ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Fri, 3 Apr 2020 15:49:15 +0000
- Accept-Language: en-US
- ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=passsmtp.mailfrom=stjude.org; dmarc=pass action=none header.from=stjude.org;dkim=pass header.d=stjude.org; arc=none
- ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;bh=3ScRdD3GqVOJpg5wYD7dIQZ8iRDmj4MCEyV6N2BBwPo=;b=Duhe/P/fHrd51y+losJG89x/ePvXiocE594U6axrpT6Y4bU2wSN3YebCvbQGPHpAtsDbqdTkZjfcu+RO8pKGh1MsIq0BYFj8MfXSoQ3agkMdyidtu9dPuO9Ae6n2onNOLSk1+YeUoPzX1Cib1RvfpImjP3wI8tT8wwkbrOwR8vAzwR4jAn1tSE7ercU1zUunbc3QFVvVw6sHvPB2lXqejMptc2aIT6hByPqgpMn222fn67ORsTBI2jvVvTjxoCns/u5fA+SDo3YWWmMPiRfH+4Zka/aQcyNLBAOP6Lc9g2HRQ87CjqCn5lwG9F0CXCht2xQRJT0SLVrPFrvI9UOhgA==
- ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;b=YCufnggs/ybW71CUaVOX891Upqv5Yd3fc+Vyp7YbNyvZ6TohqIbKwflBRP4S2oEsVwgayrhf0vQdTsLuuORSCtELumVRDVzrcgX7c71NVjDr5JM2tfRfpqko0qN+1nAavJ+kzNk5C2yKm/IHDxQ0Po9ujYirgtZixP/xjtlSFcUwPKhbbxbEatDYgU8ozH+HVzYgMv2O56uePHGEsafGTl6i4FbNdlEFqgSjW5PWveeW8jvhPNS4q57sb3FvEj3qCm/DC1hNnINIHPGinzPq4BrDUff/JQEQ7lQBf5VpCZVhwDVPJ+AAVu7ltpupvYDki94QDJvWK0pqlczmXURuoQ==
- authentication-results: spf=none (sender IP is )smtp.mailfrom=John.Bollinger@STJUDE.ORG;
- Cc: Group finalising DDLm andassociated dictionaries <ddlm-group@iucr.org>
- In-Reply-To: <CAM+dB2cEJqFQvTqbB8jhaG9T=gdjwW3URZuLtNpfKpdKEPN1bg@mail.gmail.com>
- IronPort-SDR: TjCt7okX2QGCgXOBYa3VNHy+O2iDvKUQg8XgrExHYzN/p+aBXxiCTGZwHu6zs9eiwWdvNJkP/JZ6EDMD5ziedNVMyYKYYLhPKgOP2DxezgKm1j0pG30QuyLdnFaUtxgmvihRZc4TJWkuGPdoYWTUfeAgN/v8JKjy9ndU7csaxWBvWPrLUvpGOvYkY3mwKgSmzFl4RNozqP2ybHwqkcFjommFnBDSdLXxRWNoUf/a8oKw+owwJltu+yDs6elOsthT6IUWMnYS919iPDwlUVxdpC04Q3otlAZsCOh5AWxv4vM=
- References: <CAM+dB2eFZ+-yUVWfNBVnKUaNNr9bUC9S3B8QJ9pYHNYk4ETnfA@mail.gmail.com><CABcsX26hg1KG+1P08W=GbjjV-upjKtbgyzbH4WW+qDhwZQR4zA@mail.gmail.com><CAM+dB2dBTdoXj_VegOibsFaKowy-+kXT6OQ2MxaVA=wOcD1akg@mail.gmail.com><1ffdd7d8-f29f-4c7b-e6b7-0bff08358484@rcsb.org><CAM+dB2fOodbuyMFhRnY5EZebYtPP3+RWh9pRLbAQvYmxvHYBrw@mail.gmail.com><CABcsX27tt801DdX8cmFwuBFY5JmMcm2T3od-VgnNMygP29TfLQ@mail.gmail.com><CAM+dB2dsQd3wU69bmeRZxbV5v+bK=Q841=Wgr53aJ3nHudhr6Q@mail.gmail.com><DM6PR04MB397876478F9E0FB2715C8EBEE0C60@DM6PR04MB3978.namprd04.prod.outlook.com><CABcsX24zEaLYhF1Q1ebpYAadh18QbAneygcTQoVO95kYo12SMg@mail.gmail.com><DM6PR04MB39786E135933F670DDA2CC62E0C60@DM6PR04MB3978.namprd04.prod.outlook.com>,<CAM+dB2cEJqFQvTqbB8jhaG9T=gdjwW3URZuLtNpfKpdKEPN1bg@mail.gmail.com>
For
generality, yes, and maybe more than that. But not necessarily to support the specific patterns in use in existing dictionaries. That is, at least some of the patterns now used would work correctly on UTF-8 byte streams encoding some non-ASCII characters,
and I imagine that others could be adjusted to work. If any intend to continue limit values to ASCII characters then they also may work fine as-is. We probably do want to address this, but it would be worthwhile to survey current DDL2 dictionaries to see
where we actually stand.
I
think this is drawing a weak point of DDLm into the spotlight. As we position CIF dictionaries as general ontological objects, as opposed to objects dedicated specifically to CIF-format files, we find that DDLm and data dictionaries are straddling that divide
a bit uncomfortably. I see that in how DDLm is inconsistent about specifying text formats for various _type.contents alternatives. It is quite specific about some, consistent with a role in defining details of CIF-format data representation, yet entirely
silent about others.
Perhaps,
then, this is an opportunity to sort that out somewhat. Yes, we do clarify that the regexs specify the allowed *CIF-format* text representations of the various types. And we also provide a default regex for each _type.contents alternative, moving those details
out of _type.contents's enumeration_set.
That's
not the only imgCIF (or mmCIF) data type that needs regex. Consider imgCIF's 'code' type. In the first place, it is not congruent with DDLm's 'Code' because the former is case-sensitive but the latter not. imgCIF and mmCIF distinguish this from 'ucode',
which is case-insensitive. It looks like a DDLm-based version of imgCIF's 'code' would need to be a regex-restricted version of Text. Additionally, if items of these imgCIF types need to remain restricted to ASCII characters, as their present regexes would
do, then for this reason also they would need regex. Similar considerations apply to several other mmCIF and imgCIF data types.
DDLm's present
type system is a bit eclectic and a little inconsistent, especially with regard to text-based data types. It has no general-purpose case-insensitive text data type, but it has several special-purpose ones. It has the highly domain-specific
'Symop'. And several of the types that are not inherently textual nevertheless specify details of values' text representation, whereas others leave that ambiguous. I'm not sure how much we can improve that at this point, but I think we can make at least
some progress.
We
should avoid being so specific here. What would be gained by defining an 'Image' type at the DDLm level instead of a more general 'Binary' type? Yes, though, we can use the proposed regex mechanism to describe an external representation of such values.
It must of course be understood that that would apply within the framework of the data file's container format (CIF 1.1 or CIF 2.0, for example), and not override that format's requirements.
I
take you to mean the details presented in the definition of _array_data.data (https://www.iucr.org/__data/iucr/cifdic_html/2/cif_img.dic/Iarray_data.data.html).
Most of that seems appropriate, but not all. Unless we are also going to loosen the CIF format specifications to accommodate CBF (as opposed to imgCIF), it's unclear to me that we would want to include the CBF details. There may be other bits that should
be omitted.
Makes sense.
Best regards,
John
From: James Hester <jamesrhester@gmail.com>
Sent: Thursday, April 2, 2020 7:07 PM To: Bollinger, John C <John.Bollinger@STJUDE.ORG> Cc: Herbert J. Bernstein <yayahjb@gmail.com>; Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> Subject: Adding Regular Expressions to DDLm, was On schema, syntax and semantics
Caution: External Sender
Hello again everybody:
I have taken the liberty of once again splitting this thread out into a separate topic for the benefit of future readers. I will also make a summary of this discussion in an issue in the core CIF github repository so that those not on this list can contribute.
I also don't have any great in-principle objections to adding regular expressions. I am vaguely concerned that regular expressions will be seen as a license to embed information into a data value instead of defining separate data names, but that
danger can be managed by dictionary groups.
As we all seem happy with the idea, moving on to technical comments:
(1) DDLm handles Unicode, so some adjustment to the corresponding DDL2 attribute definition will be necessary, perhaps around normalisation.
(2) DDLm should be applicable to both text and non-text data formats. So any new DDLm machinery for regexs would need to clarify that the regular expression applies only in those cases where the data value is represented as text.
(3) Herbert's original suggestion of a 'ByRegex' addition to _type.contents fits in with (2) in that it explicitly stops any other _type.contents types from having regular expressions assigned, and so is a minimal change to the DDLm architecture.
(4) John B's suggestion that a new attribute be created (which we have to do anyway to hold the regular expression) is also workable in that we could specify in the attribute definition that the Regexp is only relevant for certain _type.contents
values (Text/Code/Tag?)
Looking at img_CIF, it would seem that the data item that needs regex capability is the CIF binary format section which has a particular string at the beginning and end. Under suggestion (3), the definition for _array_data.data (which uses this) would
have _type.contents 'ByRegex' and a regular expression supplied in a separate attribute. Under suggestion (4), the regular expression would be supplied in a separate attribute *and* the _type.contents set to 'Text'.
Both obviously work in this case. Now, if we consider imgCBF (or indeed nxMX), which is an alternate format for holding CIF data, _array_data.data is delivered from the format as binary, not text. While our caveat of only applying the regular
expression if the data value is text applies and therefore validation trivially passes for the imgCBF data value, it is not correct to say that the imgCBF value is 'Text'. I therefore think we should seriously consider adding an 'Image' data type to DDLm;
when represented as ascii, the regular expression applies. The whole machinery already in img_CIF definitions describing how to decode text-encoded images would be incorporated into the DDLm definition, and the dREL value for a data name with type.contents
'Image' would be the array of integers that resulted.
So in conclusion, I don't see problems with the addition of regexs, and prefer John's suggestion (4) with the addition of an 'Image' _type.contents. We can discuss the details of the Image type in a separate thread or on the img_CIF github issues page.
all the best
James.
On Fri, 3 Apr 2020 at 08:12, Bollinger, John C <John.Bollinger@stjude.org> wrote:
T +61 (02) 9717 9907
F +61 (02) 9717 3145 M +61 (04) 0249 4148 Email Disclaimer: www.stjude.org/emaildisclaimer Consultation Disclaimer: www.stjude.org/consultationdisclaimer |
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- [ddlm-group] Preparing CIF for multi-block datasets (James Hester)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (Herbert J. Bernstein)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (James Hester)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (john.westbrook@rcsb.org)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (James Hester)
- Re: [ddlm-group] Preparing CIF for multi-block datasets (Herbert J. Bernstein)
- [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (James Hester)
- Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (Bollinger, John C)
- Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (Herbert J. Bernstein)
- Re: [ddlm-group] On schema, syntax and semantics,was Preparing CIF for multi-block datasets (Bollinger, John C)
- [ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics (James Hester)
- Prev by Date: Re: [ddlm-group] Preparing CIF for multi-block datasets
- Next by Date: Re: [ddlm-group] Preparing CIF for multi-block datasets
- Prev by thread: [ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics
- Next by thread: Re: [ddlm-group] Adding Regular Expressions to DDLm, was On schema,syntax and semantics
- Index(es):