[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] How to specify syntax of a number in CIF2

Hi John,

I believe your approach is to separate the use of CIF files into non-dictionary-aware and dictionary-aware, and to adjust the text accordingly.  I'm happy with that approach.  While I'm generally in agreement with your proposed alterations below, I'm wondering if you could critique how my proposed changes would go against objective (2), as I'm not seeing it.

See interstitial comments:

On 5 August 2015 at 01:32, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Hi James,


I support your enumerated objectives, but your proposed changes seem at odds with objective (2), with widespread DDL1 practice, and possibly even with standard DDL2 practice.  I’m up for clarifying and for making recommendations, but not for making changes that invalidate significant bodies of current software or practice.  Moreover, the proposed additions don’t address all the issues.

I am glad that you support those objectives and I certainly don't want to invalidate current software and practice, with the exception that I do want to allow non whitespace delimited numbers to be acceptable (which must be allowed by my second objective).


Without getting into specific language, this is what I think I would like to see:


() a clarification for ITVG (10) explaining that "values that are to be interpreted as numbers" refers specifically to values interpreted, for whatever reason, according to the data type 'numb' described in section


() a clarification for ITVG that explicitly narrows its scope to items not defined in a dictionary.

So the approach is to limit the use of 'numb' to situations in which no other information about the item is known to the application writer.  This seems like a reasonable approach.


() a clarification that the CIF 1.1 <Numeric> production and its related component productions provide the details of the conventional data type 'numb', as opposed to being the only allowed form for numeric data values, regardless of actual data type.



() a clarification that a dictionary *can*, without restriction, ascribe any significance to whether a value is presented quoted, paired with a recommendation that they *not* do so, and perhaps a description of the limited ways in which the current DDLs and dictionaries do do so.

Not sure that this is the way I would do it - see next comment.


() a secondary recommendation that dictionaries that do ascribe significance to whether a value is presented quoted do so as broadly and uniformly as possible.  Examples of broad and uniform would be overall dictionary-level, or even DDL-level recognition of the conventional CIF null values as distinct from their quoted analogs, and similarly-scoped specifications that numbers be presented unquoted.  We especially want to discourage such distinctions being drawn on an item-by-item basis, but I don’t think that’s a major problem because none of our DDLs has a means to express that.

I believe it is open to us to decree (in 'common semantic features') that the 'char' datavalue referred to by DDL definitions is the datavalue with delimiters removed, with the exception of '.' and '?' - I'm not sure if that was stated explicitly in Vol G.


() an adjustment to the prose definition of DDL1's '_type' attribute, which is anyway either incomplete or inconsistent in version 1.4.1 of that dictionary, as it pertains to type numb.  This could provide format details for the general case, to be narrowed where necessary by other definition attributes.

There are many issues with DDL1, and the first time I raised them (about 10 years ago) I was advised that it would be better to focus my energies on DDLm, as DDL1 was a dead end.  Insofar as this definition has stood for 20 years, we might consider that it has been relied upon by a generation of DDL1 dictionary and software authors, thereby making it true "by definition" in all its flawed glory.


() a recommendation to CIF authors (but mostly to their proxies, authors of software that outputs CIF) that numeric data values be presented unquoted wherever their data types permit.


() a recommendation to authors of software that reads CIF to accept quoted numeric data values, even when their data types do not actually allow it.  This is not meant to preclude software issuing diagnostic messages warning about malformed numeric values in the event that values are presented quoted when their items' definitions demand otherwise.

This might be confusing as I think we are already allowing quoted numbers as long as there is a dictionary available that says that the datavalue is numeric.  Also, I would like to see all discussion of the concrete representation of datatypes removed from the dictionary (as per objective (1)).


() a recommendation to CIF dictionary authors that the defined format for numeric data types be consistent with the ITVG numeric syntax wherever possible.

I guess you have in mind DDL2 here as DDLm doesn't give anybody this option.



Looking forward to the next edition of ITVG -- and with apologies to the section 2.2 authors, many of whom I know are receiving this -- I think section 2.2 would benefit from a thorough rewrite.  Minor tweaks here and there won’t really suffice.  The current version is a concatenation of two distinct documents with overlapping subject matter coverage, drawing on document history and lineage that extends to a time before that of some of the material it is intended to specify.  It is needlessly repetitive, and it emphasizes some details that these days are of minimal importance.  As the discussion here has shown, it is also tricky to interpret in places, and it struggles a bit to accommodate both DDL1 and DDL2 practices.  It will face an even bigger challenge in the next edition, with the addition of CIF 2.0 syntax and DDLm (albeit probably in a different section).  I suggest, therefore, that we not worry at this point about prose for that edition, but instead work on making the best we can of the current edition by providing written interpretations and, if necessary, corrigenda.

I'm glad that you're looking forward to Vol G.  As one of the co-editors, I am looking forward to your contribution(s) (TBA)!  Meanwhile, I am keen to hash out these issues now so that they are ready for writing up into Vol G.
My mention of Vol G was merely to make explicit that I didn't want this to be seen as material for the current paper, but nevertheless that it was important material for the overall specification.  I do agree with your assessment of the current section 2.2, but I suspect that what we see in section 2.2 is the result of a lot of harmonising of different but valuable viewpoints.  Whether the next edition will fare any better now we have demoted DDL1 remains to be seen.



All the best,


From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Monday, August 03, 2015 9:13 PM
To: ddlm-group
Subject: [ddlm-group] How to specify syntax of a number in CIF2


Dear All,

The preceding discussion around possible semantic distinctions between whitespace and non-whitespace delimited strings has thrown up an unresolved semantic issue in CIF2.  In a nutshell, a programmer wishing to write a number in CIF2 currently has no specification anywhere as to how that number should be presented, and neither do CIF2 readers know how to interpret strings as numbers.

In CIF1.1, the syntax description is included in the BNF, and the DDL2 system additionally permits each dictionary to specify the text syntax of the types used in that particular dictionary using _item_type_list.construct.

In making this specification, I think we should preserve the following behaviour:

(1) DDL dictionaries are format agnostic (i.e. they could be used to define ontologies for other file formats) - our DDLs are advanced and potentially useful to other communities

(2) DDL dictionaries determine whether or not a value should be interpreted as a number (as they define the nature of a dataitem)

In a practical sense, software written in consultation with a dictionary is happy to specify that it expects a number when it calls an API routine to obtain a datavalue, as this knowledge is available at program writing time.  So the onus is on the API routine to look at the sequence of characters that for the requested datavalue and decide if it can return something that the calling software understands as a number. 

So I would suggest the following be inserted into "Common semantic features" in our online specs and the next edition of Vol G:


A datavalue may only be interpreted as a real number if it conforms to the following syntax:

<insert delimiter-agnostic CIF1 syntax expressions here>

A datavalue may only be interpreted as an integer if it conforms to the following syntax:

<insert suitable delimiter-agnostic integer ENBF expressions here>

What do you think?




T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]