[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] How to specify syntax of a number in CIF2

Hi James,

Comments in-line below.

On Wednesday, August 05, 2015 1:09 AM, James Hester wrote:
> I believe your approach is to separate the use of CIF files into non-dictionary-aware and dictionary-aware, and to adjust the text accordingly.  I'm happy with that approach.  While I'm generally in agreement with your proposed alterations below, I'm wondering if you could critique how my proposed changes would go against objective (2), as I'm not seeing it.

Objective (2) is that dictionaries determine whether values are interpreted as numbers.  Your proposal is not wholly inconsistent with that, but it needlessly circumscribes dictionaries' capabilities.  For example, supposing that the specified numeric formats are roughly equivalent to the CIF 1.1 <Numeric> production, they would not permit a dictionary to define a numeric item that was intended to be expressed in hexadecimal format.  Perhaps more relevantly, your proposal prevents DDL1 dictionaries from specifying that values presented quoted are not numbers, as many applications now interpret those dictionaries to do.

We agree, I think, that a CIF application operating without guidance from a dictionary is pretty limited in what it can safely do.  It doesn't help such applications to place additional constraints that provide no actual guidance.  Applications operating with guidance from a dictionary, on the other hand, don't need global constraints such as you have proposed, at least not directly.  Rather, they need their dictionaries to provide sufficient information to interpret values correctly.

The overall intent of my proposals is to empower dictionaries to the greatest extent that is consistent with current practice and existing DDLs.  I did not say so before, but I would like to characterize the common semantic features as operating *through* dictionaries, more or less as a baseline, especially where it comes to data typing.  Among other things, that justifies DDL2 and DDL1 applications approaching data typing differently, as DDL1  can and should be interpreted as incorporating the common base data types, whereas DDL2 explicitly rejects them in favor of its own approach.  I haven't thought through specifically how, that would fit with DDLm, but I don't see any inherent conflict.

That's why I suggest explicitly leaving it to dictionaries how to interpret quoting status for all values.  If dictionaries are fully authoritative for type determination then the widespread, well-justified understanding among DDL1 applications that numbers are supposed to be presented unquoted must be characterized as proceeding through dictionaries, albeit implicitly.  If dictionaries have the ability to specify, even implicitly, that some types of values should be presented unquoted, then I'd rather make that a general capability than try to set up rules about when they can and can't.  And a future dictionary may find that useful for more than numbers and the null values.  In the meantime, however, I think we want to discourage such distinctions where they are not already established.  Certainly COMCIFS has the power to police that for officially approved dictionaries, and none of the current DDLs has a mechanism for explicitly defining such a distinction in a machine-readable way, so I don't think there's much risk.

> On 5 August 2015 at 01:32, Bollinger, John C <John.Bollinger@stjude.org> wrote:
>> Hi James,
>> I support your enumerated objectives, but your proposed changes seem at odds with objective (2), with widespread DDL1 practice, and possibly even with standard DDL2 practice.  I’m up for clarifying and for making recommendations, but not for making changes that invalidate significant bodies of current software or practice.  Moreover, the proposed additions don’t address all the issues.
> I am glad that you support those objectives and I certainly don't want to invalidate current software and practice, with the exception that I do want to allow non whitespace delimited numbers to be acceptable (which must be allowed by my second objective).
>> Without getting into specific language, this is what I think I would like to see:
>> () a clarification for ITVG (10) explaining that "values that are to be interpreted as numbers" refers specifically to values interpreted, for whatever reason, according to the data type 'numb' described in section
>> () a clarification for ITVG that explicitly narrows its scope to items not defined in a dictionary.
> So the approach is to limit the use of 'numb' to situations in which no other information about the item is known to the application writer.  This seems like a reasonable approach.

Basically, yes.  To reconcile that with what I wrote above, I also take DDL1 data typing as including the ITVG data types by reference.  I don't think that's a stretch, given their history and shared naming.

>> () a clarification that the CIF 1.1 <Numeric> production and its related component productions provide the details of the conventional data type 'numb', as opposed to being the only allowed form for numeric data values, regardless of actual data type.
> Fine 
>> () a clarification that a dictionary *can*, without restriction, ascribe any significance to whether a value is presented quoted, paired with a recommendation that they *not* do so, and perhaps a description of the limited ways in which the current DDLs and dictionaries do do so.
> Not sure that this is the way I would do it - see next comment. 
>> () a secondary recommendation that dictionaries that do ascribe significance to whether a value is presented quoted do so as broadly and uniformly as possible.  Examples of broad and uniform would be overall dictionary-level, or even DDL-level recognition of the conventional CIF null values as distinct from their quoted analogs, and similarly-scoped specifications that numbers be presented unquoted.  We especially want to discourage such distinctions being drawn on an item-by-item basis, but I don’t think that’s a major problem because none of our DDLs has a means to express that.
> I believe it is open to us to decree (in 'common semantic features') that the 'char' datavalue referred to by DDL definitions is the datavalue with delimiters removed, with the exception of '.' and '?' - I'm not sure if that was stated explicitly in Vol G. 

Although it has not often been presented in these terms, data values expressed in CIF format have two essential properties that affect their interpretation: a string of characters and the presence or absence of non-whitespace delimiters around that string.  All current CIF applications depend on both, at least to distinguish '.' and '?' from the special not-applicable and unknown-value values.  DDL1 applications also ascribe significance to quoting for the purpose of distinguishing numbers from simple strings of characters, as indeed ITVG directs them to do.  Even those that ultimately accept such values as numbers do that if they emit a warning as a result.  I much prefer to give simpler, more general rules about this than to try to tailor rules to the exact contours of current practice.

>> () an adjustment to the prose definition of DDL1's '_type' attribute, which is anyway either incomplete or inconsistent in version 1.4.1 of that dictionary, as it pertains to type numb.  This could provide format details for the general case, to be narrowed where necessary by other definition attributes.
> There are many issues with DDL1, and the first time I raised them (about 10 years ago) I was advised that it would be better to focus my energies on DDLm, as DDL1 was a dead end.  Insofar as this definition has stood for 20 years, we might consider that it has been relied upon by a generation of DDL1 dictionary and software authors, thereby making it true "by definition" in all its flawed glory. 

I am not proposing to alter the definition to make it mean something different than it now does.  We are not at liberty to do that.  Rather, I am proposing to alter the descriptive text to more clearly and consistent express its meaning, as it has come to be interpreted.  Of course, that presumes that we can determine what that meaning is, but if we can't determine that then the definition isn't of much use.

>> () a recommendation to CIF authors (but mostly to their proxies, authors of software that outputs CIF) that numeric data values be presented unquoted wherever their data types permit.
> Absolutely 
>> () a recommendation to authors of software that reads CIF to accept quoted numeric data values, even when their data types do not actually allow it.  This is not meant to preclude software issuing diagnostic messages warning about malformed numeric values in the event that values are presented quoted when their items' definitions demand otherwise.
> This might be confusing as I think we are already allowing quoted numbers as long as there is a dictionary available that says that the datavalue is numeric.  Also, I would like to see all discussion of the concrete representation of datatypes removed from the dictionary (as per objective (1)).

We also have specifications, a body of practice, and (as I characterize it) dictionaries that do not allow quoted strings to be interpreted as numbers.  I had not previously absorbed the full meaning of objective (1), but unless you propose to reject historic specifications and practice, you need a place to hang those quoting-dependent rules.  That doesn't have to be dictionaries/DDLs, but you are proposing incompatible general rules, so if it's not dictionaries then you need something new.  If you are proposing something that cannot be reconciled with current specifications and practice, then it needs COMCIFS approval to be effective.

Do also consider that DDLm contains data value formats (see _type.contents), though there is no indication that they are sensitive to quoting.  Dictionaries expressed in DDLm do therefore specify the format of data values, some in more detail than others.  Indeed, I think DDLm, as presented in the 2012 paper, is a bit flawed in that its apparent intent is to specify data value formats, but it is inconsistent in the level of detail it uses to specify them, with some barely being specified at all.  Effective data archiving and interchange requires that all formats be precisely described *somewhere*.

>> () a recommendation to CIF dictionary authors that the defined format for numeric data types be consistent with the ITVG numeric syntax wherever possible.
> I guess you have in mind DDL2 here as DDLm doesn't give anybody this option.

DDLm already defines numeric formats that are not consistent with ITVG numeric syntax.  It has _type.contents values 'Hexadecimal', 'Octal', 'Binary', and 'Complex', all of which specify formats inconsistent with ITVG numeric syntax.  It also has a mechanism for defining additional extension types, presumably including numeric ones.  Furthermore, it is my understanding that there is ongoing work -- or at least the intention of work -- to extend DDLm to provide all attributes needed to transliterate DDL2 dictionaries into DDLm format without loss of fidelity.  When that work is complete, DDLm will give everybody all the options the DDL2 does. 
With that said, "whenever possible" is probably too strong, and perhaps this whole point should be tossed.  By the time it is watered down to an appropriate level, it probably doesn't offer enough guidance to be helpful.



_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]