Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics


On Wednesday, July 27, 2011 10:09 PM, James Hester wrote:
>I've edited and interspersed comments below:
>On Thu, Jul 28, 2011 at 12:00 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:
>
>>(James wrote):
>>
>>>Note the following consequences of the CIF1 approach, which I hope we
>>>all accept:

[...]

>>>(3) Dictionary-blind pretty-printers as hypothesised by John B below may
>>>make mistakes in their pretty-printing if they assume 'numb' wrongly.
>>>Likewise, other dictionary-blind software cannot rely on apparent 'numb'
>>>values really being 'numb'. Successful behaviour after assuming 'numb'
>>>type is likely, but not guaranteed.
>>I agree that this is a consequence of the CIF 1.1 specifications.  I'm not
>>certain that it's desirable, however, given the substantial body of
>>software that ignores that result in favor of CIF 1.0 behavior.  I think
>>it might be more consistent for the specifications to actually require
>>numeric strings to be quoted to be interpreted as 'char' values (or at
>>least to be faithfully preserved), which now is only a practical
>>consideration.
>
>I'm not sure what the difference between CIF1.1 and CIF1.0 behaviour is -
>I thought that 'numb' type had always been the same?


Likely I have overly aggrandized by calling it "CIF 1.0" behavior, but a lot of software follows the specifications of the 1991 CIF paper that

"4. A data item is assumed to be a number if it starts with a digit '0'-'9', plus '+', minus '-' or a period '.' and it is not bounded by matching single or double quotes or semicolons as the first character on a line.

"5. A number may be supplied as an integer, as a floating-point number, or in scientific notation. When concatenated with an integer in parentheses, that integer is assumed to be the estimated standard deviation in the final digit(s) of the number. [...]

"7. A data item is assumed to be of data type character if it is not a number or text."

Much other software adapts (4) to something like "A data item is assumed to be a number if it can be parsed as a number," but otherwise does the same thing.

Note in all cases the use of "*is* assumed" (as opposed to "can be assumed" or "may be assumed"), and that even though the paper refers to CIF dictionaries, it does not condition the specified assumptions on items' definitions, or lack thereof.  Those are the primary differences from the CIF 1.1 version.

Basically, this is from the time before computer-readable dictionaries and dictionary-driven CIF software.  Even though I agree that the CIF 1.1 specifications remove the original CIF-level implicit data typing, much CIF software has not followed along with that, whether accidentally or intentionally.


>>>Is everyone happy with my analysis above?  Are we OK with accepting
>>the same semantics for CIF2?
>>
>>Reserving judgment on the advantages of 'numb', I agree that the rest
>>of the analysis is accurate for CIF 1.1.  I could probably accept the
>>same semantics for CIF2, but I think there may be room to do better.
>>Consider this:
>>
>>a) Undelimited values that lex as numbers according to the CIF
>>specifications are in fact handled as numbers (possibly with associated
>>s.u.s) in the abstract data model.
>>
>>b) CIF2 specifies a standard number format for use in coercing numbers
>>to strings in the event that a dictionary specifies a 'char' subtype for
>>such a value, limits its values via a regex, or otherwise needs a string
>>representation.
>>
>>It is implicit in that approach that the literal character sequence of
>>unquoted numeric values will not be faithfully preserved in some cases,
>>but that is the practical reality today, CIF specifications notwithstanding.
>>The main advantage to be gained is better agreement with the behavior of
>>existing software.  Although it is likely that the formatting existing software
>>performs would not agree exactly with whatever format might be chosen for
>>(b), I anticipate that for many programs and libraries it would be a lot easier
>>to change formats than to change the underlying data model.
>>
>>That would also make for more flexible use of regexes in validating 'numb'
>>values, in that the regexes used in such contexts could safely make some
>>assumptions about the form of the strings matched against them.
>
>I'm not convinced that existing software ever gets into trouble with the current
>'numb' definition.


Herbert's cif2cif example is an existing program that has this sort of trouble.


>  Any programs that really do require numbers (e.g. structure
>display programs) are not affected by the need to keep the string representation.


Agreed.


>The only programs that will have trouble are those which input and manipulate
>strings that look like numbers, if the parsing part of the program aggressively
>converts to numbers before the part that expects strings can stop it.


Yes.


>  Strings
>in datafiles are mostly meant for human consumption, so I would have thought that
>such bugs would long ago have been discovered and dealt with.


A significant part of the way this has been dealt with is by users and software recognizing that when writing CIF, string values that could be misinterpreted as numbers must be quoted.  This is a consequence of the 1991 specifications, which in that sense are more restrictive than the CIF 1.1 specs.  In other words, it has been addressed in CIF instances, but not necessarily in CIF software.

In general, I am a bit uncomfortable with specifications that allow for that kind of problem at all, regardless of how few programs *we anticipate* may suffer from them.  It feels like we are unnecessarily narrowing CIF's scope.


>While the explicit
>data model I've described may sound overly complex, I suspect that it is actually
>what pragmatic programmers have long ago implemented (e.g. CIFtbx).


No, I think the explicit data model you have described sounds attractively simple.  I largely agree that it is what CIF 1.1 formally demands.  I would like to be able to agree to it for CIF 2.0, and I have not yet concluded that I can't.  On the other hand, I think much existing software fails to consistently comply with the CIF 1.1 specifications here, and I am considering whether that should be taken as a signal that CIF 2.0 should fall back.


>So I don't think specifying a coercion format is a solution, because I
>don't think there is a problem,


Whether there is a real problem is indeed one of the issues I am trying to decide.


>and I expect it to be highly unlikely that a set of rules could be found
>such that a number that looks like a float would reliably be coerced back
>to its original form, with appropriate number of significant digits,
>identically on all platforms.


There you misunderstand, however.  The idea is not that numbers would be converted back to their original form, as I explicitly acknowledge.  Rather, numbers would be converted at need to a standard string form, which I think can indeed be done consistently across platforms.  I claim this as an advantage.  Note also that the "at need" here appears purely to be in support of value-space restriction via regex, which, although convenient to retain, is not essential (and is a bit strange).  Any other number formatting I can think of is an application-level concern.


>I don't think the current specification is broken, rather it is a somewhat
>subtle solution which allows:
>(1) Non-delimited number-like strings to still be available as strings,
>increasing the robustness of the system
>(2) Disallowing delimited number-like strings to be interpreted as numbers,
>giving a further avenue for safety and matching human readability expectations


I agree that the CIF 1.1 specifications are technically sound, at least as far as they go.  The problems are that

(a) They are poorly adapted to items for which no data type is known.
(b) In practice, they are not universally followed in this area, and failure to follow them here is not universally accepted as buggy.

I find I am leaning away from supporting a solution that requires applications to know items' data types independently in order to handle them correctly.  I am beginning to think that there is inherent value in the abstract data model providing a numeric data type.

In fact, I am inclined to think that CIF2's abstract data model and dREL's data model need to be the same, or at least need to be carefully aligned.  dREL sports four distinct numeric data types, three of which can be matched with CIF's traditional 'numb' type.  That seems to me a good reason to have that type in CIF2's abstract model.


John

--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.