Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics

I've edited and interspersed comments below:

On Thu, Jul 28, 2011 at 12:00 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:

(James wrote):

>Note the following consequences of the CIF1 approach, which I hope we
>all accept:
>(1) A delimited numerical value is invalid if the dictionary specifies
>that 'numb' is expected


>(2) A delimited numerical value is a valid number if the dictionary/DDL
>allows numbers to be derived from character strings (e.g. by giving a
>POSIX regex in the DDL2 _item_list_type.construct and a primitive code
>of 'char')

I'm not sure I follow.  Certainly a delimited numerical value is valid for a given data item if it matches the dictionary definition, which could be formulated as you describe, but how do you go from there to the value formally being a number as far as CIF is concerned?

My apologies for the lack of clarity.  I simply meant that a dictionary or DDL could define a type, e.g. "float" which it defines to act like a number (or a butterfly - that's up to the dictionary).  This is formally different to the CIF 'numb' type.  In any case consequence (2) above is probably a red herring as no dictionaries actually do that at present.

>(3) Dictionary-blind pretty-printers as hypothesised by John B below may
>make mistakes in their pretty-printing if they assume 'numb' wrongly.
>Likewise, other dictionary-blind software cannot rely on apparent 'numb'
>values really being 'numb'. Successful behaviour after assuming 'numb'
>type is likely, but not guaranteed.

I agree that this is a consequence of the CIF 1.1 specifications.  I'm not certain that it's desirable, however, given the substantial body of software that ignores that result in favor of CIF 1.0 behavior.  I think it might be more consistent for the specifications to actually require numeric strings to be quoted to be interpreted as 'char' values (or at least to be faithfully preserved), which now is only a practical consideration.

I'm not sure what the difference between CIF1.1 and CIF1.0 behaviour is - I thought that 'numb' type had always been the same?


>Is everyone happy with my analysis above?  Are we OK with accepting the same semantics for CIF2?

Reserving judgment on the advantages of 'numb', I agree that the rest of the analysis is accurate for CIF 1.1.  I could probably accept the same semantics for CIF2, but I think there may be room to do better.  Consider this:

a) Undelimited values that lex as numbers according to the CIF specifications are in fact handled as numbers (possibly with associated s.u.s) in the abstract data model.

b) CIF2 specifies a standard number format for use in coercing numbers to strings in the event that a dictionary specifies a 'char' subtype for such a value, limits its values via a regex, or otherwise needs a string representation.

It is implicit in that approach that the literal character sequence of unquoted numeric values will not be faithfully preserved in some cases, but that is the practical reality today, CIF specifications notwithstanding.  The main advantage to be gained is better agreement with the behavior of existing software.  Although it is likely that the formatting existing software performs would not agree exactly with whatever format might be chosen for (b), I anticipate that for many programs and libraries it would be a lot easier to change formats than to change the underlying data model.

That would also make for more flexible use of regexes in validating 'numb' values, in that the regexes used in such contexts could safely make some assumptions about the form of the strings matched against them.

I'm not convinced that existing software ever gets into trouble with the current 'numb' definition.  Any programs that really do require numbers (e.g. structure display programs) are not affected by the need to keep the string representation.  The only programs that will have trouble are those which input and manipulate strings that look like numbers, if the parsing part of the program aggressively converts to numbers before the part that expects strings can stop it.  Strings in datafiles are mostly meant for human consumption, so I would have thought that such bugs would long ago have been discovered and dealt with.  While the explicit data model I've described may sound overly complex, I suspect that it is actually what pragmatic programmers have long ago implemented (e.g. CIFtbx).

So I don't think specifying a coercion format is a solution, because I don't think there is a problem, and I expect it to be highly unlikely that a set of rules could be found such that a number that looks like a float would reliably be coerced back to its original form, with appropriate number of significant digits, identically on all platforms. 

I don't think the current specification is broken, rather it is a somewhat subtle solution which allows:
(1) Non-delimited number-like strings to still be available as strings, increasing the robustness of the system
(2) Disallowing delimited number-like strings to be interpreted as numbers, giving a further avenue for safety and matching human readability expectations


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.