[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics

On Monday, July 25, 2011 10:25 PM, James Hester wrote:
>In order to minimise the number of issues we have to discuss in Madrid to clean up CIF2, I would like to turn discussion to those semantic issues which are relevant to the syntax.  I believe that there are three possible types of datavalue: "inapplicable", "unknown" and "string", represented by <full point> (commonly called a "full stop" or "period"), <question mark> and everything else, respectively.
>Do we all agree with the following assertion regarding full point and question mark?
>(1) A full point/question mark inside string delimiters is *not* equivalent to an undelimited full point/question mark
>Numbers: I believe that strings that could be interpreted as numbers are nevertheless (in a formal sense) just strings in the context of the post-parse abstract data model.  Therefore, whether or not a numerical string is delimited does not change its value: 4.5 and "4.5" are identical values.
>Note that this latter assertion does *not* require that CIF-conformant software must always handle numbers as strings; I am making these statements in order to clarify the abstract data model on which the various DDLs and domain dictionaries operate, not to dictate software design.  If your software can manage any potential need to swap between string and number representation of your data value, then more power to you.
>Please state whether you agree or disagree with the above.

I agree that a CIF data value comprising only a full point or question mark character is a place-holder value where it is whitespace-delimited, but is an ordinary string value otherwise.  No other data values are place-holders in the CIF sense.  CIF 1.1 distinguishes between the meanings of these place-holders, and that distinction may occasionally be useful.

>From before the advent of CIF dictionaries, CIF 1 specified that data values of certain forms were of numeric type, and values of all other forms were of string type.  Although CIF 1.1 describes this among the common semantic features rather than the syntax specifications, I am uncertain whether that should be interpreted as an intentional technical decision.  Certainly many computer languages treat data typing for literal values as a syntactic issue, but others are very successful with a more freewheeling approach.

I agree with James and Brian that it comes down to the practical advantages of making a distinction, and from that perspective I assert

1) The distinction is useful only where the appropriate data type would otherwise be unknown, AND the data type is needed for decision making.

Knowledge of the appropriate data type could be dynamically derived from a dictionary, but I suspect that most CIF software simply encodes its data type requirements algorithmically (e.g. programs know that _cell_length_a must be numeric).  Since Herbert raises PDB software in particular, I am curious about whether there the practical ambiguity there: what are some of the CIF data items whose data type that software needs but cannot determine other than from their lexical form?  What is a specific consequence that could arise from the software choosing the wrong data type for those items?

One of the areas that would be affected is general-purpose CIF tools, such as pretty printers, that rely only on the content of the CIFs presented to them.  Such programs may safely reformat numbers (e.g. switch among pure decimal form and various recognized forms of scientific notation, convert s.u.s from rule-of-29 to rule of 19) only if they can reliably recognize them as numbers.

2) The distinction may be practical where it isn't otherwise useful, especially in the sense that it may be built in to a lot of existing software.

I know it's built into most CIF software I've ever written.  I'm not sure offhand how significant the impact would be of lifting the distinction.

Overall, I am apprehensive about lifting the formal distinction for CIF 1.x, but I am open to considering it for CIF 2.0.  I am not yet persuaded that it would be advantageous, but neither am I persuaded that it would be harmful.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

Reply to: [list | sender only]