Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics

Let me see if I can clarify the CIF1.1 approach to the 'numb'/'char' distinction.  Here is a compilation from IT Vol G of information about this distinction:

============================================================================================== Data typing: "..type numb encompasses all data values that are interpretable as numeric values...any CIF reader may encounter data names that are not defined in a public or accompanying dictionary. It is therefore appropriate to adopt a strategy of interpreting as a number any data value that looks like one...Therefore, in the absence of a specific counter-indication (from a dictionary definition), the data value in the following example may be taken as the numeric (integer) value 1:

_unknown_data_name 1

On the other hand, if _unknown_data_name were explicitly defined in a dictionary with a data type of 'char', then the value should be stored as the literal character 1...Note that numbers within a quoted string or a text block are not interpreted as type 'numb' but as type 'char'." (10): "A simple data value ... may optionally be delimited by any of the same set of delimiting character strings, *except* for data values that are to be interpreted as numbers" "Where the attributes of a data value are not available in a dictionary listing, it may be assumed that a character string interpretable as a number should be taken to represent an item of type 'numb'.  However, an explicit dictionary declaration of type will override such an assumption"

4.9 DDL1 _type: "Type 'numb' identifies items which must have values that are identifiable numbers.  The acceptable syntax for these numbers is application-dependent."

4.9 DDL1 _type_conditions: "'su' permits a number string to contain an appended standard uncertainty number enclosed within parentheses"

4.10 DDL2 _item_type_list.construct, _item_type_list.primitive_code: "When a data value can be defined as a pre-determined sequence of characters...it is specified as a construction"

I think the above extracts are consistent with Herbert's summary of the CIF1 situation.  I attempt to rephrase the situation in terms of the abstract datamodel in the following.

Section above states that a non-delimited string that is interpretable as a number may actually have 'char' type if a dictionary specifies this.  If we wish to allow modular separation between CIF parsing applications and CIF dictionary applications (to allow CIF parsers to be developed independently of particular domain dictionaries, for example), the parser must therefore preserve *all* undelimited strings as character sequences, to allow for the possibility that those datavalues that appear to be numbers will turn out to be character strings.  So, 'numb' values in the formal datamodel are actually objects containing two values, the original string and the numerical alternative value.  Note that if you defer the determination of what is and isn't a 'numb' datavalue to a later stage, when you no longer have information about the string delimiters used, you may allow delimited strings to be accepted as 'numb' type, despite the fact that this is a violation of the syntax ( and the BNF).

The above interpretation is actually consistent with the DDL2 practice of explicitly describing the syntax of integers and floats using POSIX regular expressions on the 'numb' primitive datatype - what this is actually doing conceptually is operating on the "string" aspect of the  'numb' datatype, and in this way excludes quoted strings from interpretation as numbers, even if they match the POSIX expression.

OK: the only justification I can see for the existence of the 'numb' primitive type as described above is so that delimited number strings aren't interpretable as numbers, because that would be an unexpected outcome for a human reader. As a fan of human readability, I think that CIF2 could usefully continue with the CIF1 approach to 'numb', however in written documentation (with all due respect to the Vol G authors) we should do a better job of describing the formal meaning of 'numb', as well as the practical outcome. 

Note the following consequences of the CIF1 approach, which I hope we all accept:
(1) A delimited numerical value is invalid if the dictionary specifies that 'numb' is expected
(2) A delimited numerical value is a valid number if the dictionary/DDL allows numbers to be derived from character strings (e.g. by giving a POSIX regex in the DDL2 _item_list_type.construct and a primitive code of 'char')
(3) Dictionary-blind pretty-printers as hypothesised by John B below may make mistakes in their pretty-printing if they assume 'numb' wrongly.  Likewise, other dictionary-blind software cannot rely on apparent 'numb' values really being 'numb'. Successful behaviour after assuming 'numb' type is likely, but not guaranteed.  The only advantage of 'numb' is human-readability, as described above.

Is everyone happy with my analysis above?  Are we OK with accepting the same semantics for CIF2?

(Comment on Herbert's example inserted below)

On Wed, Jul 27, 2011 at 4:41 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
To understand the problem with conflating strings and numbers, look at the
following tags and values:

_citation.journal_id_ISSN           0036-8075
_citation.journal_id_CSD            0038

If you have a dictionary, you know both items are strings, not numbers
and you will reliably keep the leading zeros and not treat the first
as 36*10^(-8075).  If you don't have a dictionary and are just using,
say, CIFtbx, you might treat both values as numbers.  Under current
rules you can protect the values from the numeric interpretation
even without a dictionary by saying

_citation.journal_id_ISSN           "0036-8075"
_citation.journal_id_CSD            "0038"

and all is well.  Without that mechanism, you need a dictionary.

This example does not support the need for a 'numb' type, as section quoted above implies that non-delimited strings that look like numbers must always be available as 'char' type as well, so there is no danger of the above mistake occuring.  I believe that CIFtbx allows the caller to decide the type of a dataname, so numb('_citation.journal_id_CSD') will return a number, but char('_citation.journal_id_CSD') will return a character string in the top example.  This would imply that CIFtbx is maintaining the string representation internally.  Or can you give a little program chunk where the 'numb'-ness of the top example leads to problems?

At 10:23 AM -0500 7/26/11, Bollinger, John C wrote:
>On Monday, July 25, 2011 10:25 PM, James Hester wrote:
>>In order to minimise the number of issues we have to discuss in
>>Madrid to clean up CIF2, I would like to turn discussion to those
>>semantic issues which are relevant to the syntax.  I believe that
>>there are three possible types of datavalue: "inapplicable",
>>"unknown" and "string", represented by <full point> (commonly
>>called a "full stop" or "period"), <question mark> and everything
>>else, respectively.
>>Do we all agree with the following assertion regarding full point
>>and question mark?
>>(1) A full point/question mark inside string delimiters is *not*
>>equivalent to an undelimited full point/question mark
>>Numbers: I believe that strings that could be interpreted as
>>numbers are nevertheless (in a formal sense) just strings in the
>>context of the post-parse abstract data model.  Therefore, whether
>>or not a numerical string is delimited does not change its value:
>>4.5 and "4.5" are identical values.
>>Note that this latter assertion does *not* require that
>>CIF-conformant software must always handle numbers as strings; I am
>>making these statements in order to clarify the abstract data model
>>on which the various DDLs and domain dictionaries operate, not to
>>dictate software design.  If your software can manage any potential
>>need to swap between string and number representation of your data
>>value, then more power to you.
>>Please state whether you agree or disagree with the above.
>I agree that a CIF data value comprising only a full point or
>question mark character is a place-holder value where it is
>whitespace-delimited, but is an ordinary string value otherwise.  No
>other data values are place-holders in the CIF sense.  CIF 1.1
>distinguishes between the meanings of these place-holders, and that
>distinction may occasionally be useful.
>>From before the advent of CIF dictionaries, CIF 1 specified that
>>data values of certain forms were of numeric type, and values of
>>all other forms were of string type.  Although CIF 1.1 describes
>>this among the common semantic features rather than the syntax
>>specifications, I am uncertain whether that should be interpreted
>>as an intentional technical decision.  Certainly many computer
>>languages treat data typing for literal values as a syntactic
>>issue, but others are very successful with a more freewheeling
>I agree with James and Brian that it comes down to the practical
>advantages of making a distinction, and from that perspective I
>1) The distinction is useful only where the appropriate data type
>would otherwise be unknown, AND the data type is needed for decision
>Knowledge of the appropriate data type could be dynamically derived
>from a dictionary, but I suspect that most CIF software simply
>encodes its data type requirements algorithmically (e.g. programs
>know that _cell_length_a must be numeric).  Since Herbert raises PDB
>software in particular, I am curious about whether there the
>practical ambiguity there: what are some of the CIF data items whose
>data type that software needs but cannot determine other than from
>their lexical form?  What is a specific consequence that could arise
>from the software choosing the wrong data type for those items?
>One of the areas that would be affected is general-purpose CIF
>tools, such as pretty printers, that rely only on the content of the
>CIFs presented to them.  Such programs may safely reformat numbers
>(e.g. switch among pure decimal form and various recognized forms of
>scientific notation, convert s.u.s from rule-of-29 to rule of 19)
>only if they can reliably recognize them as numbers.
>2) The distinction may be practical where it isn't otherwise useful,
>especially in the sense that it may be built in to a lot of existing
>I know it's built into most CIF software I've ever written.  I'm not
>sure offhand how significant the impact would be of lifting the
>Overall, I am apprehensive about lifting the formal distinction for
>CIF 1.x, but I am open to considering it for CIF 2.0.  I am not yet
>persuaded that it would be advantageous, but neither am I persuaded
>that it would be harmful.
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>ddlm-group mailing list

 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769


T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.