Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics


On Tuesday, July 26, 2011 10:47 PM, James Hester wrote:
>Let me see if I can clarify the CIF1.1 approach to the 'numb'/'char'
>distinction.  Here is a compilation from IT Vol G of information about
>this distinction:
>
>==============================================================================================
>2.2.5.2 Data typing: "..type numb encompasses all data values that are
>interpretable as numeric values...any CIF reader may encounter data names
>that are not defined in a public or accompanying dictionary. It is
>therefore appropriate to adopt a strategy of interpreting as a number any
>data value that looks like one...Therefore, in the absence of a specific
>counter-indication (from a dictionary definition), the data value in the
>following example may be taken as the numeric (integer) value 1:
>
>_unknown_data_name 1
>
>On the other hand, if _unknown_data_name were explicitly defined in a
>dictionary with a data type of 'char', then the value should be stored
>as the literal character 1...Note that numbers within a quoted string
>or a text block are not interpreted as type 'numb' but as type 'char'."
>
>2.2.7.1.4 (10): "A simple data value ... may optionally be delimited by
>any of the same set of delimiting character strings, *except* for data
>values that are to be interpreted as numbers"
>
>2.2.7.4.7.1(17): "Where the attributes of a data value are not available
>in a dictionary listing, it may be assumed that a character string
>interpretable as a number should be taken to represent an item of type
>'numb'.  However, an explicit dictionary declaration of type will override
>such an assumption"
>
>4.9 DDL1 _type: "Type 'numb' identifies items which must have values that
>are identifiable numbers.  The acceptable syntax for these numbers is
>application-dependent."
>
>4.9 DDL1 _type_conditions: "'su' permits a number string to contain an
>appended standard uncertainty number enclosed within parentheses"
>
>4.10 DDL2 _item_type_list.construct, _item_type_list.primitive_code: "When
>a data value can be defined as a pre-determined sequence of characters...
>it is specified as a construction"
>=================================================================================================
>
>I think the above extracts are consistent with Herbert's summary of the
>CIF1 situation.  I attempt to rephrase the situation in terms of the
>abstract datamodel in the following.


I agree.


>Section 2.2.5.2 above states that a non-delimited string that is
>interpretable as a number may actually have 'char' type if a
>dictionary specifies this.  If we wish to allow modular separation
>between CIF parsing applications and CIF dictionary applications (to
>allow CIF parsers to be developed independently of particular domain
>dictionaries, for example), the parser must therefore preserve *all*
>undelimited strings as character sequences, to allow for the possibility
>that those datavalues that appear to be numbers will turn out to be
>character strings.  So, 'numb' values in the formal datamodel are actually
>objects containing two values, the original string and the numerical
>alternative value.  Note that if you defer the determination of what is
>and isn't a 'numb' datavalue to a later stage, when you no longer have
>information about the string delimiters used, you may allow delimited
>strings to be accepted as 'numb' type, despite the fact that this is a
>violation of the syntax (2.2.7.1.4(10)) and the BNF).
>
>The above interpretation is actually consistent with the DDL2 practice of
>explicitly describing the syntax of integers and floats using POSIX regular
>expressions on the 'numb' primitive datatype - what this is actually doing
>conceptually is operating on the "string" aspect of the  'numb' datatype,
>and in this way excludes quoted strings from interpretation as numbers,
>even if they match the POSIX expression.


OK.


>OK: the only justification I can see for the existence of the 'numb'
>primitive type as described above is so that delimited number strings
>aren't interpretable as numbers, because that would be an unexpected
>outcome for a human reader. As a fan of human readability, I think that
>CIF2 could usefully continue with the CIF1 approach to 'numb', however in
>written documentation (with all due respect to the Vol G authors) we should
>do a better job of describing the formal meaning of 'numb', as well as the
>practical outcome.


I think the justification for disallowing quoted strings from being interpreted as numbers goes rather deeper, exactly to the point that Herbert has lately been championing.  That is, it is not just for human readability, but also for ensuring correct computer interpretation.  It may be that that justification is redundant with the implications of other specifications, but that does not negate it.

I do agree, however, both that the concept is useful and that it would bear a better formal description.


>Note the following consequences of the CIF1 approach, which I hope we
>all accept:
>(1) A delimited numerical value is invalid if the dictionary specifies
>that 'numb' is expected


Agreed.


>(2) A delimited numerical value is a valid number if the dictionary/DDL
>allows numbers to be derived from character strings (e.g. by giving a
>POSIX regex in the DDL2 _item_list_type.construct and a primitive code
>of 'char')


I'm not sure I follow.  Certainly a delimited numerical value is valid for a given data item if it matches the dictionary definition, which could be formulated as you describe, but how do you go from there to the value formally being a number as far as CIF is concerned?


>(3) Dictionary-blind pretty-printers as hypothesised by John B below may
>make mistakes in their pretty-printing if they assume 'numb' wrongly.
>Likewise, other dictionary-blind software cannot rely on apparent 'numb'
>values really being 'numb'. Successful behaviour after assuming 'numb'
>type is likely, but not guaranteed.


I agree that this is a consequence of the CIF 1.1 specifications.  I'm not certain that it's desirable, however, given the substantial body of software that ignores that result in favor of CIF 1.0 behavior.  I think it might be more consistent for the specifications to actually require numeric strings to be quoted to be interpreted as 'char' values (or at least to be faithfully preserved), which now is only a practical consideration.


>  The only advantage of 'numb' is
>human-readability, as described above.


I don't wholly agree there (see above), but for the moment I think that's a side issue.


>Is everyone happy with my analysis above?  Are we OK with accepting the same semantics for CIF2?


Reserving judgment on the advantages of 'numb', I agree that the rest of the analysis is accurate for CIF 1.1.  I could probably accept the same semantics for CIF2, but I think there may be room to do better.  Consider this:

a) Undelimited values that lex as numbers according to the CIF specifications are in fact handled as numbers (possibly with associated s.u.s) in the abstract data model.

b) CIF2 specifies a standard number format for use in coercing numbers to strings in the event that a dictionary specifies a 'char' subtype for such a value, limits its values via a regex, or otherwise needs a string representation.

It is implicit in that approach that the literal character sequence of unquoted numeric values will not be faithfully preserved in some cases, but that is the practical reality today, CIF specifications notwithstanding.  The main advantage to be gained is better agreement with the behavior of existing software.  Although it is likely that the formatting existing software performs would not agree exactly with whatever format might be chosen for (b), I anticipate that for many programs and libraries it would be a lot easier to change formats than to change the underlying data model.

That would also make for more flexible use of regexes in validating 'numb' values, in that the regexes used in such contexts could safely make some assumptions about the form of the strings matched against them.


John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.