Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics

To understand the problem with conflating strings and numbers, look at the
following tags and values:

_citation.journal_id_ISSN           0036-8075
_citation.journal_id_CSD            0038

If you have a dictionary, you know both items are strings, not numbers
and you will reliably keep the leading zeros and not treat the first
as 36*10^(-8075).  If you don't have a dictionary and are just using,
say, CIFtbx, you might treat both values as numbers.  Under current
rules you can protect the values from the numeric interpretation
even without a dictionary by saying

_citation.journal_id_ISSN           "0036-8075"
_citation.journal_id_CSD            "0038"

and all is well.  Without that mechanism, you need a dictionary.




At 10:23 AM -0500 7/26/11, Bollinger, John C wrote:
>On Monday, July 25, 2011 10:25 PM, James Hester wrote:
>>In order to minimise the number of issues we have to discuss in 
>>Madrid to clean up CIF2, I would like to turn discussion to those 
>>semantic issues which are relevant to the syntax.  I believe that 
>>there are three possible types of datavalue: "inapplicable", 
>>"unknown" and "string", represented by <full point> (commonly 
>>called a "full stop" or "period"), <question mark> and everything 
>>else, respectively.
>>
>>Do we all agree with the following assertion regarding full point 
>>and question mark?
>>(1) A full point/question mark inside string delimiters is *not* 
>>equivalent to an undelimited full point/question mark
>>
>>Numbers: I believe that strings that could be interpreted as 
>>numbers are nevertheless (in a formal sense) just strings in the 
>>context of the post-parse abstract data model.  Therefore, whether 
>>or not a numerical string is delimited does not change its value: 
>>4.5 and "4.5" are identical values.
>>
>>Note that this latter assertion does *not* require that 
>>CIF-conformant software must always handle numbers as strings; I am 
>>making these statements in order to clarify the abstract data model 
>>on which the various DDLs and domain dictionaries operate, not to 
>>dictate software design.  If your software can manage any potential 
>>need to swap between string and number representation of your data 
>>value, then more power to you.
>>
>>Please state whether you agree or disagree with the above.
>
>
>I agree that a CIF data value comprising only a full point or 
>question mark character is a place-holder value where it is 
>whitespace-delimited, but is an ordinary string value otherwise.  No 
>other data values are place-holders in the CIF sense.  CIF 1.1 
>distinguishes between the meanings of these place-holders, and that 
>distinction may occasionally be useful.
>
>
>>From before the advent of CIF dictionaries, CIF 1 specified that 
>>data values of certain forms were of numeric type, and values of 
>>all other forms were of string type.  Although CIF 1.1 describes 
>>this among the common semantic features rather than the syntax 
>>specifications, I am uncertain whether that should be interpreted 
>>as an intentional technical decision.  Certainly many computer 
>>languages treat data typing for literal values as a syntactic 
>>issue, but others are very successful with a more freewheeling 
>>approach.
>
>I agree with James and Brian that it comes down to the practical 
>advantages of making a distinction, and from that perspective I 
>assert
>
>
>1) The distinction is useful only where the appropriate data type 
>would otherwise be unknown, AND the data type is needed for decision 
>making.
>
>Knowledge of the appropriate data type could be dynamically derived 
>from a dictionary, but I suspect that most CIF software simply 
>encodes its data type requirements algorithmically (e.g. programs 
>know that _cell_length_a must be numeric).  Since Herbert raises PDB 
>software in particular, I am curious about whether there the 
>practical ambiguity there: what are some of the CIF data items whose 
>data type that software needs but cannot determine other than from 
>their lexical form?  What is a specific consequence that could arise 
>from the software choosing the wrong data type for those items?
>
>One of the areas that would be affected is general-purpose CIF 
>tools, such as pretty printers, that rely only on the content of the 
>CIFs presented to them.  Such programs may safely reformat numbers 
>(e.g. switch among pure decimal form and various recognized forms of 
>scientific notation, convert s.u.s from rule-of-29 to rule of 19) 
>only if they can reliably recognize them as numbers.
>
>
>2) The distinction may be practical where it isn't otherwise useful, 
>especially in the sense that it may be built in to a lot of existing 
>software.
>
>I know it's built into most CIF software I've ever written.  I'm not 
>sure offhand how significant the impact would be of lifting the 
>distinction.
>
>
>Overall, I am apprehensive about lifting the formal distinction for 
>CIF 1.x, but I am open to considering it for CIF 2.0.  I am not yet 
>persuaded that it would be advantageous, but neither am I persuaded 
>that it would be harmful.
>
>
>Regards,
>
>John
>--
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>
>
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>
>_______________________________________________
>ddlm-group mailing list
>ddlm-group@iucr.org
>http://scripts.iucr.org/mailman/listinfo/ddlm-group


-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.