[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics

To understand the problem with conflating strings and numbers, look at the
following tags and values:

_citation.journal_id_ISSN           0036-8075
_citation.journal_id_CSD            0038

If you have a dictionary, you know both items are strings, not numbers
and you will reliably keep the leading zeros and not treat the first
as 36*10^(-8075).  If you don't have a dictionary and are just using,
say, CIFtbx, you might treat both values as numbers.  Under current
rules you can protect the values from the numeric interpretation
even without a dictionary by saying

_citation.journal_id_ISSN           "0036-8075"
_citation.journal_id_CSD            "0038"

and all is well.  Without that mechanism, you need a dictionary.




At 10:23 AM -0500 7/26/11, Bollinger, John C wrote:
>On Monday, July 25, 2011 10:25 PM, James Hester wrote:
>>In order to minimise the number of issues we have to discuss in 
>>Madrid to clean up CIF2, I would like to turn discussion to those 
>>semantic issues which are relevant to the syntax.  I believe that 
>>there are three possible types of datavalue: "inapplicable", 
>>"unknown" and "string", represented by <full point> (commonly 
>>called a "full stop" or "period"), <question mark> and everything 
>>else, respectively.
>>
>>Do we all agree with the following assertion regarding full point 
>>and question mark?
>>(1) A full point/question mark inside string delimiters is *not* 
>>equivalent to an undelimited full point/question mark
>>
>>Numbers: I believe that strings that could be interpreted as 
>>numbers are nevertheless (in a formal sense) just strings in the 
>>context of the post-parse abstract data model.  Therefore, whether 
>>or not a numerical string is delimited does not change its value: 
>>4.5 and "4.5" are identical values.
>>
>>Note that this latter assertion does *not* require that 
>>CIF-conformant software must always handle numbers as strings; I am 
>>making these statements in order to clarify the abstract data model 
>>on which the various DDLs and domain dictionaries operate, not to 
>>dictate software design.  If your software can manage any potential 
>>need to swap between string and number representation of your data 
>>value, then more power to you.
>>
>>Please state whether you agree or disagree with the above.
>
>
>I agree that a CIF data value comprising only a full point or 
>question mark character is a place-holder value where it is 
>whitespace-delimited, but is an ordinary string value otherwise.  No 
>other data values are place-holders in the CIF sense.  CIF 1.1 
>distinguishes between the meanings of these place-holders, and that 
>distinction may occasionally be useful.
>
>
>>From before the advent of CIF dictionaries, CIF 1 specified that 
>>data values of certain forms were of numeric type, and values of 
>>all other forms were of string type.  Although CIF 1.1 describes 
>>this among the common semantic features rather than the syntax 
>>specifications, I am uncertain whether that should be interpreted 
>>as an intentional technical decision.  Certainly many computer 
>>languages treat data typing for literal values as a syntactic 
>>issue, but others are very successful with a more freewheeling 
>>approach.
>
>I agree with James and Brian that it comes down to the practical 
>advantages of making a distinction, and from that perspective I 
>assert
>
>
>1) The distinction is useful only where the appropriate data type 
>would otherwise be unknown, AND the data type is needed for decision 
>making.
>
>Knowledge of the appropriate data type could be dynamically derived 
>from a dictionary, but I suspect that most CIF software simply 
>encodes its data type requirements algorithmically (e.g. programs 
>know that _cell_length_a must be numeric).  Since Herbert raises PDB 
>software in particular, I am curious about whether there the 
>practical ambiguity there: what are some of the CIF data items whose 
>data type that software needs but cannot determine other than from 
>their lexical form?  What is a specific consequence that could arise 
>from the software choosing the wrong data type for those items?
>
>One of the areas that would be affected is general-purpose CIF 
>tools, such as pretty printers, that rely only on the content of the 
>CIFs presented to them.  Such programs may safely reformat numbers 
>(e.g. switch among pure decimal form and various recognized forms of 
>scientific notation, convert s.u.s from rule-of-29 to rule of 19) 
>only if they can reliably recognize them as numbers.
>
>
>2) The distinction may be practical where it isn't otherwise useful, 
>especially in the sense that it may be built in to a lot of existing 
>software.
>
>I know it's built into most CIF software I've ever written.  I'm not 
>sure offhand how significant the impact would be of lifting the 
>distinction.
>
>
>Overall, I am apprehensive about lifting the formal distinction for 
>CIF 1.x, but I am open to considering it for CIF 2.0.  I am not yet 
>persuaded that it would be advantageous, but neither am I persuaded 
>that it would be harmful.
>
>
>Regards,
>
>John
>--
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>
>
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>
>_______________________________________________
>ddlm-group mailing list
>ddlm-group@iucr.org
>http://scripts.iucr.org/mailman/listinfo/ddlm-group


-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]