Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics

I am not proposing a change to CIF1.1 behaviour, as I have stated before, so any 'asking for trouble' is purely CIF1.1 asking for trouble.

The cif2cif example has focused my thinking. Given that I am not actually proposing anything new, there are no consequences for such programs in clarifying that CIF1.1 'numb' datavalues have a dual 'number'/'char' datatype.

Where a program has no access to a dictionary (by user choice or programmer design), the result that the program produces is the equivalent of saying 'If everything that looks like a number is in fact a number and not a character string, this is the result'.  It may however be appropriate for the program to alert the user to this assumption.

A further consequence is that dictionary designers need to be wary of defining enumerated values that look like numbers - these enumerated values are machine-interpretable and so any cif2cif-type program that transforms number-like datavalues could inadvertently change the meaning of such a dataname that takes an enumerated value.

John B. has raised the possibility of returning to the original CIF formulation where any non-delimited number-like string was to be interpreted solely as a number.  Would anybody like to comment on this proposal?  I would note that this would e.g. sometimes force the telephone numbers appearing in current CIFs to be enclosed in delimiters, and some other number-like constructions as well, and definitely require subtle changes to existing software to pick up number-like strings and delimit them.

On Thu, Jul 28, 2011 at 7:27 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
cif2cif in the CIFtbx package is the utility program that does the rule of 19 to rule of 9 (e.g.) conversion (among many other things).  It can, of course, use a dictionary (or multiple dictionaries), but does still work
without a dictionary using the CIF rules for recognizing the numb type.

As CIF2 shifts, I do, of course try to work out the updates to CIFtbx to
get it to adapt.  This particular change seems to me to simply be asking
for trouble with no benefit to any users.


=====================================================
 Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
       Idle Hour Blvd, Oakdale, NY, 11769

                +1-631-244-3035
                yaya@dowling.edu
=====================================================

On Thu, 28 Jul 2011, James Hester wrote:

Herbert: you dispute that section 2.2.5.2 requires that 'non-delimited strings that look like
numbers must always be available as the char type as well'.  Please explain where I have gone wrong
in deriving this conclusion (see my previous email) and I will be happy to adjust my statements. 
Note that (i) I am now not arguing for conflating 4.5 and "4.5", (ii) I am arguing for the status
quo and (iii) that I am imposing no direct requirements on particular software - I'm simply trying
to pin down the abstract data model.

Note also that dictionary-aware software in this context would mean that it is sufficient for the
program author to have looked up dictionaries when writing the software.  The software does not
have to process dictionaries at runtime at all. Do you know of applications which do *not*
hard-code datanames, do *not* dynamically read dictionaries, and *do* go through the long chains of
numeric manipulations that you mention below?  Please describe their function to us as they are an
important use case.

On Thu, Jul 28, 2011 at 7:30 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
     James's comments:


     >This example does not support the need for a 'numb' type, as section
     >2.2.5.2 quoted above implies that non-delimited strings that look
     >like numbers must always be available as 'char' type as well, so
     >there is no danger of the above mistake occuring.  I believe that
     >CIFtbx allows the caller to decide the type of a dataname, so
     >numb('_citation.journal_id_CSD') will return a number, but
     >char('_citation.journal_id_CSD') will return a character string in
     >the top example.  This would imply that CIFtbx is maintaining the
     >string representation internally.  Or can you give a little program
     >chunk where the 'numb'-ness of the top example leads to problems?
     >

The example given was not given to "support the need for a 'numb'
type.  It was given in support of the proposition that it is a bad
idea to conflate "4.5" with 4.5.  Section 2.2.5.2 does not require
that "non-delimited strings that look like numbers must always be
available as the char type as well."  Certainly it is a good practice
within the API for a CIF application to provide that capability while
handling a particular CIF, which is what both CIFtbx and CBFlib do,
but in the more global sense of a workflow of multiple applications
processing CIFS, the precise original string used to represent the
equivalence class of numbers tends to get lost.  One example is the
reprocessing of all numbers for journals requiring rule of 9 versus
rule of 19, etc. on esd's (or what we now call su's) to meet
standards on scientific notation.  We have established 2 decades of
practice in which the  original strings for given numbers are very
intentionally _not_ preserved through a CIF application workflow.
The proposed change would require all such workflows to be updated to
consistently reference dictionaries before reformatting numeric
values.

Now, as it happens, I agree that it would be a good idea to be able to
retain all variants of both representations and values, but conflating
"4.5" and 4.5 does not seem to help in achieving that desirable goal.
On the contrary it seems to work in the other direction.

I see no benefit to anybody actually using CIF to make such a drastic
change in CIF semantics with so many existing data sets using the
CIF 1 numb data type and so much software already written to conform
to the CIF 1 rules.   I believe this change will pointlessly
delay use of CIF2.



At 1:47 PM +1000 7/27/11, James Hester wrote:
>Let me see if I can clarify the CIF1.1 approach to the 'numb'/'char'
>distinction.  Here is a compilation from IT Vol G of information
>about this distinction:
>
>==============================================================================================
>2.2.5.2 Data typing: "..type numb encompasses all data values that
>are interpretable as numeric values...any CIF reader may encounter
>data names that are not defined in a public or accompanying
>dictionary. It is therefore appropriate to adopt a strategy of
>interpreting as a number any data value that looks like
>one...Therefore, in the absence of a specific counter-indication
>(from a dictionary definition), the data value in the following
>example may be taken as the numeric (integer) value 1:
>
>_unknown_data_name 1
>
>On the other hand, if _unknown_data_name were explicitly defined in
>a dictionary with a data type of 'char', then the value should be
>stored as the literal character 1...Note that numbers within a
>quoted string or a text block are not interpreted as type 'numb' but
>as type 'char'."
>
>2.2.7.1.4 (10): "A simple data value ... may optionally be delimited
>by any of the same set of delimiting character strings, *except* for
>data values that are to be interpreted as numbers"
>
>2.2.7.4.7.1(17): "Where the attributes of a data value are not
>available in a dictionary listing, it may be assumed that a
>character string interpretable as a number should be taken to
>represent an item of type 'numb'.  However, an explicit dictionary
>declaration of type will override such an assumption"
>
>4.9 DDL1 _type: "Type 'numb' identifies items which must have values
>that are identifiable numbers.  The acceptable syntax for these
>numbers is application-dependent."
>
>4.9 DDL1 _type_conditions: "'su' permits a number string to contain
>an appended standard uncertainty number enclosed within parentheses"
>
>4.10 DDL2 _item_type_list.construct, _item_type_list.primitive_code:
>"When a data value can be defined as a pre-determined sequence of
>characters...it is specified as a construction"
>=================================================================================================

>
>I think the above extracts are consistent with Herbert's summary of
>the CIF1 situation.  I attempt to rephrase the situation in terms of
>the abstract datamodel in the following.
>
>Section 2.2.5.2 above states that a non-delimited string that is
>interpretable as a number may actually have 'char' type if a
>dictionary specifies this.  If we wish to allow modular separation
>between CIF parsing applications and CIF dictionary applications (to
>allow CIF parsers to be developed independently of particular domain
>dictionaries, for example), the parser must therefore preserve *all*
>undelimited strings as character sequences, to allow for the
>possibility that those datavalues that appear to be numbers will
>turn out to be character strings.  So, 'numb' values in the formal
>datamodel are actually objects containing two values, the original
>string and the numerical alternative value.  Note that if you defer
>the determination of what is and isn't a 'numb' datavalue to a later
>stage, when you no longer have information about the string
>delimiters used, you may allow delimited strings to be accepted as
>'numb' type, despite the fact that this is a violation of the syntax
>(2.2.7.1.4(10)) and the BNF).
>
>The above interpretation is actually consistent with the DDL2
>practice of explicitly describing the syntax of integers and floats
>using POSIX regular expressions on the 'numb' primitive datatype -
>what this is actually doing conceptually is operating on the
>"string" aspect of the  'numb' datatype, and in this way excludes
>quoted strings from interpretation as numbers, even if they match
>the POSIX expression.
>
>OK: the only justification I can see for the existence of the 'numb'
>primitive type as described above is so that delimited number
>strings aren't interpretable as numbers, because that would be an
>unexpected outcome for a human reader. As a fan of human
>readability, I think that CIF2 could usefully continue with the CIF1
>approach to 'numb', however in written documentation (with all due
>respect to the Vol G authors) we should do a better job of
>describing the formal meaning of 'numb', as well as the practical
>outcome.
>
>Note the following consequences of the CIF1 approach, which I hope
>we all accept:
>(1) A delimited numerical value is invalid if the dictionary
>specifies that 'numb' is expected
>(2) A delimited numerical value is a valid number if the
>dictionary/DDL allows numbers to be derived from character strings
>(e.g. by giving a POSIX regex in the DDL2 _item_list_type.construct
>and a primitive code of 'char')
>(3) Dictionary-blind pretty-printers as hypothesised by John B below
>may make mistakes in their pretty-printing if they assume 'numb'
>wrongly.  Likewise, other dictionary-blind software cannot rely on
>apparent 'numb' values really being 'numb'. Successful behaviour
>after assuming 'numb' type is likely, but not guaranteed.  The only
>advantage of 'numb' is human-readability, as described above.
>
>Is everyone happy with my analysis above?  Are we OK with accepting
>the same semantics for CIF2?
>
>(Comment on Herbert's example inserted below)
>
>On Wed, Jul 27, 2011 at 4:41 AM, Herbert J. Bernstein
><<mailto:yaya@bernstein-plus-sons.com>yaya@bernstein-plus-sons.com>
>wrote:
>
>To understand the problem with conflating strings and numbers, look at the
>following tags and values:
>
>_citation.journal_id_ISSN           0036-8075
>_citation.journal_id_CSD            0038
>
>If you have a dictionary, you know both items are strings, not numbers
>and you will reliably keep the leading zeros and not treat the first
>as 36*10^(-8075).  If you don't have a dictionary and are just using,
>say, CIFtbx, you might treat both values as numbers.  Under current
>rules you can protect the values from the numeric interpretation
>even without a dictionary by saying
>
>_citation.journal_id_ISSN           "0036-8075"
>_citation.journal_id_CSD            "0038"
>
>and all is well.  Without that mechanism, you need a dictionary.
>
>
>This example does not support the need for a 'numb' type, as section
>2.2.5.2 quoted above implies that non-delimited strings that look
>like numbers must always be available as 'char' type as well, so
>there is no danger of the above mistake occuring.  I believe that
>CIFtbx allows the caller to decide the type of a dataname, so
>numb('_citation.journal_id_CSD') will return a number, but
>char('_citation.journal_id_CSD') will return a character string in
>the top example.  This would imply that CIFtbx is maintaining the
>string representation internally.  Or can you give a little program
>chunk where the 'numb'-ness of the top example leads to problems?
>
>
>
>
>
>At 10:23 AM -0500 7/26/11, Bollinger, John C wrote:
>>On Monday, July 25, 2011 10:25 PM, James Hester wrote:
>>>In order to minimise the number of issues we have to discuss in
>>>Madrid to clean up CIF2, I would like to turn discussion to those
>>>semantic issues which are relevant to the syntax.  I believe that
>>>there are three possible types of datavalue: "inapplicable",
>>>"unknown" and "string", represented by <full point> (commonly
>>>called a "full stop" or "period"), <question mark> and everything
>>>else, respectively.
>>>
>>>Do we all agree with the following assertion regarding full point
>>>and question mark?
>>>(1) A full point/question mark inside string delimiters is *not*
>>>equivalent to an undelimited full point/question mark
>>>
>>>Numbers: I believe that strings that could be interpreted as
>>>numbers are nevertheless (in a formal sense) just strings in the
>>>context of the post-parse abstract data model.  Therefore, whether
>>>or not a numerical string is delimited does not change its value:
>>>4.5 and "4.5" are identical values.
>>>
>>>Note that this latter assertion does *not* require that
>>>CIF-conformant software must always handle numbers as strings; I am
>>>making these statements in order to clarify the abstract data model
>>>on which the various DDLs and domain dictionaries operate, not to
>>>dictate software design.  If your software can manage any potential
>>>need to swap between string and number representation of your data
>>>value, then more power to you.
>>>
>>>Please state whether you agree or disagree with the above.
>>
>>
>>I agree that a CIF data value comprising only a full point or
>>question mark character is a place-holder value where it is
>>whitespace-delimited, but is an ordinary string value otherwise.  No
>>other data values are place-holders in the CIF sense.  CIF 1.1
>>distinguishes between the meanings of these place-holders, and that
>>distinction may occasionally be useful.
>>
>>
>>>From before the advent of CIF dictionaries, CIF 1 specified that
>>>data values of certain forms were of numeric type, and values of
>>>all other forms were of string type.  Although CIF 1.1 describes
>>>this among the common semantic features rather than the syntax
>>>specifications, I am uncertain whether that should be interpreted
>>>as an intentional technical decision.  Certainly many computer
>>>languages treat data typing for literal values as a syntactic
>>>issue, but others are very successful with a more freewheeling
>>>approach.
>>
>>I agree with James and Brian that it comes down to the practical
>>advantages of making a distinction, and from that perspective I
>>assert
>>
>>
>>1) The distinction is useful only where the appropriate data type
>>would otherwise be unknown, AND the data type is needed for decision
>>making.
>>
>>Knowledge of the appropriate data type could be dynamically derived
>>from a dictionary, but I suspect that most CIF software simply
>>encodes its data type requirements algorithmically (e.g. programs
>>know that _cell_length_a must be numeric).  Since Herbert raises PDB
>>software in particular, I am curious about whether there the
>  >practical ambiguity there: what are some of the CIF data items whose
>>data type that software needs but cannot determine other than from
>>their lexical form?  What is a specific consequence that could arise
>>from the software choosing the wrong data type for those items?
>>
>>One of the areas that would be affected is general-purpose CIF
>>tools, such as pretty printers, that rely only on the content of the
>>CIFs presented to them.  Such programs may safely reformat numbers
>>(e.g. switch among pure decimal form and various recognized forms of
>>scientific notation, convert s.u.s from rule-of-29 to rule of 19)
>>only if they can reliably recognize them as numbers.
>>
>>
>>2) The distinction may be practical where it isn't otherwise useful,
>>especially in the sense that it may be built in to a lot of existing
>>software.
>>
>>I know it's built into most CIF software I've ever written.  I'm not
>>sure offhand how significant the impact would be of lifting the
>>distinction.
>>
>>
>>Overall, I am apprehensive about lifting the formal distinction for
>>CIF 1.x, but I am open to considering it for CIF 2.0.  I am not yet
>>persuaded that it would be advantageous, but neither am I persuaded
>>that it would be harmful.
>>
>>
>>Regards,
>>
>>John
>>--
>>John C. Bollinger, Ph.D.
>>Department of Structural Biology
>>St. Jude Children's Research Hospital
>>
>>
>>Email Disclaimer:
>> <http://www.stjude.org/emaildisclaimer>www.stjude.org/emaildisclaimer
>>
>>_______________________________________________
>>ddlm-group mailing list
>><mailto:ddlm-group@iucr.org>ddlm-group@iucr.org
>><http://scripts.iucr.org/mailman/listinfo/ddlm-group>http://scripts.iucr.org/mailman/listinfo/dd
lm-group
>
>--
>
>=====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>
>                  <tel:%2B1-631-244-3035>+1-631-244-3035
>                  <mailto:yaya@dowling.edu>yaya@dowling.edu
>=====================================================
>_______________________________________________
>
>ddlm-group mailing list
><mailto:ddlm-group@iucr.org>ddlm-group@iucr.org
><http://scripts.iucr.org/mailman/listinfo/ddlm-group>http://scripts.iucr.org/mailman/listinfo/ddl
m-group
>
>
>
>
>--
>T <tel:%2B61%20%2802%29%209717%209907>+61 (02) 9717 9907
>F <tel:%2B61%20%2802%29%209717%203145>+61 (02) 9717 3145
>M <tel:%2B61%20%2804%29%200249%204148>+61 (04) 0249 4148
>
>_______________________________________________
>ddlm-group mailing list
>ddlm-group@iucr.org
>http://scripts.iucr.org/mailman/listinfo/ddlm-group


--
=====================================================
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                 +1-631-244-3035
                 yaya@dowling.edu
=====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.