Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] CIF2 semantics

> Do we all agree with the following assertion regarding full point
> and question mark?
> (1) A full point/question mark inside string delimiters is *not*
> equivalent to an undelimited full point/question mark

I agree with this assertion. The ITG description "The special values
of '.' and '?' represent data that are inapplicable and unknown,
respectively." is at pains to stress that there is a semantic
distinction, albeit a subtle one, between the two cases. As James
notes, there is often a "relatively harmless" confusion between the
two, though it can be significant. CIFs generated from the CCDC
typically have ? for the symmetry operators in geometry loops. I
think this is correct: the database has not recorded the symmetry
of these positions (probably because they were not supplied in the
original publication from which the information was abstracted),
and while you'll probably get away with guessing that's because
they are 1_555, it's not guaranteed.

However, I do have some concern over the way that the unquoted literal
full point is also advanced as an enumeration value with a specific
(implied) value in a number of coreCIF definitions (for example, the
symmetry operation associated with an atom site). Volume G explains
it thus:

    The substitution of the full-point character `.' in place of a
    CIF data value serves two similar, but not identical, purposes. If it
    is used in looped lists of data it is normally a signal that a value
    in a particular packet (i.e. a value in the row of the table) is
    `inapplicable' or `inappropriate'. In some CIF applications involving
    access to a data dictionary it is used to signal that the default
    value of the item is defined in its definition in the
    dictionary. Consequently, the interpretation of this signal is an
    application-specific matter and its use must be determined according
    to the application. For example, in a CIF submitted for publication in
    Acta Crystallographica the presence of a `.' value for the item
    _geom_bond_site_symmetry_1 is predetermined as the default value 1_555
    (as per the dictionary definition). Note that, in this instance, it is
    also equivalent to `no additional symmetry' or `inapplicable'.

That phrase "no ADDITIONAL symmetry" feels somewhat forced. Given the
number of existing CIF1 data files, I propose that we live with this,
but I would be interested if anyone could come up with a cleaner or
clearer rationalisation.

Note how the extracts from Volume G unhelpfully place the characters
under discussion within quote marks, though the eagle-eyed will notice
that in the printed volume the quote marks are in Times font!

----

>>> Numbers: I believe that strings that could be interpreted as numbers
>>> are nevertheless (in a formal sense) just strings in the context of
>>> the post-parse abstract data model.  Therefore, whether or not a
>>> numerical string is delimited does not change its value: 4.5 and
>>> "4.5" are identical values.

I guess the distinction is that in isolation you don't know whether 4.5
means the quantity halfway between four and five, or the software revision
preceding 4.5.1 (or even 4.5.2beta). The assumption behind seeking to
differentiate these cases with syntactic quoting is that you're not
relying on type declarations in a dictionary to tell an application
how to treat this - as an unalterable string or as a quantity that can
be subjected to arithmetic manipulations.

I do wonder how maintaining the distinction actually does help
non-dictionary-based software. I can see that fixed-format FORTRAN
i/o benefits from knowing that columns 27-32 represent a floating-point
number, but I suppose that even FORTRAN CIF parsers must account for the
free-format nature of the CIF by isolating the string value and
subsequently determining how to convert it to a number. If the
decisions on how to do so are based only on hand-coding according
to known tags, then I see no reason why one cannot add a
"delimiter-stripper" function to the necessary routines. I'm
genuinely curious here. I don't have a strong a priori prejudice
against or in favour of maintaining the formal distinction
between non-quoted and quoted numbers.

Regards
Brian

On Tue, Jul 26, 2011 at 11:24:15PM +1000, James Hester wrote:
> I take it from the comment below that Herbert agrees to continue with the IT
> Vol G descriptions of the meanings of . and ?.  I am aware that one often
> finds a relatively harmless confusion between the two, most obviously when ?
> is used as a placeholder in a loop  instead of the usually more appropriate
> <full point>.  This confusion should encourage us to provide clarification
> in the formal specification.
> 
> Regarding numbers, could Herbert or others who wish 4.5 and "4.5" to have
> different abstract types , whereas kkkkk and "kkkkk" have the same abstract
> type, please explain why this behaviour is preferable, how it allows useful
> work to be done etc.   Meanwhile I'll prepare a post describing my reasoning
> for more uniform behaviour.
> 
> On Tue, Jul 26, 2011 at 10:13 PM, Herbert J. Bernstein <
> yaya@bernstein-plus-sons.com> wrote:
> 
>> On null values, I believe "." and "?" are different in meaning from
>> their unquoted versions, but that unquoted . and ? are both essentially
>> equivalent null values.
>>
>> On numbers, past practice has been to treat 4.5 and "4.5" as very
>> different, the former being a type numb value and the latter being
>> a type char value.  This was an important and significant early
>> difference between CIF and STAR and has been used in the handling of
>> the number-like strings that arise in PDB bib entries, e.g.
>> 1234-5678 is the number 1234e-5678, but "1234-5678" is a string
>>
>>
>> At 1:24 PM +1000 7/26/11, James Hester wrote:
>>> Dear DDLm group,
>>> 
>>> In order to minimise the number of issues we have to discuss in
>>> Madrid to clean up CIF2, I would like to turn discussion to those
>>> semantic issues which are relevant to the syntax.  I believe that
>>> there are three possible types of datavalue: "inapplicable",
>>> "unknown" and "string", represented by <full point> (commonly called
>>> a "full stop" or "period"), <question mark> and everything else,
>>> respectively.
>>> 
>>> Do we all agree with the following assertion regarding full point
>>> and question mark?
>>> (1) A full point/question mark inside string delimiters is *not*
>>> equivalent to an undelimited full point/question mark
>>> 
>>> Numbers: I believe that strings that could be interpreted as numbers
>>> are nevertheless (in a formal sense) just strings in the context of
>>> the post-parse abstract data model.  Therefore, whether or not a
>>> numerical string is delimited does not change its value: 4.5 and
>>> "4.5" are identical values.
>>> 
>>> Note that this latter assertion does *not* require that
>>> CIF-conformant software must always handle numbers as strings; I am
>>> making these statements in order to clarify the abstract data model
>>> on which the various DDLs and domain dictionaries operate, not to
>>> dictate software design.  If your software can manage any potential
>>> need to swap between string and number representation of your data
>>> value, then more power to you.
>>> 
>>> Please state whether you agree or disagree with the above.
>>> 
>>> James.
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> 
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>> --
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                  +1-631-244-3035
>>                  yaya@dowling.edu
>> =====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.