Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

On Aug 19 2004, Herbert J. Bernstein wrote:

> At 2:26 PM +0100 8/19/04, Dr P. Murray-Rust wrote:
> >On Aug 19 2004, Herbert J. Bernstein wrote:
> >
> ...
> 
> >The difficulty is not pserving the data type, but the semantics of 
> >downstream decisions. If one author writes _my_phone "123-45678" 
> >they are announcing this is not a number while if another writes 
> >_my_phone 123-45678 they are announcing it is a number. The 
> >discussion so far seems to suggest that these statements overrule 
> >the datatypes specified in the dictionary entries. There is a 
> >particular problem in loop_s, where it is then possible to have 
> >different data types within a column:
> >
> >loop_ _atom_site_occupancy
> >1.0
> >0.3
> >"not refined"
> >"0.3"
> >"."
> >
> >which makes the implementation very difficult. I believe that a 
> >programmer should be able to look up the data type in the dictionary 
> >entry and write a routine that relies on a value being of the 
> >correct data type and throws an exception if not.
> >
> 
> If there is a dictionary, so the type is known, there are no downstream 
> decisions to be made. If the data type is numeric, the non-numeric 
> strings are an error.

Good. This makes things much easier. 

 If the data type is a character type, all the data 
> values are valid. 

Again no problem.

If there is no dictionary, then the parser designer has 
> to make some context-sensitive typing decisions. The choice in CIFtbx is 
> to infer the typing from the first instance of the data. Other choices 
> could be made, including posponing the typing decision until an entire 
> column is read, but whatever the decision, once it is made, the right 
> thing to do is to report to the user conflicts between the type of the 
> data and the type chosen for the tag.

I understand the logic of this. It is probably manageable if there are only 
char and numb - but becomes impossible if there are many. I am happy to go 
along with any interpretation as long as it's general across the community. 
I understand your proposal as:

Author: - if it's quoted its a char. (Note there are some strings that have 
to be quoted but they can only be chars anyway) - it it's not quoted no 
datatype is stated.

Reader:
- if there is a dictionary the type is defined by that:
  -if the dictType is a char, no problem
  - if dictType = numb, and authorType is char, then error
  - if dictType = numb and authorType is not stated, try to decode as numb
      -if impossible, throw an error
- if there is no dictType
  -if an item, try to decode as numb; if successful treat as numb else char
  - if in a loop_ use this logic to decide data type of first value
    - if all types are numb , decide the column is a numb
    - if any types cannot be decoded as numb, make all of them chars
    - never throw any dataType errors

I can live with this (as I expect that many authors will make up their own 
data types without dictionaries). However I think this (and other recent 
discussions need formalising in the spec. It is unlikely that implementers 
will work this out consistently!

P.

 It is a bit like the problem of 
> working with an XML dataset without the DTD. You have to guess a bit on 
> what is legal where, and sometimes you guess wrong. 

Yes, but XML only has one dataType (string) if a DTD is not provided.

It is best to have 
> the dictionaries in CIF just as it is best to have DTDs or schema in XML.

I agree. I think it's almost essential.

P.

>    -- Herbert
> 


Reply to: [list | sender only]