Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

On Aug 19 2004, Herbert J. Bernstein wrote:

> At 2:26 PM +0100 8/19/04, Dr P. Murray-Rust wrote:
> >On Aug 19 2004, Herbert J. Bernstein wrote:
> >
> ...
> >The difficulty is not pserving the data type, but the semantics of 
> >downstream decisions. If one author writes _my_phone "123-45678" 
> >they are announcing this is not a number while if another writes 
> >_my_phone 123-45678 they are announcing it is a number. The 
> >discussion so far seems to suggest that these statements overrule 
> >the datatypes specified in the dictionary entries. There is a 
> >particular problem in loop_s, where it is then possible to have 
> >different data types within a column:
> >
> >loop_ _atom_site_occupancy
> >1.0
> >0.3
> >"not refined"
> >"0.3"
> >"."
> >
> >which makes the implementation very difficult. I believe that a 
> >programmer should be able to look up the data type in the dictionary 
> >entry and write a routine that relies on a value being of the 
> >correct data type and throws an exception if not.
> >
> If there is a dictionary, so the type is known, there are no downstream 
> decisions to be made. If the data type is numeric, the non-numeric 
> strings are an error.

Good. This makes things much easier. 

 If the data type is a character type, all the data 
> values are valid. 

Again no problem.

If there is no dictionary, then the parser designer has 
> to make some context-sensitive typing decisions. The choice in CIFtbx is 
> to infer the typing from the first instance of the data. Other choices 
> could be made, including posponing the typing decision until an entire 
> column is read, but whatever the decision, once it is made, the right 
> thing to do is to report to the user conflicts between the type of the 
> data and the type chosen for the tag.

I understand the logic of this. It is probably manageable if there are only 
char and numb - but becomes impossible if there are many. I am happy to go 
along with any interpretation as long as it's general across the community. 
I understand your proposal as:

Author: - if it's quoted its a char. (Note there are some strings that have 
to be quoted but they can only be chars anyway) - it it's not quoted no 
datatype is stated.

- if there is a dictionary the type is defined by that:
  -if the dictType is a char, no problem
  - if dictType = numb, and authorType is char, then error
  - if dictType = numb and authorType is not stated, try to decode as numb
      -if impossible, throw an error
- if there is no dictType
  -if an item, try to decode as numb; if successful treat as numb else char
  - if in a loop_ use this logic to decide data type of first value
    - if all types are numb , decide the column is a numb
    - if any types cannot be decoded as numb, make all of them chars
    - never throw any dataType errors

I can live with this (as I expect that many authors will make up their own 
data types without dictionaries). However I think this (and other recent 
discussions need formalising in the spec. It is unlikely that implementers 
will work this out consistently!


 It is a bit like the problem of 
> working with an XML dataset without the DTD. You have to guess a bit on 
> what is legal where, and sometimes you guess wrong. 

Yes, but XML only has one dataType (string) if a DTD is not provided.

It is best to have 
> the dictionaries in CIF just as it is best to have DTDs or schema in XML.

I agree. I think it's almost essential.


>    -- Herbert
comcifs mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.