Discussion List Archives

[Date Prev][Date Next][Date Index]

(62) pdCIF category semantics

  • To: COMCIFS@iucr.ac.uk
  • Subject: (62) pdCIF category semantics
  • From: bm
  • Date: Wed, 7 May 1997 11:09:58 +0100
Dear Colleagues

D61.1 pdCIF categories
----------------------
In response to the previous circular, David Brown has contributed the
following mission statement for the current round of discourse:

D> 	I have had a chance to look at Paula's comments and find her way
D> of listing the data names and categories is very revealing.  Some of the
D> detailed concerns she raises Brian T will be able to answer, but the
D> general impression is that a lot of tidying up is needed.  The dictionary
D> seems to regard categories only as a convenient way to layout the
D> dictionary, presumably because the powder diffraction community does not
D> see any advantage in using categories to organise the data.  However, the
D> purpose of the comcifs approval is to ensure that all dictionaries conform
D> to common data standards so that they can be read by common software and
D> used in ways that we cannot envisage at the present.  Of course, they must
D> also provide for the needs of the user community whose immediate
D> requirements may be more modest. 
D> 
D> 	Before approving the powder dictionary, comcifs has to make a
D> decision about which way it is going, because every approved dictionary
D> acts as a precedent for the future.  We can decide that loose data
D> structures are OK if that is what the community wants, but if we do that,
D> we should do it with our eyes open and realise that we may be limiting our
D> options later.  If on the other hand we decide that current conventions
D> are likely to harden into rules as we exploit the possibilities that
D> tight structures offer for computer manipulation of the data, then we
D> should insist on all dictionaries conforming, whether or not the community
D> feels that such a tight structure is necessary for their purposes.  
D> 
D> 	I would like to hear from all our members on this question.  It is
D> an important question that needs to be resolved before we give final
D> approval to the powder dictionary. 


Peter Murray-Rust makes the following points:

PMR> I think the problem over categories is because there are implied semantics
PMR> in the names that - if not very carefully described - will cause software
PMR> implementors a lot of problems.  (I have certainly found this for some of
PMR> the CIF-related software I have written.)  
PMR> 
PMR> I have implemented software that manages coreCIF, - dictionaries and
PMR> data - and it's considerably harder than it might appear because of the
PMR> semantics.  Amongst other things I have been developing glossary
PMR> technology and have been using ISO12620 which is a terminology for
PMR> terminology.  You might find this useful - it's included in the
PMR> software I mention, I think.
PMR> 	For a simple glossary (and CIF is more - it's a data dictionary)
PMR> there are three things that seem relevant:
PMR> 	- ID - the string you use to retrieve the entry
PMR> 	- term - the 'name' you use for the entry
PMR> 	- concept - which corresponds to CIF category In some dictionaries
PMR> these have well-defined semantics within the names (e.g.  1.2 is the
PMR> parent concept (category) of 1.2.3.  However it's also possible to have
PMR> completed unrelated strings to manage these (and most concepts are not
PMR> substrings of the terms they contain, or which are derived from them).  It
PMR> would appear in the powder dictionary that there is no formal
PMR> relationaship between term and parent concept - which would appear to be
PMR> allowed.  This is no problem for my software, which uses the category as a
PMR> concept, but it is clearly a problem for humans, or for software which
PMR> makes assumptions without being clearly warned of the problem.
PMR> 	I am not suggesting that the CIF syntax is redesigned because
PMR> that's not feasible, but it needs to be stated whether terms have any
PMR> formal semantics.  If they *have* you must be consistent, and if they
PMR> haven't you must educate the implementors to treat terms as unique strings
PMR> and nothing more.  IOW you must write very clearly that term (_name) is
PMR> NOT parsable, however tempting it looks.  
PMR> 	If names are not parsable, the leading substring has questionable
PMR> value and might be omittable in future terms, so long as it's humanly
PMR> understandable.  If it's any consolation, namespace is a very difficult
PMR> problem and I'm going through thisin other disciplines.  IMO any future
PMR> developments should include the possibility of a formal hierarchical
PMR> namespace where the term started with one (or more) categories like:
PMR> 	core.cell.length.a
PMR> and where the dots represent formal syntactic constructs. But we have been
PMR> over this several times already.
PMR> 
PMR> If it helps, my software can be downloaded from:
PMR> http://www.vsms.nottingham.ac.uk/vsms/java/jumbo
PMR> and it includes (I think) a complete hierarchical parsing of the core CIF
PMR> by category.  This can be browsed and navigated with the JUMBO browser and 
PMR> runs under any Java-enabled WWW browser such as Netscape.  My experience
PMR> has been that it's important to develop software at the same rate as
PMR> languages (e.g. CIF) are developed because they are usually much more
PMR> complex than appear on human reading.

This is very nicely put. It seems to me that DDL2 is a formalism that does
imbue the datanames with semantic content (in my ciftex printouts for mmCIF
I suppress the '_category' field because it is redundant against the dot in
the data name). DDL1 does not; the '_category' must be defined in all cases,
though for the sake of tidiness (and to assist in conversion to DDL2) the
existing datanames in the Core (with a few historic exceptions) include the
name of their parent category as their first portion. Thus, formally, there
is no reason why the powder dictionary shouldn't include '_pd_refln_peak_id'
in (Core) category REFLN, and both '_pd_proc_intensity_net' and
'_pd_meas_counts_total' in category PD_DATA. Such an approach to naming
is not acceptable in DDL2, and the powder dictionary would need to be
modified substantially in a migration to DDL2 formalism.

There is, however, a point to be made about 'looseness', as referred to by
David. Controlled 'looseness' can be construed as 'flexibility', and there
may be applications that benefit more from flexibility than from rigidity.
The more flexible the data structure, the more complex it will be to
program completely, of course.

I have here a concern that I would like to air about the introduction of
local data names to a CIF. It's one of the early principles of CIF that
developers can invent or introduce their own data names for their own
purposes, and such undefined data names should not break CIF reading
software. Is this principle preserved in the mmCIF software that Paula is
familiar with? Which - if any - of the following cases would
be considered to invalidate the CIF (where "SOMETHING" is not a known
category in the mmCIF dictionary, and "MMTHING" is a known category)?

(a)     _something_unknown   "on its own"

(b)     _something.unknown   "on its own, but with a dot"

(c)     _mmthing_unknown     "undefined dataname" 

(d)     _mmthing.unknown     "category is defined, dataname isn't"

and how about

(f)    loop_ _mmthing.defined_1 _mmthing.defined_2 _something_unknown

(g)    loop_ _mmthing.defined_1 _mmthing.defined_2 _something.unknown

(h)    loop_ _mmthing.defined_1 _mmthing.defined_2 _mmthing_unknown

(i)    loop_ _mmthing.defined_1 _mmthing.defined_2 _mmthing.unknown

This has some relevance to the question of future handling of data files; if
a powder file is presented in DDL1 formalism, DDL2 software can map most of
the Core datanames to their DDL2 equivalents as aliases, leaving the 
_pd_ items (for which there may be no DDL2 equivalents) as "foreign"
datanames which don't compromise the syntactic integrity of the file, but
cannot be validated  semantically. If this scheme works, pdCIF data files
can still be handled by advanced software, while not being able to enjoy the
full power of such software.

I retain a fairly open mind on this matter; I still think I could approve
the dictionary as presented (with some amendments to take account of the
particular anomalies Paula has noted), though the accompanying documentation
should very clearly describe the limitations of the category structure (or
lack of structure) adopted. But I have no fundamental objection to a more
normalised category structure.

Regards
Brian