[Date Prev][Date Next][Date Index]
(62) pdCIF category semantics
- To: COMCIFS@iucr.ac.uk
- Subject: (62) pdCIF category semantics
- From: bm
- Date: Wed, 7 May 1997 11:09:58 +0100
Dear Colleagues D61.1 pdCIF categories ---------------------- In response to the previous circular, David Brown has contributed the following mission statement for the current round of discourse: D> I have had a chance to look at Paula's comments and find her way D> of listing the data names and categories is very revealing. Some of the D> detailed concerns she raises Brian T will be able to answer, but the D> general impression is that a lot of tidying up is needed. The dictionary D> seems to regard categories only as a convenient way to layout the D> dictionary, presumably because the powder diffraction community does not D> see any advantage in using categories to organise the data. However, the D> purpose of the comcifs approval is to ensure that all dictionaries conform D> to common data standards so that they can be read by common software and D> used in ways that we cannot envisage at the present. Of course, they must D> also provide for the needs of the user community whose immediate D> requirements may be more modest. D> D> Before approving the powder dictionary, comcifs has to make a D> decision about which way it is going, because every approved dictionary D> acts as a precedent for the future. We can decide that loose data D> structures are OK if that is what the community wants, but if we do that, D> we should do it with our eyes open and realise that we may be limiting our D> options later. If on the other hand we decide that current conventions D> are likely to harden into rules as we exploit the possibilities that D> tight structures offer for computer manipulation of the data, then we D> should insist on all dictionaries conforming, whether or not the community D> feels that such a tight structure is necessary for their purposes. D> D> I would like to hear from all our members on this question. It is D> an important question that needs to be resolved before we give final D> approval to the powder dictionary. Peter Murray-Rust makes the following points: PMR> I think the problem over categories is because there are implied semantics PMR> in the names that - if not very carefully described - will cause software PMR> implementors a lot of problems. (I have certainly found this for some of PMR> the CIF-related software I have written.) PMR> PMR> I have implemented software that manages coreCIF, - dictionaries and PMR> data - and it's considerably harder than it might appear because of the PMR> semantics. Amongst other things I have been developing glossary PMR> technology and have been using ISO12620 which is a terminology for PMR> terminology. You might find this useful - it's included in the PMR> software I mention, I think. PMR> For a simple glossary (and CIF is more - it's a data dictionary) PMR> there are three things that seem relevant: PMR> - ID - the string you use to retrieve the entry PMR> - term - the 'name' you use for the entry PMR> - concept - which corresponds to CIF category In some dictionaries PMR> these have well-defined semantics within the names (e.g. 1.2 is the PMR> parent concept (category) of 1.2.3. However it's also possible to have PMR> completed unrelated strings to manage these (and most concepts are not PMR> substrings of the terms they contain, or which are derived from them). It PMR> would appear in the powder dictionary that there is no formal PMR> relationaship between term and parent concept - which would appear to be PMR> allowed. This is no problem for my software, which uses the category as a PMR> concept, but it is clearly a problem for humans, or for software which PMR> makes assumptions without being clearly warned of the problem. PMR> I am not suggesting that the CIF syntax is redesigned because PMR> that's not feasible, but it needs to be stated whether terms have any PMR> formal semantics. If they *have* you must be consistent, and if they PMR> haven't you must educate the implementors to treat terms as unique strings PMR> and nothing more. IOW you must write very clearly that term (_name) is PMR> NOT parsable, however tempting it looks. PMR> If names are not parsable, the leading substring has questionable PMR> value and might be omittable in future terms, so long as it's humanly PMR> understandable. If it's any consolation, namespace is a very difficult PMR> problem and I'm going through thisin other disciplines. IMO any future PMR> developments should include the possibility of a formal hierarchical PMR> namespace where the term started with one (or more) categories like: PMR> core.cell.length.a PMR> and where the dots represent formal syntactic constructs. But we have been PMR> over this several times already. PMR> PMR> If it helps, my software can be downloaded from: PMR> http://www.vsms.nottingham.ac.uk/vsms/java/jumbo PMR> and it includes (I think) a complete hierarchical parsing of the core CIF PMR> by category. This can be browsed and navigated with the JUMBO browser and PMR> runs under any Java-enabled WWW browser such as Netscape. My experience PMR> has been that it's important to develop software at the same rate as PMR> languages (e.g. CIF) are developed because they are usually much more PMR> complex than appear on human reading. This is very nicely put. It seems to me that DDL2 is a formalism that does imbue the datanames with semantic content (in my ciftex printouts for mmCIF I suppress the '_category' field because it is redundant against the dot in the data name). DDL1 does not; the '_category' must be defined in all cases, though for the sake of tidiness (and to assist in conversion to DDL2) the existing datanames in the Core (with a few historic exceptions) include the name of their parent category as their first portion. Thus, formally, there is no reason why the powder dictionary shouldn't include '_pd_refln_peak_id' in (Core) category REFLN, and both '_pd_proc_intensity_net' and '_pd_meas_counts_total' in category PD_DATA. Such an approach to naming is not acceptable in DDL2, and the powder dictionary would need to be modified substantially in a migration to DDL2 formalism. There is, however, a point to be made about 'looseness', as referred to by David. Controlled 'looseness' can be construed as 'flexibility', and there may be applications that benefit more from flexibility than from rigidity. The more flexible the data structure, the more complex it will be to program completely, of course. I have here a concern that I would like to air about the introduction of local data names to a CIF. It's one of the early principles of CIF that developers can invent or introduce their own data names for their own purposes, and such undefined data names should not break CIF reading software. Is this principle preserved in the mmCIF software that Paula is familiar with? Which - if any - of the following cases would be considered to invalidate the CIF (where "SOMETHING" is not a known category in the mmCIF dictionary, and "MMTHING" is a known category)? (a) _something_unknown "on its own" (b) _something.unknown "on its own, but with a dot" (c) _mmthing_unknown "undefined dataname" (d) _mmthing.unknown "category is defined, dataname isn't" and how about (f) loop_ _mmthing.defined_1 _mmthing.defined_2 _something_unknown (g) loop_ _mmthing.defined_1 _mmthing.defined_2 _something.unknown (h) loop_ _mmthing.defined_1 _mmthing.defined_2 _mmthing_unknown (i) loop_ _mmthing.defined_1 _mmthing.defined_2 _mmthing.unknown This has some relevance to the question of future handling of data files; if a powder file is presented in DDL1 formalism, DDL2 software can map most of the Core datanames to their DDL2 equivalents as aliases, leaving the _pd_ items (for which there may be no DDL2 equivalents) as "foreign" datanames which don't compromise the syntactic integrity of the file, but cannot be validated semantically. If this scheme works, pdCIF data files can still be handled by advanced software, while not being able to enjoy the full power of such software. I retain a fairly open mind on this matter; I still think I could approve the dictionary as presented (with some amendments to take account of the particular anomalies Paula has noted), though the accompanying documentation should very clearly describe the limitations of the category structure (or lack of structure) adopted. But I have no fundamental objection to a more normalised category structure. Regards Brian
- Prev by Date: (61) pdCIF: category concerns
- Next by Date: (63) more on pdCIF categories
- Index(es):