Hi all, Last week, Peter (Murray-Rust)'s post to the PDB mailing list reminded me of a problem with core cifdic compatability which still hasn't been dealt with in mmCIF. This is the problem of how to cope with unit extensions to data names: On Thu, 16 Nov 1995, Murray-Rust Dr P wrote: > One of the issues I have been trrying to resolve in CIF is the > distinction between syntax and semantics. In simple terms syntax means > parsing the file without attempting to interpret its meaning or content. > In general CIF (note that the core CIF dictionary does not use the full > STAR language) has a well-defined syntax and can be expressed by a BNF. > There are - I think - still a few concerns such as how to escape certain > characters and what to do when 'including' files. > > Semantics determines what meaning you put on the contents. For example, > > _cell_length_a_pm 1490.3 > > is meaningless untill semantics is applied to it. CIF deals with this by > saying "Go and look up _cell_length_a_pm in a dictionary and take further > action depending on what you find". The use of dictionaries in CIF is > IMO a major advance and one that the crystallographic community can be > proud of. > > Unfortunately there are places where the syntax and semantics are > confused - in the above case the _pm suffix (which is not parsable) has > the implied message "Divide the following number by 100 because that is > what the crystallographic community uses" > > In my view the syntax and semantics of CIF must be clearly separated, so > that the language can be rigorously parsed without having to add semantic > content (especially by implication). There is nothing in mmDDL which corresponds to the core dictionary term _units_extension, so at the moment, any attempt to use a data item such as '_cell_length_a_pm' in a mmCIF-based application would fail. (The core ddl does not seem to have kept pace with this either, but that is another issue.) For the reasons which Peter explains very well, it would not be a good idea to allow unit extensions to mmCIF data names in the same way as they are in the core dictionary. Apart from this more fundemental problem, it would cripple any application's attempt to speed up access to dictionary terms by converting mmCIF to a form in which a fast lookup algorithm can be used. As John pointed out at Montreal, the (increasing) size of mmCIF makes this an important issue. Here is my proposal of how this could be resolved. It would need a new DDL item in the ITEM_ALIASES category - _item_aliases.units_code, and the relevant part of the DDL would change to: save__item_units_list.code _item_description.description ; The code specifying the name of the unit of measure. ; loop_ _item.name _item.category_id _item.mandatory_code '_item_units_list.code' item_units_list yes '_item_units.code' item_units yes '_item_units_conversion.from_code' item_units_conversion yes '_item_units_conversion.to_code' item_units_conversion yes ### Units of _item_aliases.alias_name are implicitly the same as ### those of _item_aliases.name: '_item_aliases.units_code' item_aliases implicit _item_type.code code loop_ _item_linked.child_name _item_linked.parent_name '_item_units.code' '_item_units_list.code' '_item_units_conversion.from_code' '_item_units_list.code' '_item_units_conversion.to_code' '_item_units_list.code' '_item_aliases.units_code' '_item_units_list.code' save_ As for an example of how this would be used in mmCIF, take the _cell.length_a data item: save__cell.length_a _item_description.description ; Unit-cell length a corresponding to the structure reported. ; _item.name '_cell.length_a' _item.category_id cell _item.mandatory_code no _item_sub_category.id 'cell_length' ### _item_aliases.name is determined implicitly for each row: loop_ _item_aliases.alias_name _item_aliases.units_code _item_aliases.dictionary _item_aliases.version '_cell_length_a' 'angstroms' 'cifdic.c94' '2.0' '_cell_length_a_pm' 'picometres' 'cifdic.c94' '2.0' '_cell_length_a_nm' 'nanometres' 'cifdic.c94' '2.0' ..... _item_units.code 'angstroms' save_ Now, on finding _cell_length_a_pm 1490.3 in a CIF, a mmCIF-based application can find out from the dictionary that the units of the data value _in_the_file_ are picometres. It can also determine that the units of _item_aliases.name ('_cell.length_a') are angstroms, and can find the appropriate conversion from the ITEM_UNITS_CONVERSION category: loop_ _item_units_conversion.from_code _item_units_conversion.to_code _item_units_conversion.operator _item_units_conversion.factor ... 'picometres' 'angstroms' '*' 1.0E-02 I think that this answers the semantic vs. s syntactic point, since the '_pm' suffix does not need to be 'noticed' at the time the data name is parsed - '_cell_length_a_pm' is just another data name at this point. The assignment of units to the following data value is cleanly separated from the syntax checking. If it seems a little odd at first to put something about units in the ITEM_ALIASES category, bear in mind that this category's raison d'etre is compatability with the core CIF dictionary anyway. Comments anyone? Peter. ======================================================================== Peter Keller. \ "We kill the cows to make jackets out of Dept. of Biology and \ them, and then we kill each other for the Biochemistry, \ jackets we made out of the cows." University of Bath, \ --- Denis Leary Bath, BA2 7AY, UK. \ ------------------------------\----------------------------------------- Tel. (+44/0)1225 826826 x 4302 | Email: P.A.Keller@bath.ac.uk (Internet) Fax. (+44/0)1225 826449 | P.A.Keller%bath.ac.uk@UKACRL (BITNET) ========================================================================