[Date Prev][Date Next][Date Index]
(9) Review of the CIFtools workshop
- To: COMCIFS@uk.ac.iucr
- Subject: (9) Review of the CIFtools workshop
- From: bm@uk.ac.iucr (Brian McMahon)
- Date: Tue, 2 Nov 93 17:05:23 GMT
Dear Colleagues Attached are my impressions of last month's workshop. This isn't intended as a complete report of the meeting - I've made no reference to several interesting talks that weren't directly relevant to our concerns. I believe Phil Bourne will make the proceedings available over the network, and I'll supply details of how to obtain them when they're distributed. I'm also circulating copies of this to Nick Spadaccini and Phil Bourne, who are invited to correct any errors of fact and debate any matters of interpretation (mail me at bm@iucr.ac.uk, guys). CIFtools Workshop Tarrytown NY 15-18 October 1993 ------------------------------------------------- This workshop was funded by the National Science Foundation to promote the development of tools for handling macromolecular data in CIF format. It was a sequel to a workshop in York in April 1993, on the theme of data validation in macromolecular crystallography. Several participants at both meetings characterised the York workshop as an event of some significance in the acceptance and development of the CIF format for data archive and transfer in this field. It was felt that many participants had come to York with no particular enthusiasm for CIF. However, the (often animated) discussions at York had persuaded many of the participants whose particular interests lay in the sphere of database operations that CIF could be used as a valuable data transfer mechanism if it were enhanced to indicate explicit relationships between data items. The recent Workshop was characterised by broad acceptance of the CIF modifications brought about since York, and an eagerness to promote applications using the extended CIF format. The major changes to the CIF format following the York meeting were extensions to the dictionary definition language (DDL) that is used to define terms in CIF dictionaries. In particular, the term _category was introduced to associate data items possessing some common relationship [such relationships were previously implicit in the hierarchical name elements of a CIF data name]; and _list_link_child and _list_link_parent terms were introduced to describe relationships between data appearing in different lists [thus, the atoms partaking in bonds are labelled by _geom_bond_atom_site_label_1 and *_2; these must take the values of atom site labels as listed in the coordinates loop, and so they have a _list_link_parent of _atom_site_label]. Other DDL terms were also introduced to constrain or relate data that should appear together in a list: _list_reference is the data item present in a list which allows references to that list; _list_mandatory signals whether the data item must be present in a list of items of the current _category; and _list_uniqueness describes those data items which must (singly or in combination) appear uniquely in a valid list. These changes taken together permit a detailed description of list structures that may appear in a CIF, and fulfil most of the requirements of relating data access across tables in a relational database representation. However, there was some discussion over the way in which _category should be defined or understood. In the current definition, if the data item belongs in a looped list then it may only be grouped with items from the same category. Therefore, data items from different categories *may not* appear together in the same list. This was felt by Nick Spadaccini to be unduly restrictive, as there might reasonably be cause for mixing items of separate categories in the same list. It was also opposed strongly by Brian Toby, whose powder extension dictionary currently *requires* lists containing data items of different _category. There is also on occasion a requirement to represent data belonging to the same category across more than one list. Currently, this is impossible to do if there is a _list_reference, because all looped data in the category must contain the data name which acts as the _list_reference, and this cannot appear in more than one loop. Nick Spadaccini suggested a more flexible approach to list building. Consider the _atom_site_ and _atom_site_aniso_ loops, which currently have different categories (atom_site and atom_site_aniso, respectively). The _atom_site_fract_ items have a _list_reference of _atom_site_label, but _atom_site_aniso_U items have a _list_reference of _atom_site_aniso_label. This implies that the two sets of data items will normally be given in two separate lists. However, the anisotropic U's *could* be placed in the same list if the _list_reference value may be derived transitively through the _list_link_parent pointer. That is, a parser finds an anisotropic U in a list of atom site labels and fractional coordinates. The parser looks for the _list_reference value for the U's, which should be _atom_site_aniso_label, and fails to find it. However, _atom_site_label is the parent of _atom_site_aniso_label, and *is* found in the list; so the list (containing x, y, z and U) is valid. Michael Scharf argued that the changes to the DDL still did not go far enough towards supplying data in a fully machine-parsable form. For instance, dates in CIF format are given as strings with verbal instructions for parsing the strings ('in yy-mm-dd format'), where a date could sensibly be split into atomic elements (year, month, day) and each such element could be given a separate data name. There was some discussion over this issue, and it was pointed out that it was not always reasonable to split an item into its smallest possible constituent elements: a crystallographer seeking a bibliographic reference is very unlikely to be interested in, say, '_citation_date_day'. Shoshana Wodak emphasised that the primary requirement was to guarantee that the maximum amount of data was present in the file. Subsequent post-processing strategies could always be devised to extract the information, if required. However, it was considered desirable to make the file as fully machine readable as possible, and one approach to identifying structure within a string might be to constrain the form of the string to some regular expression value. This might help in extracting data from the string; but would certainly help in validating the contents of the string. For instance, if the 'type' of a date were defined by a regular expression (such as [0-9][0-9]-[0-9][0-9]-[0-9][0-9]) this would mean that a supposed date of 93-AB-X@ would be detected as invalid. The idea of such validation types was further developed by Peter Murray-Rust and others (see below). Another suggestion of Michael Scharf was the development of a formalism allowing data items to be referenced *from within free-text fields*. This would be a style of hypertext, and Scharf suggested that such links could be established by use of the SGML dialect already employed in the World-Wide Web hypertext system. The suggestion of a hypertext link generated no particular enthusiasm (although it was generally regarded as a good idea in principle); but there appears to be a pressing need for standard means of referring to data outside the current data block, and outside the current file. Brian Toby described the powder diffractionist's desire to have globally unique identifiers for data sets referenced in a file; ideally such global identifiers could actually be used as *locators* of the relevant file. The macromolecular dictionary developers were also keen to develop a method of providing external references to files containing lists of allowed keywords or standard dictionaries of chemical properties [these latter requirements might be met to some extent by enumerations within the dictionary or by introducing the concept of 'save frames' into CIF. However, it is preferred to have a relatively volatile set of keywords maintained outside of the dictionary itself, and 'save frames' might constitute a very mixed blessing, as evidenced by the experience to date with MIF]. The 'external reference file pointer' discussed above is one aspect of what can be called the 'name space' problem. In its original format, the CIF is seen essentially as a self-contained and local data file. There is now a requirement for locating CIF data in the crystallographic information universe. How can one reliably refer to data in other data blocks in the file? in other files? Often it is sensible to impose some structure on those parts of a CIF that could be modified to hold location information. Thus the data block name, which is a purely arbitrary string, is likely to be constructed locally in such a way that it labels an individual data set collected, or an independent refinement result. Should there be an external specification of ways to do this? If data blocks were ordered so as to archive a sequence of refinements (or experiments), would not it be useful to introduce a global_ block (as permitted in the full STAR syntax, and used in the Dictionaries)? This then would restrict the arbitrariness with which data can be placed in a CIF; it would not be permissible to concatenate two CIFs (since the scope of a global_ declaration is from the point of declaration onwards). Many of these points were raised in general discussion, but without any resolution. Peter Murray-Rust described a possible extension to the DDL where a specific field could contain an algorithm relating the data item currently defined to other data items. One application of this could be a precise description of the arithmetic relationship between, say, U and B factors (thus replacing the rather vague "_related_function constant" entry in the CIF dictionary). The idea is that the algorithm would be stated either in a machine-parsable pseudo-code, or, at least for prototyping purposes, in an expressive language such as the freely-available tcl. So there might be an entry for a data item, say "_construction", which lists the generating algorithm in a free-text field. This idea might also be extended to allow for validation of the contents of a data field - e.g. _symmetry_Int_Tables_number might be related algorithmically to _symmetry_space_group_H-M. A DDL working group, including Murray-Rust, Scharf and others interested in extending the scope of the DDL, discussed a strategy for seeking further extensions to the DDL. Given that data dictionaries are themselves STAR files, they should benefit from the CIF principle of *extensibility*. It is therefore suggested that further modifications to the DDL be undertaken as local extension dictionaries that will be blended into the official dictionaries for the purpose of prototyping further developments. The principles to be followed by the working group are: (-1) All existing CIFs must remain valid. (0) This is considered a local exercise in prototyping dictionaries. (1) Changes will be introduced by merging in extensions from external files. (2) Dictionary validation will be separated operationally from other evaluation of data (i.e. consistency between bonds and coordinates is not - at least at present - determinable from the dictionary relationships alone). (3) More than one group will work *independently* on extending the DDL. (4) A "_validation_type" will be developed. Possible values for this will include Boolean, 'multi-valued Boolean' ("yes", "no", "maybe"), enum, cif_name, date, regexp, and a "smart" type which contains an executable procedure. (5) A file is to be considered a dictionary iff it begins with global_ and _dictionary_name. (6) Algorithms will be stated in tcl. Some other contributions to the meeting illustrated the extent to which the macromolecular community is eagerly awaiting CIF. Dave Stampf explained how the Protein Data Bank is preparing to change its operation from PDB file format to CIF all at once, with no transition period. This will require staff retraining, but is considered a better option than trying to introduce a sliding transition. John Westbrook described how the Nucleic Acid Data Bank has already for some time used CIF as its internal data exchange format. The NDB has developed several tools for merging and editing CIFs, although these depend largely on the syntactic characteristics of the file, and do not validate fully against _list_link_ style relationships. Phil Bourne described an approach to a high-order visualisation tool, to which he proposed to give the name CIFbook. Apparently there has been some interest by Macromolecular Structures in investigating this approach as a companion to their current hard-copy data compendia. Following Nick Spadaccini's description of the MIF project, involving data exchange files with the full STAR syntax, there was some discussion over the possibility of introducing multiple-level loops into CIF. By and large, those present seemed convinced by Nick's argument that the mmCIF project would not be so far advanced if it had to handle the complexities of the full STAR syntax. However, it was felt that, as the community developed tools of increasing sophistication and power, future extensions to include save frames and nested loops could well be envisaged. ----- Brian
- Prev by Date: (8) restraints, dict files, *_[], comments
- Next by Date: (10) STAR changes, DDL, dataname character sets
- Index(es):