Crystallographic data

CODATA 2002

Montreal, 29 September - 3 October 2002

The biennial CODATA meeting just finished in Montreal set out its agenda under six headings intended to emphasise cross-disciplinary concerns relevant to all the CODATA participating organisations:
  • Preserving and Archiving of Scientific and Technical Data;
  • Legal Issues in Using and Sharing Scientific and Technical Data;
  • Interoperability and Data Integration;
  • Information Economics for Scientific and Technical Data;
  • Emerging Tools and Techniques for Data Handling;
  • Ethics in the Creation and Use of Scientific and Technical Data.
In an interesting keynote lecture on preservation and archiving, Kevin Ashley of the University of London Computing Centre struck some encouraging notes. The technical problems of archiving digital data are well understood and pose no severe challenges; the challenges are economic, social and managerial. The sheer volume of data available is daunting, but the growth in bulk storage capacity continues to accelerate, and the difficulty with preserving large volumes will become more one of locating the desired data within the collection. Of course, one should not aim to store absolutely everything. Selection is important, and professional archivists are skilled in selection. It rapidly becomes apparent that collectors of raw data often are not skilled in the critical assessment of what should be stored to maximise the usefulness of archived data to future exploitation. Hence there is merit in the current trend of funding bodies in Europe and Australasia to support the digital archiving and libraries communities because of their experience in archiving and preservation; but these are communities not necessarily attuned to the specific requirements of scientific data management, and collaboration is needed to ensure that these requirements are met when scientific data are transferred to non-specialised repositories.

It is clear that the more detail accompanying the data collections (i.e. the richer the metadata at source), the more value will emerge from the archive. Some anecdotes pointed out the need to retain 'bad' data (in Ashley's talk the example was given of statistical demographic data that was known to be flawed, but which had influenced political arguments - the availability of the bad data was valuable to historical analysts). In the Canadian Virtual Observatory, astronomical data from US sources was recalibrated on the fly when requested. This dynamic processing using best current techniques improved the usefulness of the data served at this point in time. On the other hand, NASA demands that its archives retain the original raw data (after all, the latest recalibration might contain systematic errors). So, while the point was not made explicitly, it is apparent that archives may not be static repositories, but may be called upon to reflect changes overlaid upon the data they contain. This emphasises the importance of audit trails to accompany the data as a further level of metadata. (There is a parallel with our electronic journals, where errata should be combined with the content of a paper to assist current researchers, while yet the initial form must also be accessible as a document of historical record.)

The Open Archival Information System (OAIS) reference model appears to have been well accepted. It is not clear how widespread it is, but where it has been adopted as a working reference it has proven effective, whether in the actual generation of code from its formal UML representation (as has been done at the Jet Propulsion Labs), or as a more traditional blueprint for software engineering using XML, SOAP and other web services tools as practised by Centre National d'Etudes Spatiales (CNES). The CNES experience showed that its use promoted easy interoperability between different databases, and it seems to me that its level of abstraction makes it by far one of the most effective tools to date in working towards proper cross-disciplinary interoperability.

Among other contributions related to archiving, the point was repeatedly made that original data must be retained, whatever subsequent processing it might undergo. The US Geological Survey distinguished between the 'migration' of data to different physical media, which was easy, from its 'transcription' into different formats which might be required for subsequent reprocessing or even storage. One difficulty with transcription strategies applied to very large volumes of data was that the lifetime of the target format was often short compared with the time taken for the transcription operation, so that one was forever chasing one's tail. The astronomy community well understood the value of archived data: over 600 papers a year are published from old observational data retrieved from data stores, and data from the Hubble Space Telescope is being extracted for research at a rate four time greater than it is being added to the archives.

The Principal Director of the Erpanet project (a European funded venture somewhat focused on cultural digital objects) reiterated the archivist's principle of selectivity. It is better to collect little but document well than to aggregate huge amounts of poorly documented material. Acquisition strategies of librarians are developed hand in hand with disposal strategies (though this is perhaps driven by storage space concerns which are less pressing in the current digital environment). One point he made that I thought worth pondering is that archiving resources should be allocated less on a cost/benefit analysis than through risk analysis: what do you stand to lose?

The interoperability thread was introduced by a keynote talk full of ideas (about 120 slides worth!) by Robert Robbins of the Hutchinson Cancer Research Centre, Seattle. His theme was that interoperability between databases in the life/molecular biology sciences alone was hampered, partly by the scale, but also by obstacles to technical, semantic and social connectivity. In practice it was found that people were more willing to tackle the semantic and to some degree social obstacles as the technical connectivity was improved. The problems that he saw at the technical level had to do with the fact that current relational database management systems were optimised for business databases. But business and science differ: business is concerned with a closed universe and deductive logic; science deals in an open universe of observations with inductive logic. Nevertheless, relational database systems were attractive inasmuch as they had a sound theoretical basis: their behaviour and properties were tractable to set-theoretic analysis. Object-oriented databases with local methods were attractive in terms of efficiency of manipulation, but tended to be designed ad hoc to match the problem in hand. The difficulties of integrating ad hoc solutions are more severe. In practice biological databases will form at best a 'loosely coupled federation' within a formal taxonomy of databases. Earlier attempts to rigorously analyse such systems foundered on the impossibility of synchronising loosely coupled structures, but Robbins believes that a formal theory of 'read-only' loosely-coupled federated databases is possible, and is essential to provide a sound basis for the design and implementation of the desired integration of very large-scale biological databases. One thing he identified as an essential was some sort of resource registry acting as an analogue of domain name service to direct structure queries to the appropriate server within a WWW technical model.

The other talks in the interoperability thread seemed to illustrate that interoperability needs to - or at least tends to - start at the technical level. The oceanographic OpENDAP protocol defining syntactic metadata was successful in bringing together data sets in a number of different formats behind a common front-end. It is an open-source network data access protocol that sits, rather like a format translation layer, on top of TCP/IP in a network transfer process. Mechanical procedures for translating between formats are employed, and the amount of semantic metadata required by the search and retrieval applications is rather low. But the point was made that the format transfer layer reduces the amount of metadata that is needed to facilitate meaningful data transport. It was also pointed out that one can get a lot of functionality out of 'smart' clients, but the more intelligent the client, the lower (in general) its capability for interoperability. OpENDAP appears to be sufficiently low-level that it has been used by oceanography, earth sciences and solar-terrestrial communities to good effect.

The OpenGIS Consortium demonstrated some impressive overlaying of map-based information from different geography sources using a number of web-compatible services. Among the tools used in XML-based data transfer applications, UDDI (Universal Description, Discovery and Integration) was mentioned by a number of speakers, and may go some way towards fulfilling the role of a 'semantic DNS' mentioned in the keynote presentation.

Impressive though some of the working examples were, they are still largely restricted to one discipline or to related disciplines with rather similar data descriptions. Cross-discipline interoperability still seems a long way off.

A speaker from the Open Archives Initiative (OAI) discussed the protocol for metadata harvesting (PMH) that is designed to collect metadata across disciplines. It is based on Dublin Core metadata, but may provide a way to aggregate disparate metadata from different sources. This sounded interesting, but unfortunately the speaker disappeared before I could chat to him; but there might be more on this at the forthcoming CERN meeting.

The keynote talk in the 'Emerging Tools' thread was on Text Mining by Stan Matwin of the University of Ottawa. This is the process of analysing natural-language text to uncover new knowledge (that is, to extract structured information from an unstructured source). The distinction was made between 'uncovering' and 'discovering' new knowledge - an example of the latter would be the recognition that references to birds in Grimms' fairy tales were always metaphors for death. (To my mind this sounded far more interesting, but doesn't appear to be anywhere near realisation yet!) Text mining projects of today combine linguistic analysis with machine learning. Linguistic analysis includes word stemming, tagging, and rule-based parsing of the grammatical structures of a natural-language text. The objective is to work towards a semantic analysis, which for scientific texts is imaginable because the formal language of scientific discourse uses relatively direct mappings between syntax and semantics (unlike, say, metaphor-rich literary text). The machine-learning component involves the preliminary feeding into the system of portions of text which are variously tagged by experts as relevant or not relevant to a particular type of query. This is seen as an effective way to generate the thesauri relevant to a topic area. It was claimed that early projects concerned with the automatic categorisation of documents in genomics, and the detection of email spam, were showing promise.

Among the contributions to this thread, Henry Kehiaian presented the standard file format SELF for physicochemical data as a technique for publishing, retrieving and exchanging such data. A man from Oracle discussed some of the innovations within Oracle databases for storing spatial data, extending the database query language to include operations on spatial data types and introducing optimised spatial indexing. I gather that they are working with SDSC on applications in protein structure representation and that they plan to provide biospatial types that are compliant with mmCIF. This sounds to me like very good news, because I am sure the integration of mmCIF objects in a commercial product of this importance will be very welcome. If I understood correctly, he is collaborating or proposing to collaborate - I presume with Phil Bourne's group? - on PDB-mmCIF conversion tools. Unfortunately, he also vanished before I could talk to him.

Because of the structure of the conference I could not attend to the other topic threads in detail. But their keynotes were all of interest.

Masamitsu Negishi of the Japanese National Institute of Informatics (NII) described Japan's current heavy investment in information technology: the e-Japan strategy is a government-driven initiative to become the world's most advanced IT nation by 2005. As its contribution, NII hosts 2892 databases (only 5% of them scientific), but provides a national portal to these. It is building a citation index of Japanese publications (the Japanese-language equivalent of ISI). Japanese electronic libraries are exploring consortial subscription models, and Japanese interest in the SPARC initiative for low-cost academic publication is high.

Pamela Samuelson of University of California at Berkeley discussed some topical legal concerns and emphasised the need for the scientific community to uphold the value of the public domain in safeguarding access to information and ideas, fertilising new ideas, and upholding the general principles of scientific openness. Current legislation on intellectual property rights provides strong safeguards for the owner of data, but at the potential cost of eroding the doctrine of 'fair use' for educational and research purposes.

Such restrictions on access were also referred to by M. G. K. Menon (Dr Vikram Sarabhai Distinguished Professor of Space and President, LEAD, India), in an eloquent address on the ethical problems that would certainly arise in the globalisation of information science and technology. There is already an existing economic divide between the rich and poor nations, but the digital divide exists too and is growing. The poor cannot afford computers; telecommunications infrastructure in the developing world is poor; illiterates cannot use keyboards; the Internet is English-language dominated; and the developing nations have difficulty enough in meeting their energy needs. The problem of access to data is particularly one that concerns the poor nations. However, CODATA is active in involving members of the developing nations in its activities, and shows by its promotion of ethics-related sessions at this meeting and elsewhere that it is an active participant in the quest for a proper balance between ethical and economic values.

Among the entertainments provided for delegates were a pair of public lectures and a session to predict the future. The public lectures were: a bilingual presentation by Guy Baillargeon on biodiversity and the Global Biodiversity Information Facility front-end to a collection of interoperable taxonomy and specimen databases; and a presentation of high-definition television satellite images of geographic, geologic and meteorological phenomena by Fritz Hasler. The prophets in the CODATA 2015 session were Paul Ginsparg (who envisaged very effective text mining through optimisation of the simple algorithms that now power Google; and who suggested that future trends would favour the publisher whose income per article were nearer the $1-5 of the arXiv preprint server than the $10,000-20,000 of certain commercial publishers); Werner Martienssen (who foresaw developments in the understanding of the natural laws of physics that encompassed fractal concepts and elaboration of knowledge from models that included evolutionary competition); and David Thomas (who saw the progression in understanding of molecular biology through gene function and cell function into the complete mapping of the cell, with deep understanding of the organism and populations still to come).

Finally, I enjoyed a number of presentations illustrating the Virtual Human Project, although the one with the most intellectual content had to be abandoned because the speaker could not make his Mac talk to the data projector!

Brian McMahon
CODATA Representative
7 October 2002