Crystallographic data

CODATA 2004

The Information Society: New Horizons for Science

Berlin, 7-10 November 2004

The title of the 2004 biennial CODATA conference reflected the growing emphasis within CODATA of data science and scientific data management as crucial components of the "information society". An important session during the conference programme was devoted to presentations and discussions on CODATA and ICSU activities leading to Phase II of the World Summit on the Information Society in Tunis 2005 (see below). However, the relationship between society and scientific data was a recurring theme running through many of the conference presentations, sometimes explicitly, sometimes unspoken.

Keynote lectures

The Berlin Declaration of October 2003 is an initiative to encourage open access to knowledge in the sciences and humanities, with the goal of disseminating knowledge widely through society. It has been signed by several representatives of national and international academic institutions, and is strongly promoted by the Max Planck Society (MPS). Jurgen Renn of the MPS described his vision of a web of culture and science, arising from strenuous efforts to expose scholarly knowledge on the web. Without a concerted effort, many artefacts and components of cultural heritage - art, literature, languages, oral traditions - will lose visibility as they become the preserve of specialist scholars. The alternative is to use the power and universality of the web to provide access for all. He argued that there is presently a crisis of dissemination, linked to spiralling costs of journals and books. The current standard solutions for web-based distribution are flawed: the "big player" model tries to secure exclusive rights for commercial exploitation, but fails to create an adequate access and retrieval infrastructure and promotes the Digital Divide; the "scout" model of transfer of content through pilot ventures lacks self-sustaining dynamics. What is needed is a self-sustaining infrastructure built through an "agora" solution - a support programme arising from the contribution of all citizens towards the common good. In this vision, the web of the future will be constituted by informed peer-to-peer interactions. The engine for this will be dynamic ontologies engendering a self-organising mechanism, so that semantic linking runs deeper than the current linking between predefined metadata collections. The Berlin Declaration is seen as a starting point; it encourages scientists to publish the results of their research through open-access vehicles, and calls upon the holders of cultural heritage collections to make them available via the web. Projects such as European Cultural Heritage Online are informed by this vision and demonstrate its potential. A major short-term goal of the signatories to the Berlin Declaration is to raise the awareness of learned societies and to lay out a roadmap that systematically addresses the issues of legal obstruction, economics and the Digital Divide.

For all the current interest in scientific data and society, it remains true that science itself is built on data, and Johann Gasteiger of the University of Erlangen-Nurnberg described recent developments in chemical informatics that drew new knowledge from the mass of accumulating data. His theme was that chemistry is more interested in properties than in compounds; although a million new compounds are described each year in over 800,000 publications, the sheer volume can make it difficult to solve new problems. On the other hand, there is often not enough data. While there are in excess of 41 million recorded chemical compounds, only a quarter of a million well-defined crystal structures are known. Sometimes it is opportune for academia, which pioneers new methods, to work closely with industry, which has the capacity to provide large numbers of new compounds, and hence data points, required for a specific study. It was noteworthy that industry was poorly represented among CODATA members. The new discipline of chemoinformatics allowed the application of informatics and mathematical methods to solve chemical problems. Neural networks are one of a growing number of powerful new tools. Case studies were presented of industrial/academic collaborations in which self-organising two-dimensional neural networks had been applied to problems of solubility prediction, infrared spectra characterisation and drug discovery.

Gladys Cotter, of the US Geological Survey, discussed biodiversity studies, another area where informatics is challenged by the volume of data, by new techniques for collecting and processing data, and by the demands of organisation and knowledge management. Recent developments in the biological sciences have produced close cooperation between participants at many levels of scale, from UN-sponsored programmes through regional and national organisations, local institutions and individual field workers, all trying to communicate through increasingly interoperable channels. This is an area of science where the hierarchy of levels does seem to work quite well together, despite (but perhaps also helped by) the proliferation of new data discovery techniques such as portable digital assistants (PDAs) equipped with Global Positioning System (GPS) locators, field computers, unmanned aerial survey vehicles and lidar. Effective metadata schemes within the Global Biodiversity Information Forum (GBIF) project allowed the exposure of 35 million taxonomic records within a year using the DiGIR software framework. New data models are moving away from purely descriptive taxonomy towards a more predictive function.

Yoshiyuki Sakaki, Director of the RIKEN Genomic Sciences Center, described the recent work on "finishing" the euchromatic sequence of the human genome, first published in draft in 2001. The motivation was to produce data of the highest quality to form the foundation for future medical research. The result is the elucidation of some 20-25,000 protein-encoding genes. As well as providing insight into genetic function within man, the complete genome provides the raw data for the new study of comparative genomics, where comparison of highly conserved sequences across many species provides clues to evolution. Current bioinformatics techniques applied to the genome have the potential to map phylogenetic relationships.

Data and society

Addressing the general theme of the conference, the plenary session "Data and Society" provided two wide-ranging reviews. Rene Deplanque, from FIZ Chemie Berlin, surveyed "The Use of Scientific and Technological Data in Today's Society" in the context of the development of the web and related information sciences. Search machines like Google are working towards a paradigm of intuitive searching without the need for formal query languages, but are limited in the depth of information they crawl, and the promiscuous nature of the returned results. Structured information will certainly help, but there is still a vast challenge in integrating databases from very different domains of science. Ontology management software is needed, and is gradually evolving; perhaps a suitable machinery for software development in this area will be based on languages such as Prolog that have significant logical inference capability. Exciting new developments continue to emerge: examples are e-learning systems, Grid technology, distributed virtual reality and ever more powerful supercomputing. Despite all this, the processing efficiency of the human brain is still far beyond anything we can currently envisage.

If that was an upbeat and technically optimistic review, Roberta Balstad Miller, Director of the Earth Institute at Columbia University, rang warning bells over the potential to abuse demographic and other human-population data gathered for scientific research purposes. Science had contributed much to warfare and human oppression in the 20th century through atomic and chemical weapons and bioterrorism; without responsible management, social data held the potential for large-scale harm in the 21st century. Existing controls, such as the 72-year embargo on census data in the US, are well intentioned but inadequate as the combination of multiple databases can allow data mining and discovery of detailed personal information about individuals. She argued for a widespread educational programme to raise awareness of these concerns, and the establishment of protocols allowing independent academic advisory committees to work alongside Government bodies in the collection and management of large social data sets. CODATA had an important role to play in leading the necessary educational programme and in defining appropriate checks and balances. Technology-driven monitoring solutions would be helpful, but the problem needed to be brought into the global spotlight and might need to be addressed through international treaties.

Mark-up languages

Brian Matthews of CCLRC Rutherford Labs, which hosts the UK office of W3C, surveyed the tools developed for and promoted by W3C as essential components of "The Semantic Web and Science Communities". These were promoted as the standards for implementing the vision of the Web of Culture and Science presaged by Jurgen Renn's keynote address. Current thinking on the semantic web uses a layered model: Unicode and URIs provide the base layer, on which are overlaid: XML with namespaces and schemas as a transport layer; the resource description framework (RDF) for metadata; ontology vocabularies managed by languages such as OWL to express formal relationships; and above that, layers of logic, proof and trust that were still to be addressed. Notable among emerging projects to develop languages suitable for supporting thesauri is SKOS.

Haruki Nakamura of the Institute for Protein Research at Osaka University presented PDBML as an example of an XML language built on a formal ontology (the mmCIF dictionary) and now used as the standard exchange mechanism between the components of the Worldwide Protein Data Bank (wwPDB). Standard XML tools can be used to manage the data in this format, such as the use of XPath searches in quite complex queries. The PDBjViewer is an alternative to rasmol for protein structure visualisation, and can be distributed as a Java applet, demonstrating the platform independence essential for sustained progress. The presentation also described a biomolecular simulation markup language (BMSML) that was being developed under a grid architecture to allow biosimulations at multiple size scales simultaneously.

Peter Murray-Rust described his ongoing work with CML and presented his collaborative work with Acta as an example of the ease of interoperability between structured data representations such as CIF. He also presented his standard appeal for explicit licensing declarations in machine-readable format to promote data reuse, and his advocacy of the need for community cooperation.

Data archiving

CODATA has had an active interest for some time in long-term preservation and access, and there were a number of sessions and presentations on this topic. Increasingly, archival solutions are designed under the influence of the Open Archive Information Systems (OAIS) Reference Model; but, although this provides an essential conceptual framework for the management of large systems, its richness and complexity can be overwhelming for small organisations. In a very nice presentation of her doctoral research project, Jacqeline Spence of the University of Wales at Aberystwyth demonstrated a questionnaire-based approach to scoring small organisations' performance within the OAIS framework. The objective is not so much to rank by merit, but to demonstrate the areas where work is needed (and possibly to highlight areas where work is not needed, according to the requirements of the organisation). The scorecard is useful especially for allowing organisations to work together collaboratively to ensure that the archiving function is delegated and managed at an appropriate level. I am not sure that the actual scoring methodology is optimal (numeric scores assigned to risk and perceived requirements are added, where multiplication might seem a better weighting); but the idea suggests how small(ish) organisations can present their actual archiving abilities and status in a reasonably understandable and standard way. This could be very helpful, for instance, in my long-standing desire to record the crystallographic databases' status as archives.

The US National Committee for CODATA has been working with an Archiving Task Group in collaboration with ICSTI to create a portal site for resources connected with the archiving of scientific data. The prototype site (http://stills.nap.edu/shelves/codata/index.html) demonstrates the potential uses of the portal, although its development is hampered by the content management system used in this prototype. It is hoped that the fully developed portal will be hosted in a developing country as a capacity-building exercise. Note that this fits in well with my suggestion some time back to provide information about domain-specific data resources through CODATA (perhaps weith archiving activities measured through a scorecard of the type mentioned above).

The new Digital Curation Centre was introduced by David Giaretta, its Associate Director (Development). The DCC (http://www.dcc.ac.uk) was established following a recommendation in the JISC Continuing Access and Digital Preservation Strategy (October 2002) to establish a UK centre to help solve challenges which could not be solved by any single institution or discipline, including generic services, development activity, and research. It does not seek to be a repository of primary research data, but might nevertheless be a useful establishment for providing us with advisory services, ideas, tools and access to standards. The DCC development site is at http://dev.dcc.rl.ac.uk and includes some demonstration projects (see e.g. the astronomy FITS example, which has some parallels with our CIF development).

A German equivalent is nestor, a distributed partnership of German libraries and museums, www.langzeitarchieverung.de.

Among other points of interest to emerge from the presentations in these sessions I noted the following.

China recognises long-term preservation and access as an objective specifically listed in the WSIS draft plan of action. Chinese receiving stations for the NASA MODIS (imaging spectroradiometer) satellite programme can distribute received data online within an hour, but during the same time frame the data are entered into a long-term storage system.

The OAIS idea of a "designated user community" is important in designing archive systems, but developers must be aware that there may well be unanticipated demands for use by a broader user community. Some principles of good practice follow - define a user community with allowance for outreach (but within reason); engage non-technical authors to write the documentation for data centres (obviously in collaboration with the technicians); design architectures that rely on transparency, interoperability, extensibility, and storage or transaction economy; ensure that uncertainties in data are properly documented.

These principles are being applied in metadata and ontology development for a German project concerned with the *very* long-term preservation of digital information (specifically, that relating to nuclear waste disposal sites where the design goal is to make information available for at least 100,000 years). An important component of this is seen to be crafting ontologies that are aware of IT infrastructure (the principles of storage, database formats, communications channels and security), so that these can also be migrated to new platforms over time. A useful backup mechanism is the HD-Rosetta approach of etching text or other analogue information microscopically on a hardened nickel substrate (e.g. http://www.norsam.com/hdrosetta.htm).

NASA itself is building more complex archiving applications on top of the OAIS model, and increasingly integrating these into live projects. The motivation behind well-characterised software systems is to create complex systems that self-adjust with the loss of one or more components in a network of satellites and receiving stations. The NASA view is that archiving and e-science together are essential for 21st-century science and technology.

Open scientific communications/publication and citation of scientific data

Norman Paskin of the International DOI Foundation discussed the use of digital object identifiers (DOIs) for scientific data sets. DOIs are used in publishing to identify literature articles and, through searching of associated bibliographic metadata, to provide a linking service for publishers through the CrossRef registration agency. Similar functionality is possible for scientific data sets. DOIs are intended as persistent identifiers, and allow for more reliable long-term access than ad hoc and frequently transient URLs. Two case studies were presented of projects employing interesting DOI applications with science data. One is the "Names for Life" project, which proposes DOIs as persistent identifiers of taxonomic definitions. Because taxonomic definitions change over time, the unambiguous identification of a species can be difficult. Assignment of a DOI to a specific definition, and the provision of forward linking to synonyms or other related resources, will provide an audit trail of taxonomic changes, and allow both the unambiguous identification of a cited species and an understanding of the contemporary definition in its historical context. Note the distinction between an identifier for a specific data record (a taxonomic description) and an identifier for a concept (the taxon itself). DOIs are most likely to be used for the former purpose, since concept identifiers tend to be domain-specific (e.g. genus/species scientific names, INChIs, phase identifiers, chemical element symbols...). Nonetheless, the use of DOIs as concept identifiers is not entirely ruled out, especially if there is no existing systematic identification scheme in place.

Paskin's second example was the assignment of DOIs to climate data from the World Data Center for Climate (WDCC) in Hamburg. The German National Library for Science and Technology (TIB, Hannover) is acting as the registration agency in this case, and the WDCC application is a pilot within a longer-term project to define metadata suitable for different disciplines. TIB has an objective of becoming the central registration agency for scientific primary data by 2006. Michael Lautenberger of the Hamburg WDCC gave more details of the pilot project, and made it clear that one of their objectives was to promote academic credit associated with the "publication" of primary data sets identified by DOIs, together with integration of data sets into library catalogues and their appearance in the Science Citation Index.

I chatted to Paskin about these developments, and mentioned that I thought they were filling an important need, one that CrossRef had declared itself unwilling to take on board when we spoke with them some years ago. Subsequently, however, I discovered that CrossRef have been discussing with the PDB the assignment of DOIs for protein structures, and so the field appears to be opening up. There are a number of considerations that will come into play: will CrossRef or TIB create the better set of metadata for characterising scientific data? is there a case for distinguishing between "primary" data and "supplementary" data associated with publications? what will be the financial model for scientific data publication?

In a presentation on "Open Access to Data and the Berlin Declaration", Jens Klump of the GeoForschungsZentrum Potsdam also proposed that data centres could act as the equivalents of data publishers within an open-access environment. He proposed that the Berlin Declaration, and its effective endorsement by Governments in the OECD Final Communique of January 2004 (http://www.oecd.org/document/15/0,2340,en_21571361_21590465_25998799_1_1_1_1,00.html) should apply also to data. The key components of such a model would be: irrevocable free access, worldwide; licences to copy, use or distribute; licences for derivative works; and availability through at least one long-term archival gateway. At this point, a major difficulty was in formulating principles of "fair use" for applications of openly-accessible scientific data.

Heinrich Behrens presented a paper considering the growth in the number of publications in scientific literature and data since the seventeenth century. Growth curves rise very rapidly over this period, but without any models, the best way to fit such curves is through statistical analysis of best-fit functions. Often growth curves are fitted by exponentials, sometimes by a succession of exponentials when the curve exhibits changes in growth rate over time. Behrens demonstrated that statistical residuals could be much smaller if multiple quadratics were fit through the same empirical data points. While the differences in fitting past curves were small, future growth predictions will of course differ markedly depending on whether exponential or polynomials are extrapolated. It would be interesting to predict growth in CCDC or PDB by extrapolating quadratic fits into the future.

A paper that wasn't in fact presented nevertheless had an interesting abstract demonstrating the close synergy between data and publications in astronomy. (http://www.codata.org/04conf/abstracts/OpenSciComm/Genova-Informationnetworking.htm)

Data quality

Ronald G. Munro of the Ceramics Division of NIST gave a talk on "Data Evaluation as a Scientific Discipline", which presented a mathematical model for assessing quality, but also made a number of interesting general points. One was that the objective of data evaluation should be considered as ascertaining the credibilty of data. Another was the benefit of classifying quality indicators into functional groups - at NIST a useful scheme (roughly in ascending order) was: Unacceptable / Research / Commercial / Validated / Unevaluated / Typical / Qualified / Certified.

Volkmar Vill of the University of Hamburg demonstrated some applications of SciDex, an object-oriented database allowing 2D and 3D data sets as data types. The system was developed for implementation of LiqCryst, a liquid crystals database, and hence contains some rather general chemical validation methods (such as substructure comparison) that fit it for other purposes. It has been used to create a search engine for the index of Springer's Landolt-Bornstein Online, as well as a number of other scientific databases: 29Si-NMR, Phytobase, Hazardous Substances...

World Summit on the Information Society

The World Summit on the Information Society (WSIS) takes place, in two stages, in Geneva in December 2003 and Tunis in November 2005, organised by the International Telecommunication Union under the patronage of the UN Secretary-General. It aims to bring together Heads of State, Executive Heads of United Nations Agencies, industry leaders, non-governmental organizations, media representatives and civil society in a single high-level event, to discuss the broad range of questions concerning the Information Society and move towards a common vision and understanding of this societal transformation.

ICSU and CODATA worked closely together to raise the visibility of science as a contributor to the information society at the first leg of the Summit. Now ICSU wishes to delegate to CODATA more involvement in the run up to the Tunis event. The WSIS Session during the CODATA conference is part of that involvement.

The first phase of the summit produced an Agenda for Action that includes a number of charges related to science. The most relevant single item is

22. E-science

a) Promote affordable and reliable high-speed Internet connection for all universities and research institutions to support their critical role in information and knowledge production, education and training, and to support the establishment of partnerships, cooperation and networking between these institutions.
b) Promote electronic publishing, differential pricing and open access initiatives to make scientific information affordable and accessible in all countries on an equitable basis.
c) Promote the use of peer-to-peer technology to share scientific knowledge and pre-prints and reprints written by scientific authors who have waived their right to payment.
d) Promote the long-term systematic and efficient collection, dissemination and preservation of essential scientific digital data, for example, population and meteorological data in all countries.
e) Promote principles and metadata standards to facilitate cooperation and effective use of collected scientific information and data as appropriate to conduct scientific research.

The CODATA session aimed specifically to highlight the initiatives currently under way in the scientific community relating to the Agenda Action items, and to identify particular outstanding problems. A round-table discussion was structured around five questions that had previously been distributed to attendees. Below I give terse summaries of some of the points raised.

What are the major challenges regarding scientific data management and access?

  • 20,000 petabytes of data are being produced annually. The problem is not just of access, but of usability of such amounts.
  • Access and connectivity are essential, but as a first step. We also need new techniques for knowledge discovery, which depend on an ability to integrate knowledge at different scales.
  • New forms of dissemination are potentially useful in helping policy makers and the general public to understand scientific issues.
  • Funding is a common problem - how to persuade governments to finance data management as well as the basic science?
  • There is a lack of resources (and interest) in digitising heritage data (e.g. astronomical photographic plates).
  • There remains a mismatch in the collection of environmental data between what is being gathered and what is actually of most use, particularly in the developing world.
  • The International Mathematical Union is working on the goal of digitising all mathematical publications to produce a complete digital library of mathematics.
  • Geodiversity needs to be emphasised.
  • WSIS should emphasise the need for common data standards.
  • Personnel in the developing world need to become more involved. There are issues of language and training; and specifically a lack of awareness of the need for archives.
  • INASP emphasised the need for improved access as a first step, and can provide many examples of how the benefits of increased bandwidth to developing institutions is very quickly realised.
  • The Third World Academy of Science acknowledges the need for archiving, but their priority is rapid access to the latest information.

2. What issues and accomplishments should be highlighted at Tunis?

  • Need to discriminate among different types (i.e. quality) of data.
  • Want to see more new horizons for science arising from WSIS, and a proper respect for, and understanding of, the role of science within the broader Information Society.
  • The IAU wants to see *better* science coming out of WSIS, and a culture change. Data should be taken seriously; the science is not finished until the associated data have been publicly posted.
  • NASA looks forward to the emergence of a common language of science, with more collaborations in scientific endeavour.

3. Activities relating to e-Science

  • The International Polar Year of 2007/8 (marking the 50th anniversary of the International Geophysical Year) demonstrates the role of science in promoting international cooperation.
  • The World Data Centres offer another good example.
  • The forthcoming "Electronic Geophysical Year" will contribute towards the new horizon of taking data and information seriously.
  • A project is under way to create a 1:1,000,000 digital map of the entire world, with eight layers of sustainable development. The best input so far has come from the developing world.
  • The OAI-PMH transport mechanism for metadata in the provision of open access is a noteworthy achievement.

4/5. What outcomes and actions are expected?

  • Renewed efforts towards the provision og electricity and power globally - no data if no power!
  • Much science is based on relationships, and initiatives promoting interpersonal contacts should be encouraged.
  • ICT developments may lead to an entirely different structure for science in the future - CODATA should paint the picture of what science will be like in 15 years time.
  • The exercise of producing an inventory of specific activities is very important, but should not end with the Tunis summit.
  • Scientists need to engage more with policy makers on issues of relevance. Internet governance is one such area.
  • The summit is an opportunity to emphasise the non-monetary value of sharing knowledge. This is understood intrinsically within scientific culture, but may need to be spelled out to the world at large.
  • Intellectual property rights must be managed sensitively in cooperation with WIPO.
  • Open access to data and Equitable access to publications remain specific goals that should emerge from the WSIS summit.

Summary

CODATA 2004 billed itself as the first major interdisciplinary conference addressing new horizons for science in the information society. The organisers believed that it had merited that description. There were 260 participants from 28 countries, and activities of most of the scientific unions were represented. The participation by representatives from ICSU, UNESCO, IIASA and the African Academy of Languages was taken as evidence of the growth of interest in CODATA.
Brian McMahon
CODATA Representative
25 November 2004