Crystallographic data

CODATA 2006

Scientific Data and Knowledge within the Information Society

Beijing, 22-25 October 2006

The 20th International CODATA Conference, held in Beijing, China, 23-25 October 2006, had the title "Scientific Data and Knowledge Within the Information Society", continuing the emphasis on CODATA's role in the Information Society expressed by the title of the previous conference ("The Information Society: New Horizons for Science") and the organisation's involvement in the World Summit on the Information Society and related activities over the past few years. Also running through the programme was a sense of CODATA's evaluation of its own purpose, reflected in a retrospective view of its past 40 years of existence, and a session looking to its future directions.

CODATA past, present and in the future

In the Key Session "CODATA: 40 Years of Bringing Data to the World", David Lide traced the organisation's early development in a post-war world of burgeoning scientific research and rapidly developing experimental apparatus and instrumentation. With the thermodynamicist Frederick Rossini as the prime mover, and an initial Executive Committee of six members from the US, France, UK, West Germany, Japan and the USSR, the expertise of CODATA's founding fathers lay solidly in the realm of the physical and chemical sciences. From the beginning, however, the organisation's constitution covered the biological and geosciences, and the representation of these sciences was already apparent by the time of the second conference, in 1970. In the early years, of course, the participants were scientists pure and simple: none would have been considered "information technologists". But early demonstrations of computerised data management, again in the second conference, formed the starting point for the enormous growth in information technologies that underpin all of today's scientific data management. From his longstanding involvement with CODATA, Lide felt that its early achievements were: to provide a necessary forum for scientists from different disciplines to work together; to foster international cooperation, most importantly during the Cold War; to reach international agreement on key physical and chemical data sets; to conduct educational activities; and to contribute to the computer-based use of data. He felt that efforts to improve the presentation of published data had been less successful, relying as they did on voluntary adoption of the CODATA guidelines. Nevertheless, he believed that these guidelines had indeed had some beneficial impact on journals; and his advice for future directions in CODATA was to give high priority to issues of data quality.

John Rumble (winner of the 2006 CODATA Prize and a former president of CODATA) took up the history of the organisation's development through the 1980s. This period was characterised by: rapid technological developments, including the rise of the personal computer; the maturing of scientific data activity; and the globalisation of scientific and technological data, involving new disciplines, new countries and new people in a growing community. It also saw the start of the transformation of scientific and technological data work from the hands of specialists to those of practically every working scientist. During this period, CODATA formulated its first strategic agenda, which concentrated on: critical evaluation and improvement of data quality; the accessibility and dissemination of data ; the structure and format of data files; the use of computers and telecommunications in data dissemination; and the propagation of CODATA output. A particular consequence of this was the formal process for nominating, reviewing and approving the Task Groups that undertake CODATA's scientific activities. This period ended at about the time the Internet was about to be revolutionised by the invention of the World Wide Web.

Krishan Lal brought the history of CODATA up to the present, covering the period of the last decade and a half, during which the scope of CODATA has expanded greatly, with vibrant and active Task Groups across a very wide range of biological, physical, chemical and geosciences, and with active involvement of scientists from around the world, including the new rapidly growing economies of Asia, but also a significant number of developing nations. There is also growing engagement with the human and social sciences, and a growing awareness of the relationships between science and society as a whole. These trends have been analysed in a recent far-ranging ICSU Priority Area Assessment (PAA) review of Scientific Data and Information, which included a number of recommendations to CODATA. Among the elements of CODATA's response to the ICSU PAA review was a commitment to a new mission statement and Strategic Plan. CODATA had also been an active contributor to the World Summit on the Information Society, providing the organisation with high-level international visibility and an opportunity to launch publicly a Global Information Commons for Science Initiative.

In an accompanying Key Session on the "CODATA Vision of the Future", Professor Lal and Alexei Gvishiani, the incumbent Vice Presidents of CODATA, developed further the initiatives and policies that were evolving as CODATA responded actively to the ICSU PAA and worked on its future Strategic Plan. Although the state of international data science and information technologies seems very healthy, there are concerns that CODATA itself is losing the formal membership of many National Member organisations. This appears to reflect in part the structural difficulties of raising funding from national governments for scientific activities that are supportive of long-term management or development, rather than project-based; nevertheless, the loss of such National Members weakens CODATA by reducing the funding that they bring to secure the financial health of the organisation, together with the loss of their active voices in setting the policy objectives and the action agenda of the organisation. There is also concern over losing contact with developing nations that cannot afford to raise membership dues, and of the lack of involvement from a number of geographic regions, for a variety of reasons. It was also important to engage more young scientists in the work of the organisation, and to build up the Data Science Journal as an important instrument for developing data science. The new Strategic Plan would try to address these shortcomings; but it would also be ambitious and progressive, developing a number of projects to address directly concerns such as the Digital Divide, universal and equitable access to data, and the encouragement and exploitation of new scientific and technological developments.

Neatly straddling the history of CODATA and its plans for the future, Tony Hey's Keynote Lecture on "e-Science and Cyberinfrastructure" reviewed several of the latest technical developments in information handling and management. Many of these were reflected in presentations or sessions at this conference; they included the astronomical Virtual Observatory (an astronomy data grid); the Comb-e-Chem project (linking high-throughput experimental data from a chemistry wet lab to automated data collection and analysis); crystallographic e-Prints (applying open-architecture e-publication techniques to provide access to scientific data sets); Grid middleware services running on top of high-bandwidth research networks to support a growing number of research projects; open-access publication sources (ranging from preprint archive through fully peer-reviewed journals); social communication technologies (RSS feeds, Wikis, blogs), perhaps leading to new forms of "live" journals. Hey is Vice-President for Technical Computing at Microsoft Corporation, and explained how Microsoft was becoming active in augmenting interoperability, whether by releasing its next generation of office productivity software on top of an open royalty-free file format specification, or by actively working with the technical developers of the Open-Archives Initiative to investigate cross-searching across institutional repositories. Future projects on data integration would include a web of remote sensors coupled to the geographic information of the Microsoft Virtual Earth project.

An information commons for science

A Key Session on the Global Information Commons for Science Initiative (GICSI) returned to one of CODATA's new initiatives, and Paul Uhlir (co-author with P. David of the proposal) explained in detail the idea of an "information commons": digital information, originating principally from government or publicly funded sources, made freely available for common use online, either in the public domain or with only limited rights reserved, and typically organised thematically. The advantages of an information commons were: its facilitation of information transfer in many directions: geographically between North and South (and indeed between South and South, also promoting capacity building as it did so); between disciplines, across sectors and amongst institutions. It would also promote international research and development activities. There are obstacles to such a commons: it remains necessary to assess and communicate the values of the commons approach, and to develop adequate incentives. There are issues of long-term financial sustainability, legitimate legal restrictions and the need to develop effective technical and organisational implementation. In practice it would not be possible to overcome all of these, and compromises would need to be made; but CODATA should aim to approach this ideal as closely as possible by improving understanding and awareness of the idea, promoting the broad adoption of successful models, encouraging and helping to coordinate the efforts of stakeholders, and establishing an online open access knowledge base.

An important contributing element to GICSI is the family of machine-readable rights licences developed by Science Commons and its parent organisation, Creative Commons. Chunyan Wang of Creative Commons China described the organisation's activities in China, and pointed out that the idea of the information commons found a natural home with the traditional Chinese approach of a society sharing its knowledge, with reasonable guidelines. One example of a successful implementation in the area of Chinese science was QiJi, a counterpart to arXiv, which has a translation project for open-access journals that makes use of the Creative Commons attribution licences.

John Wilbanks, Director of Science Commons (US), pointed out that such attribution licences, traceable through their machine-readable expression in metadata, provided an excellent indicator of the actual re-use of a scientific idea, that provided more information about the significance of a piece of work than existing metrics such as citation index, impact factor, or even number of downloads.

A wealth of topics

If the sessions reported on above reflect the broad context of CODATA's mission and activities, the remainder of the Conference reflected the full richness of data-centric activities in science and technology in which CODATA members and their scientific communities are involved. The conference comprised 4 keynote lectures, 13 Key Sessions, 64 contributed sessions, and opportunities to present posters. Among the session titles, chosen almost at random to indicate the diversity of topics, one might list: Disaster Data; Computational Informatics; Data Role in Promoting Public Understanding of Science; Solar-Terrestrial Data; e-Science; Virtual Observatories in the Geosciences; International Polar Year Activities; Chemical and Physical Data; Bioinformatics/Biodiversity; Social Science Data Issues. The problem with such a wealth of diverse topics is that, running in parallel at up to 10 sessions simultaneously, it is impossible for an individual to take full advantage of the interdisciplinary opportunities afforded by such a gathering. Hence my reports of sessions and presentations following do not in any sense offer a representative cross-section of activities; they are simply the topics I had a particular interest in, or happened to attend.

Data archiving

A topic of growing concern in recent years has been the long-term preservation (archiving) and curation of digital data. Chuang Liu presented the current state of digital preservation in China, a country with a good record of archiving traditional scholarship (going as far into the past as 2000 years for a set of geography books). A long-term project of digital archiving commenced in 2003. As a first step, a project was under way to survey how many data sets existed that needed to be integrated into this programme. So far nearly 2500 databases have been identified in earth sciences, environment, public health and the physical sciences, comprising about 500 TB of archived data. This is a substantial amount, but China is preparing itself for the huge data archiving effort to handle satellite data, gene banks and biodiversity projects that will soon come on-stream.

David Giaretta (Digital Curation Centre, UK) reported on early results from the CASPAR study, an effort to test many of the principles and techniques that would be required in effective long-term curation activities. Focusing initially on three test data sets (from astronomy, cultural-heritage and performing-arts areas), the project was designed to subject the widely adopted Open-Architecture Information System (OAIS) reference model to real-world testing, exploring in particular the specification of the requirements of the "Designated Community". This is the entity in the OAIS model that would potentially re-use the archived data, and which established the granularity of metadata describing the data that would be necessary to guarantee effective re-use over a very long period of time, during which there would be inevitable changes in information technologies and common data handling formats and methods. Since the OAIS Reference Model underlies many large-scale archiving initiatives, this rigorous testing programme would appear to be of the highest importance.

Practical archiving activities were also reviewed in the session organised by the Task Group on Preservation of and Access to Scientific and Technical Data in Developing Countries, where a report on the conferences, publications and activities of the Task Group was followed by an account of efforts to establish a national database of non-profit organisations and their activities in South Africa. Also presented in this session was an enthralling account of a variety of biodiversity and sustainable development activities in the multi-faceted ecology of Thailand.

Data and the scientific literature

Publication archives were described by Newman Yan, in the "Electronic Journal Production" session, who presented China Academic Journals, the flagship of Chinese e-publishing. This journal aggregator platform offers over 18 million articles published during the period 1994-2006 from over 7500 titles; an extensive digitisation project has also converted over seven and a half million articles from 3664 journals published during the period 1887-1993. The articles are all full-text searchable, indexed by subject area, viewable in a proprietary document format that offers greater functionability than PDF, and are accessible through common library and information standard protocols. The entire collection (of Chinese-language content only) has over 6000 institutional subscribers and records over 1.2 billion downloads annually.

Myung-Seok Choi described the KISTI-ACOMS web-based article submission and review system that is distributed free of charge to 225 academic societies by KISTI. The service offers modules for journal articles and for conference proceedings.

Another national service provided to academic societies is the J-STAGE electronic journal publishing platform in Japan (which currently hosts CODATA's own publication, the Data Science Journal). More than 330 journals are hosted at present, and features of the service include full-text searching, linking, pay-per-view and the provision of COUNTER-compliant usage statistics; OpenURL and OAI-PMH interfaces are planned.

S. Mitra of the American Physical Society described how the society's journals continued to flourish in the electronic age; and how they were becoming increasingly international in a better-connected world. Now only a third of the papers are from American authors; a third come from Western Europe and the remaining third from the rest of the world. Among these other countries, China has demonstrated a dramatic rise in the number of articles published, although the trend is in line with China's growth in GDP. Among the challenges facing the journals were the continuing increase in the number of submissions, coupled with increased editorial effort to handle articles from authors whose native language was not English. There is rising competition from other journals, including open-access journals for which subscriptions were not required. [On the other hand, subscriptions to APS journals were holding up well despite the early availability of much of the research they report in arXiv: this is taken as an indicator that the added-value of peer review is indeed valued.] There is also the challenge of making scholarly journals accessible to non-specialists in the field, in order to promote cross-disciplinary talk.

All of these presentations focused on the traditional role of the journals as publication vehicles for scholarly research articles, although the execution of this role was profoundly affected by new information technologies. But Ed Pentz of CrossRef, the provider of cross-publisher reference linking services, challenged journals to make use of the underlying techniques of digital object identifier (DOI) and handle resource resolver to link not only to articles, but to research data sets also. As journals and databases converged to common storage, management and dissemination methods, so linking to, citing and disseminating of data sets came closer to the publishing model.

Interoperability through cross-disciplinary metadata

The use of DOIs to supply a permanent and citable identifier for data sets was also described in the session on "Supporting Sustainable Access to Scientific Data through Metadata" by Peter Löwe (representing Jens Klump of the GeoForschungsZentrum Potsdam). The GFZ register DOIs for their data sets with the Technische Informationsbibliothek Hannover, who play the same role as a DOI registration agency for scientific data that CrossRef fulfils for publications. An important motivation behind this approach is to make data citable, to encourage its recognition in the academic credit process.

Citing data was also a topic addressed by Chris Lenhardt, of CIESIN, who described a style guide developed for citing data (http://gking.harvard.edu/files/cite.pdf), and a code of good practice for database providers to allow such citations.

Registering identifiers for data sets will certainly facilitate citation, but discovery and searching requires rich metadata describing the data sets so registered. Xaolin Zhang (China Digital Science and Technology Museum) noted the existence of data providers such as e-Bank that were already exposing metadata describing scientific data sets, and outlined the requirements for metadata interoperability methodologies, among which was a proposal for an Open Metadata development project. These ideas were emphasised in the presentation by Jian Qin entitled "Metadata as the Underpinning of Sustainable and Effective Access to Data", which specifically recommended that CODATA involve itself in projects such as the construction of a metadata directory service, which would provide an inventory of domain metadata standards.

This was a timely, focused and important session that underlined the importance of structured cross-domain metadata development to effectively promote interoperability between data providers. In a final thought-provoking contribution to the session, Raed Sharif made a plea to improve interoperability between different native-language communities by including multilingual descriptions of data sets in their accompanying metadata.

Challenges in bioinformatics and astronomy

An organization with whom we have close ties, and that has already taken the step of registering DOIs for its data sets, is the Protein Data Bank; and in the session on "Primary Biological Databases" Helen Berman of the Research Collaboratory for Structural Bioinformatics (RCSB) described the Worldwide Protein Data Bank (wwPDB). The member organisations of this collaboration (RCSB, the European Bioinformatics Institute, PDBj and BioMagResBank) worked together to maintain a single archive of macromolecular structural data freely and openly available to the community. While the RCSB component currently had the responsibility of acting as the master archive site, data sets were reliably exchanged among all the members by the use of standard formats, guaranteeing that each member site was in possession of the same data, but retained the freedom to provide its own value-added services and interfaces. The wwPDB was committed to the highest standards of annotation and quality control, and had just completed a labour-intensive remediation of the entire archive.

Other contributions in this session described the activities and services provided by Uniprot, the Universal Protein Resource (Claire O'Donovan); the EMBL Nucleotide Sequence Database (Guy Cochrane); and the Quality of Services of the Primary Nucleotide Sequences Databases, particularly the DNA Database of Japan (Hideaki Sugawara).

Common to all these presentations was the high quality of professionalism of the organisations, their commitment to data quality, and their belief in the efficiency of open and unhindered access to authoritative and properly automated data.

Astronomy is another scientific discipline that has a good record of collecting, analysing and managing large amounts of data, and in the session on "Managing Astronomical Data" the presentations by Wenping Chen and Yongheng Zhao described the challenges of high volumes of data generated by modern astronomical projects. Already the Taiwan-American Occultation Survey (which searches for dim comets by the momentary drop in brightness of stars in front of which they pass) is generating a few hundred GB of data every night, which must be processed almost in real time to allow for coincidence checking. The projected Panoramic Survey Telescopes and Rapid Response System, which will use four 1.4 gigapixel telescopes, will generate a few TB of data every night. On a slightly more modest scale, China's new Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) will generate something in the region of 15 GB per night; but considerable thought is already being given to the requirements for proper archiving of this data, and its integration into the international Virtual Observatory project.

Ray Norris (CSIRO, Australia) pointed out that many such large-scale astronomy projects were exemplary in their management of data; but this was not necessarily the case throughout astronomy. In the "Astronomer's Data Manifesto" he and colleagues were challenging the community to address consistently and thoughtfully the areas where improvements could be made, such as: the deposit of all data supporting any published tables, images and spectra; placing in the public domain all data from publicly funded observatories; incorporating effective data management policies into plans for new instruments and observatories, addressing issues of the Digital Divide; preserving legacy data in digital form in data centres; and working through the IAU with other international organisations to achieve common goals.

Access and quality

The session on "Data Access Policy" included a number of practical applications of open or openly-encouraged access to a variety of data sources. These ranged from a community-based information system designed to enhance effective e-government of the widely dispersed communities of Indonesia (Muhammad Suryanegara, Indonesia), through the free and open exchange of seismological data promoted by the IRIS Consortium (Ray Willemann, USA), and a number of aspects of global sustainability, spanning fields of natural environment, artefacts, social science and economic data collections (Masaru Yarime, Japan). A presentation by Robert Clark (University College Dublin, Ireland) reviewed the legal framework and possible challenges to the free exchange of scientific data that arose through implementation of new legislation such as the European Database Directive.

Within this rather disparate collection of topics, the IUCr presentation on "Improved Reporting of Crystal Structures: the Impact of Publishing Policy on Data Quality" struck a note of contrast by exploring the results of applying a specific policy in practice. The IUCr journals have for many years offered open availability of deposited machine-readable structural data sets - certainly a good example of data accessibility. What is relatively new is the effect of enforcing standards of evaluation for data supporting publications, reinforced by openly documented algorithms and objective tests through a public service, checkCIF, that has become adopted as a community standard for evaluating structural data, whether published or not. In the context of many of the other themes and topics of this conference, there were a number of pleasing resonances. First, the linking of the deposited data sets with their primary publications is done through digital object identifiers. Second, journal publication is no longer an inevitable result for small-unit-cell structure determinations, but the development of common metadata standards allows interoperability between structural journals and data collections at repositories such as e-Bank and the Reciprocal Net. Third, the continuing rapid rise in the number of structures determined is placing stress on the traditional subscription-based journal pricing model, and encouraging the IUCr to explore open-access publication strategies. Fourth, the efforts of the IUCr to make structural data freely accessible through its journals, through the activities of affiliated database organisations, and through relaxation of a subscription levy on educational, nomenclature-based and some other types of article all fit well into the Global Information Commons initiative. Finally, the effect of journal policy on evaluating supporting data sets has indeed had a positive impact on improving the overall quality of published structures, and is likely to have raised standards of data quality generally in the field of small-molecule and inorganic structure determination. In this way, reflecting on David Lide's survey of CODATA's history, we have nicely repaid CODATA's investement in early efforts to improve publication quality, and suggest that our efforts to maintain the highest achievable quality of associated data sets fit well with Lide's view of CODATA's future priorities.

Summary

Once again the CODATA Conference has brought together an astonishing variety of topics and speakers covering every aspect of data science. Over 600 participants made this the largest contribution to date. Abstracts of the presentations are available on the Web at http://www.codataweb.org/06conf/prog-glance.html, and there are plans to publish many papers developing the contents of these presentations in the Data Science Journal. It is sobering to reflect on the enormous number of productive data centres, and the huge volumes of scientific data they collect, manage and disseminate in every scientific discipline. At the same time, it is encouraging to see how the IUCr's information dissemination activities fulfil many of the requirements that CODATA identifies for the best possible data management and curation of data. Participation in CODATA continues to be beneficial for the IUCr, and it is our hope that the best practices of crystallography encourage and inspire other scientific communities.

Brian McMahon
CODATA Representative

[Creative Commons By NC licence]