ICSTI Winter Workshop 2012: Delivering Data in Science

Nowadays there are many high-level activities focusing on the management of research data sets, their archiving and their re-use - in effect, their publication. On 5 March 2012 ICSTI convened a one-day workshop, held at ICSU headquarters in Paris, on "Delivering Data in Science" to survey some of the most pressing issues.

The session on Data and the Policy Makers opened with an account by Ray Harris of SCCID, the ICSU Strategic Coordinating Committee on Information and Data that he Chaired between 2009 and 2011. As ICSU is an interdisciplinary and international body, the importance of these recommendations lies in their representing the priorities for science worldwide. The Committee's six main recommendations related to scientific data were: (1) ICSU National and Union Members should adopt a guide to best practice (presented in the SCCID report) covering aspects of data policy, governance, planning and organisation, standards and tools, management and stewardship, and data access. This should help to foster a common view of the significance of these issues across all domains. (2) ICSU Members should explore and agree the terms used under the umbrella of "Open Access" to clarify a very muddled terminology and in consequence help to distinguish and prioritise factors leading to universal and equitable access to publications (guided by ICSTI and INASP) and data (CODATA). (3) ICSU Members should improve the whole process of creating data as a publication, with increased academic recognition, appropriate behaviour modification, and a possible role for legal deposit libraries in providing infrastructure or services. (4) ICSU should use its affiliated organisations, CODATA and the World Data Service (WDS), more actively in managing large-scale data activities. (5) Practical help needs to be given to less economically developed countries, again using the existing networks of ICSU and its affiliated bodies. (6) There should be greater interaction with the private sector to use commercial expertise and resources for mutual benefit.

One potential weakness of SCCID's analysis of data management is that it does not consider separate strategies for "raw" versus "processed" data. In part this is a philosophical decision - many of the technical challenges of handling electronic information do not depend on the nature of that information within the scientific experiment/publication life cycle. Nevertheless, several later presentations did demonstrate how different strategies needed to be applied in different scientific fields to data that had undergone various stages of processing. In crystallography, IUCr Journals and Commissions have long promoted the exemplary position of requiring coordinates and structure factor amplitudes (our "processed data") to be deposited. The IUCr's Diffraction Data Deposition Working Group is now wrestling with the possible next step of archiving the "raw data".

Four succeeding presentations at the Workshop gave a survey of policy and funding support available from national and regional funding organisations, who will have a key role to play in realising the vision laid out by ICSU.

Stefan Winkler-Nees discussed the recommendations on data of the Alliance of German Science Organisations. Research funding in Germany has a division of responsibilities between the Federal government and the regional Lander, and in part by German research institutions' traditionally strong relationships with private industry. Nevertheless, common principles for archiving and free access to publicly-funded research data have emerged that are similar to those of other countries, and there is a significant investment in funding to assist German science organisations to realise these principles. The speaker referred to a frequent perception amongst some authors that suitable data archiving was to 'stuff a CD or DVD in a desk drawer' - clearly not easily accessible to a wide readership and subject to the author remaining alive during the lifetime of the publication. Within the activities of the IUCr DDD WG we have learnt that Universities, at least in the UK, are beginning to wake up to the need to provide their staff with centralised archives not only as good practice but also to avoid inadvertent research malpractice.

Carlos Morais-Pires of the European Commission described the preparations for the next European framework programme for research development and innovation (Horizon 2020) and emphasised the positive commitment to research and development, mirrored by a likely increase in funding of 40-45% over the coming 7-year period. Much of the Commission's emphasis will be towards Open Science, in the belief that open content, open infrastructures and an open culture will work together to create optimal sharing of research results and tools. The impetus for data management strategy in this programme comes from the influential "Riding the Wave" report of October 2010.

Rob Pennington of the US National Science Foundation described the Cyberinfrastructure for the 21st Century programme (CIF21). This will have some $200M available for data infrastructure investment, but NSF has been very keen to assess how best to distribute funding within the new and still poorly understood paradigm of data-driven science. He described the detailed consultation and review processes that have informed CIF21, built around several Grand Challenges but seeking to provide multidisciplinary and multi-scale integration to draw real and useful science out of the sea of data. While NSF feels that it has been "behind the curve" in this area, it is moving forward with a very strong and focused commitment. Already funded PIs are required to provide a data archiving plan in grant proposals and account for themselves in their annual reports to the NSF and at the end of their grant awards. This 'policing' of their policy is an important step in itself.

Runda Liu, deputising for Peng Jie of the Institute of Scientific and Technical Information of China (ISTIC), described the Chinese Scientific Data Sharing Project in which ISTIC is an active participant. China wishes to follow the western model of data sharing and reuse across research institutes and end users, and has been building up a national distributed network that now includes ten data centres and over 100 branches and nodes, covering over 3000 databases. China is an active participant in CODATA activities and is enthusiastic about participation in the World Data System. ISTIC works with the Wangfang Data Agency to provide DOI registration for scientific data sets in China, and there is significant investment in the development of scientific data classification and navigation systems, and building an Internet platform for scientific data resource information.

A session on "Data in Practice" brought some ground-level perspectives to these high-minded policy objectives. Todd Vision described DRYAD, a system within the life sciences that allows authors to deposit their supporting data sets at the same time as they submit a research article for publication. Currently there are 25 journals with deposition/submission integration in this field; each deposited data set has a unique DataCite-supplied digital object identifier (DOI). The philosophy behind the system has been to make it easy and low-cost for authors to deposit their data, and this strategy is broadly working. The down side, however, is that the deposited data sets are described by limited metadata. There is some encouraging evidence of re-use of deposited data sets by other researchers; but there is also some concern that providing too easy a route for deposition might hurt existing curated data centres by diverting material away from them.

If DRYAD handles "long-tail" scientific data, the opposite extreme is faced at the particle physics facility CERN, as described by Tim Smith. In 2012 over 22 petabytes of data were recorded on the Large Hadron Collider (LHC), although this is only a fraction of the amount that can be generated by the experiments. CERN must invest heavily in data filtering procedures to trap only the small fraction of the experimental results that may be of interest to specific experiments. Even then, the large volume of data (most of which is reduced and analysed in research institutions outside of CERN) requires very large data storage facilities distributed around the world, and dedicated high-bandwidth optical private networks to transport the data between nodes. An interesting feature of particle-physics "information" was that, as one moved along the data pyramid from large volumes of raw data through smaller volumes of processed data to the relatively small volume of published results, the proliferation of multiple copies of more highly processed information actually amplified the data management problem. It was estimated that the 22 petabytes of raw data collected in a year gave rise to a total of 70 petabytes of duplicated and derivative data that needed to be tracked, verified and reconciled. One beneficial aspect of the data explosion was that for each generation, the archiving of previous generations' output (including content migration to new-generation media) became progressively less burdensome. Another feature of the LHC's data output is that the data is not really digestible by many other research workers; it is almost as if all those that could digest the data are already co-authors on the publication(s)!

Toby Green of OECD demonstrated the visualization and access gateways to data sets published by the Organisation for Economic Cooperation and Development. Where there is a significant holding of well-characterised and homogeneous data, it becomes cost-effective to develop tools to make it easier for end-users to access and visualise those data. For the OECD data sets, simple web-based applications allowed the extraction and combination of data sets in many ways. Very granular dataset DOIs facilitated linking statistical tables to publications, and the potentially difficult issues of tracking time-variable data sets were being tackled initially by detailed versioning.

The after-lunch session on Global Initiatives began with a description by Michael Diepenbroek of the ICSU World Data System, the federation of data centres largely in the earth sciences. Much of the impetus behind this system is the establishment of common norms of quality and interoperability across a very diverse spread of activities, and early attention is focusing on organisational aspects, including the establishment in Japan of a coordinating International Programme Office. Among the technical aspects of the new system are the orderly registration of DOIs and linking to associated publications in a way that will give due credit to those collecting and curating the data.

Jan Brase, Managing Agent of DataCite, and overall Chair of the Workshop, explained how DataCite acted to register data DOIs across the sciences. Member bodies of DataCite (typically national science libraries) provided local support services to their research communities, but collaborated to provide a uniform level of service. Currently more than 1.3 million DOIs had been registered. DataCite was also interested in improving data citation practices, and as part of this quest was also a leading participant in a CODATA Task Group on Data Citation Standards and Practice.

Geoffrey Boulton previewed the forthcoming Royal Society policy report "Science as an open enterprise", which would discuss the major policy issues surrounding research data management, drawing on recent cases such as the "Climategate" affair and on the perception that the data deluge offers both challenge, in the scale of handling vast quantities of data, and opportunity to involve a wider research community, and indeed the citizen scientist. The report would recommend that open data should be the default for scientific research, rather than the exception; and that learned societies should promote open-science priorities and opportunities in their discipline and in its application. The four criteria that would constitute "open" scientific data were that it be accessible, intelligible, assessable (i.e. well characterised and open to validation and reanalysis), and usable by others. One slightly worrying trend was that, as private companies seem to be moving towards greater openness in their data handling, so universities are in danger of becoming more closed.

Françoise Genova closed this session with an account of the Astronomical Virtual Observatory, a good example of a discipline-wide and international approach to data handling and linking to publications. Driven by common scientific objectives, astronomers have found ways to handle very heterogeneous data sets and to work towards a common data policy. Early standardisation efforts and engagement with new IT technologies allowed the community to develop its own persistent identifiers and remote query systems even before the development of DOIs and the Web; and now the emphasis was on extending these approaches to provide seamless access to all the disparate systems that currently exist as a loose federation. The main challenge is in the ongoing standards development exercises: agreeing on well-characterised international standards is a difficult process. Nevertheless, the growth of the International Virtual Observatory Alliance was steady and organic, and gave great hope for realising all the benefits of universal access to open data. In the question time Françoise stressed the importance of establishing standards so as to minimise the chances for things to go wrong later in the data archiving. In effect agreeing 'what is to be archived' is vital to future success.

In the final session, Publishers and Data, three academic publishers gave their perspectives on the integration of data management and archiving with the much longer-established business of learned journals.

Eefke Smit of the International Association of STM Publishers (representing over 100 publisher members) described some individual journal initiatives to enhance scientific articles through linking to associated data sets, and spoke also of the PARSE-Insight survey ("Permanent Access to the Records of Science in Europe") that had identified the current patchy distribution of scientific data archived in orderly and accessible ways. In a new European project (ODE: Opportunities for Data Exchange), STM had developed a model of the data publication pyramid, represented by a vast base of raw research data, narrowing through successive layers of dataset collections, processed data, and databases, to the relatively small apex represented by the published literature. This metaphor mirrored discussions throughout the workshop on different classes (and volumes) of variously processed information. It was provided as a reference model around which publishers could consciously build systems that would distinguish between, and handle appropriately, data that was "just data" and data that more directly supported a published scientific argument.

Fred Dylla of the American Institute of Physics (AIP) preferred to emphasise the traditional added value of the publishing enterprise and see integration with supporting data as a simple extension of the existing paradigm. As he stated it, publishing is the logistical support of information exchange, adding a useful degree of uniformity to that information, and making order out of chaos. From his perspective, the urgency for publishers was to tackle the "data pyramid" model of STM by starting at the top - i.e. by ensuring close linkage between the published article and the data sets that directly support their argument. This did not need to be unduly complicated - AIP was working on pilot projects that turned published figures and tables into accessible digital data objects by linking them directly to the numerical data that they represented, often as simple spreadsheets. While this was a low-energy threshold approach to providing access to data, it would also permit a critical survey to be carried out of the community's reaction to this new approach, and determine whether the new approach could win over authors in that community to valuing it as a new publishing paradigm. As he saw it, the main challenge in establishing the value proposition was not a technical one, but the need to change aspects of the existing culture.

Alicia Wise of Elsevier described some initiatives within the Elsevier stable of journals to enhance linking between articles and data sets. In many cases authors upload supplementary material with the article; in other cases they link directly to curated entries in discipline-specific databases such as the Cambridge Structural Database or Protein Data Bank. Enhancement of articles to provide such linking was done either by tagging added by experts (a laborious and expensive task) or through textmining and the overlay of implied semantics. In experimenting with novel layout and structuring of articles, as in the "Article of the Future" project, Elsevier was exploring the more dynamic nature of online publications. They were integrating online knowledge bases such as Scirus and Scopus, and by providing open application programming interfaces to SciVerse applications, hoping to encourage developers in the wider community to increase the usability and usefulness of their published articles. Already Elsevier published some data journals that could include executable code, and they were keen to work in partnership with other stakeholders, and to provide opportunities for research activities involving textmining.

Overall, this Workshop provided a helpful snapshot of the state of play in making research data available within the framework of the record of science. There are encouraging signs that public policy is well informed and is moving towards encouraging orderly curation of data across many disciplines. Within this framework, public funding is available for well defined data management activities, and this may provide some resources for individual disciplines to address any needs they have that cannot be met by existing academic funding. There is, of course, huge disparity both in the types of data across different disciplines, and in the sophistication of different communities with the management of their data. This does provide a continuing challenge to publishers, especially large organisations publishing across many different scientific fields. As yet, the ability of publishers to take advantage of specific data handling opportunities seems rather limited. Initiatives such as DOI, which now provides persistent unique identifiers in both the publishing and data worlds, do of course facilitate linking and citation, which are important first steps. But there is still a great deal to do before there is routine validation, visualisation and reuse of data across the whole field of science. It is very beneficial that organisations such as CODATA and ICSTI are both aware of the problems, and well placed to work together with the many relevant stakeholders to bring this vision closer to reality.

John R. Helliwell, Representative to ICSTI
Brian McMahon, Representative to CODATA

Crystallographic data

ICSTI Winter Workshop 2012: Delivering Data in Science