Managing Data for Science

ICSTI 2009 Conference

June 9-10, 2009, Ottawa, Ontario, Canada

The 2009 Conference of the International Council for Scientific and Technical Information (ICSTI) took place in the Library and Archives Canada building in Ottawa on June 9 and 10. The Conference theme was 'Managing Data for Science', and separate sessions explored: the foundations of scientific data management; the role of libraries; the practical realities of existing data services; and the development of 'semantic science'.

The organisation of the conference was polished and efficient, and it is proposed to post all the presentations and video recordings of the talks on the Web (Powerpoint presentations are now available at http://www.icsti2009.org/02-program-abs_e.shtml), so rather than give a complete synopsis of all the presentations, I shall highlight comments or topics that were of particular interest to the IUCr.

Lee Dirks of Microsoft gave an opening lecture, eResearch, Semantic Computing and the Cloud, that surveyed many recent developments in computing infrastructure and web development that were relevant to the management of data - especially of the large volumes of data that are produced by many scientific experiments and observations. Like many subsequent speakers, he emphasised the enormous volume of digital information that is generated annually (now exceeding the total available storage capacity of the world). Much of it is transient, but there are still heavy burdens in storing and using the rest. Commercial vendors such as Amazon and Google are providing useful infrastructure by renting storage or computing power, and some scientific projects find it easier to rent low-cost computation from such sources rather than requisitioning dedicated hardware.

An important new initiative is the planned DuraCloud project of DuraSpace, a body formed by merger of the Fedora and DSpace repository developers. DuraCloud would provide a service layer for managing data backup and storage conforming to best digital preservation packages, and implemented on commercial Cloud infrastructure. Thus a local data provider would sign a service-level agreement with DuraCloud, and all the details of managing long-term preservation with associated format migration, redundant backup, integrity checking and so on would be managed by DuraCloud through service level agreements it contracted with Amazon, Google, Yahoo or whatever other commercial providers were relevant.

To an audience question of the wisdom of relying on commercial companies to archive important data, the response was that DuraCloud would always attempt to guard against a single failure through redundant storage arrangements; but it was emphasised that this solution should be seen as additional to the provider's own local archiving efforts, and not as a replacement. There were also issues of confidentiality and security, but it was felt that these could be handled through appropriate agreements, since many of the providers already provided services for sensitive data (e.g. medical records) with appropriate safeguards.

A particular comment of interest was the need to develop enhanced interoperability protocols (SWORD and OAI-ORE were mentioned) to improve the ability to manage data across distributed services.

Francine Berman of San Diego Supercomputing Center (Mobilising the Deluge of Data) gave some examples of projects handling data on a large scale (earthquake modeling, astronomy Virtual Observatories, the Protein Data Bank), but was especially eloquent on the topic of developing a coherent plan for the managing of large and important data sets, and especially for their long-term preservation. Such plans should be developed by individual communities of interest, for it is the communities themselves who can best judge what data is most valuable, and needs or benefits most from careful stewardship. The communities can also assess what intervention is needed at each level of data handling - acquisition, use, acess and storage - to provide an efficient and holistic mechanism for handling data throughout its lifecycle. Research communities by themselves do not know enough about the best practices of digital preservation, and should collaborate with archivists and librarians wherever appropriate.

While technical problems were well defined and often surmountable, the economics of long-term data preservation remain uncertain, and the research community may need to open up to new ideas for funding the necessary activities. It was also helpful for communities to carry out data censuses of existing holdings; this, combined with triage, would help to determine an appropriate level of stewardship, but into the calculations of cost and benefits should always be factored the question of what level of data loss would be considered acceptable - since some loss was practically inevitable.

Many of the important issues are well addressed by recent or forthcoming reports from the Blue Ribbon Task Force on Sustainable Data Preservation and Access. However, while it was most desirable for communities to formulate a coherent data management/preservation plan, it was essential that they did not delay practical implementation of useful measures while awaiting such a plan!

Richard Boulderstone of the British Library (Managing Data Science - a Customer-Based Approach) described some of the difficulties that conventional large libraries faced in the new digital world, and showed how one response was to engage closely with the emerging technologies to provide services that were very different from traditional static archiving and cataloguing. The British Library, for example, leads UK PubMedCentral, has worked with Nature Publishing in experiments with the virtual world environment "Second Life", runs a "Talk Science" forum for researchers, commissions and participates in research studies with JISC, the Research Information Network and others, is a partner in WorldWideScience.org, and belongs to the consortium established to assign DOIs to scientific data sets.

The British Library is developing a strong infrastructure for secure digital archiving, with four major nodes in England, Wales and Scotland that offer continuous peer/peer validation and correction of file errors. These will offer a very large archiving capacity, but it will be nowhere near large enough to archive large research data sets of the size already managed by discipline-specific data centres. The Library feels that in any case discipline experts will have the particular knowledge and skills necessary for appropriate stewardship of the data that they use within their own disciplines. Nevertheless, the British Library wants to provide services for people to find and link to such external data also, through such measures as assignment of persistent identifiers and data catalogues.

Chris Greer, of the Federal Networking and Information Technology R&D Program, described in his talk Science in 5 Dimensions: Digital Data for Cyberscholarship the outcome of a project sponsored by the US Government to develop a framework within which federal agencies could operate open interoperable data preservation solutions. Involvement of the full range of federal agencies provided exposure to these ideas across all Government-sponsored scientific activities. An important conclusion of the resultant report was that separate communities of practice were essential features of the digital landscape. As with the British Library's perception, it was recognised that different scientific disciplines had their own requirements, quality criteria and expertise for curation of data to the extent they demanded. One specific corollary of this was that not all available data needed to be preserved, and not all that was preserved needed to be preserved indefinitely, so that dynamic strategies were necessary. This point echoed Fran Berman's that triage and staged expiry of preserved data would both be necessary, and were best managed by the communities with the most intimate knowledge of their data.

The Report (Harnessing the Power of Digital Data for Science and Society) recommended that the national importance of digital preservation be recognised by the formation of a full Subcommittee of the National Science and Technology Council. It was also hoped that the proposals made to guide federal agencies would also be adopted as a model by state governments and by individual institutions (universities, academic libraries) that had similar digital preservation responsibilities.

Jim Mullins of Purdue University introduced the session on the role of libraries with a case study of scientific data management practice at Purdue, where librarians were all faculty members and where there was active involvement with other faculties in establishing appropriate data curation mechanisms. The University had developed a Distributed Data Curation Center (D2C2) that undertook initiatives in developing interdisciplinary metadata and ontologies to harmonise the practices of diverse faculties. As a consequence the library staff were not simply service providers, but active developers of effective curation strategies and implementations. The University as a whole was particularly committed to promoting interdisciplinary research, and the HubZero system, designed as a "Web 2.0 for Scientists" framework and first implemented as the highly respected NanoHub, would become an open-source product in 2010.

Liz Lyon of UKOLN (Libraries and "Team Science") addressed the role of academic libraries as participants in the social activity of science. She listed ten areas in which libraries had an important role to play, illustrating many of her points with experiences drawn from the eBank and eCrystals projects collaboration with the UK National Crystallography Service. Her 'top ten' areas were: leadership; policy; planning; audit; engagement; repositories; sustainability; access and re-use; training and skills; and community building. The UK Digital Curation Centre plays an active role in many of these areas. As examples of the usefulness in engaging the libraries community with scientific practitioners, she mentioned three reports by her co-worker Manjula Patel, published or in preparation, that were relevant to future developments in archiving crystallographic information: Preservation Planning for Crystallography Data, Preservation Metadata for Crystallography Data, and Representation Information for Crystallography Data, and she also referred to moves towards a crystallography data commons undertaken with the Australian TARDIS project.

Jan Brase of the German National Library for Science and Technology (Access to research data: what can libraries do to help?) concentrated on the role of registering persistent identifiers in the form of DOIs for scientific data sets. The DOI system can easily connect research articles with their underlying data, and can provide citability of data sets, with subsequent improvements in the visibility of such data sets, opportunities for re-use and verification, the enhancement of the scientific reputation of the data collector, the avoidance of duplication, and other such benefits. DOIs assigned by TIB and partner libraries in a new consortium have associated metadata which relate them to parent publications, describe relationships with other data sets ('parent/child' or 'also known as'), and indicate their technical format, typically through a MIME type. This allows the system to be agnostic towards particular data types - in a sense, anything is 'data' - but provides handles allowing downstream applications to dispose oif the content in accordance with its type.

The session on Data Services began with a talk from Ellsworth LeDrew, University of Waterloo, on The Enduring Legacy of the International Polar Year (IPY). The IPY, spanning March 2007-March 2009, was the fourth major international and interdisciplinary programme to survey comprehensively the Polar regions (the others had been in 1882, 1932-33 and 1957-58), and from the start there was an awareness of the need to establish procedures to ensure that the collected data were retained and remained accessible for the benefit of future projects of this type: very little of the primary data from the earlier IPYs was available to the current investigators. With this objective, a data management policy was written for all IPY projects based on ICSU guidelines. The policy operates within a concept of an information commons (using the same principles developed for eGY), and encourages the maximum sharing of collected data. The 'encouragement' took the form of a requirement that all projects had to provide basic descriptive metadata within one year, or risk delay in obtaining their next funding tranche - an approach that seemed to work well in practice. IPY managers are trying to implement citation of data sets in published journals, are building a network of polar data catalogues, and are trying to make best use of existing World Data Centres to provided federated long-term holding of the data sets. Feeding into this are proposals to inventory and harvest related data sets that appear on private web sites or are otherwise vulnerable. Although there are still issues within the areas of interoperability, equitable access to data, intellectual property and data ownership rights (the usual suspects!) the overall picture is of a substantial effort to operate a major programme within the framework of a data management plan adopted by all the participants that is informed by the principles increasingly identified as best practice in scientific data management.

Tim Smith of CERN spoke about the Digital Library Services and Data at the high-energy physics laboratory. The enormous volumes of data generated by the particle accelerator challenge the data processing procedures themselves - the 300 GB/s of raw data generated by the collider must be reduced in silicon and in situ even before the resulting 2 GB/s is delivered into the CERN data storage infrastructure; the resulting redistribution of data in real time worldwide is an epic endeavour for information services. But CERN wishes the data to be reusable, and so they must be stored in a structured archive; and the CERN Document Server (CDS) is designed as a one stop shop to access the complete archive of the experiment, including its derived publications, experiment specifications, theses, collaborations and so on. CDS uses open-source INVENIO software to manage the front-end acess to data, and the CDS managers are finding increasingly that although the archive must be managed according to the best practices of library science, their users increasingly require different entrance pathways to the material. Information retrieval and presentation mechanisms similar to popular Web services such as YouTube or 'Amazon-style' recommendations are in demand, and there is also great demand for Web 2.0 collaborative tools. The result is that users have access to informative visualizations of publication relationships (citation statistics, co-citation analysis, coauthorship networks) and to image or multimedia data in a range of formats and resolutions. They now have similar high expectations of being able to view and analyse the scientific data. However, although the expectations are high, the experience of the CDS is that by engaging the users in this way they are encouraged to throw more effort into managing and annotating their own input to  the system, and so contribute directly to the efficient functioning of the CDS.

Paula Hurtubise (Carleton University) gave a presentation on <odesi>, an integrated data portal providing access to social science and statistical data holdings across all Ontario universities. One practical result of this initiative has been increased cost efficiency and savings arising from elimination of unnecessary duplication of dataset holdings. The portal itself provides powerful mechanisms for searching across all holdings based on their aggregated metadata, and has intuitive interfaces allowing easy formulation of complex queries and consequently unexpected knowledge discovery. The project seems a model of well-integrated data management that greatly benefits end-user and data provider alike.

The session on Semantic Science was launched by Peter Fox of Rensselaer Polytechnic University with a presentation (Xinformatics, Data Science and the Full Life Cycle of Data Information and Knowledge) that showcased some of the real applications in Earth and Space Sciences that were being driven by semantic web technologies. He highlighted some of the important principles behind planning and developing effective applications: it was important to develop use cases for the problem under consideration before making any commitment to a technical solution. Too often there was a tendency to use one's favourite tool and make it fit the problem in hand. There was a great deal of emphasis on the need for new graduates to be well versed in informatics from the start of their training; and he also emphasised the need to integrate domain knowledge and experience into the information technology development process - the most effective practitioners of science informatics had 'multilingual' skills. He made much of the fact that areas of earth and space sciences have effective data journals and the ability to cite data, have effective data validation, and are increasingly promoting free access to and reuse of data. Nevertheless it was crucially important to add semantic information at every level of the data pipeline. Not only does this spread the practical cost of tagging, but it catalyses a positive feedback loop. He adopted the meme 'data → information → knowledge → wisdom' (repeated frequently throughout this conference, as at every similar meeting) and suggested that invoking this paradigm at the data acquisition stage allowed one to better design the data processing stage; enhancing the semantic content of available data through the processing stage informed and improved the data analysis process, and so forth.

Katy Börner of Indiana University discussed Computational Scientometrics to Inform Science Policy, or the best way to develop the science of science itself using scientometric or bibliometric concepts and approaches. She demonstrated how the linkages between different fields of science (as measured, for example, by literature citations) could be graphed as a topological map, usually in practice with a generally ring-like topology. Such visualization suggested an intuitive way to grasp the connectedness of underlying scientific ideas, and one could overlay other information on the map and so perhaps come to understand the interdependencies that existed when trying to formulate global scientific policy. Some of these visualisations could be seen in the scientific maps project at scimaps.org. She produced among many other stimulating ideas the metaphor that the current practice of science funding through grant cycles was akin to trying to raise a baby by giving birth, then visiting it only at annual intervals.

Jan Velterop of the Concept Web Alliance (Beyond Open Access: Maximising the Use of Scientific Knowledge) described work to  address information exchange standards through the use of 'concept triples'. By extension of RDF formalism, concept triples embody formal machine-readable resource descriptions of the type

     <source concept> <relation> <target concept>

where each element of such a triple represents a distinct concept that survives translation - that is, they are unique identifiers of a term and all its possible synonyms. (A simple example might be a 'personId' that identifies a person regardless of the form of that person's name - with initials or full forenames, familiar or nicknames etc.)

The Concept Web Alliance was a growing community of interested parties contributing to disambiguation strategies through this technology. An example of a practical application on the SpringerLink web site was the appearance of hyperlinked terms in, for example, an article abstract. If the reader clicked on such a hyperlink, a popup appeared with information relevant to the selected term: links to synonyms, definitions, papers by the same author, related articles, links to relevant books or commercial suppliers, all of which led the interested reader to further information. The fact that the popup could lead readers to selected books or commercial materials allowed an economic model of 'click-through' payments from suppliers, and as a result the service could be provided free to publishers.

The final keynote address of the meeting was given by Paul Uhlir of the US National Research Council. Entitled Revolution and Evolution in Scientific Communication, this discussed the benefits to society of maximising access to digital information through open access and the ideas of information commons. Among the most relevant to a science like crystallography were: promotion of interdisciplinary, inter-institutional and international research; enabling automated knowledge discovery; avoidance of inefficiency such as unnecessary duplication of research; permitting verification of previous results; promoting new research and new types of research; and promoting capacity building. There was also immense direct benefit from educational projects such as MIT Open Courseware. He was particularly keen to promote a vision of Open Knowledge Environments (OKEs) at universities, where concerted efforts could be made to return the process of disseminating scholarly information to the academic world. It was likely that this would be structured within focused thematic areas, and individual universities might act as centres of expertise for particular disciplines, thus sharing the burden of different tasks among consortia. He suggested prototypical models such as the Genomics Standards Consortium and the CAMERA metagenomics resource.


In surveying current practices of data management and preservation, the conference showed that the many challenges and difficulties - identified, for example, in successive CODATA conferences - are becoming more generally recognised, and that there is even some progress in addressing them in certain fields of science, and increasingly in the formulation of science policy at national levels. The compelling message of the conference was the need to develop a coherent plan for managing data as an essential component of any research project, and encouraging signs that the technology was available to allow such plans to be carried out. It was also significant, to my mind, that due recognition was given to the differing needs of different communities of practice, so that individual disciplines had a need to assess their own requirements for management and long-term preservation; and that strategies for preservation needed to perform critical analyses of what exactly did need to be preserved and for how long.

It is noteworthy that this was a conference sponsored by ICSTI, since the content would have seemed entirely natural in a CODATA meeting. It is clear that the traditional data/publication dichotomy is no longer appropriate to the processes of dissemination of scientific and technical information, and we may look forward to many future synergies between these two aspects.

Brian McMahon
CODATA Representative

12 June 2009