Crystallographic data

JISC Managing Research Data (International) programme workshop

Aston University, Birmingham, UK, 28-29 March 2011

JISC (formerly the Joint Information Services Committee) is charged with providing world-class leadership in the innovative use of Information and Communications Technology to support education, research and institutional effectiveness, primarily in the UK Higher and Further Education Sector. It funds a number of research projects to further this cause; several of these come under the umbrella of Managing Research Data (JISCMRD).

A two-day conference at Aston University, Birmingham, on 28-29 March 2011, surveyed the activities and achievements of many of the projects funded under this framework during its two-year lifecycle (October 2009 through activities to the end of July 2011). Simon Hodson, the JISCMRD Project Manager, gave an introductory view of the programme, which had disbursed some GBP 4M among some 30 projects across five distinct strands: Research Data Management Infrastructure; Research Data Management Planning; Support and Tools; Citing, Linking, Integrating and Publishing Research Data; and Research Data Management Training Materials.

The IUCr has an interest in a number of these projects (either as formal partner, e.g. in the case of Peter Murrray-Rust's Project XYZ, or through ongoing engagement with active structural science practitioners such as Simon Coles - Projects I2S2, Webtracks); and so this seemed a good opportunity to gain an overview of how this JISC initiative has been faring. I therefore went along to the conference (along with some 120 other delegates), and found it a very useful overview of data activities in many sectors of the UK Higher Education community. Full programme details and subsequent documentation can be found at the conference website

Of the projects mentioned above, I2S2 (Infrastructure for integration in structural sciences) is a collaboration including Simon Coles (Southampton U.), Brian Matthews (STFC, Rutherford), Martin Dove (Cambridge U.) and Liz Lyon (UKOLN). The work reported was an effort building on the STFC Core Scientific Metadata model to integrate with a general process model for chemistry, using the OREChem framework. Starting with the ICAT system for cataloguing raw data at the ISIS facility, an 'ICAT-Lite' pilot implementation of the I2S2 informational model was under development. The format is XML, with <process> and <dataset> elements describing various experimental outputs. The information model tracks much of the complexity of data reduction, calibration, scaling etc. in real-world processing workflows. The interest in this lies in its potential for formalising experimental procedures in chemistry experiments, ultimately providing seamless information transfer from laboratory information management systems (LIMS) into structure solution and refinement procedures, structural model building and publication. There is also the more general topic of what metadata to capture alongside raw data cataloguing, for example using the NeXuS format in neutron, X-ray and muon science. This is of considerable interest to the people developing imgCIF, and is likely to come up again at the I2S2 workshop I shall attend at Rutherford on 1 April.

Steve Androulakis, of the TARDIS project for federating raw data storage on Australian repository platforms, also gave a presentation on the current state of TARDIS, and its move towards abstraction by developing a generic object model. This had entailed replacing an original specific protein crystallography ontology by something similar to the STFC ICAT model. Placing this on top of a central indexing facility permits the development of different 'publication' modules to export content in a variety of target formats. Steve will also be at the Rutherford meeting on 1 April; and in my view the closer the information models used by disparate facilities, then the easier it will be to create an effective distributed data store for raw diffraction data - one of the directions in which the IUCr would like to see the community progress.

Nick England reported for the XYZ project on ongoing initiatives in data publication, exploring issues of sustainability (such as the hosting of the CrystalEye aggregator by the IUCr in the longer term), extraction, automated analysis and repurposing of published open content (e.g. in the form of an overlay journal of chemical experimental methodology), data deposition and validation. He also described Peter Murray-Rust's recent involvement in the development of Scholarly HTML as a paradigm for structured authoring of semantic content using HTML. The way in which this was likely to work would involve the embedding of ontological terms using RDFa, and facilitated on platforms such as WordPress by the development of modules and add-ons. (The Knowledge Blog, or kblog, project of Phil Lord at Newcastle University was also of interest in demonstrating the development of WordPress as a scholarly authoring platform.) Web-savvy applications such as Mendeley could, in principle, be tailored to deposit structured bibliographic information from citations databases, for example; and an appropriate such ontology might be CiTO.

CiTO, it transpires, is being used and developed by David Shotton at Oxford U. in the JISC Open Citation Project (not, strictly, under the JISCMRD umbrella, but relevant because of its connections with the Dryad UK Project, and as a companion to the Murray-Rust group's Open Bibliography Project). There are also efforts to develop a parallel ontology for expressing formally the relationships between publications and their associated data sets. This has great relevance, again, to the linking of data sets as components of a compound publication, federated datastore, or research object.

Dryad UK is a JISCMRD project that aims to further develop Dryad - an international repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields. Dryad is acting as a repository of biological data sets, supported by many publishers because it means they do not have to store supplementary data themselves. It affords a reasonable level of integration with publisher submission systems such as ScholarOne, and is increasingly allowing peer reviewers access to supporting data in the pre-publication stages.

Cameron Neylon (STFC) described the Webtracks project, aimed at developing a peer-to-peer protocol to underpin the construction of a web of linked data. The idea is that Webtrack servers allow the establishment of bidirectional links between data set and derived 'publications', based somewhat on the 'pingback' methodology of blogs, that reports back to an original posting any subsequent instances of its reuse. The ensuing set of semantically annotated links between data resources forms a graph of citation and provenance.

Other presentations of interest were Gudmunder Thorisson's presentation of progress with the ORCID project (Open Researcher and Contributor ID), which is moving towards a beta implementation in late 2011; and an account by Eefke Smit, of the International Association of Scientific, Technical and Medical Publishers (STM), of existing approaches and attitudes towards data publication by publishers and their authors and readers. A recent survey showed that, at best, about 15% of authors of scientific research articles deposited their data sets with journals or in subject-based datastores; but a large majority of respondents expressed a preference for being able to do so.

Other noteworthy trends across the conference were the growing number of institutions and services that used the DataCite organisation to assign digital object identifiers (DOIs) to data collections; and a sense that the need for institutions and for scientific projects to have research management policies was at last becoming widely recognised.

So, these are the highlights of the programme that I have singled out as being particularly relevant to our own activities, but there were many other interesting presentations in a meeting that demonstrated effective vision, management and control by JISC, and that itself provided excellent interdisciplinary networking opportunities.

Brian McMahon
Research and Development Officer