Crystallographic data

International Data Week 2016

Denver, Colorado, 12-17 September 2016

The theme of this landmark event, intended to bring together data scientists, researchers, industry leaders, entrepreneurs, policy makers and data stewards to explore how best to exploit the data revolution to improve our knowledge and benefit society through data-driven research and innovation, was 'From Big Data to Open Data: Mobilizing the Data Revolution'. It included the biennial SciDataCon organised jointly by CODATA and the World Data System, the Eighth Plenary Meeting of the Research Data Alliance, and an International Data Forum convened by all three organisations to provide an overview of 'Data for the Public Good'.

The IUCr was represented by John R. Helliwell, University of Manchester, IUCr Representative to CODATA (JRH) and Brian McMahon, IUCr R&D Officer (BM). 

SciDataCon 2016

Monday 12 – Tuesday 13 September 2016

This was the first part of a back to back conference organised by CODATA jointly with the WDS (SciDataCon, 2 days) followed by the Research Data Alliance (RDA) Plenary Conference (2½ days) sharing a common extra day (International Data Forum) on 14 September 2016.

We organised the session entitled "Crystallography and Structural Data Bases" with an introductory talk by John Helliwell on raw diffraction data archiving and reuse, followed by talks on specific databases, namely: Crystallography Open Database (Saulus Gražulis), Cambridge Structural Database (Ian Bruno), International Centre for Diffraction Data (Soorya Kabekkodu) and the Protein Data Bank (John Westbrook). There was a good attendance of approximately 40 to 50 people across a wide range of disciplines. As well as specific questions to the speakers there were remarks made e.g. from John Rumble, former CODATA President, on crystallography as a community being an exemplar over many decades of data sharing and openness.

In the CODATA portion of the conference quite possibly the most significant outcome was the possibility of encouraging a uniform description system for soils, i.e. akin to the CODATA-led UDS for Nanomaterials run by John Rumble, in which IUCr was an active participant.

JRH attended the following very memorable sessions:

  • Coordination of Data Management Policy and Practice across ICSU Unions/Disciplines in an Open Data World.
  • Defining data professions (i.e. their education amongst the different categories of data professional such as data scientist, data clerk, data architect etc.).

BM participated in two sessions to develop best practice in scientific data handling through the development of disciplinary standards: a set of presentations and panel discussion on "Building a Disciplinary, Worldwide Data Infrastructure" (Figure 1) and the session on "Coordination of Data Management Policy and Practice across ICSU Unions/Disciplines in an Open Data World". The former session led to an invitation to co-author a paper for Data Science Journal comparing crystallographic practices with those in astronomy, materials science, humanities, linguistics and earth sciences. The latter was associated with CODATA’s desire to promote more input from Scientific Unions.

Figure 1. Speaker panel for the Disciplinary Worldwide Data Infrastructure session; at left, Christophe Arviset presenting the International Virtual Observatory Alliance (IVOA). Courtesy CODATA International

Other noteworthy sessions attended by BM are summarised below.

"Semantic Enrichment, Metadata and Data Packaging" had a number of relevant talks on metadata and ontology development. Simon Cox enlarged upon his enlightening Keynote Lecture on controlled vocabularies and vocabulary services by describing SKOS (Simple Knowledge Organization System), a W3C recommendation for representation of thesauri, taxonomies and other forms of controlled vocabulary. John Kunze's presentation on "a vocabulary for persistence" described a voting system to allow community consensus to emerge on conflicting definitions of terms. Natasha Simons discussed the issues around open licensing of metadata (for which the CC0 licence was generally preferred). Daniel Foster's "Towards a Frictionless Data Future" emphasised that many researchers need a lightweight standard for characterizing data (for these communities the CIF/STAR approach could prove useful).

"Sustainable Business Models for Data Repositories" included a stimulating economics-based analysis by Cameron Neylon of possible behavioural models for making best use of open data as a common good. Bob Downs described how scientific data centres operated by NASA and Columbia University needed to embrace a "portfolio" approach taking account of existing business models for important stakeholders. Martie VanDeventer emphasised that low and middle-income countries (LMICs) might not have the capacity to build all aspects of a data infrastructure, and so needed to buy-in repository services – and the external suppliers of such services must be sufficiently trustworthy. As payback, LMICs had to accept their responsibility to contribute data that is of real value, and accelerate their learning path.  

International Data Forum

Wednesday 14 September 2016

This was a common day jointly organised by CODATA, WDS and the RDA. It was advertised as a day-long International Data Forum, with the theme "Data for the Public Good: Responsibilities, Opportunities and Dangers in a Data Aware Society", debating potential data-contingent transformations in civil society, government, health, education, and science.

 Talks and panel discussions took place within the following session themes:

  • Maintaining scientific rigour and enhancing discovery
  • Open data as a public good and the responsibilities of scientists
  • Data stories in citizen science; earth sciences; Médecins Sans Frontières
  • Responsible openness
  • Data for the public good: a next generation vision

The presentation by Phil Bourne, "Making biomedical research more like Airbnb", was of particular interest, describing a proposed biomedical commons platform to be funded by the NIH as an evolution of its "Big Data to Knowledge" (BD2K) project. Digital objects accessible through this platform needed to comply with the FAIR principles (findable, accessible, interoperable and reusable) championed by the Force11 organisation (and rooted in the "Beyond the PDF" activities that Phil was heavily involved with).

The closing panel discussion with young career scientists was also particularly memorable.

Figure 2. Panel discussion “Data for the Public Good – A Next-generation Vision”. Francine Berman (Moderator), D. Sarah Stamps, Virginia Tech, Henri Tonnang, Global Young Academy,  Xiaogang (Marshall) Ma, University of Idaho, Candice Lanius, Rensselaer Polytechnic Institute, Alliance of Digital Humanities Organizations. Courtesy Simon Hodson, @simonhodson99

RDA Plenary 8

Thursday 15 – Saturday 17 September 2016

A very useful event was the Newcomers to RDA information event comprising a 1½ hour basic description of the RDA’s history, activities and governance. There are over 4000 individual members (IMs), mainly from the USA and Europe; approximately 7% are from Asia/Oceania. On joining the RDA an IM pledges to hold to the RDA’s seven guiding principles, which includes a ‘non-profit’ pledge to ‘not promote, endorse, or sell commercial products, technologies or services’. There is a quite sizeable corporate set of members comprising funding agencies, data archives, charities and commercial companies (such as Elsevier and Wiley). 

JRH joined the RDA as an IM during the meeting and within that frame joined the following Interest Groups (IGs), whose sessions he also attended during the RDA: Photon and Neutron Science Data; Chemistry Data; Materials Data; Research Data Archives; and finally Reproducibility of Science. 

The Reproducibility IG had the initial focus of discussing the media issue of recent times, of work not being reproducible (perceived uniformly as a bad thing), splitting the discussion into three sections: empirical reproducibility; statistical reproducibility; and computational reproducibility. JRH led a discussion into the beneficial aspects, and indeed underpinning philosophical basis of science, of the Popperian falsification of science as a methodology for progress. Indeed the falsification of an important body of previous results can thereby represent the Kuhnian paradigm shift indicative of major progress in science. JRH commended that a WG be proposed to the RDA Assembly to lay out a 'white paper' for the proper philosophical understanding of science progress, as briefly described above, as well as to describe practical situations where any unreliability of research data would both undermine the philosophical basis of science progress and undermine public trust.

The RDA IGs and Working Groups (WGs) that JRH attended had an ad hoc membership rather than a systematic attempt to populate a topic or challenge with a full range of experts. This contrasted with the CODATA Task Groups which are more systematic in their membership. The strong presence of the International Scientific Unions at CODATA provides the bedrock for that systematicity. Nevertheless it was interesting to see a very clearly organised template from the RDA Council for IGs to spawn WGs, approved by the RDA assembly, to tackle and hopefully dissolve any 'road blocks' to effective dissemination and sharing of data in any domain of research data. In turn there was an impressive 'grass roots' energy visible from the attendees within the IG and WG sessions JRH attended.

BM attended RDA sessions on Photon and Neutron Science Data and Chemistry Data with JRH, but also sessions on Data Publishing (Data Usability Certification Services), Metadata (including the development of a Metadata Standards Catalogue) and Legal Interoperability. The last-named has produced a document on "Principles and Implementation Guidelines for the Legal Interoperability of Research Data" which has relevance to the IUCr, and which is being examined with interest by JRH and BM. The metadata groups are relevant to the desire of the IUCr’s Diffraction Data Deposition Working Group to construct a catalogue of metadata standards in crystallography and related fields, and may also have a role to play in the new CODATA Inter-Union Task Group on disciplinary data standards.

Posters, networking and literature

There were separate poster sessions for the CODATA portion (12-13 September) and the RDA portion (15-16 September) of the International Data Week. There were very memorable posters describing:

  • Archiving and reuse of neutron research data at the Oak Ridge National Laboratory neutron facilities
  • The plans for Open Science and research in Finland (Figure 3)
  • The linking of research data to PhD theses in the Netherlands

Figure 3. Plans for Open Science and research in Finland. Courtesy Finland Ministry of Education and Culture. Similar plans exist within the EU’s OpenAire project.

There was an overall notable presence of Elsevier with staff giving three talks (two within IDW and one at ICSTI) and a nice poster on the research data life cycle as well as their impressive booklet entitled "Research Elements: Publish Data, Software and Methods in Brief, Citable Articles".

The table of handouts for participants revealed the following interesting developments:

  • A leaflet describing "Biosharing enables researchers to make an informed decision as to which standard or database is appropriate".
  • An NIH request for information on metrics to assess the value and impact of biomedical digital repositories.
  • The China Scientific Data journal, "a bilingual open access journal publishing papers in multidisciplinary fields in English and Chinese" and published by the Chinese Academy of Sciences:

John R. Helliwell and Brian McMahon