Crystallographic data


Scientific Data and Sustainable Development

Stellenbosch, Cape Town, South Africa, 24-27 October 2010

Opening the 22nd International CODATA Conference in Stellenbosch, Cape Town, the CODATA President Krishan Lal described it as the fulfilment of a dream. Having such a meeting in Africa provided an opportunity to bring the work of the organisation to many Africans, and in turn to allow many more scientists from the continent to engage with its operation than had previously been possible. In line with the venue. the main theme of the conference had been chosen to emphasise CODATA's commitment to work vigorously against the Digital Divide, and to foster education and maximum access to scientific data for the benefit of society, all key aspects of CODATA's evolving strategy.

Science and policy

The conference was immediately galvanised by a powerful address from the South African Minister for Science, Naledi Pandor. She described the conference title as 'intriguing', but expressing a critically important link. She also rallied delegates to action; whilst all were 'accomplished conference-goers', it was not sufficient to formulate good theories of bringing about advances in science and in societal welfare, and then to move on to some different considerations at the next conference. Each meeting should reflect on how well it had actually achieved goals and tackled problems identified in the last one. She identified four key ideas relevant to the link between scientific data and sustainable development: (1) the notion that in the developing world, poverty is aggravated by intellectual poverty; (2) intellectual property rights are of crucial concern in capitalising on things like traditional medicines and indigenous knowledge with due credit to the communities wherein they reside; (3) the setting up of international interdisciplinary networks is imperative to share ideas effectively and to increase capacity; (4) governments and scientific bodies must make strong and directed investment in cutting-edge communications and storage technology, again to increase capacity but also to bring projects through to an effective conclusion.

In a keynote address, Lidia Brito, Head of the Science Policy Division of UNESCO, considered the implications of the new age of data-intensive science in which we live, which is global, diverse, valuable, and complex. Where the previous generation had found it difficult to address many problems because we did not have enough data, today there is sometimes the risk of drowning in an excess of disordered or poorly analysed data. Good and improving global communications facilitated new ways of working - the 'collaboratory' concept that was so beneficial to places like Africa, allowing communication and exchange of services with other countries and regions. Increasingly, scientific practice is becoming not merely interdisciplinary, but truly trans-disciplinary. But the implication of these interconnected complexities is not only that good scientific data are essential in support of policy decision making; those data must be analysed and their implications presented to policy-makers in a comprehensible way. This is not easy - politics and science have different cultures, and the well-delimited uncertainties of scientific data may not match the desire for certainty that a particular policy demands. In the subsequent question-and-answer session, she deprecated the concept of the the 'information society', if the phrase was taken to indicate flooding people with information but without the proper knowledge or understanding of these facts. Such a practice is downright unethical. Despite the positive impact in Africa of the 2003/2005 World Summit on the Information Society, UNESCO preferred to turn from the concept of information society to the promotion of a new trend: 'can societies become wise?'

Another keynote address by Kari Raivio, Vice-President for Scientific Planning and Review, of the International Council for Science, laid out ICSU's vision of working in pursuit of the universality of science through providing scientific leadership to world policy makers. The ICSU General Assembly in Maputo had confirmed that ICSU should continue to assert such a strategic leadership role, and towards that end the ongoing structural changes were designed to streamline and facilitate effective use of the limited resources available. These structural changes included the establishment of a World Data System (WDS) to promote cooperation and interoperability between data centres; the establishment of an ad hoc Strategic Coordination Committee on Information and Data (SCCID); and efforts to encourage CODATA in its development of a Strategic Plan. Among the challenges facing ICSU were the problems of increasingly limited public funding; but also the need to accommodate an expanded repertoire of disciplines, as the role of social sciences and health sciences databases and information services come increasingly to the fore. The possibility was raised that CODATA and the WDS together might evolve to take over the role currently assumed by the ad hoc SCCID committee.

In an accompanying keynote, Jean-Bernard Minster, Chair of the World Data System Scientific Committee described how the WDS envisaged by ICSU was actually taking shape. Its organisational structure had been elaborated, criteria for membership formulated, and the process of accepting actual membership applications from the hundred or so bodies that had already expressed interest was about to begin. The mission of the WDS had been formalised as providing an international framework for the long-term stewardship of quality-assessed data and information; interdisciplinary integration; adapting to changing technologies; and implementing a community of excellence.

In a subsequent high-level session, the two candidates for the CODATA Presidency, Guo Huadong and Mikhail Zgurovsky, described their manifestos for leading CODATA through its Strategic Plan to assume the challenges and demands being made of it to become the world leader of data communities within this framework of global science policy. While there were differences of emphasis and approach, they shared many common themes, including the importance of the three pillars of the Strategic Plan - enhancing global commons for better data access; tackling the digital divide through key partnerships with scientific organisations and with key development agencies, nongovernmental organisations, universities, research institutes etc.; and promoting advanced data methods and information technologies for research and education. Both favoured exploring the idea of a CODATA-sponsored International Academy for Data Science, and they set forth various ideas for strengthening CODATA's financial and resource base.

In the final high-level policy presentation, Hilary Inyang, President of the International Society for Environmental Technology, brought the focus of discussion back to Africa, by presenting a framework for a science-based decision support system in development programmes within the continent. His proposals included the establishment of a continental research foundation, fulfilling a similar role to America's National Science Foundation; an expansion and rationalisation of the African Academy of Sciences; the establishment of advisory boards for the sciences; development of an African continental university system to foster world-class universities; the orchestration of support funds from many sources to establish an endowment fund for African science and technology; and the establishment of regional scientific, technical and entrepreneurship parks across the continent.

Data publishing and standards

Two sessions of particular interest to the IUCr touched on the related topics of validating published research against its underlying data; and making the underlying data available as a published work in its own right.

The session Data management to support research integrity - the 'data behind the graph' problem was organised largely as a round-table discussion by Mark Thorley, who is responsible for coordinating and supporting the management of scientific data and information within the UK Natural Environment Research Council (NERC). The discussion was spurred by recent controversies where difficulties in accessing underlying research data sets had raised questions over the integrity of the research, even leading to some loss in public trust of the integrity of the research process. There is some awareness of the importance of this topic by funding bodies: Research Councils UK now requires that underpinning data should normally be preserved and accessible for 10-20 years. I presented the situation in crystallography, where extensive validation is applied by many journals and curated databases to the final data (explaining how published structures were always freely available in standard CIF format, even for articles that were behind subscription-based access); and, increasingly, the final data are validated against the underpinning processed experimental data (structure factors). Progress is also being made in validating the structure factors themselves. There is also growing debate within the community as to the need for, and practicalities of, validating against raw data (typically diffraction images) and providing systematic archival services for those raw data. There was substantial debate over the difficulties in identifying and linking to related data sets within different disciplines, and a recognition that 'one size fits all' solutions were impractical. However, there was some support for a suggestion of abstracting the ideas into a high-level description of workflow and relationships between the key elements in the process (i.e. published articles and supporting data), along the lines of the Open Archive Information Systems reference model (OAIS). Such a reference model had effectively provided a common language to different communities needing to archive data, and a similar initiative could usefully identify the important issues and help focus on solving them in the matter of validating research results and linking to the data available for such validation.

The high-level parallel session Data publishing in the context of the ICSU World Data System was introduced by Michael Diepenbroeck, Center for Marine Environmental Sciences, who described the ways in which cross-linking between journal and data archives was routinely carried out in the Pangaea system. Individual data sets were assigned digital object identifiers (DOIs) by the DataCite organisation, and the provenance and descriptive metadata associated with these DOIs provided a solid base for citing the data sets (and thereby providing the potential for accruing academic merit by those responsible for its creation), as well as referencing any literature articles making use of these data sets. Roy Lowry, of the British Oceanographic Data Centre, demonstrated some specific projects linking data, appropriately packaged in a citable form and held by data centres, to articles stored in library DSpace repositories or other subject-based document collections. Again, the glue binding very disparate data collections was typically a DOI registered with DataCite. Jan Brase of the German National Library of Science and Technology, and Chair of the DataCite consortium, provided more information on the system now in place to provide distributed DOI registration services amongst a number of national libraries. My presentation, Integrating data with publication: greater interactivity and challenges for long-term preservation of the scientific record, complemented these efforts to create a data-publishing framework within the diversity of earth and environmental sciences journals and data centres, by showing how crystallography also makes use of DOIs for linking articles to data holdings (for example in the Protein Data Bank). More than that, however, the publishing workflow of IUCr structural journals such as Acta Crystallographica Sections C, E and F allows the article, the structural data sets it describes, and the supporting experimental data from which they are derived, all to be cross-validated, archived and published as components of a single complex information package. I also made the point that the ability to couple data and article so closely provides new potential, such as the ability to interact directly with the structure within the online article, using interactive visualisation technology such as Jmol.

A simultaneous session on Coordinating Data Standards: the Perspective of Scientific Unions included contributions from IUGS, IUGG, IAU and IUPAC (the Unions representing Geological Sciences, Geodesy and Geophysics, Astronomy and Chemistry). A presentation by Lesley Wyborn, A new role for CODATA in harmonizing the data standards activities of the International Science Bodies, gave rise in subsequent discussions to a proposal placed before, and endorsed by, the CODATA General Assembly. This proposal envisaged the establishment of a register of data standards sponsored by or associated with Unions or other scientific societies (to which if necessary DOIs would be assigned by DataCite). The IUCr agreed to participate in this initiative.

Data at risk

Another initiative of this meeting that may be of interest to crystallographers was a series of presentations and a round-table discussion of Data at risk, by which the organisers included data sources that are not 'born digital' (such as photographic plates in astronomy), as well as the potential for loss of the large amount of digital information that has been collected in a variety of establishments where there is no active data management policy or the resources to implement one. Also relevant to these concerns is the vulnerability of unsupported software programs (see IUCr leading articles on 'Age Concern' and the transfer of the Crystallographic Software Museum to the care of the IUCr Computing Commission). In crystallography, part of the purpose of the short-format electronic journal Acta Crystallographica Section E was to capture completed structural data that were not being submitted for publication because of the effort involved in working up a full scientific paper; but there is still plenty of anecdotal evidence that large quantities of crystallographic data reside in filing cabinets and magnetic media archives of individual scientists, with no provision for its retention or further use once they have retired.

Other scientific sessions

As with all CODATA conferences, a wide variety of themes and topics is pursued across multiple parallel sessions, and it is impossible to form a complete impression of the areas of greatest interest and importance. Among sessions that caught my attention were:

  • several exploring the structure, function and purpose of the new World Data System, including relevant technical sessions on interoperability and access models;
  • e-Science and e-Infrastructure in many countries and regions, including a Workshop and subsequent presentation session on  GRDI2010, a 10-year vision for Global Research Data Infrastructures;
  • sustainable development, the keynote theme of the conference, represented additionally by a couple of presentation sessions, and a panel discussion of CODATA's potential role in promoting the United Nations Millennium Development Goals;
  • aspects of biological, environmental and health data projects, with significant focus on programmes in Africa;
  • Young Scientist presentations and round-table discussions, designed specifically to provide a high profile for the work and aspirations of almost 30 young African scientists across a range of diciplines;
  • reports of the activities of CODATA Task Groups in the preceding biennium.

There were more than 30 poster presentations, and for the first time at a CODATA meeting these were eligible for a prize recognising the best poster, judged in terms of its creativity, scientific merit, appearance and clarity of exposition. The judges also took into account the presenters' performance in 'One-Minute Madness' sessions, where they each had one minute (and one visual aid) to capture the essence of their work, a task that most achieved very impressively. Fittingly, the prize and a high commendation were awarded to two African young scientists.

The full scientific programme can be found on the CODATA Conference web site.

CODATA Prize 2010

The CODATA Prize was awarded to Paul Uhlir, Director of International Scientific and Technical Information Programs at The National Academies in Washington, DC. It acknowledged his extensive work in the areas of scientific and technical data management and policy, and on the relationship of intellectual property law in digital data and information to research and development policy. He has been active within CODATA for many years, leading efforts to challenge national and regional laws that threaten access to scientific data, and sponsoring initiatives such as Global Information Commons for Science (GICSI). He is well known as an energetic champion of international cooperation in matters of data archiving, management and intellectual property rights.


In part because of its venue, this was a smaller meeting than other biennial CODATA conferences I have attended (about 260 people attended all or part of it). There was also - rightly - a greater focus on the conference theme and its relevance to the local hosts than is often the case. So this may not have been the most wide-ranging and scientifically varied meeting that CODATA has organised; but it was nevertheless a satisfying showcase of data management and analysis activities across many disciplines and topic areas. And the rare opportunity for so many African scientists to mingle with a large group of their international colleagues will surely have sown the seeds of many ideas, collaborations and friendships that will last for years to come.

Brian McMahon
CODATA Representative
1 November 2010

[Creative Commons By NC licence]