Crystallographic data

CODATA 2008

Scientific information for society - from today to the future

National Technical University of Ukraine - Kyiv Polytechnic Institute
Kyiv, Ukraine, 5-8 October 2008

[CODATA 2008 opening ceremony] The Opening Ceremony in the Centre of Arts, Plenary Session Hall.
The theme of the 21st International CODATA Conference continued the emphasis on the information society that has emerged in the last few biennial meetings. But if the last conference in Beijing focused on the maturity of CODATA after 40 years of promoting and representing international data science, the 2008 meeting took as its keynote the importance of engaging the younger generation of scientists to lead future developments in a world community increasingly dependent upon information and scientific data.

Plenary lectures

[Bohdan Hawrylyshyn] Bohdan Hawrylyshyn presenting a plenary lecture.
A sombre assessment of the problems that need to be tackled was provided in the plenary lecture of Bohdan Hawrylyshyn (Information and knowledge as a tool in facing global challenges); put bluntly, his thesis was "the world is sick in its main components". Focusing in turn on the primary areas of demography, ecology, economy, geopolitics and the failure of the institutions of society (schools, churches, even family), he provided a reasoned but unsettling picture of the state of the modern world. But the advantage of a rational analysis is that it suggests rational responses, and of course scientific research and analysis can help to fuel sensible and carefully-judged responses. For each aspect of his analysis, he provided pointers to how science could help to address the decisions that needed to be made. It was important that "social wisdom" should grow in an effort to keep pace with rapidly developing technologies; and he cited the example of Scandinavian countries that, in his view, better demonstrated approaches to development that had a sustained emphasis on social justice.

The plenary lecture of Michael Zgurovsky (Interdisciplinary scientific data for sustainable development global simulation) suggested a practical approach to developing analytic tools for modeling sustainable development across the nations of the world. A numeric model of suitable metrics to describe the stability and security of individual nations could be built from a matrix of factors describing performance along the economic, ecologic, social and institutional dimensions - the "sustainable development gauging matrix" (SDGM). Sustainability, based on the UN declaration of 1996, is considered an important metric in characterising economic and political stability, and a number of entries in the SDGM demonstrated how many economically strong nations were still ranked low in terms of security of their societies. It was argued that global modeling in this multidimensional way was important in making well informed policy decisions.

These global views were accompanied by a special presentation from Wataru Iwamoto, representing UNESCO, who described many of the initiatives in which UNESCO is playing a part to promote the information society. These include the promotion of open access or differential pricing for access to scientific information; the development of metadata to facilitate long-term archiving; the promotion of evidence-based decision making in national policies; and the recruitment of young scientists and other workers in these tasks.

While supranational agencies are promoting evidence-based policy making, there is a practical need for high-quality technical structures to support the management and analysis of the large amounts of data involved, and in a plenary lecture The EGEE infrastructure and its support for European scientific collaboration, Robert Jones described a particular collaborative effort to provide such a structure. EGEE is entering its third two-year phase of operation to provide and increase the capacity of a production computing Grid infrastructure. With support from over 50 countries in and beyond Europe, EGEE includes over 300 sites linked together in a collaborative model. Applications cover many fields, including high-energy physics, earth sciences and life sciences, and the system provides not only high-capacity, highly resilient hardware, but middleware linking contributing centres in coherent "virtual" organisations. It was acknowledged that data management across the various applications is still rudimentary compared with the hardware and middleware provision, but the quality of service provided is very high, and is promoting an enormous amount of new and exciting science. Although the current approach is still project-based, the ultimate goal of EGEE is to provide a long-term sustainable Grid infrastructure throughout Europe and collaborating partners.

In the final plenary lecture, Curating data? What about curating services and workflows?, Carole Goble presented a complementary approach to linking together complex scientific data-driven inquiries. In the life sciences, over a thousand databases are regularly used by bioinformaticians. They are increasingly disparate in structure and architecture, but are usually accessed through Web services. This allows the construction of workflows that combine, integrate, link, process, derive and curate data resources from any combination of these database sources. The workflows are instantiated as discrete modules within a computational framework, that can be exchanged, extended and linked as required. Workflows do have advantages, in that the choice of modules automatically documents the processes involved in managing data from a number of disparate sources. On the other hand, the individual modules are constantly evolving as living program segments, and so it is essential to capture the particular versions used in any application. Curation of such rapidly-changing components is not easy. Neither is validation, especially as a community of authors contributes workflow modules to a common pool. At this stage in their development, workflows are being generated by an active community, that is equally active in quality assessment and validation. They are carried along on the wave of enthusiasm for social computing and networking that underlies the "Web 2.0" approach. The "bottom-up" approach to building solutions is in some ways at the opposite pole from the large-scale integrated architecture that can be seen in network infrastructures like EGEE; but it has a real potential to solve problems and perhaps to catalyse the development of a completely new approach to computer-assisted problem solving.

Oral sessions

If the plenary sessions provided an opportunity to state and develop the overall theme of the conference, the multiplicity of parallel sessions provided ample evidence for the diversity of activities embraced by CODATA. Just a few examples of the session topics will illustrate this: information society, global climate change, Grid infrastructure, geophysical data systems and analysis, biodiversity, scientific capacity building, repositories for scientific data, materials: data exchange, nanotechnology, natural disasters and risk, e-science collaboration, International Polar Year, biological and genetics data, etc. The full programme can be reviewed on the CODATA web site. However, a definite disadvantage of so many oral presentation sessions (up to 11 in parallel) is that the interdisciplinary nature of the conference becomes diluted, as each session focuses on a particular discipline, and it is impossible to see at one time how different communities face and tackle the same problems in their different environments. I would certainly recommend that future programme committees reduce drastically the number of parallel sessions, and work harder to ensure that each session explores topics of interest across subject boundaries. There would be merit in expanding greatly the number of posters presented, since there is clearly an enthusiasm for presenting the results of research, and a large poster session can generate much discussion and excitement.

Among the sessions that I attended, almost at random given the choice, were a number of exciting astronomy sessions that reviewed many of the collaborative initiatives contributing to the Virtual Observatory projects characterising much contemporary astronomical work. A keynote talk by George Djorgovski was particularly good at demonstrating how the virtual observatories of astronomy sat within the broader context of e-science. Modern information technology hardware can - just about - keep up with the explosive growth in data volumes (large digital sky surveys currently collect 10 or 100 terabytes of data, and forthcoming ones will collect petabytes; the latest generation of telescopes can collect 30 TB per day). There are now real problems in keeping up with real-time data analysis, and the science is challenged not only by the data volume, but increasingly by its complexity, such as with panchromatic (multi-wavelength) views of the Universe, and the additional computational challenges of simulations. A particular point of note was the increasing reliance on computational modeling, so that computer science is in many areas becoming the "new mathematics" of scientific discovery.

Other astronomy talks covered a range of large-scale observational projects, including Russian, Armenian, Ukrainian and European ventures. There were also discussions of the benefits of common data formats (FITS, VOTables), common interrogation languages and a common data model in unifying the discipline and increasing the synergy of collaborative projects. There was also a very nice presentation by Fabien Chéreau of Stellarium and VirGO, open-source desktop planetarium applications that tap into the large databases of astronomical objects that are openly available, and allow both amateur and professional access to fundamental data.

[D. Grodzinsky] D. Grodzhinsky describing the effects from the Chernobyl nuclear reactor fallout.
Two sessions on biological responses to low dose radiation illustrated the rather more mundane, but extraordinarily practical, benefits of careful collection and comparison of data from individual incidents - in this case the widespread exposure of human and other biological populations to radiation from the atomic bomb detonations in Japan and the Chernobyl reactor incident in Ukraine. A number of careful studies were reported, building up a more complete picture of long-term health effects from a (thankfully) very small number of direct observations; and variations in the epidemiology of various forms of leukaemia between the two cases provide an example of the new knowledge that can be gained.

The session on long-term data and knowledge management surveyed a number of large-scale and successful approaches to archiving, such as those of the Earth Sciences Sector of Natural Resources Canada, NASA's Planetary Data System, and the data management and publishing activities of the Canada Institute for Scientific and Technical Information (CISTI). Bob Chen of Columbia University made the very important point that governance and organisational sustainability are at least as important in building durable archives as the technical infrastructure and data storage capacity that is most often discussed. Arrangements to provide long-term archiving for data collected by the Center for International Earth Science Information Network (CIESIN) involve lengthy discussion with Columbia University Libraries to guarantee the preservation of existing data long after CIESIN itself may have disappeared. Other contributions in this session looked at the prospects for peer-reviewed data publication, to confer appropriate academic credit on data generators, managers and analysts, and to provide citable records; and what could be learned from the policies and norms for collaborative production and dissemination of scientific data sets from the activities of the open-source software developer community.

[V. Ezhela] V. Ezhela.
A stimulating session on physical science: data quality and databases, which I was privileged to co-chair with Fedor Kuznetsov, included excellent surveys of science data quality, especially in applied sciences, as managed in China through application of national and international standards (Hu Lianglin); of the extensive and careful programmes of standard reference data evaluation as carried out by the National Center for Standard Reference Data in Korea (Chang Geung Kim, H. S. Suh et al.); of the many important physical databases throughout the Russian Federation (T. Golashvili), in which attention was drawn to the need to differentiate carefully between reference, recommended and standard data values; and in nuclear data science activities in India (S. Ganesan). This latter presentation included a vivid illustration of the importance of continuously updating working practices and associated documentation to reflect revised values of physical data, as failure to do so had led to an over-power transient incident (Level-2 event on the INES scale) in an Indian nuclear reactor. Vigorous attempts to redress this problem, and energetic efforts to practise the highest quality of nuclear science in the power industry, demonstrate a maturity in Indian nuclear science that is reflected in the growth of international collaborative projects in nuclear science and technology, and in high-energy physics. The session also included a warning from a Russian high-energy physicist, Vladimir Ezhela, that physics journals needed to provide full machine-readable copies of numerical measured data as reported in their publications, to allow adequate refereeing and quality assurance. He provided the example of negative eigenvalues in the correlation matrix of certain combinations of the fundamental physical constants that would be obtained if the published values of the constants were used, and not their full-precision values. The IUCr of course requires deposition of experimental data to allow numerical peer review (and by the nature of our subject we can conduct most routine validation automatically); in discussions it was suggested that the International Union of Pure and Applied Physics (IUPAP) should be engaged to explore similar policies in physics; or that a CODATA Task Group might be a useful way to approach this. The session concluded with a challenging paper presented by Dong Bong Yang, Gun Woong Bahang and Sang Zee Lee that suggested a new natural units system to define all physical constants as well as the SI units by dimensionless numerical values.

Finally, the session on data visualization approaches, which promised an interesting variety of examples, was disappointing because many speakers failed to show up. Nevertheless, Jean-Jacques Royer presented to the many local students present an excellent overview of the three-dimensional subsurface modelling carried out by his group at the GOCAD project, University of Nancy. I also demonstrated the IUCr approach to the interactive visualization of data as a feature of online crystallography journal articles (PowerPoint presentation | annotated PDF).

Awards

[Liu Chuang] Professor Liu Chuang giving the 2008 CODATA Prize Lecture.
The CODATA Prize was awarded this year to Liu Chuang, Professor and Director of the Global Change Information and Research Center at the Institute of Geography and Natural Resources, Chinese Academy of Sciences, who has been very actively involved as Co-Chair with the CODATA Task Group on the Preservation and Archiving of Scientific and Technical Data in Developing Countries, and who served on the ICSU Priority Area Assessment (PAA) Panel on Scientific Data and Information. Her lecture on receiving the CODATA Prize was entitled A worldwide solution for bridging the digital divide for innovative research and development, and ranged widely over her many activities within CODATA and other organisations to promote archiving, development and innovation. Among the highlights were the creation and active development of the CODATA Task Group on Preservation and Archiving, various workshops, the development of an Open Data policy in China, the presentation by CODATA at the World Summit on the Information Society (WSIS) meeting in Tunis, the Berlin Declaration on open access, and the identification of bridging the digital divide as a strategic goal for CODATA following the ICSU PAA. She concluded by looking forward to the activities of the newly created United Nations Global Alliance for ICT and Development (UN-GAID).

At the same prize-giving ceremony, the Sangster Award 2008 for a young Canadian Scientist was awarded to Sabrina Fortin, who subsequently presented a paper on Normative models to manage collective research resources - from commons to contracts: the case of human populational databases in the parallel session on biomedical data sharing and informatics.

In this, and in many other ways, the CODATA conference made strong efforts to showcase young talent. A number of presentations were singled out as contributions from young scientists. A Young Scientist Roundtable was held, from which came the idea that a CODATA Working Group should be formed by young scientists, with a longer-term goal of establishing a full Task Group. The idea of a CODATA Prize for Young Scientists was floated. For me, however, the most direct way to reach out to young scientists was to engage directly with the many students and youg researchers who were able to attend sessions, and who helped out as part of the local organisation. This was a real benefit of holding such a conference in a university environment, and the cheerful hospitality and enthusiasm of the local students was greatly appreciated, and will not easily be forgotten.

Summary

[CODATA 2008 closing ceremony] Closing ceremony, Academic Council Hall.
As always, I found the CODATA conference a stimulating meeting, providing a useful cross-disciplinary survey of progress in data science. The IUCr has benefited from hearing many of the presentations, and I hope it has also provided stimulus and input to participants through our involvement. I certainly took advantage of many informal opportunities to make new contacts, open up new possibilities for collaboration, and indeed make new friendships. I hope that the next conference will be structured with fewer parallel sessions, in order to maximise the opportunities for exploring interdisciplinary themes, and I also hope that CODATA will continue to value the contributions of the more laboratory-based sciences in emphasising the necessity for quality assurance, critical peer review and proper annotation and management of scientific data.
Brian McMahon
CODATA Representative


Photographs courtesy of the local organising committee.
[Creative Commons By NC licence]