Scientific information for society - from today to the future
National Technical University of Ukraine - Kyiv Polytechnic Institute
Kyiv, Ukraine, 5-8 October 2008
The plenary lecture of Michael Zgurovsky (Interdisciplinary scientific data for sustainable development global simulation) suggested a practical approach to developing analytic tools for modeling sustainable development across the nations of the world. A numeric model of suitable metrics to describe the stability and security of individual nations could be built from a matrix of factors describing performance along the economic, ecologic, social and institutional dimensions - the "sustainable development gauging matrix" (SDGM). Sustainability, based on the UN declaration of 1996, is considered an important metric in characterising economic and political stability, and a number of entries in the SDGM demonstrated how many economically strong nations were still ranked low in terms of security of their societies. It was argued that global modeling in this multidimensional way was important in making well informed policy decisions.
These global views were accompanied by a special presentation from Wataru Iwamoto, representing UNESCO, who described many of the initiatives in which UNESCO is playing a part to promote the information society. These include the promotion of open access or differential pricing for access to scientific information; the development of metadata to facilitate long-term archiving; the promotion of evidence-based decision making in national policies; and the recruitment of young scientists and other workers in these tasks.
While supranational agencies are promoting evidence-based policy making, there is a practical need for high-quality technical structures to support the management and analysis of the large amounts of data involved, and in a plenary lecture The EGEE infrastructure and its support for European scientific collaboration, Robert Jones described a particular collaborative effort to provide such a structure. EGEE is entering its third two-year phase of operation to provide and increase the capacity of a production computing Grid infrastructure. With support from over 50 countries in and beyond Europe, EGEE includes over 300 sites linked together in a collaborative model. Applications cover many fields, including high-energy physics, earth sciences and life sciences, and the system provides not only high-capacity, highly resilient hardware, but middleware linking contributing centres in coherent "virtual" organisations. It was acknowledged that data management across the various applications is still rudimentary compared with the hardware and middleware provision, but the quality of service provided is very high, and is promoting an enormous amount of new and exciting science. Although the current approach is still project-based, the ultimate goal of EGEE is to provide a long-term sustainable Grid infrastructure throughout Europe and collaborating partners.
In the final plenary lecture, Curating data? What about curating services and workflows?, Carole Goble presented a complementary approach to linking together complex scientific data-driven inquiries. In the life sciences, over a thousand databases are regularly used by bioinformaticians. They are increasingly disparate in structure and architecture, but are usually accessed through Web services. This allows the construction of workflows that combine, integrate, link, process, derive and curate data resources from any combination of these database sources. The workflows are instantiated as discrete modules within a computational framework, that can be exchanged, extended and linked as required. Workflows do have advantages, in that the choice of modules automatically documents the processes involved in managing data from a number of disparate sources. On the other hand, the individual modules are constantly evolving as living program segments, and so it is essential to capture the particular versions used in any application. Curation of such rapidly-changing components is not easy. Neither is validation, especially as a community of authors contributes workflow modules to a common pool. At this stage in their development, workflows are being generated by an active community, that is equally active in quality assessment and validation. They are carried along on the wave of enthusiasm for social computing and networking that underlies the "Web 2.0" approach. The "bottom-up" approach to building solutions is in some ways at the opposite pole from the large-scale integrated architecture that can be seen in network infrastructures like EGEE; but it has a real potential to solve problems and perhaps to catalyse the development of a completely new approach to computer-assisted problem solving.
If the plenary sessions provided an opportunity to state and develop the overall theme of the conference, the multiplicity of parallel sessions provided ample evidence for the diversity of activities embraced by CODATA. Just a few examples of the session topics will illustrate this: information society, global climate change, Grid infrastructure, geophysical data systems and analysis, biodiversity, scientific capacity building, repositories for scientific data, materials: data exchange, nanotechnology, natural disasters and risk, e-science collaboration, International Polar Year, biological and genetics data, etc. The full programme can be reviewed on the CODATA web site. However, a definite disadvantage of so many oral presentation sessions (up to 11 in parallel) is that the interdisciplinary nature of the conference becomes diluted, as each session focuses on a particular discipline, and it is impossible to see at one time how different communities face and tackle the same problems in their different environments. I would certainly recommend that future programme committees reduce drastically the number of parallel sessions, and work harder to ensure that each session explores topics of interest across subject boundaries. There would be merit in expanding greatly the number of posters presented, since there is clearly an enthusiasm for presenting the results of research, and a large poster session can generate much discussion and excitement.
Among the sessions that I attended, almost at random given the choice, were a number of exciting astronomy sessions that reviewed many of the collaborative initiatives contributing to the Virtual Observatory projects characterising much contemporary astronomical work. A keynote talk by George Djorgovski was particularly good at demonstrating how the virtual observatories of astronomy sat within the broader context of e-science. Modern information technology hardware can - just about - keep up with the explosive growth in data volumes (large digital sky surveys currently collect 10 or 100 terabytes of data, and forthcoming ones will collect petabytes; the latest generation of telescopes can collect 30 TB per day). There are now real problems in keeping up with real-time data analysis, and the science is challenged not only by the data volume, but increasingly by its complexity, such as with panchromatic (multi-wavelength) views of the Universe, and the additional computational challenges of simulations. A particular point of note was the increasing reliance on computational modeling, so that computer science is in many areas becoming the "new mathematics" of scientific discovery.
Other astronomy talks covered a range of large-scale observational projects, including Russian, Armenian, Ukrainian and European ventures. There were also discussions of the benefits of common data formats (FITS, VOTables), common interrogation languages and a common data model in unifying the discipline and increasing the synergy of collaborative projects. There was also a very nice presentation by Fabien Chéreau of Stellarium and VirGO, open-source desktop planetarium applications that tap into the large databases of astronomical objects that are openly available, and allow both amateur and professional access to fundamental data.
The session on long-term data and knowledge management surveyed a number of large-scale and successful approaches to archiving, such as those of the Earth Sciences Sector of Natural Resources Canada, NASA's Planetary Data System, and the data management and publishing activities of the Canada Institute for Scientific and Technical Information (CISTI). Bob Chen of Columbia University made the very important point that governance and organisational sustainability are at least as important in building durable archives as the technical infrastructure and data storage capacity that is most often discussed. Arrangements to provide long-term archiving for data collected by the Center for International Earth Science Information Network (CIESIN) involve lengthy discussion with Columbia University Libraries to guarantee the preservation of existing data long after CIESIN itself may have disappeared. Other contributions in this session looked at the prospects for peer-reviewed data publication, to confer appropriate academic credit on data generators, managers and analysts, and to provide citable records; and what could be learned from the policies and norms for collaborative production and dissemination of scientific data sets from the activities of the open-source software developer community.
Finally, the session on data visualization approaches, which promised an interesting variety of examples, was disappointing because many speakers failed to show up. Nevertheless, Jean-Jacques Royer presented to the many local students present an excellent overview of the three-dimensional subsurface modelling carried out by his group at the GOCAD project, University of Nancy. I also demonstrated the IUCr approach to the interactive visualization of data as a feature of online crystallography journal articles (PowerPoint presentation | annotated PDF).
At the same prize-giving ceremony, the Sangster Award 2008 for a young Canadian Scientist was awarded to Sabrina Fortin, who subsequently presented a paper on Normative models to manage collective research resources - from commons to contracts: the case of human populational databases in the parallel session on biomedical data sharing and informatics.
In this, and in many other ways, the CODATA conference made strong efforts to showcase young talent. A number of presentations were singled out as contributions from young scientists. A Young Scientist Roundtable was held, from which came the idea that a CODATA Working Group should be formed by young scientists, with a longer-term goal of establishing a full Task Group. The idea of a CODATA Prize for Young Scientists was floated. For me, however, the most direct way to reach out to young scientists was to engage directly with the many students and youg researchers who were able to attend sessions, and who helped out as part of the local organisation. This was a real benefit of holding such a conference in a university environment, and the cheerful hospitality and enthusiasm of the local students was greatly appreciated, and will not easily be forgotten.
Photographs courtesy of the local organising committee.