Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Report on CODATA 2006 Conference

To members of the Electronic Publishing and Database Committees
(copy to IUCr staff for information)

Herewith my digest of the highlights of the recent CODATA meeting in
Beijing.

Brian
_________________________________________________________________________
Brian McMahon                                       tel: +44 1244 342878
Research and Development Officer                    fax: +44 1244 314888
International Union of Crystallography            e-mail:  bm@iucr.org
5 Abbey Square, Chester CH1 2HU, England                   bm@iucr.ac.uk

==========================================================================
     Scientific Data and Knowledge within the Information Society
     ----------------------------------------------------------------

           CODATA 2006 - Beijing, 22-25 October 2006

The 20th International CODATA Conference, held in Beijing, China, 23-25
October 2006, had the title "Scientific Data and Knowledge Within the
Information Society", continuing the emphasis on CODATA's role in the
Information Society expressed by the title of the previous conference
("The Information Society: New Horizons for Science") and the
organisation's involvement in the World Summit on the Information Society
and related activities over the past few years. Also running through the
programme was a sense of CODATA's evaluation of its own purpose,
reflected in a retrospective view of its past 40 years of existence, and
a session looking to its future directions.

CODATA past, present and in the future
--------------------------------------
In the Key Session "CODATA: 40 Years of Bringing Data to the World",
Davide Lide traced the organisation's early development in a post-war
world of burgeoning scientific research and rapidly developing
experimental apparatus and instrumentation. With the thermodynamicist
Frederick Rossini as the prime mover, and an initial Executive Committee
of six members from the US, France, UK, West Germany, Japan and the USSR,
the expertise of CODATA's founding fathers lay solidly in the realm of
the physical and chemical sciences. From the beginning, however, the
organisation's constitution covered the biological and geosciences, and the
representation of these sciences was already apparent by the time of the
second conference, in 1970. In the early years, of course, the
participants were scientists pure and simple: none would have been
considered "information technologists". But early demonstrations of
computerised data management, again in the second conference, formed the
starting point for the enormous growth in information technologies that
underpin all of today's scientific data management. From his longstanding
involvement with CODATA, Lide felt that its early achievements were: to
provide a necessary forum for scientists from different disciplines to
work together; to foster international cooperation, most importantly
during the Cold War; to reach international agreement on key physical
and chemical data sets; to conduct educational activities; and to
contribute to the computer-based use of data. He felt that efforts to
improve the presentation of published data had been less successful,
relying as they did on voluntary adoption of the CODATA
guidelines. Nevertheless, he believed that these guidelines had indeed
had some beneficial impact on journals; and his advice for future
directions in CODATA was to give high priority to issues of data quality.

John Rumble (winner of the 2006 CODATA Prize and a former president of
CODATA) took up the history of the organisation's development through the
1980s. This period was characterised by: rapid technological
developments, including the rise of the personal computer; the maturing
of scientific data activity; and the globalisation of scientific and
technological data, involving new disciplines, new countries and new
people in a growing community. It also saw the start of the
transformation of scientific and technological data work from the hands of
specialists to those of practically every working scientist. During this
period, CODATA formulated its first strategic agenda, which concentrated
on: critical evaluation and improvement of data quality; the
accessibility and dissemination of data ; the structure and format of
data files; the use of computers and telecommunications in data
dissemination; and the propagation of CODATA output. A particular
consequence of this was the formal process for nominating, reviewing and
approving the Task Groups that undertake CODATA's scientific
activities. This period ended at about the time the Internet was about
to be revolutionised by the invention of the World Wide Web. 

Krishan Lal brought the history of CODATA up to the present, covering the
period of the last decade and a half, during which the scope of CODATA
has expanded greatly, with vibrant and active Task Groups across a very
wide range of biological, physical, chemical and geosciences, and with
active involvement of scientists from around the world, including the new
rapidly growing economies of Asia, but also a significant number of
developing nations. There is also growing engagement with the human and
social sciences, and a growing awareness of the relationships between
science and society as a whole. These trends have been analysed in a
recent far-ranging ICSU Priority Area Assessment (PAA) review of
Scientific Data and Information, which included a number of
recommendations to CODATA. Among the elements of CODATA's response to the
ICSU PAA review was a commitment to a new mission statement and Strategic
Plan. CODATA had also been an active contributor to the World Summit on
the Information Society, providing the organisation with high-level
international visibility and an opportunity to launch publicly a Global
Information Commons for Science Initiative.

In an accompanying Key Session on the "CODATA Vision of the Future",
Professor Lal and Alexei Gvishiani, the incumbent Vice Presidents of
CODATA, developed further the initiatives and policies that were evolving
as CODATA responded actively to the ICSU PAA and worked on its future
Strategic Plan. Although the state of international data science and
information technologies seems very healthy, there are concerns that
CODATA itself is losing the formal membership of many National Member
organisations. This appears to reflect in part the structural
difficulties of raising funding from national governments for scientific
activities that are supportive of long-term management or development,
rather than project-based; nevertheless, the loss of such National
Members weakens CODATA by reducing the funding that they bring to secure
the financial health of the organisation, together with the loss of their
active voices in setting the policy objectives and the action agenda of
the organisation. There is also concern over losing contact with
developing nations that cannot afford to raise membership dues, and of
the lack of involvement from a number of geographic regions, for a variety
of reasons. It was also important to engage more young scientists in the
work of the organisation, and to build up the Data Science Journal as an
important instrument for developing data science. The new Strategic Plan
would try to address these shortcomings; but it would also be ambitious
and progressive, developing a number of projects to address directly
concerns such as the Digital Divide, universal and equitable access to
data, and the encouragement and exploitation of new scientific and
technological developments.

Neatly straddling the history of CODATA and its plans for the future,
Tony Hey's Keynote Lecture on "e-Science and Cyberinfrastructure"
reviewed several of the latest technical developments in information
handling and management. Many of these were reflected in presentations or
sessions at this conference; they included the astronomical Virtual
Observatory (an astronomy data grid); the Comb-e-Chem project (linking
high-throughput experimental data from a chemistry wet lab to automated
data collection and analysis); crystallographic e-Prints (applying
open-architecture e-publication techniques to provide access to scientific
data sets); Grid middleware services running on top of high-bandwidth
research networks to support a growing number of research projects;
open-access publication sources (ranging from preprint archive through
fully peer-reviewed journals); social communication technologies (RSS
feeds, Wikis, blogs), perhaps leading to new forms of "live"
journals. Hey is Vice-President for Technical Computing at Microsoft
Corporation, and explained how Microsoft was becoming active in augmenting
interoperability, whether by releasing its next generation of office
productivity software on top of an open royalty-free file format
specification, or by actively working with the technical developers of
the Open-Archives Initiative to investigate cross-searching across
institutional repositories. Future projects on data integration would
include a web of remote sensors coupled to the geographic information of
the Microsoft Virtual Earth project.

An information commons for science
----------------------------------
A Key Session on the Global Information Commons for Science Initiative
(GICSI) returned to one of CODATA's new initiatives, and Paul Uhlir
(co-author with P. David of the proposal) explained in detail the idea of
an "information commons": digital information, originating principally
from government or publicly funded sources, made freely available for
common use online, either in the public domain or with only limited
rights reserved, and typically organised thematically. The advantages of
an information commons were: its facilitation of information transfer in
many directions: geographically between North and South (and indeed
between South and South, also promoting capacity building as it did so);
between disciplines, across sectors and amongst institutions. It would
also promote international research and development activities. There are
obstacles to such a commons: it remains necessary to assess and communicate
the values of the commons approach, and to develop adequate
incentives. There are issues of long-term financial sustainability,
legitimate legal restrictions and the need to develop effective technical
and organisational implementation. In practice it would not be possible to
overcome all of these, and compromises would need to be made; but CODATA
should aim to approach this ideal as closely as possible by improving
understanding and awareness of the idea, promoting the broad adoption of
successful models, encouraging and helping to coordinate the efforts
of stakeholders, and establishing an online open access knowledge base. 

An important contributing element to GICSI is the family of
machine-readable rights licences developed by Science Commons and its
parent organisation, Creative Commons. Chunyan Wang of Creative Commons
China described the organisation's activities in China, and pointed out
that the idea of the information commons found a natural home with the
traditional Chinese approach of a society sharing its knowledge, with
reasonable guidelines. One example of a successful implementation in the
area of Chinese science was QiJi, a counterpart to arXiv, which has a
translation project for open-access journals that makes use of the
Creative Commons attribution licences.

John Wilbanks, Director of Science Commons (US), pointed out that such
attribution licences, traceable through their machine-readable expression
in metadata, provided an excellent indicator of the actual re-use of a
scientific idea, that provided more information about the significance of
a piece of work than existing metrics such as citation index, impact
factor, or even number of downloads.

A wealth of topics
------------------
If the sessions reported on above reflect the broad context of CODATA's
mission and activities, the remainder of the Conference reflected the
full richness of data-centric activities in science and technology in
which CODATA members and their scientific communities are involved. The
conference comprised 4 keynote lectures, 13 Key Sessions, 64 contributed
sessions, and opportunities to present posters. Among the session titles,
chosen almost at random to indicate the diversity of topics, one might
list: Disaster Data; Computational Informatics; Data Role in Promoting
Public Understanding of Science; Solar-Terrestrial Data; e-Science;
Virtual Observatories in the Geosciences; International Polar Year
Activities; Chemical and Physical Data; Bioinformatics/Biodiversity;
Social Science Data Issues. The problem with such a wealth of diverse
topics is that, running in parallel at up to 10 sessions simultaneously,
it is impossible for an individual to take full advantage of the
interdisciplinary opportunities afforded by such a gathering. Hence my
reports of sessions and presentations following do not in any sense offer
a representative cross-section of activities; they are simply the topics
I had a particular interest in, or happened to attend. 

Data archiving 
---------------
A topic of growing concern in recent years has been the long-term
preservation (archiving) and curation of digital data. Chuang Liu
presented the current state of digital preservation in China, a country with
a good record of archiving traditional scholarship (going as far into the
past as 2000 years for a set of geography books). A long-term project of
digital archiving commenced in 2003. As a first step, a project was under
way to survey how many data sets existed that needed to be integrated
into this programme. So far nearly 2500 databases have been identified
in earth sciences, environment, public health and the physical sciences,
comprising about 500 TB of archived data. This is a substantial amount,
but China is preparing itself for the huge data archiving effort to handle
satellite data, gene banks and biodiversity projects that will soon come
on-stream. 

David Giaretta (Digital Curation Centre, UK) reported on early results
from the CASPAR study, an effort to test many of the principles and
techniques that would be required in effective long-term curation
activities. Focusing initially on three test data sets (from astronomy,
cultural-heritage and performing-arts areas), the project was designed to
subject the widely adopted Open-Architecture Information System (OAIS)
reference model to real-world testing, exploring in particular the
specification of the requirements of the "Designated Community". This is
the entity in the OAIS model that would potentially re-use the archived
data, and which established the granularity of metadata describing the
data that would be necessary to guarantee effective re-use over a very
long period of time, during which there would be inevitable changes
in information technologies and common data handling formats and methods.
Since the OAIS Reference Model underlies many large-scale archiving
initiatives, this rigorous testing programme would appear to be of the
highest importance.

Practical archiving activities were also reviewed in the session
organised by the Task Group on Preservation of and Access to Scientific
and Technical Data in Developing Countries, where a report on the
conferences, publications and activities of the Task Group was followed
by an account of efforts to establish a national database of non-profit
organisations and their activities in South Africa. Also presented in
this session was an enthralling account of a variety of biodiversity and
sustainable development activities in the multi-faceted ecology
of Thailand.

Data and the scientific literature
----------------------------------
Publication archives were described by Newman Yan, in the "Electronic
Journal Production" session, who presented China Academic Journals, the
flagship of Chinese e-publishing. This journal aggregator platform offers
over 18 million articles published during the period 1994-2006 from over
7500 titles; an extensive digitisation project has also converted over
seven and a half million articles from 3664 journals published during the
period 1887-1993. The articles are all full-text searchable, indexed by
subject area, viewable in a proprietary document format that offers
greater functionability than PDF, and are accessible through common
library and information standard protocols. The entire collection (of
Chinese-language content only) has over 6000 institutional subscribers
and records over 1.2 billion downloads annually.

Myung-Seok Choi described the KISTI-ACOMS web-based article submission
and review system that is distributed free of charge to 225 academic
societies by KISTI. The service offers modules for journal articles
and for conference proceedings.

Another national service provided to academic societies is the J-STAGE
electronic journal publishing platform in Japan (which currently hosts
CODATA's own publication, the Data Science Journal). More than 330
journals are hosted at present, and features of the service include
full-text searching, linking, pay-per-view and the provision of
COUNTER-compliant usage statistics; OpenURL and OAI-PMH interfaces
are planned.

S. Mitra of the American Physical Society described how the society's
journals continued to flourish in the electronic age; and how they were
becoming increasingly international in a better-connected world. Now only
a third of the papers are from American authors; a third come from
Western Europe and the remaining third from the rest of the world. Among
these other countries, China has demonstrated a dramatic rise in the
number of articles published, although the trend is in line with China's
growth in GDP. Among the challenges facing the journals were the
continuing increase in the number of submissions, coupled with increased
editorial effort to handle articles from authors whose native language
was not English. There is rising competition from other journals,
including open-access journals for which subscriptions were not required.
[On the other hand, subscriptions to APS journals were holding up well
despite the early availability of much of the research they report in
arXiv: this is taken as an indicator that the added-value of peer review
is indeed valued.] There is also the challenge of making scholarly
journals accessible to non-specialists in the field, in order to promote
cross-disciplinary talk.

All of these presentations focused on the traditional role of the
journals as publication vehicles for scholarly research articles,
although the execution of this role was profoundly affected by new
information technologies. But Ed Pentz of CrossRef, the provider of
cross-publisher reference linking services, challenged journals to make
use of the underlying techniques of digital object identifier (DOI) and
handle resource resolver to link not only to articles, but to research
data sets also. As journals and databases converged to common storage,
management and dissemination methods, so linking to, citing and
disseminating of data sets came closer to the publishing model. 

Interoperability through cross-disciplinary metadata
----------------------------------------------------
The use of DOIs to supply a permanent and citable identifier for data sets
was also described in the session on "Supporting Sustainable Access to
Scientific Data through Metadata" by Peter Löwe (representing Jens Klump
of the GeoForschungsZentrum Potsdam). The GFZ register DOIs for their
data sets with the Technische Informationsbibliothek Hannover, who
play the same role as a DOI registration agency for scientific data that
CrossRef fulfils for publications. An important motivation behind this
approach is to make data citable, to encourage its recognition in the
academic credit process.

Citing data was also a topic addressed by Chris Lenhardt, of CIESIN, who
described a style guide developed for citing data
(http://gking.harvard.edu/files/cite.pdf),
and a code of good practice for database providers to allow such citations.

Registering identifiers for data sets will certainly facilitate citation,
but discovery and searching requires rich metadata describing the data
sets so registered.  Xaolin Zhang (China Digital Science and Technology
Museum) noted the existence of data providers such as e-Bank that were
already exposing metadata describing scientific data sets, and outlined
the requirements for metadata interoperability methodologies, among which
was a proposal for an Open Metadata development project.
These ideas were emphasised in the presentation by Jian Qin entitled
"Metadata as the Underpinning of Sustainable and Effective Access to
Data", which specifically recommended that CODATA involve itself in
projects such as the construction of a metadata directory service, which
would provide an inventory of domain metadata standards.

This was a timely, focused and important session that underlined the
importance of structured cross-domain metadata development to effectively
promote interoperability between data providers. In a final
thought-provoking contribution to the session, Raed Sharif made a plea to
improve interoperability between different native-language communities by
including multilingual descriptions of data sets in their accompanying
metadata. 

Challenges in bioinformatics and astronomy
------------------------------------------
An organization with whom we have close ties, and that has already taken
the step of registering DOIs for its data sets, is the Protein Data Bank;
and in the session on "Primary Biological Databases" Helen Berman of the
Research Collaboratory for Structural Bioinformatics (RCSB) described the
Worldwide Protein Data Bank (wwPDB). The member organisations of this
collaboration (RCSB, the European Bioinformatics Institute, PDBj and
BioMagResBank) worked together to maintain a single archive of
macromolecular structural data freely and openly available to the
community. While the RCSB component currently had the responsibility of
acting as the master archive site, data sets were reliably exchanged
among all the members by the use of standard formats, guaranteeing that
each member site was in possession of the same data, but retained the
freedom to provide its own value-added services and interfaces. The wwPDB
was committed to the highest standards of annotation and quality control,
and had just completed a labour-intensive remediation of the entire
archive. 

Other contributions in this session described the activities and services
provided by Uniprot, the Universal Protein Resource (Claire O'Donovan);
the EMBL Nucleotide Sequence Database (Guy Cochrane); and the Quality of
Services of the Primary Nucleotide Sequences Databases, particularly the
DNA Database of Japan (Hideaki Sugawara). 

Common to all these presentations was the high quality of professionalism
of the organisations, their commitment to data quality, and their belief
in the efficiency of open and unhindered access to authoritative and
properly automated data.

Astronomy is another scientific discipline that has a good record of
collecting, analysing and managing large amounts of data, and in the
session on "Managing Astronomical Data" the presentations by Wenping Chen
and Yongheng Zhao described the challenges of high volumes of data
generated by modern astronomical projects.  Already the Taiwan-American
Occultation Survey (which searches for dim comets by the momentary drop
in brightness of stars in front of which they pass) is generating a few
hundred GB of data every night, which must be processed almost in real
time to allow for coincidence checking. The projected Panoramic Survey
Telescopes and Rapid Response System, which will use four 1.4 gigapixel
telescopes, will generate a few TB of data every night. On a slightly
more modest scale, China's new Large Sky Area Multi-Object Fiber
Spectroscopic Telescope (LAMOST) will generate something in the region of
15 GB per night; but considerable thought is already being given to the
requirements for proper archiving of this data, and its integration into
the international Virtual Observatory project.

Ray Norris (CSIRO, Australia) pointed out that many such large-scale
astronomy projects were exemplary in their management of data; but this
was not necessarily the case throughout astronomy. In the "Astronomer's
Data Manifesto" he and colleagues were challenging the community to
address consistently and thoughtfully the areas where improvements could
be made, such as: the deposit of all data supporting any published
tables, images and spectra; placing in the public domain all data from
publicly funded observatories; incorporating effective data management
policies into plans for new instruments and observatories, addressing
issues of the Digital Divide; preserving legacy data in digital form in
data centres; and working through the IAU with other international
organisations to achieve common goals.

Access and quality
------------------
The session on "Data Access Policy" included a number of practical
applications of open or openly-encouraged access to a variety of data
sources. These ranged from a community-based information system designed
to enhance effective e-government of the widely dispersed communities of
Indonesia (Muhammad Suryanegara, Indonesia), through the free and open
exchange of seismological data promoted by the IRIS Consortium (Ray
Willemann, USA), and a number of aspects of global sustainability,
spanning fields of natural environment, artefacts, social science and
economic data collections (Masaru Yarime, Japan). A presentation by
Robert Clark (University College Dublin, Ireland) reviewed the legal
framework and possible challenges to the free exchange of scientific data
that arose through implementation of new legislation such as the European
Database Directive.

Within this rather disparate collection of topics, the IUCr presentation
on "Improved Reporting of Crystal Structures: the Impact of Publishing
Policy on Data Quality" struck a note of contrast by exploring the
results of applying a specific policy in practice. The IUCr journals have
for many years offered open availability of deposited machine-readable
structural data sets - certainly a good example of data accessibility.
What is relatively new is the effect of enforcing standards of evaluation
for data supporting publications, reinforced by openly documented
algorithms andd objective tests through a publich service,
checkCIF, that has become adopted as a community standard for
evaluating structural data, whether published or not.  In the context of
many of the other themes and topics of this conference, there were a
number of pleasing resonances. First, the linking of the deposited data
sets with their primary publications is done through digital object
identifiers. Second, journal publication is no longer an inevitable
result for small-unit-cell structure determinations, but the development
of common metadata standards allows interoperability between structural
journals and data collections at repositories such as e-Bank and the
Reciprocal Net. Third, the continuing rapid rise in the number of
structures determined is placing stress on the traditional
subscription-based journal pricing model, and encouraging the IUCr to
explore open-access publication strategies. Fourth, the efforts of the
IUCr to make structural data freely accessible through its journals,
through the activities of affiliated database organisations, and through
relaxation of a subscription levy on educational, nomenclature-based and
some other types of article all fit well into the Global Information
Commons initiative. Finally, the effect of journal policy on evaluating
supporting data sets has indeed had a positive impact on improving the
overall quality of published structures, and is likely to have raised
standards of data quality generally in the field of small-molecule and
inorganic structure determination. In this way, reflecting on David
Lide's survey of CODATA's history, we have nicely repaid CODATA's
investement in early efforts to improve publication quality, and suggest
that our efforts to maintain the highest achievable quality of associated
data sets fit well with Lide's view of CODATA's future priorities.


Summary
-------

Once again the CODATA Conference has brought together an astonishing
variety of topics and speakers covering every aspect of data science.
Over 600 participants made this the largest contribution to
date. Abstracts of the presentations are available on the Web at
http://www.codataweb.org/06conf/prog-glance.html, and there are plans
to publish many papers developing the contents of these presentations in
the Data Science Journal. It is sobering to reflect on the enormous number
of productive data centres, and the huge volumes of scientific data they
collect, manage and disseminate in every scientific discipline. At the
same time, it is encouraging to see how the IUCr's information
dissemination activities fulfil many of the requirements that CODATA
identifies for the best possible data management and curation of
data. Participation in CODATA continues to be beneficial for the IUCr,
and it is our hope that the best practices of crystallography encourage
and inspire other scientific communities.

==========================================================================

_______________________________________________
Epc mailing list
Epc@iucr.org
http://scripts.iucr.org/mailman/listinfo/epc

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.