Arising from previous meetings, the Protocol for Metadata Harvesting (OAI-PMH) is a technical method for disseminating descriptive metadata about resources that an institution wishes to advertise or make available. Such resources are typically e-prints (theses, conference proceedings, informal reports, preprints), but might be anything at all (database records, physical museum specimens...). The protocol is designed to allow a client (the "harvester") to query a server (the "data provider") for the metadata formats it supports, and for individual records or sets of records in the desired formats. As a base level for interoperability, servers are required to provide metadata conforming to the Dublin Core standard, but the negotiation facility allows metadata of arbitrary complexity to be exchanged between capable servers and clients.
Version 2.0 of the OAI-PMH standard has just been released, as a stable release version built on the experience of pilot schemes running the earlier version 1.0.
It is expected that service providers will emerge who collect the metadata
offered by contributing institutions, and who then may layer value-added
services on top of the harvested metadata to link together and impose
organisation on the distributed resources made visible in this way. One
element of the protocol designed to facilitate this is the
The majority of presentations at this workshop concentrated on the
establishment of institutional repositories of digital documents
interconnecting via the OAI-PMH. In most cases, the institution would
be a University, and the maintainer of the repository the University
Library. In part this is because many of the resources considered suitable
for management in this realm have by tradition been curated by the
libraries. They feel that this new role is one that they are well
qualified to perform, though in practice not all university libraries have
sufficient technical resources at their disposal to implement even the
modest programming requirements.
The most highly developed of the library-based schemes have front-end
applications to allow direct submission of electronic content by the
organising faculty staff; some also have hooks that could permit peer
review and the development of home-grown journal publishing
operations. The DSpace project developed at MIT in collaboration with
Hewlett-Packard is an impressive implementation, and the software engine
will shortly be released as an Open Source package. Other infrastructure
packages, such as the established eprints.org software of the University
of Southampton and the powerful system driving document organisation,
translation and serving built over many years at CERN are also, or will
soon be, available for Open Source download and development. The intention
is to encourage the adoption and use of powerful and reasonably
standardised tools in building properly federated data repositories. The
presentation of the DSpace project drew attention to the need for a
vertical integration by discipline of the horizontally federated
institutional resources, a distinction that was implicit in many other
presentations.
In some of the breakout sessions to discuss general concerns, this
dichotomy again became apparent. One approach towards its resolution was
to charge professional and learned societies with the organisation of
their discipline-specific content through construction and dissemination
of their own controlled vocabularies and relevant metadata formats.
Through its interoperability, the OAI-PMH can certainly support
value-added discipline-specific metadata records; but the learned society
is faced in the first instance with the problem of locating the records in
which it has an interest from the large volume of metadata that is
harvestable across a wide range of providers, much of which may be
incomplete or of a low standard. The answer to this problem seemed to be
the creation of middle-tier service providers to harvest promiscuously and
annotate the records they retrieve, prior to re-export. In fact this is a
similar function to that provided by "traditional" abstracting and
indexing services, the distinction being that such agencies would be ready
(one supposes as a matter of honour) to re-export the metadata that they
had freely harvested.
It is difficult for institutions - at least universities - to
self-publish the results of scholarly research as traditional
peer-reviewed articles, since the institution cannot call upon outsiders
to provide the reviewing service. (A manifesto on academic independence by
Professor J.-C. Guedon of the University of Montreal looked ahead to the
days of confederal editorial boards organised by groups of universities -
but while an Ivy League board might flourish, one wonders about the
academic credibility of boards convened by lesser-known and less respected
institutions.) Yet the institutions want all the publications of their
faculty members to be available (free of cost) from their own web
servers. The preferred approach was to retain copyright or negotiate with
journal publishers a copyright waiver that allowed the institutions to
host (and deliver) such articles. While the IUCr, for example, has been
happy to allow authors to mount their published articles on their own web
pages, a consistent policy of allowing free redistribution by federated
institutional servers would severely threaten the journals' basis for
subscription.
It is also clear that in designing their institutional repositories,
universities do not wish to become real publishers and to shoulder the
costs of administering peer review and document markup. Yet they do want
cost-free access to high-quality articles. There is a confidence in some
disciplines that authors are sufficiently skilled in editorial tasks to be
fully entrusted with document markup; hence the SPARC journal Documenta
Mathematica can claim annual production costs for its 700-page journal of
around EUR 200. It is fortunate that its Managing Editor, an academic,
gives his time freely. Of course, such apparent philanthropy hides the
true costs of production, but the community represented here felt this to
be acceptable since the requirement to publish is an integral part of the
academic endeavour.
The open-access archive was, however, seen by Guedon in his elegant essay
as an important development that could restore the openness and continuity
of scientific communication that is sometimes characterised by the
idealised Age of Letters. New web technologies allow and indeed encourage
post-publication feedback and recommendations (the model for this is the
amazon.com retailer site). Indeed, the possibility arises to move away
from the discrete article-based method of contributing ideas to a
distributed forum of discussion towards which any qualified person may
contribute. The model here is of distributed open-source software
development projects, where small and large contributions are made, but
access control and detailed logging permits open scrutiny and evaluation
of the contributions. In this Utopian ideal, extended and continuous
evaluation of open contributions restores to scholars full independence,
unfettered by the controlling power of commercial publishers who provide
resources for the dissemination of information, coupled with a certain
amount of control.
Absent from this meeting was a sense of real concern about the long-term
preservation of (and access to) the resources within disparate autonomous
repositories, though the various national-level funding and coordinating
bodies represented demonstrated that this is increasingly a matter of
concern at national levels, at least in some countries. An innovative role
for "service providers" was seen to be the possibility of polling
open-archive repositories and notifying them when it was deemed
appropriate to migrate the digital objects they held to another format or
representation in an effort to prolong their longevity. Indeed, such
services could in principle perform the migration function and return to
the repository the new representation of the object (presumably at some
cost).
The reticence of institutional repositories to become fully-fledged
self-publishers left the SPARC (Scholarly Publication and Academic
Resources Coalition) presentations appearing as a complementary rather
than integral initiative. The Budapest Open Access Initiative to
promote new publishing models to allow open access to information also
appeared as a rather loosely-associated development. Both SPARC and the
BOAI do however see the metadata harvesting protocol as an important
technical facilitator of their goals.
The Elsevier Scirus science-centric web search engine was presented as an
application that could certainly exploit OAI-distributed metadata to good
effect, allowing the construction of a science resource finder able to
span both formal journal publications and less formal web documents.
I feel that the IUCr should certainly consider implementing an OAI-PMH
based data server, and perhaps also run harvester software. Among the
possible applications are:
Brian McMahon, CODATA Representative
The presentations of the workshop (including slides and video recordings
of the talks) are available at
http://doc.cern.ch/age?a02333
21 October 2002 -
IUCr CODATA
Representative -
Copyright © International Union of Crystallography