Crystallographic data

Second Workshop on the Open Archives Initiative: gaining independence with e-Prints archives and OAi

Geneva, 17-19 October 2002

The Open Archive Initiative (OAi) is a movement growing out of the successful implementation of the preprint server initially for high-energy physics (arXiv.org), and a desire to disseminate scholarly publications such as theses, conference proceedings and courseware from the web servers of the institutions sponsoring such publications. The possibilities of self-publication of refereed articles are also attractive, especially where library budgets are continually squeezed in the face of ever-increasing subscription rates by what are considered high-cost commercial publishers.

Arising from previous meetings, the Protocol for Metadata Harvesting (OAI-PMH) is a technical method for disseminating descriptive metadata about resources that an institution wishes to advertise or make available. Such resources are typically e-prints (theses, conference proceedings, informal reports, preprints), but might be anything at all (database records, physical museum specimens...). The protocol is designed to allow a client (the "harvester") to query a server (the "data provider") for the metadata formats it supports, and for individual records or sets of records in the desired formats. As a base level for interoperability, servers are required to provide metadata conforming to the Dublin Core standard, but the negotiation facility allows metadata of arbitrary complexity to be exchanged between capable servers and clients.

Version 2.0 of the OAI-PMH standard has just been released, as a stable release version built on the experience of pilot schemes running the earlier version 1.0.

It is expected that service providers will emerge who collect the metadata offered by contributing institutions, and who then may layer value-added services on top of the harvested metadata to link together and impose organisation on the distributed resources made visible in this way. One element of the protocol designed to facilitate this is the container type which can point to related repositories.

The majority of presentations at this workshop concentrated on the establishment of institutional repositories of digital documents interconnecting via the OAI-PMH. In most cases, the institution would be a University, and the maintainer of the repository the University Library. In part this is because many of the resources considered suitable for management in this realm have by tradition been curated by the libraries. They feel that this new role is one that they are well qualified to perform, though in practice not all university libraries have sufficient technical resources at their disposal to implement even the modest programming requirements.

The most highly developed of the library-based schemes have front-end applications to allow direct submission of electronic content by the organising faculty staff; some also have hooks that could permit peer review and the development of home-grown journal publishing operations. The DSpace project developed at MIT in collaboration with Hewlett-Packard is an impressive implementation, and the software engine will shortly be released as an Open Source package. Other infrastructure packages, such as the established eprints.org software of the University of Southampton and the powerful system driving document organisation, translation and serving built over many years at CERN are also, or will soon be, available for Open Source download and development. The intention is to encourage the adoption and use of powerful and reasonably standardised tools in building properly federated data repositories. The presentation of the DSpace project drew attention to the need for a vertical integration by discipline of the horizontally federated institutional resources, a distinction that was implicit in many other presentations.

In some of the breakout sessions to discuss general concerns, this dichotomy again became apparent. One approach towards its resolution was to charge professional and learned societies with the organisation of their discipline-specific content through construction and dissemination of their own controlled vocabularies and relevant metadata formats.

Through its interoperability, the OAI-PMH can certainly support value-added discipline-specific metadata records; but the learned society is faced in the first instance with the problem of locating the records in which it has an interest from the large volume of metadata that is harvestable across a wide range of providers, much of which may be incomplete or of a low standard. The answer to this problem seemed to be the creation of middle-tier service providers to harvest promiscuously and annotate the records they retrieve, prior to re-export. In fact this is a similar function to that provided by "traditional" abstracting and indexing services, the distinction being that such agencies would be ready (one supposes as a matter of honour) to re-export the metadata that they had freely harvested.

It is difficult for institutions - at least universities - to self-publish the results of scholarly research as traditional peer-reviewed articles, since the institution cannot call upon outsiders to provide the reviewing service. (A manifesto on academic independence by Professor J.-C. Guedon of the University of Montreal looked ahead to the days of confederal editorial boards organised by groups of universities - but while an Ivy League board might flourish, one wonders about the academic credibility of boards convened by lesser-known and less respected institutions.) Yet the institutions want all the publications of their faculty members to be available (free of cost) from their own web servers. The preferred approach was to retain copyright or negotiate with journal publishers a copyright waiver that allowed the institutions to host (and deliver) such articles. While the IUCr, for example, has been happy to allow authors to mount their published articles on their own web pages, a consistent policy of allowing free redistribution by federated institutional servers would severely threaten the journals' basis for subscription.

It is also clear that in designing their institutional repositories, universities do not wish to become real publishers and to shoulder the costs of administering peer review and document markup. Yet they do want cost-free access to high-quality articles. There is a confidence in some disciplines that authors are sufficiently skilled in editorial tasks to be fully entrusted with document markup; hence the SPARC journal Documenta Mathematica can claim annual production costs for its 700-page journal of around EUR 200. It is fortunate that its Managing Editor, an academic, gives his time freely. Of course, such apparent philanthropy hides the true costs of production, but the community represented here felt this to be acceptable since the requirement to publish is an integral part of the academic endeavour.

The open-access archive was, however, seen by Guedon in his elegant essay as an important development that could restore the openness and continuity of scientific communication that is sometimes characterised by the idealised Age of Letters. New web technologies allow and indeed encourage post-publication feedback and recommendations (the model for this is the amazon.com retailer site). Indeed, the possibility arises to move away from the discrete article-based method of contributing ideas to a distributed forum of discussion towards which any qualified person may contribute. The model here is of distributed open-source software development projects, where small and large contributions are made, but access control and detailed logging permits open scrutiny and evaluation of the contributions. In this Utopian ideal, extended and continuous evaluation of open contributions restores to scholars full independence, unfettered by the controlling power of commercial publishers who provide resources for the dissemination of information, coupled with a certain amount of control.

Absent from this meeting was a sense of real concern about the long-term preservation of (and access to) the resources within disparate autonomous repositories, though the various national-level funding and coordinating bodies represented demonstrated that this is increasingly a matter of concern at national levels, at least in some countries. An innovative role for "service providers" was seen to be the possibility of polling open-archive repositories and notifying them when it was deemed appropriate to migrate the digital objects they held to another format or representation in an effort to prolong their longevity. Indeed, such services could in principle perform the migration function and return to the repository the new representation of the object (presumably at some cost).

The reticence of institutional repositories to become fully-fledged self-publishers left the SPARC (Scholarly Publication and Academic Resources Coalition) presentations appearing as a complementary rather than integral initiative. The Budapest Open Access Initiative to promote new publishing models to allow open access to information also appeared as a rather loosely-associated development. Both SPARC and the BOAI do however see the metadata harvesting protocol as an important technical facilitator of their goals.

The Elsevier Scirus science-centric web search engine was presented as an application that could certainly exploit OAI-distributed metadata to good effect, allowing the construction of a science resource finder able to span both formal journal publications and less formal web documents.

I feel that the IUCr should certainly consider implementing an OAI-PMH based data server, and perhaps also run harvester software. Among the possible applications are:

  • By offering metadata records in the PubMed (and other) formats we could optimise the transfer of our metadata to arbitrary linking partners.
  • We could harvest non-published materials such as theses, providing in the first instance a web catalogue of theses in crystallography, subsequently perhaps providing links from article reference lists to theses and other such reports.
  • It could provide a possible route for limited access to databases such as CSD, which do not currently offer web access.
  • More speculatively, it is a technique we might persuade generators of crystallographic data sets (synchrotron laboratories, service crystallography facilities) to adopt so as to auto-catalogue and provide access to such primary data.

Brian McMahon
CODATA Representative