Feature article

Update on Protein Data Bank activities

The first annual report for the Protein Data Bank (PDB) describes the purpose and functions of the PDB, details accomplishments during this first year of Research Collaboration for Structural Bioinformatics (RCSB) management, and outlines plans for 2001. As the sole international repository for three-dimensional structure data of biological macromolecules, the PDB is an important resource for the entire life-science research community. The RCSB’s mission is to enable new science. Each of the RCSB partner sites contributes to the operation and development of the PDB: Rutgers, data deposition and processing; Nat’l Inst. of Standards and Technology (NIST), data uniformity, exploring issues specific to nuclear magnetic resonance (NMR), and data archiving; and the San Diego Supercomputer Center (SDSC) at the U. of California, San Diego (UCSD), data query reporting and distribution. The RCSB has made great strides in enhancing the PDB due to its unique personnel, hardware, software and network infrastructure. The seamless transition of the PDB from Brookhaven Nat’l Laboratory was completed three months ahead of schedule, a large number of files were processed with a rapid turnaround time, legacy data were reprocessed and cross-referenced to ensure reliability, and an average of 90,000 hits per day have been accommodated by the main PDB web site alone. Plans for the future are to enhance the PDB’s many features with new capabilities, including a higher, faster throughput of deposited data; a greater number of query capabilities, including more complex and accurate queries, and a more uniform archive.

Data Deposition

From July, 1999 to June, 2000, 2292 structures were deposited at the RCSB and 468 backlog entries and 456 'layer 1' entries were processed. The data rate during the period was on average 44 depositions per week. The average time to fully process an entry was less than 12 days. 81% are from x-ray experiments, 15% are from NMR experiments, 61% of data deposited are from North America, 25% from Europe, 11% from Asia, and 2.4 % from Australia. Proteins make up 89% of the depositions, while 11% are nucleic acids. Of the structures deposited, 20% were indicated by the author to be held until a particular date, 57% were indicated to be held until the publication of the corresponding article, and 23% were indicated to be released immediately.

A popular PDB feature, the Validation Server (ADIT), allows depositors to check a structure at any time during structure determination and refinement. It checks the format consistency of coordinates during the Precheck step, and creates validation reports about a structure before deposition using ADIT (http://pdb.rutgers.edu/). Once deposited, entries are processed to completion, returned to the author for review, and released on the PDB site (www.pdb.org/) and its mirrors. The PDB staff continues to enhance and upgrade the capabilities of the PDB searching and reporting tools. As part of the Data Uniformity project, PDB members have curated the R-factor, resolution data, and primary citation data for all entries in the PDB, and have incorporated this information into the database. These fields are available for improved searching, and the updated data are available via database reports.

Data Distribution

The site’s 'Get Educated' page includes an introduction to proteins for general audiences and materials for undergraduates on topics such as nucleic acids, principles of protein structure, and electron microscopy. Tutorials are available on two popular molecular graphics viewing programs, how to query the PDB and how to use RasMol and the Swiss-PDB Viewer (Guex, Peitsch 1997). Links are frequently added to this resource, which also includes papers on the PDB, animated presentations about the PDB, and VRML 'protein documentaries' developed by students. The electronic help desk at [email protected], which is available to answer all types of questions about the PDB, usually within a 24-hour period.

Other developments in query and reporting include expanded ligand searching and reporting capabilities, improved access to dynamic links using the Molecular Information Agent (http://mia.sdsc.edu), the accurate query of enzymes, the incorporation of cross-links to sequences databases, and improved graphics options. The PDB can now be queried based on source, by number of chains, and by the availability of experimental data.

Each month a key biological molecule is profiled as the Molecule of the Month. Beautiful images of the molecule are provided by D. Goodsell of the Scripps Research Inst. and featured on the PDB home page and links provide additional information about the structure and function of the molecule at a general level.

Usage, which grew in the initial months of operation, has now leveled off at about 90,000 Web hits per day and 70,000 PDB files downloaded per day.

The RCSB PDB Team (left to right): Phoebe Fagan, Dorothy Kegler, Haiyan Cheng, John Westbrook, Zukang Feng, Phil Bourne, Gary Gilliland, Diane Hancock, T.N. Bhat, Brad Kroeger, David Padilla, Victoria Colesh, Helge Weissig, Narmada Thanki, Gnanesh Patel, Bohdan Schneider, Helen M. Berman, Nita Deshpande, Wolfgang Bluhm, Kyle Burkhardt, Lisa Iype, Ward Fleri, Christine Zardecki, Tammy Battistuz.

The RCSB-Rutgers site also maintains two other sites: [email protected], for general deposition and processing questions; and [email protected], for ADIT information. Furthermore, the [email protected] discussion list provides a forum for users to interact and collaborate. The PDB staff attends conferences, hosting exhibit booths, demonstrations, and user group meetings to gather feedback from the community and to provide information about PDB’s capabilities and growth. In order to better serve the needs of the scientific community, the RCSB is collaborating with BioMagResBank (BMRB), Cambridge Crystallographic Data Centre (CCDC), European Bioinformatics Inst. (EBI), Inst. for Protein Research, Osaka Univ., Nat’l Ctr for Biotechnology Information (NCBI), NCI-Frederick Cancer Research & Development Foundation and the Swiss Inst. for Bioinformatics/Glaxo.

It is estimated that the PDB could grow to approximately 35,000 structures by 2005, nearly tripling its size. A major factor in this growth is structural proteomics, the determination of the structures of as many of the proteins as possible, in the shortest time possible. This increased volume will present a challenge to the PDB. As technology advances, the PDB’s user base will also expand. In order to accommodate this demand, the RCSB plans to enhance the robustness of the PDB’s query capabilities. The RCSB is proceeding with the next phase of archiving the physical data, which involves scanning and electronically storing all documents associated with the PDB. Data uniformity work will continue by focusing on structure classification, compound records, chain ID fields, refinement parameters, coordinates, sequence records, and the biological unit.

(taken from the PDB annual report July 1999-June 2000)

IUCr Newsletter

Feature article

Update on Protein Data Bank activities

Data Deposition

Data Distribution