PDBase: A Relational Database Expression
of Macromolecular Structure Data

The OpenMMS Toolkit
Release V_1_5_1

Introduction

PDBase is a toolkit for mapping the mmCIF ontology onto a relational database. Additional columns described below have been added the the database tables to make searching through the data fast and efficient.

This document describes principles used in PDBase schemas and provides a few simple example SQL queries that highlight some of the ways in which the data may be searched. Whenever possible these queries were designed to be compatible with the SQL-92 standard. However in some cases, vendor specific commands are needed. These cases are noted in the text where appropriate.

Setting Up the Database and Loading the Tables

Vendor Specific Commands

The procedures for setting up an DB account, running the command line interface and describing the table metadata vary between database vendors.

Some PDBase documentation is available concerning commands specific to particular vendors:

Oracle

Different vendors also sometimes use different keywords to describe the same data type. For example, to define large text strings such as those required for the PDB_REMARK fields, Oracle uses the keyword "LONG" while Sybase uses the keyword "TEXT". For this reason, a slightly different set of commands is required to create the schema for different vendors. SQL scripts to create the MmsDB schema are available for the following vendors:

Initialization

Once one of the above scripts has been used to create the schema, the MMS system tables must be initialized by running the script InitMms.sql. This script assigns category and item keys and loads the category and item definitions from the mmCIF dictionary into the database.

After the schema has been initialized, entries can be read from mmCIF data files and loaded into the database using the main program defined in PDBase.java (in the package org.rcsb.openmms.apps.rdb). This program is run by the script run/unix/pdbase.

Several indices are defined in the schema to make retrievals and queries run faster. However when loading a large number of structures it is usually a good idea to remove the indices before loading and then create them again afterward. For this purpose the scripts DropIndices.sql and CreateIndices.sql are also automaticly generated from the metamodel.

The PDBase System Tables

PDBase uses five MMS systems tables to store information from the mmCIF dictionary, keep track of entries that have been loaded and record which categories are present in each entry. MMS system tables are identified by using the prefix "MMS_" in the table name.

MMS_SYSTEM

The MMS_SYSTEM table will usually only contain a single record containing the data_bank, model and version number that can be used by SW to check that a compatible PDBase system has been installed and initialized.

MMS_ENTRY

This table contains a record for each Entry that has been loaded into the database. Each record contains the PDB ID, the entry_key, the deposition date and the date the Entry was loaded. As described below, the integer entry_key is used to identify the Entry (unique PDB_ID) in all of the data tables.

MMS_CATEGORY

The MMS_CATEGORY table contains a record for each mmCIF category that is used in this database. The record contains the mmCIF category name, the corresponding table name, an integer category_key (unique identifier) and the mmCIF description of the category.

MMS_ITEM

This table contains a record for every mmCIF item type in every category that is used in this database. Each record contains the mmCIF item name, the category_key for the table in which this item occurs, the corresponding column name for the item, an integer item_key and the mmCIF description for the item.

MMS_ENTRY_CATEGORIES

Records in the MMS_ENTRY_CATEGORIES table are used to describe the many-to-many relationship between entries and categories. Each record in this table describes a category that is present in a particular data file. The entry_key column is a foreign key into the MMS_ENTRY table that identifies the Entry. The category_key column is a foreign key into the MMS_CATEGORY table that identifies the Category. A length column provides the number of records in that category that are present in a particular datafile. The length field is set equal to one for categories that are not inside an mmCIF "_loop" construct.

Some Simple SQL Queries Examples

The following examples queries illustrate how to retrieve information from the MMS data tables and the five PDBase system tables described above. The first several examples use only the MMS system tables and are provided to help illustrate the PDBase schema design.

Each example query is preceded by a simple English question it is intended address. The sample queries themselves as they would be entered into a SQL command line interpreter are shown in Bold Typeface. The results that are generated are shown in Monospaced Typeface For convenience some of the longer SQL queries are contained in a file with the filename noted in parentheses after the query text.

Although only the results using an SQL command line interpreter are shown here, essentially these same commands could be used for numerous SQL programming language interfaces, many of the commercial tools for dynamic web based queries, and many Java packages such as JSP and JDBC.

How do I find the the entry_key for a given PDB ID?

The integer entry_key's are used to make searching the database fast and efficient. The following command retrieves the appropriate record for PDB_ID = 1rv5.

Note that all string constants such as '1rv5' must be surrounded by single quotes.


select entry_key from MMS_ENTRY where id='1rv5';


 ENTRY_KEY
----------
       492

Note that all identifiers ending with a "_key" suffix e.g. entry_key are assigned sequential integer values as data is loaded into the database. These provide an consistent and efficient method for searching and cross-referencing between tables within any given PDBase database. However, they will not in general have the same value across different instances of the database.

What mmCIF categories are in the database? What is the table name for each category and what is corresponding category_key?

In most cases the table name is just the upper case version of the category name, but since the STAR format permits characters in names that are not allowed in SQL, table and column names two separate columns must be used. Note that ellipses "..." are used in the output listings to denote missing sections of long results.


select id, table_name, category_key from MMS_CATEGORY;


ID                               TABLE_NAME                       CATEGORY_KEY
-------------------------------- -------------------------------- ------------
atom_site                        ATOM_SITE                                   0
atom_site_anisotrop              ATOM_SITE_ANISOTROP                         1
atom_type                        ATOM_TYPE                                   2
chem_comp                        CHEM_COMP                                   3

...

database_PDB_tvect               DATABASE_PDB_TVECT                        137
computing                        COMPUTING                                 138
software                         SOFTWARE                                  139

How do I find out what columns are in a particular PDBase data table?

The following is a generic SQL-92 command that uses the MMS_ITEM system table to list the "mmCIF-item" columns in a data table. Most relational databases also have vendor specific commands for describing tables that could also be used for this purpose. A vendor specific command to describe the table would also show the "entry_key" and "<table_name>_key" columns that are used for fast and efficient table access.

The following SQL command lists the "mmCIF-item" column names that are present in the DIFFRN_RADIATION data table. Note the use of a subquery in parentheses to determine the category_key from the MMS_CATEGORY table.


select MMS_ITEM.id from MMS_ITEM 
  where category_key =
    ( select category_key from MMS_CATEGORY where id = 'diffrn_radiation');


ID
--------------------------------
collimation
filter_edge
inhomogeneity
monochromator
polarisn_norm
polarisn_ratio
probe
type
xray_symbol

Is there any on-line information available that describes each tables and provides the scientific definition of the data found in each of the columns?

Yes. The _category.description and _item_description definitions from the mmCIF dictionary are stored in the MMS_CATEGORY and MMS_ITEM tables.

The following query retrieves the general description of the "diffrn_radiation" table.


select description from MMS_CATEGORY where id='diffrn_radiation';


DESCRIPTION
--------------------------------------------------------------------------------
              Data items in the DIFFRN_RADIATION category describe
              the radiation used in measuring diffraction intensities,
              its collimation and monochromatisation before the sample.
...

This query retrieves the description of the "monochromator" column in the diffrn_radiation table.


select MMS_ITEM.description from MMS_ITEM
   where id='monochromator' and category_key =
     ( select category_key from MMS_CATEGORY where id = 'diffrn_radiation');

sql_examples/ColumnDescription.sql


DESCRIPTION
--------------------------------------------------------------------------------
              The method used to obtain monochromatic radiation. If a mono-
              chromator crystal is used the material and the indices of the
              Bragg reflection are specified.

In terms of the number of atoms it contains, what is the smallest, largest and average size all entries in the database?

Note that the results given here were based on a query from a database containing about 1500 entries.


select min(len), max(len), avg(len) from MMS_ENTRY_CATEGORIES
    where category_key =
    ( select category_key from MMS_CATEGORY where id='atom_site' );

sql_examples/MaxAtoms.sql


  MIN(LEN)   MAX(LEN)   AVG(LEN)
---------- ---------- ----------
        55      69188 3385.53649

How can I find the PDB ID's for the smallest entries (less that 100 atoms) along with how many atom they contain?

Note that when doing "joins" with multiple tables it is essential to include the relevant constraints. In this case:
"MMS_ENTRY.entry_key=MMS_ENTRY_CATEGORIES.entry_key".


select id, len from MMS_ENTRY, MMS_ENTRY_CATEGORIES
    where len < 100 and MMS_ENTRY.entry_key=MMS_ENTRY_CATEGORIES.entry_key
    and category_key =
    ( select category_key from MMS_CATEGORY where id='atom_site' ) ;

sql_examples/SmallEntries.sql


ID                                      LEN
-------------------------------- ----------
1dn8                                     91
279d                                     55
415d                                     66
416d                                     55

How do I find the PDB ID's for entries that contain a particular mmCIF category of interest?

The following query whose form is similar to the one given above, can be used to find those entries that contain information about non-crystallographic symmetry operations.


select id, len from MMS_ENTRY, MMS_ENTRY_CATEGORIES
    where MMS_ENTRY.entry_key=MMS_ENTRY_CATEGORIES.entry_key
    and category_key =
    ( select category_key from MMS_CATEGORY where id='struct_ncs_oper' ) ;

sql_examples/FindCategories.sql


ID                                      LEN
-------------------------------- ----------
1a34                                     14
1a3q                                      2
1a74                                      1
1a9n                                      3
...

How do I determine the set of values a column contains and how often each value occurs?

A SQL "GROUP BY" clause can be used to group records together that have the same value. The COUNT(*) aggregate function can be used to report how many times this value occurs. For example the following query determines which refinement software is used and how often. The Oracle COLUMN command below is used to truncate the text output to 32 characters to make the result more readable.


column refinement_software format a32 trunc;
select count(*), structure_refinement as refinement_software
  from computing
  group by structure_refinement ;

sql_examples/WhatMethods.sql


  COUNT(*) REFINEMENT_SOFTWARE
---------- --------------------------------
         1 ?
        57 CNS
         1 CNS 0.1
         4 CNS 0.3
         1 CNS 0.3A

...

         7 SHELXL-96
        55 SHELXL-97
         1 SHELXL93
        44 TNT
         1 TNT & CNS
        87 TNT 5-D
         1 TNT 5-E
         4 TNT 5D
         2 TNT 5EB
       212 X-PLOR
         1 X-PLOR 0.3
         5 X-PLOR 3.0
        85 X-PLOR 3.1
...

How do I find entries with citations by a particular author?

The query given below will find all the entries with "Bancroft" in the author field. The percent sign '%' is a SQL-92 wildcard that matches any string of zero or more characters. The underscore '_' can be used as a wildcard to match any single character. Because the wildcard is used in the query, multiple authors with different initials are reported. Note that the keyword "like" is used rather than the equal sign in declaring the string matching constraint.


column PDB_ID format a10 trunc;
column name format a40 trunc;
select MMS_ENTRY.id as PDB_ID, name
    from MMS_ENTRY, CITATION_AUTHOR
    where MMS_ENTRY.entry_key=CITATION_AUTHOR.entry_key
    	and CITATION_AUTHOR.name like '%Bancroft%'
    group by MMS_ENTRY.id, name ;

sql_examples/Citation.sql


PDB_ID     NAME
---------- ----------------------------------------
131d       Bancroft, D.
1cwp       Bancroft, J.B.

How can I get a list of the polymer entities contained in a particular protein?

The query below lists the entities in PDB_ID '1a5h' that have type polymer. Note that an entity may occur multiple times in the asymmetric unit and that information is stored in the STRUCT_ASYM table.


column PDB_ID format a10 trunc;
column ENTITY format a10 trunc;
column DETAILS format a50 trunc;
select MMS_ENTRY.id as PDB_ID, ENTITY.id as ENTITY,
	 ENTITY.details as DETAILS
    from MMS_ENTRY, ENTITY
    where MMS_ENTRY.entry_key=ENTITY.entry_key
	and ENTITY.type='polymer' and MMS_ENTRY.id = '1a5h';

sql_examples/Polymers.sql


PDB_ID     ENTITY     DETAILS
---------- ---------- --------------------------------------------------
1a5h       1           Protein chain: C, D
1a5h       2           Protein chain: A, B

How do I find the center of a molecule?

The following SQL command prints the number of atoms and the mean x,y and z coordinate position for all of the atoms in the entry "1hwt".


column PDB_ID format a10 trunc;
select MMS_ENTRY.id as PDB_ID, count(*) as atoms,
	avg(cartn_x), avg(cartn_y), avg(cartn_z)
    from MMS_ENTRY, ATOM_SITE
    where MMS_ENTRY.entry_key=ATOM_SITE.entry_key
	and MMS_ENTRY.id = '1hwt'
    group by MMS_ENTRY.id ;

sql_examples/Center.sql


PDB_ID          ATOMS AVG(CARTN_X) AVG(CARTN_Y) AVG(CARTN_Z)
---------- ---------- ------------ ------------ ------------
1hwt             4062   10.8431743   43.0713272   23.6939114

How do I print a table with an estimated size for every molecule in the database?

Note that this query examines every atom record in the database and will take some time to finish. The "group by" clause is used to compute the aggregate functions (min and max) over each entry.


select MMS_ENTRY.id, count(*) as atoms,
	sqrt(power(max(cartn_x)-min(cartn_x),2) +
	power(max(cartn_y)-min(cartn_y),2) +
	power(max(cartn_z)-min(cartn_z),2)) as Diagonal_3D
    from MMS_ENTRY, ATOM_SITE
    where MMS_ENTRY.entry_key=ATOM_SITE.entry_key
    group by MMS_ENTRY.id ;

sql_examples/Diagonal.sql


...

ID                                    ATOMS DIAGONAL_3D
-------------------------------- ---------- -----------
1eyg                                   4244   108.25681
1eyu                                   3400  100.228327
1f1t                                    860  68.5101871
1f3i                                   4808  138.651836
1f5t                                   5621  188.478184
1f69                                    262  36.9467493
1fdg                                    271  45.8646174
1ffk                                  64281  352.701736
1ffy                                   9386  176.260785
1ffz                                  10725  226.875971
1fg0                                  10704   227.42758
1fix                                    484  51.7029837
1fjf                                  51927  338.826822

...

What table is used to store the residue sequence?

The ENTITY_POLY_SEQ table is used to record the sequence of monomers in a polymer. The following query ranks all of the entries in the database by the number of Cysteine residues they contain.

Note that this query only counts the residues in each entity once, even though the entity may occur mulitple times in the asymmetric unit.


column PDB_ID format a10 trunc;
select MMS_ENTRY.id as PDB_ID, count(*) as CYS_RESIDUES
    from MMS_ENTRY, ENTITY_POLY_SEQ
    where MMS_ENTRY.entry_key=ENTITY_POLY_SEQ.entry_key
	and ENTITY_POLY_SEQ.mon_id='CYS'
    group by MMS_ENTRY.id 
    order by CYS_RESIDUES ;

sql_examples/CountResidues.sql


...

PDB_ID     CIS_RESIDUES
---------- ------------
1a8e                 16
1bmv                 16
1a8f                 16
1a9x                 17
19hc                 18
1dsz                 18
2nll                 19
1ffk                 21
1a8x                 22
1a4y                 38

How do I find which entries that contain "special" elements?

"Special" is of course a subjective term and for the purposes of this query simply means any element which is not Hydrogen, Carbon, Nitrogen, Oxygen, Phosphorus or Sulfur. Some entries such as PDB_ID '1dcr' shown in the results below contains multiple types of metals. In such cases, the query given lists the entry multiple times with each special element on a separate row along with the number of atoms of that type.

Note that this query must also examine every atom record and may take some time to run.


column PDB_ID format a10 trunc;
column ELEMENT format a10 trunc;
select MMS_ENTRY.id as PDB_ID, ATOM_SITE.type_symbol_id as ELEMENT,
	count(*) as ATOMS
    from MMS_ENTRY, ATOM_SITE
    where MMS_ENTRY.entry_key=ATOM_SITE.entry_key
	and type_symbol_id not in ('H', 'C', 'N', 'O', 'P', 'S')
    group by MMS_ENTRY.id, type_symbol_id ;

sql_examples/Elements.sql


...

PDB_ID     ELEMENT         ATOMS
---------- ---------- ----------
1dcr       BR                  2
1dcr       MG                  3
1dcr       NA                  1
1dct       CA                  2
1dct       F                   2
1dcw       NA                  1
1ddn       NI                  8
1ddy       CO                  4
1de9       MN                  2
1dfm       CA                  2
1dfm       SE                 10
1dfu       MG                  5
...

PDBase: A Relational Database Expression of Macromolecular Structure Data

The OpenMMS ToolkitRelease V_1_5_1