PDBase is a toolkit for mapping the mmCIF ontology onto a relational database. Additional columns described below have been added the the database tables to make searching through the data fast and efficient.
This document describes principles used in PDBase schemas and provides a few simple example SQL queries that highlight some of the ways in which the data may be searched. Whenever possible these queries were designed to be compatible with the SQL-92 standard. However in some cases, vendor specific commands are needed. These cases are noted in the text where appropriate.
The procedures for setting up an DB account, running the command line interface and describing the table metadata vary between database vendors.
Some PDBase documentation is available concerning commands specific to particular vendors:
Different vendors also sometimes use different keywords to describe the same data type. For example, to define large text strings such as those required for the PDB_REMARK fields, Oracle uses the keyword "LONG" while Sybase uses the keyword "TEXT". For this reason, a slightly different set of commands is required to create the schema for different vendors. SQL scripts to create the MmsDB schema are available for the following vendors:
Once one of the above scripts has been used to create the schema, the MMS system tables must be initialized by running the script InitMms.sql. This script assigns category and item keys and loads the category and item definitions from the mmCIF dictionary into the database.
After the schema has been initialized, entries can be read from
mmCIF data files and loaded into the database using the main program
defined in PDBase.java
(in the package org.rcsb.openmms.apps.rdb
).
This program is run by the script run/unix/pdbase
.
Several indices are defined in the schema to make retrievals
and queries run faster. However when loading a large number of
structures it is usually a good idea to remove the indices before
loading and then create them again afterward.
For this purpose the scripts DropIndices.sql
and CreateIndices.sql
are also automaticly
generated from the metamodel.
PDBase uses five MMS systems tables to store information from the mmCIF dictionary, keep track of entries that have been loaded and record which categories are present in each entry. MMS system tables are identified by using the prefix "MMS_" in the table name.
The MMS_SYSTEM table will usually only contain a single record containing the data_bank, model and version number that can be used by SW to check that a compatible PDBase system has been installed and initialized.
This table contains a record for each Entry that has been loaded into the database. Each record contains the PDB ID, the entry_key, the deposition date and the date the Entry was loaded. As described below, the integer entry_key is used to identify the Entry (unique PDB_ID) in all of the data tables.
The MMS_CATEGORY table contains a record for each mmCIF category that is used in this database. The record contains the mmCIF category name, the corresponding table name, an integer category_key (unique identifier) and the mmCIF description of the category.
This table contains a record for every mmCIF item type in every category that is used in this database. Each record contains the mmCIF item name, the category_key for the table in which this item occurs, the corresponding column name for the item, an integer item_key and the mmCIF description for the item.
Records in the MMS_ENTRY_CATEGORIES table are used to describe the many-to-many relationship between entries and categories. Each record in this table describes a category that is present in a particular data file. The entry_key column is a foreign key into the MMS_ENTRY table that identifies the Entry. The category_key column is a foreign key into the MMS_CATEGORY table that identifies the Category. A length column provides the number of records in that category that are present in a particular datafile. The length field is set equal to one for categories that are not inside an mmCIF "_loop" construct.
The following examples queries illustrate how to retrieve information from the MMS data tables and the five PDBase system tables described above. The first several examples use only the MMS system tables and are provided to help illustrate the PDBase schema design.
Each example query is preceded by a simple English question
it is intended address. The sample queries themselves as
they would be entered into a SQL command line interpreter
are shown in Bold Typeface.
The results that are generated are shown in Monospaced Typeface
For convenience some of the longer SQL queries are contained in a
file with the filename noted in parentheses after the query text.
Although only the results using an SQL command line interpreter are
shown here, essentially these same commands could be used for
numerous SQL programming language interfaces,
many of the commercial tools for dynamic web based queries,
and many Java packages such as JSP and JDBC.
The integer entry_key's are used to make searching the database fast and efficient. The following command retrieves the appropriate record for PDB_ID = 1rv5.
Note that all string constants such as '1rv5' must be surrounded by single quotes.
select entry_key from MMS_ENTRY where id='1rv5';
ENTRY_KEY
----------
492
Note that all identifiers ending with a "_key" suffix e.g. entry_key are assigned sequential integer values as data is loaded into the database. These provide an consistent and efficient method for searching and cross-referencing between tables within any given PDBase database. However, they will not in general have the same value across different instances of the database.
In most cases the table name is just the upper case version of the category name, but since the STAR format permits characters in names that are not allowed in SQL, table and column names two separate columns must be used. Note that ellipses "..." are used in the output listings to denote missing sections of long results.
select id, table_name, category_key from MMS_CATEGORY;
ID TABLE_NAME CATEGORY_KEY
-------------------------------- -------------------------------- ------------
atom_site ATOM_SITE 0
atom_site_anisotrop ATOM_SITE_ANISOTROP 1
atom_type ATOM_TYPE 2
chem_comp CHEM_COMP 3
...
database_PDB_tvect DATABASE_PDB_TVECT 137
computing COMPUTING 138
software SOFTWARE 139
The following is a generic SQL-92 command that uses the MMS_ITEM system table to list the "mmCIF-item" columns in a data table. Most relational databases also have vendor specific commands for describing tables that could also be used for this purpose. A vendor specific command to describe the table would also show the "entry_key" and "<table_name>_key" columns that are used for fast and efficient table access.
The following SQL command lists the "mmCIF-item" column names that are present in the DIFFRN_RADIATION data table. Note the use of a subquery in parentheses to determine the category_key from the MMS_CATEGORY table.
select MMS_ITEM.id from MMS_ITEM
where category_key =
( select category_key from MMS_CATEGORY where id = 'diffrn_radiation');
ID
--------------------------------
collimation
filter_edge
inhomogeneity
monochromator
polarisn_norm
polarisn_ratio
probe
type
xray_symbol
Yes. The _category.description and _item_description definitions from the mmCIF dictionary are stored in the MMS_CATEGORY and MMS_ITEM tables.
The following query retrieves the general description of the "diffrn_radiation" table.
select description from MMS_CATEGORY where id='diffrn_radiation';
DESCRIPTION
--------------------------------------------------------------------------------
Data items in the DIFFRN_RADIATION category describe
the radiation used in measuring diffraction intensities,
its collimation and monochromatisation before the sample.
...
This query retrieves the description of the "monochromator" column in the diffrn_radiation table.
select MMS_ITEM.description from MMS_ITEM
where id='monochromator' and category_key =
( select category_key from MMS_CATEGORY where id = 'diffrn_radiation');
(file:
sql_examples/ColumnDescription.sql)
DESCRIPTION
--------------------------------------------------------------------------------
The method used to obtain monochromatic radiation. If a mono-
chromator crystal is used the material and the indices of the
Bragg reflection are specified.
Note that the results given here were based on a query from a database containing about 1500 entries.
select min(len), max(len), avg(len) from MMS_ENTRY_CATEGORIES
where category_key =
( select category_key from MMS_CATEGORY where id='atom_site' );
(file: sql_examples/MaxAtoms.sql)
MIN(LEN) MAX(LEN) AVG(LEN)
---------- ---------- ----------
55 69188 3385.53649
Note that when doing "joins" with multiple tables it is essential
to include the relevant constraints. In this case:
"MMS_ENTRY.entry_key=MMS_ENTRY_CATEGORIES.entry_key".
select id, len from MMS_ENTRY, MMS_ENTRY_CATEGORIES
where len < 100 and MMS_ENTRY.entry_key=MMS_ENTRY_CATEGORIES.entry_key
and category_key =
( select category_key from MMS_CATEGORY where id='atom_site' ) ;
(file:
sql_examples/SmallEntries.sql)
ID LEN
-------------------------------- ----------
1dn8 91
279d 55
415d 66
416d 55
The following query whose form is similar to the one given above, can be used to find those entries that contain information about non-crystallographic symmetry operations.
select id, len from MMS_ENTRY, MMS_ENTRY_CATEGORIES
where MMS_ENTRY.entry_key=MMS_ENTRY_CATEGORIES.entry_key
and category_key =
( select category_key from MMS_CATEGORY where id='struct_ncs_oper' ) ;
(file:
sql_examples/FindCategories.sql)
ID LEN
-------------------------------- ----------
1a34 14
1a3q 2
1a74 1
1a9n 3
...
A SQL "GROUP BY" clause can be used to group records together that have the same value. The COUNT(*) aggregate function can be used to report how many times this value occurs. For example the following query determines which refinement software is used and how often. The Oracle COLUMN command below is used to truncate the text output to 32 characters to make the result more readable.
column refinement_software format a32 trunc;
select count(*), structure_refinement as refinement_software
from computing
group by structure_refinement ;
(file:
sql_examples/WhatMethods.sql)
COUNT(*) REFINEMENT_SOFTWARE
---------- --------------------------------
1 ?
57 CNS
1 CNS 0.1
4 CNS 0.3
1 CNS 0.3A
...
7 SHELXL-96
55 SHELXL-97
1 SHELXL93
44 TNT
1 TNT & CNS
87 TNT 5-D
1 TNT 5-E
4 TNT 5D
2 TNT 5EB
212 X-PLOR
1 X-PLOR 0.3
5 X-PLOR 3.0
85 X-PLOR 3.1
...
The query given below will find all the entries with "Bancroft" in the author field. The percent sign '%' is a SQL-92 wildcard that matches any string of zero or more characters. The underscore '_' can be used as a wildcard to match any single character. Because the wildcard is used in the query, multiple authors with different initials are reported. Note that the keyword "like" is used rather than the equal sign in declaring the string matching constraint.
column PDB_ID format a10 trunc;
column name format a40 trunc;
select MMS_ENTRY.id as PDB_ID, name
from MMS_ENTRY, CITATION_AUTHOR
where MMS_ENTRY.entry_key=CITATION_AUTHOR.entry_key
and CITATION_AUTHOR.name like '%Bancroft%'
group by MMS_ENTRY.id, name ;
(file: sql_examples/Citation.sql)
PDB_ID NAME
---------- ----------------------------------------
131d Bancroft, D.
1cwp Bancroft, J.B.
The query below lists the entities in PDB_ID '1a5h' that have type polymer. Note that an entity may occur multiple times in the asymmetric unit and that information is stored in the STRUCT_ASYM table.
column PDB_ID format a10 trunc;
column ENTITY format a10 trunc;
column DETAILS format a50 trunc;
select MMS_ENTRY.id as PDB_ID, ENTITY.id as ENTITY,
ENTITY.details as DETAILS
from MMS_ENTRY, ENTITY
where MMS_ENTRY.entry_key=ENTITY.entry_key
and ENTITY.type='polymer' and MMS_ENTRY.id = '1a5h';
(file: sql_examples/Polymers.sql)
PDB_ID ENTITY DETAILS
---------- ---------- --------------------------------------------------
1a5h 1 Protein chain: C, D
1a5h 2 Protein chain: A, B
The following SQL command prints the number of atoms and the mean x,y and z coordinate position for all of the atoms in the entry "1hwt".
column PDB_ID format a10 trunc;
select MMS_ENTRY.id as PDB_ID, count(*) as atoms,
avg(cartn_x), avg(cartn_y), avg(cartn_z)
from MMS_ENTRY, ATOM_SITE
where MMS_ENTRY.entry_key=ATOM_SITE.entry_key
and MMS_ENTRY.id = '1hwt'
group by MMS_ENTRY.id ;
(file: sql_examples/Center.sql)
PDB_ID ATOMS AVG(CARTN_X) AVG(CARTN_Y) AVG(CARTN_Z)
---------- ---------- ------------ ------------ ------------
1hwt 4062 10.8431743 43.0713272 23.6939114
Note that this query examines every atom record in the database and will take some time to finish. The "group by" clause is used to compute the aggregate functions (min and max) over each entry.
select MMS_ENTRY.id, count(*) as atoms,
sqrt(power(max(cartn_x)-min(cartn_x),2) +
power(max(cartn_y)-min(cartn_y),2) +
power(max(cartn_z)-min(cartn_z),2)) as Diagonal_3D
from MMS_ENTRY, ATOM_SITE
where MMS_ENTRY.entry_key=ATOM_SITE.entry_key
group by MMS_ENTRY.id ;
(file: sql_examples/Diagonal.sql)
...
ID ATOMS DIAGONAL_3D
-------------------------------- ---------- -----------
1eyg 4244 108.25681
1eyu 3400 100.228327
1f1t 860 68.5101871
1f3i 4808 138.651836
1f5t 5621 188.478184
1f69 262 36.9467493
1fdg 271 45.8646174
1ffk 64281 352.701736
1ffy 9386 176.260785
1ffz 10725 226.875971
1fg0 10704 227.42758
1fix 484 51.7029837
1fjf 51927 338.826822
...
The ENTITY_POLY_SEQ table is used to record the sequence of monomers in a polymer. The following query ranks all of the entries in the database by the number of Cysteine residues they contain.
Note that this query only counts the residues in each entity once, even though the entity may occur mulitple times in the asymmetric unit.
column PDB_ID format a10 trunc;
select MMS_ENTRY.id as PDB_ID, count(*) as CYS_RESIDUES
from MMS_ENTRY, ENTITY_POLY_SEQ
where MMS_ENTRY.entry_key=ENTITY_POLY_SEQ.entry_key
and ENTITY_POLY_SEQ.mon_id='CYS'
group by MMS_ENTRY.id
order by CYS_RESIDUES ;
(file:
sql_examples/CountResidues.sql)
...
PDB_ID CIS_RESIDUES
---------- ------------
1a8e 16
1bmv 16
1a8f 16
1a9x 17
19hc 18
1dsz 18
2nll 19
1ffk 21
1a8x 22
1a4y 38
"Special" is of course a subjective term and for the purposes of this query simply means any element which is not Hydrogen, Carbon, Nitrogen, Oxygen, Phosphorus or Sulfur. Some entries such as PDB_ID '1dcr' shown in the results below contains multiple types of metals. In such cases, the query given lists the entry multiple times with each special element on a separate row along with the number of atoms of that type.
Note that this query must also examine every atom record and may take some time to run.
column PDB_ID format a10 trunc;
column ELEMENT format a10 trunc;
select MMS_ENTRY.id as PDB_ID, ATOM_SITE.type_symbol_id as ELEMENT,
count(*) as ATOMS
from MMS_ENTRY, ATOM_SITE
where MMS_ENTRY.entry_key=ATOM_SITE.entry_key
and type_symbol_id not in ('H', 'C', 'N', 'O', 'P', 'S')
group by MMS_ENTRY.id, type_symbol_id ;
(file: sql_examples/Elements.sql)
...
PDB_ID ELEMENT ATOMS
---------- ---------- ----------
1dcr BR 2
1dcr MG 3
1dcr NA 1
1dct CA 2
1dct F 2
1dcw NA 1
1ddn NI 8
1ddy CO 4
1de9 MN 2
1dfm CA 2
1dfm SE 10
1dfu MG 5
...