[IUCr Home Page] [CIF Home Page] [pdb2cif]
pdb2cif

Translating PDB Entries into mmCIF

Based on Translating PDB Entries into mmCIF, mmCIF workshop, IUCr meeting, Seattle Washington, August 1996, Abstract E0719

Philip E. Bourne
San Diego Supercomputer Center, PO Box 85608, San Diego, CA 92186-9784, USA and
Department of Pharmacology, University of California, San Diego, CA 92093-0365, USA
phone: 1-619-534-8301 email: bourne@sdsc.edu

Frances C. Bernstein
Bernstein+Sons, 5 Brewster Lane, Bellport, NY 11713-2803, USA
phone: 1-516-286-1339 email: fcb@bernstein-plus-sons.com

Herbert J. Bernstein
Bernstein+Sons, 5 Brewster Lane, Bellport, NY 11713-2803, USA
phone: 1-516-286-1339 email: yaya@bernstein-plus-sons.com

Work supported in part by IUCr (for HJB), US NSF, PHS, NIH, NCRR, NIGMS, NLM and DOE (for FCB prior to 1998), and US NSF grant no. BIR 9310154 (for PEB).


Before using this software, please read the
NOTICE
and please read the IUCr
Policy
on the Use of the Crystallographic Information File (CIF)


Introduction



The essential steps needed to map Protein Data Bank (PDB) entries into valid mmCIF data sets are discussed. Examples of converting both routine and complex structures using actual PDB entries with the program pdb2cif [BBB98] are given.

The Protein Data Bank format [PDB77, PDB95, PDB96] has been used for over 20 years to archive macromolecular data, is produced by many refinement programs, and is used as an input format by many applications. The pending adoption of the mmCIF dictionary [FBB96] by the IUCr, in response to the need to explicitly represent a larger amount of data which can be parsed by computer, necessary as the number of structures continues to grow exponentially, has made translation from PDB format to mmCIF format a pressing issue.

In this talk we review the techniques needed to move from structures represented in PDB format to mmCIF format. Some data items have direct mapping with minor syntactic adjustment, such as for author names and journal references. Other data items, however, require us to recast our thinking along new lines. For example, the PDB format works with chains and HET groups, while mmCIF uses entities (discrete chemical components). Proper identification of entities in a PDB entry may require looking for sequence homologies. As another example, consider beta sheets. The PDB format treats a bifurcated sheet as two distinct sheets which happen to have certain strands in common, while mmCIF allows all the strands involved to be represented as a single sheet. This requires strand matching and alignment to go from PDB format to mmCIF. What has currently been automated in pdb2cif and what still requires human intervention will be discussed.

Relationship Between mmCIF and PDB format

The Protein Data Bank [PDB96] uses a format with fixed fields and is order-dependent. Here is part of the list of atomic coordinates information from the PDB entry 4INS [DHH89] in the 1989 format and the 1996 format:

1989 format:

ATOM      1  N   GLY A   1      -8.863  16.944  14.289  1.00 21.88   1  4INS 235
ATOM      2  CA  GLY A   1      -9.929  17.026  13.244  1.00 22.85   1  4INS 236
ATOM      3  C   GLY A   1     -10.051  15.625  12.618  1.00 43.92   1  4INS 237
ATOM      4  O   GLY A   1      -9.782  14.728  13.407  1.00 25.22   1  4INS 238
ATOM      5  N   ILE A   2     -10.333  15.531  11.332  1.00 26.28   1  4INS 239
ATOM      6  CA  ILE A   2     -10.488  14.266  10.600  1.00 20.84   1  4INS 240
ATOM      7  C   ILE A   2      -9.367  13.302  10.658  1.00 11.81   1  4INS 241
ATOM      8  O   ILE A   2      -9.580  12.092  10.969  1.00 20.31   1  4INS 242
ATOM      9  CB  ILE A   2     -10.883  14.493   9.095  1.00 40.00   1  4INS 243
ATOM     10  CG1 ILE A   2     -11.579  13.146   8.697  1.00 36.74   1  4INS 244

1996 format:

ATOM      1  N   GLY A   1      -8.863  16.944  14.289  1.00 21.88   1       N  
ATOM      2  CA  GLY A   1      -9.929  17.026  13.244  1.00 22.85   1       C  
ATOM      3  C   GLY A   1     -10.051  15.625  12.618  1.00 43.92   1       C  
ATOM      4  O   GLY A   1      -9.782  14.728  13.407  1.00 25.22   1       O  
ATOM      5  N   ILE A   2     -10.333  15.531  11.332  1.00 26.28   1       N  
ATOM      6  CA  ILE A   2     -10.488  14.266  10.600  1.00 20.84   1       C  
ATOM      7  C   ILE A   2      -9.367  13.302  10.658  1.00 11.81   1       C  
ATOM      8  O   ILE A   2      -9.580  12.092  10.969  1.00 20.31   1       O  
ATOM      9  CB  ILE A   2     -10.883  14.493   9.095  1.00 40.00   1       C  
ATOM     10  CG1 ILE A   2     -11.579  13.146   8.697  1.00 36.74   1       C  


The new mmCIF format is one of a family of STAR (Self-Defining Text Archive and Retrieval File [HS94]) formats which uses a tag-value style of presentation and has very little sensitivity to the ordering of the information. Here is an extract from an mmCIF conversion of PDB entry 4INS:

loop_
_atom_site.label_seq_id
_atom_site.group_PDB
_atom_site.type_symbol 
_atom_site.label_atom_id 
_atom_site.label_comp_id 
_atom_site.label_asym_id 
_atom_site.auth_seq_id 
_atom_site.label_alt_id 
_atom_site.cartn_x 
_atom_site.cartn_y 
_atom_site.cartn_z 
_atom_site.occupancy
_atom_site.B_iso_or_equiv 
_atom_site.footnote_id
_atom_site.label_entity_id
_atom_site.id
1 ATOM  N  N    GLY A    1  .  -8.863  16.944  14.289  1.00 21.88   1   1      1
1 ATOM  C  CA   GLY A    1  .  -9.929  17.026  13.244  1.00 22.85   1   1      2
1 ATOM  C  C    GLY A    1  . -10.051  15.625  12.618  1.00 43.92   1   1      3
1 ATOM  O  O    GLY A    1  .  -9.782  14.728  13.407  1.00 25.22   1   1      4
2 ATOM  N  N    ILE A    2  . -10.333  15.531  11.332  1.00 26.28   1   1      5
2 ATOM  C  CA   ILE A    2  . -10.488  14.266  10.600  1.00 20.84   1   1      6
2 ATOM  C  C    ILE A    2  .  -9.367  13.302  10.658  1.00 11.81   1   1      7
2 ATOM  O  O    ILE A    2  .  -9.580  12.092  10.969  1.00 20.31   1   1      8
2 ATOM  C  CB   ILE A    2  . -10.883  14.493   9.095  1.00 40.00   1   1      9
2 ATOM  C  CG1  ILE A    2  . -11.579  13.146   8.697  1.00 36.74   1   1     10

Because tags are always given, the same information can be presented in different orderings. Note that the mmCIF format does not depend on the columns shown here, just on a consistent ordering of tags versus data values.

Syntax

The major differences in syntax are as follows:

mmCIF:

tag-value definitions, little order dependence, strict table structure, upper/lower case, yyyy-mm-dd dates, last-name-first author names, related items may have to appear in disjoint tag-value lists.

PDB:

fixed fields, strongly order dependent, some information non-tabular, upper-case only, dd-mmm-yy dates (dd-mmm-yyyy in some REMARKS), last-name-last author names.

Content

The major differences in content are:

mmCIF:

extensive normalization, works with entities (discrete chemical entities).

PDB:

less normalization, works with chains and HET groups.

PDB and mmCIf formats agree simply and directly for some data items, such as cell parameters, and admit a simple tabular mapping, as shown by this extract from the concordance [B96] which is available as part of pdb2cif [BBB98]:

PDB Field          Content  Type of Transformation
                                            and Related mmCIF field
CRYST1[1-6]        CRYST1   NA
CRYST1[7-15]       a        equivalent to   _cell.length_a
CRYST1[16-24]      b        equivalent to   _cell.length_b
CRYST1[25-33]      c        equivalent to   _cell.length_c
CRYST1[34-40]      alpha    equivalent to   _cell.angle_alpha
CRYST1[41-47]      beta     equivalent to   _cell.angle_beta
CRYST1[48-54]      gamma    equivalent to   _cell.angle_gamma
CRYST1[56-66]      sGroup   equivalent to
                               _symmetry.space_group_name_H-M
CRYST1[67-70]      z        equivalent to    _cell.Z_PDB


while other important macromolecular data descriptors, because of the very different views of the same data, require complex transformations.

For example, in mmCIF, sheets are built up out of strands. All the strands in all sheets are listed in one STRUCT_SHEET_RANGE table. The relative ordering and orientation of all strands in all sheets are given in one STRUCT_SHEET_ORDER table. The hydrogen-bonding among all strands in all sheets is listed in one STRUCT_SHEET_HBOND table. The general characteristics of all sheets per se is given in one STRUCT_SHEET table. In PDB format, sheets are described by one set of sheet records per simple, non-bifurcated sheet. To convert from PDB format to mmCIF format, a list of all strands must be extracted from the SHEET records, sorted to remove duplicates, and the information placed in a STRUCT_SHEET_RANGE table. All strand to strand relationships are extracted and placed in a STRUCT_SHEET_ORDER table, etc. Here is a diagram of PDB entry 2ACE [HRSS96] showing strands forming sheets:

[2ace sheets]

This is presented in the PDB entry as:

SHEET    1   A 3 LEU     6  THR    10  0
SHEET    2   A 3 GLY    13  MET    16 -1  N  VAL    15   O  VAL     8
SHEET    3   A 3 VAL    57  ALA    60  1  N  TRP    58   O  LYS    14
SHEET    1   B11 MET    16  PRO    21  0
SHEET    2   B11 HIS    26  PRO    34 -1  O  ALA    29   N  THR    18
SHEET    3   B11 TYR    96  PRO   102 -1  N  ILE    99   O  PHE    30
SHEET    4   B11 VAL   142  SER   147 -1  N  LEU   143   O  TRP   100
SHEET    5   B11 THR   109  TYR   116  1  N  MET   112   O  VAL   142
SHEET    6   B11 THR   193  GLU   199  1  O  THR   195   N  VAL   113
SHEET    7   B11 ARG   220  SER   226  1  N  ILE   223   O  ILE   196
SHEET    8   B11 GLN   318  ASN   324  1  N  GLY   322   O  LEU   224
SHEET    9   B11 GLY   417  PHE   423  1  N  TYR   421   O  LEU   321
SHEET   10   B11 PHE   502  LEU   505  1  N  ILE   503   O  LEU   420
SHEET   11   B11 MET   510  GLN   514 -1  N  HIS   513   O  PHE   502

Here is the same information converted to mmCIF format by pdb2cif:

loop_
_struct_sheet.id
_struct_sheet.number_strands
  A     3
  B    11

loop_
_struct_sheet_hbond.sheet_id
_struct_sheet_hbond.range_id_1
_struct_sheet_hbond.range_id_2
_struct_sheet_hbond.range_1_beg_label_seq_id
_struct_sheet_hbond.range_1_beg_label_atom_id
_struct_sheet_hbond.range_2_beg_label_seq_id
_struct_sheet_hbond.range_2_beg_label_atom_id
_struct_sheet_hbond.range_1_end_label_seq_id
_struct_sheet_hbond.range_1_end_label_atom_id
_struct_sheet_hbond.range_2_end_label_seq_id
_struct_sheet_hbond.range_2_end_label_atom_id
  A   1_A   2_A    8   O   15  N   8  O  15  N  
  A   2_A   3_A   14   O   58  N  14  O  58  N  
  B   1_B   2_B   18   N   29  O  18  N  29  O  
  B  10_B  11_B  502   O  513  N 502  O 513  N  
  B   2_B   3_B   30   O   99  N  30  O  99  N  
  B   3_B   4_B  100   O  143  N 100  O 143  N  
  B   4_B   5_B  142   O  112  N 142  O 112  N  
  B   5_B   6_B  113   N  195  O 113  N 195  O  
  B   6_B   7_B  196   O  223  N 196  O 223  N  
  B   7_B   8_B  224   O  322  N 224  O 322  N  
  B   8_B   9_B  321   O  421  N 321  O 421  N  
  B   9_B  10_B  420   O  503  N 420  O 503  N  

loop_
_struct_sheet_order.sheet_id
_struct_sheet_order.range_id_1
_struct_sheet_order.range_id_2
_struct_sheet_order.offset
_struct_sheet_order.sense
  A     1_A     2_A +1 anti-parallel
  A     2_A     3_A +1 parallel
  B     1_B     2_B +1 anti-parallel
  B    10_B    11_B +1 anti-parallel
  B     2_B     3_B +1 anti-parallel
  B     3_B     4_B +1 anti-parallel
  B     4_B     5_B +1 parallel
  B     5_B     6_B +1 parallel
  B     6_B     7_B +1 parallel
  B     7_B     8_B +1 parallel
  B     8_B     9_B +1 parallel
  B     9_B    10_B +1 parallel

loop_
_struct_sheet_range.sheet_id
_struct_sheet_range.id
_struct_sheet_range.beg_label_comp_id
_struct_sheet_range.beg_label_asym_id
_struct_sheet_range.beg_label_seq_id
_struct_sheet_range.end_label_comp_id
_struct_sheet_range.end_label_asym_id
_struct_sheet_range.end_label_seq_id
  A     1_A LEU *    6  THR *   10 
  A     2_A GLY *   13  MET *   16 
  A     3_A VAL *   57  ALA *   60 
  B     1_B MET *   16  PRO *   21 
  B    10_B PHE *  502  LEU *  505 
  B    11_B MET *  510  GLN *  514 
  B     2_B HIS *   26  PRO *   34 
  B     3_B TYR *   96  PRO *  102 
  B     4_B VAL *  142  SER *  147 
  B     5_B THR *  109  TYR *  116 
  B     6_B THR *  193  GLU *  199 
  B     7_B ARG *  220  SER *  226 
  B     8_B GLN *  318  ASN *  324 
  B     9_B GLY *  417  PHE *  423 
 

pdb2cif

pdb2cif [BBB98] is a program which converts PDB entries into mmCIF datasets. Most, but not all, common PDB record types are converted. The program cannot resolve some of the ambiguitites involved in the conversion. The program has gone through extensive changes since 1993 as both mmCIf and the PDB format have evolved. The program, which was initially written as an awk script, is now available as an m4 macro document which produces either perl or awk versions. The perl version is recommended.

The pdb2cif.m4 document contains approximately 6500 lines of text, which generates a similar sized awk script of over 10,000 lines of perl code (due to in-lining of certain critical functions). On modern processors with sufficient memory (32 to 64 MB available RAM), the conversion takes from several seconds to a few minutes (e.g. for large NMR entries) depending on the size of the PDB entry. The mmCIF data sets produced are approximately the same size as the original PDB entries. Here are the statistics for some conversions done on an SGI R8000 Indigo:


                 Size in Characters (* 1000)     Conversion
         PDB Entry   PDB           mmCIF         Time (secs.)
         4INS        117            130             2.7
         1CTJ        170            179             2.7
         2ACE        393            433             7.3
         4HIR      1,753          1,896            28.8

The time is approximately linear in the file size, dominated by the processing time for the atom list. The times given are real times and approximate the processor time on larger machines, but for large NMR entries processed on small machines, the real time can become very large due to extensive page swapping for the arrays used to hold the atom list.

The program produces summary warnings as comments at the end of each output CIF. Unconverted records are captured in the AUDIT category warnings and converted records should be examined carefully, especially for the following record types:

COMPND, SOURCE, TITLE and CAVEAT are merged into _struct.title without further parsing. A great deal of information could be derived from the entries which follow the PDB 1995 format description when sufficient information for mapping of MOL_ID to entities is available.

One of the most challenging parts of the conversion done by pdb2cif is the identification of chemical entities. pdb2cif does this by scanning SEQRES and ATOM list information for sequence homologies. Doubtful cases are reported by warning comments in the mmCIF output.

pdb2cif is used to produce mmCIF output from a Browser [APMS96] available on the PDB home page since August 1996 at: http://www.pdb.bnl.gov

The program accepts all current PDB record types. Here is the DBREF information from the PDB entry 1CTJ [S95].

DBREF  1CTJ      1    89  SWS    Q09099   CYC6_MONBR       1     89


which is converted into two tables by pdb2cif:

loop_
_struct_ref.id 
_struct_ref.entity_id 
_struct_ref.biol_id 
_struct_ref.db_name 
_struct_ref.db_code 
_struct_ref.seq_align 
_struct_ref.seq_dif 
_struct_ref.details 
     1    1    *      SWS  'Q09099 CYC6_MONBR'  partial   no .

loop_
_struct_ref_seq.align_id 
_struct_ref_seq.ref_id 
_struct_ref_seq.seq_align_beg 
_struct_ref_seq.seq_align_end
_struct_ref_seq.db_align_beg 
_struct_ref_seq.db_align_end 
_struct_ref_seq.details 
     1     1     '1'    '89'     '1'    '89' . 

1CTJ has anisotropic U's:

ATOM      1  N  AGLU     1       4.127  26.179  -7.903  0.49 57.53           N  
ANISOU    1  N  AGLU     1     9336   7394   4591      4   2737   2771       N  
ATOM      2  N  BGLU     1       3.535  25.488 -12.889  0.51 54.52           N  
ANISOU    2  N  BGLU     1     8406   5015   6783   -887   3093    161       N  
ATOM      3  CA AGLU     1       5.490  26.607  -8.207  0.49 52.50           C  
ANISOU    3  CA AGLU     1     9283   5563   4611   -256   2331   1241       C  
ATOM      4  CA BGLU     1       2.754  26.395 -12.051  0.51 51.27           C  
ANISOU    4  CA BGLU     1     7663   5124   6212   -653   2258    184       C  
ATOM      5  C  AGLU     1       5.550  27.734  -9.233  0.49 47.55           C  
ANISOU    5  C  AGLU     1     8593   4752   4275   -880   1820    625       C 


pdb2cif inserts the necessary tags and values into the ATOM_SITE table, but uses a different ordering. Also note the change in scaling, because the values for aniso U in mmCIf are not multiplied by 10000 as in PDB entries.

loop_
_atom_site.label_seq_id
_atom_site.auth_asym_id
_atom_site.group_PDB
_atom_site.type_symbol 
_atom_site.label_atom_id 
_atom_site.label_comp_id 
_atom_site.label_asym_id 
_atom_site.auth_seq_id
_atom_site.label_alt_id
_atom_site.cartn_x
_atom_site.cartn_y
_atom_site.cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.footnote_id
_atom_site.label_entity_id
_atom_site.id
_atom_site.aniso_U[1][1]
_atom_site.aniso_U[1][2]
_atom_site.aniso_U[1][3]
_atom_site.aniso_U[2][2]
_atom_site.aniso_U[2][3]
_atom_site.aniso_U[3][3]
1 '    '
ATOM  N    N    GLU *    1  A   4.127  26.179  -7.903  0.49 57.53  .    1      1
                        0.9336  0.0004  0.2737  0.7394  0.2771  0.4591 
1 '    '
ATOM  N    N    GLU *    1  B   3.535  25.488 -12.889  0.51 54.52  .    1      2
                        0.8406 -0.0887  0.3093  0.5015  0.0161  0.6783 
1 '    '
ATOM  C    CA   GLU *    1  A   5.490  26.607  -8.207  0.49 52.50  .    1      3
                        0.9283 -0.0256  0.2331  0.5563  0.1241  0.4611 
1 '    '
ATOM  C    CA   GLU *    1  B   2.754  26.395 -12.051  0.51 51.27  .    1      4
                        0.7663 -0.0653  0.2258  0.5124  0.0184  0.6212 
1 '    '
ATOM  C    C    GLU *    1  A   5.550  27.734  -9.233  0.49 47.55  .    1      5
                        0.8593  -0.088   0.182  0.4752  0.0625  0.4275 

As we gain more experience with the new PDB format and with mmCIF we hope to extend the mapping of record types into the internal fields of COMPND and SOURCE and of the newer, more structured remarks. Ultimately we hope to be able to do conversions from PDB format to mmCIF in sufficient detail to extract all information for which mmCIF tokens exist and for which information was provided in an entry, while preserving the names and relationships which existed in the PDB entry, so that all records of the original entry can be reconstructed from the new mmCIF data set.

References

Useful WWW URL's

There are many useful sites on the World Wide Web where information, tools and software related to CIF, mmCIF and the PDB can be found. The following are good starting points for exploration:

The International Union of Crystallography (IUCr) provides access to software, dictionaries, policy statements and documentation relating to CIF and mmCIF at:

with mirror sites at: Information and Software for STAR and CIF can be found at:

The Nucleic Acid Database Project provides access to its entries, software and documentation, with an mmCIF page giving access to the dictionary and mmCIF software tools at:

with mirror sites at:

The Protein Data Bank provides access to entries, software and documentation with a browser, and an on-line PDB format description at:

with mirror sites at many locations (see http://www.pdb.bnl.gov/pdb-docs/mirror_sites.html).

Tutorials on mmCIF and the relationship to PDB format can be found at: http://www.sdsc.edu/pb/cif/tutorials.html


Here are direct links to copies of the IUCr CIF home page, the NDB's mmCIF home page, pdb2cif, cif2pdb and CIFtbx (with Cyclops and cif2cif).

United States
NDB, Rutgers, NJ mmCIF pdb2cif cif2pdb CIFtbx...
SDSC, San Diego, CA CIF mmCIF pdb2cif cif2pdb CIFtbx...
United Kingdom
IUCr, Chester CIF   pdb2cif cif2pdb CIFtbx...
EBI, Hinxton   mmCIF pdb2cif cif2pdb CIFtbx...
France
U. P. et M. Curie, Paris CIF pdb2cif cif2pdb CIFtbx...
Sweden
U. of Stockholm CIF   pdb2cif cif2pdb CIFtbx...
South Africa
U. of the Witwatersrand CIF   pdb2cif cif2pdb CIFtbx...
Japan
NIBH, Ibaraki   mmCIF pdb2cif cif2pdb CIFtbx...
Australia
UWA, Nedlands STAR/CIF   pdb2cif cif2pdb CIFtbx...


Updated 24 June 1998

Herbert J. Bernstein (yaya@bernstein-plus-sons.com)