README.html

This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.

README for pdb2cif.pl, pdb2cif.oawk, pdb2cif.awk

produced from pdb2cif.m4 version 2.3.7 7 March 1998

Scripts to filter a PDB entry and produce a CIF file.

by
Philip E. Bourne, Herbert J. Bernstein and Frances C. Bernstein

For a discussion of the rationale behind this software, see Translating PDB Entries into mmCIF

Work supported in part by IUCr (for HJB), US NSF, PHS, NIH, NCRR, NIGMS, NLM and DOE (for FCB prior to 1998), and US NSF grant no. BIR 9310154 (for PEB).

Before using this software, please read the

and please read the IUCr

on the Use of the Crystallographic Information File (CIF)

THE CONVERSION FROM PDB FORMAT TO CIF FORMAT IS COMPLEX
******* USE WITH CAUTION *******
COMMENTS AND SUGGESTIONS APPRECIATED

If you like the basic approach, thank Phil Bourne. He did the real work of creating pdb2cif. If you have problems with the adaptation to cif_mm.dic or any other aspects of pdb2cif, send email to pdb2cif@bernstein-plus-sons.com

The Authors

Philip E. Bourne, San Diego Supercomputer Center, PO Box 85608, San Diego, CA 92186-9785 USA
email: bourne@sdsc.edu

Herbert J. Bernstein, Bernstein+Sons, P.O. Box 177, Bellport, NY 11713
email: yaya@bernstein-plus-sons.com

Frances C. Bernstein, Bernstein+Sons, P.O. Box 177, Bellport, NY 11713
email: fcb@bernstein-plus-sons.com

Where to Get pdb2cif

Current versions are available via http from: http://www.bernstein-plus-sons.com/software/pdb2cif,
http://www.sdsc.edu/pb/pdb2cif/pdb2cif/ and
http://ndbserver.rutgers.edu/NDB/mmcif/software
It is available as a compressed shar pdb2cif.shar.Z (2.8 megabytes), a compressed C-shell shar pdb2cif.cshar.Z (2.8 megabytes) or as individual files, as given in the MANIFEST.

If your system cannot handle a Unix-style compressed file, you may wish to download an uncompressed shar pdb2cif.shar or an uncompressed cshar pdb2cif.cshar.

If you need a later version, and are willing to work with code that is changing, you may with to try the next_test_version (not always present)

Recent Changes

Release 2.3.7 makes the alignment of the ATOM list to SEQRES more robust, making use of the ATOM list connectivity to identify segments that should have closely related alignment. All important summary disgnostics now begin with the string "#=#" to simplify extraction of this information with grep. Non-standard charges on ATOM records are now converted to blank or to a single digit followed by a sign.

Release 2.3.6 corrected some comments and documentation.

Release 2.3.5 corrected the handling of long references and JRNL PUBL records. Residue names which had been quoted with a single quote mark are now quoted with a double quote mark.

Release 2.3.4 corrected the handling of two blank fields in SEQADV and some typos in STRUCT_MON_PROT tags.

Release 2.3.3.2 corrected a spurious header generated in the CIF when a PDB entry has SSBOND records and no secondary structure.

Release 2.3.3.1 was a minor revision to the web pages of version 2.3.3. URLs in comments in the program were also updated. Changes were made in the m4 script for the gnu m4 handling of format.

Release 2.3.3 was an interim revision to pdb2cif to support the changes in tokens introduced with the mmCIF dictionary 0.8.10. The only change done at this stage was to remap the names currently in use. Additional changes will be needed in the future to support parsing to make use of the additional tokens.

Release 2.3.2 has several changes for compliance with the mmCIF dictionary version 0.8.02, in response to some problems discovered by John Westbrook and the checking provided by his ciflib routines. The most visible changes are the listing of the standard residues used in an entry in the CHEM_COMP category, changing use of a quoted blank field as a value for _atom_site.auth_asym_id to a period, and moving some data items common to a loop into the loop itself.

Release 2.3.1 corrects some minor problems in release 2.3.0. In particular a problem with a bad item count and a bad date on machines running some older versions of perl has been corrected. Extra warnings for NMR entries with unusual uses of B-values or occupancies have been added.

Release 2.3.0 was an update to Release 2.2.7 correcting some minor problems with data item types, long publication names, and a failure to report CSD codens.

Release 2.2.7 was the first pdb2cif release after PDB entries compliant with the February 1996 V2.0 PDB format became available. The format of data items in ATOM_SITE lists derived from V2.0 entries was corrected, and the mapping of HETNAM and HETSYN moved from ENTITY_NAME_SYS to ENTITY_NAME_COM.

For more information and prior revisions, see CHANGES .

Compliance

This version is intended to produce mmCIF files conforming to mmCIF version 0.8.02 and above. Full compliance is not possible in some areas. In particular, most of the values used for _exptl.method, and some of the values used for _struct_conf_type.id do not conform to the enumerations in the dictionary. Full compliance would require agreement between the PDB and COMCIFS on equivalent lists of values.

The following definitions would have to be appended to the mmCIF dictionary for validation of pdb2cif output:

save__struct_conn.ptnr1_atom_site_id 
    _item_description.description
;              The id of an atom site for the first partner in a bond

               This data item is a pointer to _atom_site.id in the
               ATOM_SITE category.
;
    _item.name                  '_struct_conn.ptnr1_atom_site_id'
    _item.mandatory_code          no
    _item.category_id             struct_conn
    _item_linked.child_name     '_struct_conn.ptnr1_atom_site_id'
    _item_linked.parent_name    '_atom_site.id'
     save_

save__struct_conn.ptnr2_atom_site_id 
    _item_description.description
;              The id of an atom site for the second partner in a bond

               This data item is a pointer to _atom_site.id in the
               ATOM_SITE category.
;
    _item.name                  '_struct_conn.ptnr2_atom_site_id'
    _item.mandatory_code          no
    _item.category_id             struct_conn
    _item_linked.child_name     '_struct_conn.ptnr2_atom_site_id'
    _item_linked.parent_name    '_atom_site.id'
     save_

save__atom_site.label_model_id
    _item_description.description
;              A component of the macromolecular identifier for this atom site.
               The value of _atom_site.label_model_id associates the atom
               site with a particular nmr model.
               
;
    _item.name                  '_atom_site.label_model_id'
    _item.mandatory_code          no
    _item.category_id            'atom_site'
    _item_type.code               code
     loop_
    _item_linked.child_name
    _item_linked.parent_name
         '_struct_mon_prot.label_model_id'         '_atom_site.label_model_id'
         '_struct_mon_prot_cis.label_model_id'     '_atom_site.label_model_id'
     save_


save__struct_mon_prot.label_model_id
    _item_description.description
;              This data item is a pointer to _atom_site.label_model_id in the
               ATOM_SITE category.
;
    _item.name                  '_struct_mon_prot.label_model_id'
    _item.mandatory_code          no
    _item.category_id             struct_mon_prot
     save_

save__struct_mon_prot_cis.label_model_id
    _item_description.description
;              This data item is a pointer to _atom_site.label_model_id in the
               ATOM_SITE category.
;
    _item.name                  '_struct_mon_prot_cis.label_model_id'
    _item.mandatory_code          no
    _item.category_id             struct_mon_prot_cis
     save_

save__struct_ref_seq_dif.db_seq_num
    _item_description.description
;              The sequence position in the referenced database entry
               corresponding to this point difference position.

               The use of . for _struct_ref_seq_dif.db_seq_num when
               a value has been given for _struct_ref_seq_dif.seq_num
               indicates that there has been an insertion at this
               position.

               The use of . for _struct_ref_seq_dif.seq_num when
               a value is given for _struct_ref_seq_dif.db_seq_num
               indicates that there has been a deletion at this
               position. 
;
    _item.name                  '_struct_ref_seq_dif.db_seq_num'
    _item.mandatory_code          no
    _item.category_id             struct_ref_seq_dif
     loop_
    _item_range.maximum           
    _item_range.minimum           
                                  .    1
                                  1    1
    _item_type.code               int

     save_

Conversion Notes and Known Problems

This program produces summary warnings as comments at the end of each output CIF. Each diagnostic begins with the string "#=#", so that a summary may be extracted using grep. Unconverted records are captured in the AUDIT category warnings and uncoverted records should be examined carefully.

COMPND, SOURCE, TITLE and CAVEAT are merged into _struct.title without further parsing. A great deal of information could be derived from the entries which use the PDB 1995 format description when sufficient information for mapping of MOL_ID to entities is available.

REMARK records currently are mapped without parsing. There is a great deal of information in these records which can be parsed in more recent entries. It should be noted that only columns 12-70 of REMARKs are mapped to mmCIF.

EXPDTA records use values which do not have a direct mapping to enumerated values for _explt.method

ATOM/HETATM records in newer PDB entries have a field for the XPLOR segment id. The field is mapped to _atom_site.auth_asym_id, but the data type used in the dictionary does not permit embedded blanks, which may occur in the field. The problem is side-stepped for totally blank fields by mapping them to a period.

Additional data items for categories like _struct_topol will need to be added as they evolve.

The output produced is in fairly close compliance with mmCIF 0.8.2. However, we have introduced a few additional tokens via the PUBL_MANUSCRIPT_INCL category. Though we have done so in a manner which conforms to the example in the dictionary, the result is not, strictly speaking, proper syntax, since the entry_id which is the category key, is given outside the loop.

The definitive documentation of the program is, of course, the program itself. However, for those interested in the background relationship between between the PDB format and mmCIF, we have included a partial concordance.

This program is distributed as an m4 macro script "pdb2cif.m4" from which three executable scripts have been made: "pdb2cif.pl", "pdb2cif.oawk", "pdb2cif.awk". A makefile is provided to show how the executable scripts were made, but you need not rebuild them. They are current. If you attempt to rebuild the perl script you may have difficulty with the awk to perl conversion program a2p, which fails for this script on many systems. A properly configured a2p is provided in the distribution directory in perl5.001_sgi.built.tar.Z.

Installing pdb2cif

If your system is sufficiently similar to ours, then you may be able to install the program simply by making one of the three versions executable:

On most unix systems, you can make the script into an executable program by executing one of the following sets of commands, depending on whether you want the perl, awk, or old-awk version to be pdb2cif:

chmod 755 pdb2cif.pl
ln -s pdb2cif.pl pdb2cif

chmod 755 pdb2cif.oawk
ln -s pdb2cif.oawk pdb2cif

chmod 755 pdb2cif.awk
ln -s pdb2cif.awk pdb2cif

after which pdb2cif may be executed directly.

NOTE

On some systems, you may need to use "gawk" instead of "awk". pdb2cif.awk uses features which are _not_ found in the original Aho, Kernighan, Weinberger, "Awk - a pattern scanning and processing language," but which have since been added on most systems: functions and the call to "system". If the use of function or system generates a syntax error, you may wish to obtain the gnu version of awk, "gawk", to be able to run pdb2cif. The other system dependency you may have is in the use of a system call to "date". Some systems do not support the 4-digit year format code %Y, and others do not support format codes at all. In the first case, you can change the %Y to 19%y (just remember to fix this in the year 2000), but in the second case, you should just comment out the offending call. The call is marked with a WARNING comment in the m4 script.

If your system is different, you may have to rebuild from the pdb2cif.m4. You do this with the program make and Makefile. The first thing you need to know is where you have a working version of perl or gnu-awk. Edit Makefile to show the correct path to at least one of them. Be warned that rebuilding the perl version from a standard perl release may fail. Before you do so, you may wish to save pdb2cif.pl elsewhere. If you have a good verion of perl with a version of the utility a2p built with a very large OPSMAX, then execute the command

make perl_pdb2cif

If you have a good version of gnu-awk, then execute the command

make awk_pdb2cif

instead.

You can test your installation with

make tests

Flags

The operation of this program is controlled by the following flags, which may be set by statements of the form

#define variable value

in the entry or by including header files with definitions in the list of arguments before the entry.

The following flag is used to produce a more complete CIF entry, i.e. data items are given, but with the value "?".

#define verbose [yes|no]

where "yes" implies verbose output.

The following flag controls conversion of text fields using the type-setting codes used in some PDB entries

#define convtext [yes|no]

where "yes" implies the use of the 1992 PDB format description typesetting conventions.

The following flags control conversion of author and editor names

#define auth_convtext [yes|conditional|no]
#define junior_on_last [yes|no]

where "yes" for auth_convtext implies the conversion of names independent of the setting of convtext, "conditional" implies "yes" only if convtext is "yes" and "no" means to pass through the PDB style name unchanged. If conversion is done, then "yes" for junior_on_last will follow the COMCIFs convention of keeping "dynastic" modifers, such as "Junior," "Senior," "II," etc with the family name. The typesetting used differs slightly from the 1992 PDB format description, by forcing capitalization after "'" and "-". If the translations done are not satisfactory, special cases may be handled by including

#define name PDB_form name_value

where the PDB_form is the form of the name expected in the PDB and name_value is the form to be used by this program. All blanks in either form must be replaced by "_". For example, you can give the following

#define name E.F.MEYER_JUNIOR Meyer Junior,_E.F.

If the same name is defined multiple times, only the last translation given will be used. The PDB_form is not case-sensitive, but the name_value is.

The following flag controls the distribution of label_seq_id to all atom site lines. Select the value "yes" if you do _not_ want this distribution done, but want denser atom lists

#define dense_list [yes|no]

The following flag controls the printing of TER records

#define print_ter [yes|no]

Running the Program

You should put any flag definitions that will be used for most entries into a file named default.pdbh (a sample is included in the distribution directory), and any definitions required by a particular entry into an file with the name of the entry and the extension "pdbh". The program and the header files should be in your current working directory. If you wish, you may put the program into another directory and modify your path to point to it, but the header files must be local, or you will need to give rooted paths for each of them.

Then you can convert a single file named entry.ent by excuting

pdb2cif default.pdbh entry.pdbh entry.ent > entry.cif

for example

pdb2cif default.pdbh 4ins.pdbh 4ins.ent > 4ins.cif

To run with a directory of pdb files such that *.ent -> *.cif:

foreach i (*.ent)
set head = ($i:r)
touch $head.pdbh
pdb2cif default.pdbh $head.pdbh $i > $head.cif
end

Notes on the m4 script

If you are reading the m4 script, please note the macro definitions used for the build. If you modify this program, please note the following:

You cannot use the m4 substr or index
The quotation marks used are: \036 and \037
The version for PERL is obtained by defining "PERL"
Do not use "split" directly; use "dosplit"
Defining "NOLOWER" replaces calls to the built-in "tolower" or "toupper" with loops
Defining "NOFUNCS" caused the functions we define to be expanded in-line
Defining "BADSPLIT" includes code to correct for a PERL field miscount

Updated 7 March 1999

yaya@bernstein-plus-sons.com, fcb@bernstein-plus-sons.com, bourne@sdsc.edu