This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.
[IUCr Home Page] [CIF Home Page] [mmCIF Home Page]

Molecular structure definitions

Eldon Ulrich (elu@gir.nmrfam.wisc.edu)
Wed, 6 Sep 1995 11:54:23 -0500


Molecular Structure Descriptions:

I want to add my support to the comments Dale Tronrud has made concerning
the need to describe a molecule and its structure.  I feel it is worthwhile to
revisit the overall goal of describing molecules or entities and a proposed set
of
tokens is outlined below.  Within the current mmCIF, it appears that only
single
chain polymers and non-polymers can be described with tokens whose
definitions do not assume that a structure has been determined.  In most cases,
the complexity of the system being studied is known to a higher degree than
this
before the data is collected (multimeric molecular structures or complex
molecules are known to be present).  To describe these structures, as Dale has
pointed out, additional tokens are needed to define bonds that link polymers,
but
also tokens are needed to describe the polymers that are associated to form a
quarternary molecular structure, to name those structures, and to provide
references to nomenclature systems and other databases.

I am developing a data deposition form to be used by NMR spectroscopists for
submitting information to BioMagResBank and a flat-file format for distributing
data from the databank. We are using the STAR format and want to use data
tokens compatible with the mmCIF tokens, if not identical, wherever possible.
In designing the form, it is useful to define, using one supercategory of
tokens,
the complete chemical structure for each molecule in the system studied.
Unfortunately, the _entity, _chem_bond, and chem_link categories do not
appear to have all of the tokens needed.  Below (starting with the comment
"Biological system definition") are listed a large number of tokens in the
format
of the deposition form I am constructing.  Many of these tokens or equivalent
ones are available in mmCIF but many are not.  I propose that constructs be
added to mmCIF that will allow higher order molecular structures to be
described before experimental data is implied.  I am not concerned that the
actual constructs adopted follow the outline listed below.  However, I would be
interested in comments on this outline as we plan to make it available to the
NMR community in the near future.  Please remember that this is a draft and is
in the format of a deposition form not a dictionary.

With these tokens, the intent is to be able to describe hemoglobin, hexose
kinase, lipoproteins, peptidoglycans, enzyme inhibitor complexes, and hopefully
a wide variety of other structures. The _system tokens are included, because I
need to describe solution systems that may involve the transient interaction
between two or more entities.

All of the structural information has been unified under the 'entity' umbrella.
I
would prefer to use the term 'mol' instead of entity for two reasons:  1)
 entity
is a vary broad term that can cover a wide variety of objects.  2)  entity has
a
relatively well defined meaning in the database world that causes confusion
when discussing schemas and file formats.  The _chem_link_bond tokens in
mmCIF are now listed as _entity_chem_link.  Link_bond seemed redundant, but
this is also a minor issue.  Also, the term 'label' has been used in many
tokens
where `id' is used in mmCIF.  Within our database, we have reserved the term
`id' for numeric values that simply identify a particular instance of a
relationship or row in a table and they do not have any implied meaning, as is
the case with using `VAL' as an `id' for a chemical compound structure in
mmCIF.  I think in mmCIF `id' tokens at times do carry contextual information
and at other times are intend as just row markers and that this can be
confusing.
A category `_bond_site' has been constructed similar to `_atom_site' for use in
loops where data relevant to individual bonds may be listed.


Other Considerations:

1.  The abbreviation `comp' is used both for compounds and computer.
Possibly `cmptr' could be used in computer related tokens.

2.  `exptl' with some fonts appears to mean `experiment one'.  Could this be
abbreviated further to `expt' without causing many problems.

3.  Additional tokens:
        _citation.CAS_AN       Chemical Abstracts number
        _citation.book_city       Is in publisher token, but I think
                                            would be useful on its own.
        _citation.keywords       Often useful for searching

4.  The term monomer or abbreviation `mon' can be confused as meaning either
the monomeric unit of a polymer or a monomer in a dimeric or higher order
structure. Since `mon' is used relatively infrequently in the current mmCIF,
`residue' might be substituted.

5.  I would recommend using Chemical Abstracts abbreviations for journals as a
standard.


Best Regards,
Eldon



###########################
#  Biological system definition  #
###########################

#  This section defines the macromolecules and small molecules that form the
#  system reported on in this entry.  The system may consist of a single
molecule,
#  such as ribonuclease, but may also be defined by several molecules as in the
case
of a study
#  involving the tryptophan repressor - DNA operator complex formed in the
presence of
#  tryptophan.  Include the molecules for which NMR data is provided and
those
molecules
#  that are significant for the study.  Do not include buffers, salts,
solvents, etc
as these should
#  be described in the section where the contents of each sample are listed.


save_system

	_System.name
	_System.detail

	loop_
		_System_constit.name
		_System_constit.label
		_System_constit.function

	stop_

save_


######################
#  Molecule definitions  #
######################

# Three classes of molecular structure are defined in this section, multimeric
molecules,
complex molecules, and simple or homogeneous molecules.  Multimeric
molecules are defined
as macromolecules that have quaternary structure and are formed by the
association of two or
more subunits each of which may have various degrees of complexity.
Complex molecules are
constructed of covalently linked homo- or hetero-polymers, of complexes
involving tightly
associated but non-covalently linked molecules, or of non-polymer compounds
bound to one or
more polymers. Homogeneous molecules are either polymeric or non-polymeric.
 Polymeric
molecules of this class are constructed of one type of monomer (i.e. amino
acids,
deoxyribonucleotides, ribonucleotides, carbohydrates, etc.)  Examples of
non-polymeric
molecules would include free amino acids, enzyme prosthetic groups,
substrates, inhibitors,
and other small molecules.

A category, `_entity_chem_struct' is available for defining the atoms and bonds
that make up
a small molecule, a monomer found in a polymer, or a molecular fragment that
is part of a
complex molecule.

Create a `saveframe' block for each molecule that comprises the system being
studied.


######################
#  Multimeric molecules  #
######################


save_<_entity_multimeric.label>

	_entity_multimeric.label
	_entity_multimeric.name
	_entity_multimeric.details

	loop_
		_entity_multimeric.subunit_unique_label
		_entity_multimeric.subunit_unique_name
		_entity_multimeric.subunit_label

	stop_

	loop_
		_entity_multimeric.reference.database_label
		_entity_multimeric.reference.database_code

	stop_

	loop_
		 _entity_multimeric.class_system_name	#  Enzyme Commission;

	CAS registry
		 _entity_multimeric.class_system_code	#  EC number; CAS

	registry number

	stop_

	loop_
		_entity_multimeric.synonym

	stop_

save_


#  The following saveframe is used to declare the molecules that are present in
each
#  subunit of a multimeric macromolecule.  Each subunit found in a multimeric
macromolecule
#  must be assigned a unique identifier so that the locations of individual
atoms
can be
#  described when specific data are listed later in the form.


save_<_entity_multimeric.subunit_label>

	_entity_multimeric.subunit_label
	_entity_multimeric.subunit_name

	loop_
		_entity_multimeric.subunit_member_unique_label
		_entity_multimeric.subunit_member_label

	stop_

	loop_
		_entity_multimeric.subunit_synonym

	stop_

save_



#############################
#  Complex Molecular structures  #
#############################

#  Two types of complex molecules are defined: Those that consist of multiple
covalently
#  linked polymer structures.  For instance, insulin, CD2, and erythropoietin.
And, those that
#  are defined by prosthetic groups or other molecules and atoms covalently or
tightly
#  associated with a polymer (cytochrome c, flavodoxin, calmodulin with bound
calcium, the
#  alpha chain of hemoglobin, etc.)


save_<_entity_complex.label>

	_entity_complex.label
	_entity_complex.common.name
	_entity_complex.formula_weight
	_entity_complex.details

	loop_
		_entity_complex.member_unique_label
		_entity_complex.member_label
	stop_

	loop_
		_entity_complex_reference.database_name
		_entity_complex_reference.database_code
		_entity_complex_reference.details

	stop_

	loop_
		 _entity_complex.class_system_name
		 _entity_complex.class_system_code

	stop_

	loop_
		_entity_complex.synonym

	stop_

save_


#######################################################
#  Single chain simple polymeric molecules and small molecules  #
#######################################################

#  Molecules are complete chemical structures of either polymer or non-polymer
type.
#  Ribonuclease, lysozyme, water, acetone, and dioxane are a few examples.


save_<_entity.label>

	_entity.label
	_entity.common.name
	_entity_chem_struct.label
	_entity.type
	_entity_poly.type
	_entity.formula_weight
	_entity.details

	loop_
		_entity_reference.database_name
		_entity_reference.database_code
		_entity_reference.details

	stop_

	loop_
		 _entity.class_system_name
		 _entity.class_system_code

	stop_

#  The sequence of a polymeric molecule is provided in the following loop.
Standard one-
#  letter or three-letter nomenclature for amino acids and nucleotides will be
assumed unless
#  indicated otherwise.  Any non-standard monomers should be given unique
labels and should
#  have their chemical structure and linkage to the polymer described in the
section following
#  this loop where unique chemical compounds are described.

	loop_
		_entity_poly_seq.num
		_entity_poly_seq.position_label		# Author defined
sequence

# position label.

		_entity_poly_seq.mol_chem_struct.label

	stop_

	loop_
		_entity.synonym

	stop_

save_


# Structures for complete chemical compounds that are non-polymer entities or
fragments of
# chemical compounds that are monomers used to form polymers and their
linkage within the
# polymer.

save_<_entity_chem_struct.label>

	_entity_chem_struct.label
	_entity_chem_struct.name
	_entity_chem_struct.detail

#  List the atoms that comprise the compound, their chirality and formal
charge.
Include
#  protons in the atom list.

	loop_
		_entity_chem_struct_atom.atom_label
		_entity_chem_struct_atom.chirality
		_entity_chem_struct_atom.charge

	stop_

#  List the bonds and their type that link the atoms in the compound.  Include
bonds to protons
#  in the list.

	loop_
		_entity_chem_struct_bond.label
		_entity_chem_struct_bond.atom_label_atom_one
		_entity_chem_struct_bond.atom_label_atom_two
		_entity_chem_struct_bond.value_order

	stop_

save_


			#########################
			#  Molecule chemical links  #
			#########################

#  The covalent bonds that link non-standard monomers within a polymer, one
polymer to
#  another, or chemical compounds that are covalently attached to a polymer are
listed here.

save_molecule_chemical_links

	loop_
		_entity_chem_link.label

		_entity_chem_link.mol_multimeric.label_atom_one
		_entity_chem_link.mol_multimeric.subunit_unique_label_atom_one
_entity_chem_link.mol_multimeric.subunit_member_unique_label_atom_one
		_entity_chem_link.mol_complex.label_atom_one
		_entity_chem_link.mol_complex.member_unique_label_atom_one
		_entity_chem_link.mol.label_atom_one
		_entity_chem_link.mol_poly_seq.num_atom_one
		_entity_chem_link.mol_chem_struct.label_atom_one
		_entity_chem_link.mol_chem_struct_atom.atom_label_atom_one

		_entity_chem_link.mol_multimeric.label_atom_two
		_entity_chem_link.mol_multimeric.subunit_unique_label_atom_two
_entity_chem_link.mol_multimeric.subunit_member_unique_label_atom_one
		_entity_chem_link.mol_complex.label_atom_two
		_entity_chem_link.mol_complex.member_unique_label_atom_two
		_entity_chem_link.mol.label_atom_two
		_entity_chem_link.mol_poly_seq.num_atom_two
		_entity_chem_link.mol_chem_struct.label_atom_two
		_entity_chem_link.mol_chem_struct_atom.atom_label_atom_two

		_entity_chem_link.value_order

	stop_

save_



			################################
			#  Unique atom identification labels  #
			################################

#  The following loop can be used to define an identification label for unique
atoms within a
#  molecular structure.  In constructing the data tables found below the atom
identification
#  label can be used instead of repeating the large number of tokens required
to
define specific
#  atoms.

save_atom_identification_labels

  loop_
	_Atom_site.id
	_Atom_site.mol_multimeric.label
	_Atom_site.mol_multimeric.subunit_unique_label
	_Atom_site.mol_multimeric.subunit_member_unique_label
	_Atom_site.mol_complex.label
	_Atom_site.mol_complex.member_unique_label
	_Atom_site.mol.label
	_Atom_site.mol_poly_seq.num
	_Atom_site.mol_chem_struct.label
	_Atom_site.mol_chem_struct_atom.atom_label

  stop_

save_


			################################
			#  Unique bond identification labels  #
			################################

#  As for atoms, this loop can be used to define an identification label for
unique bonds in a
molecular structure.

save_bond_identification_labels

  loop_
	_Bond_site.id

	_Bond_site.mol_multimeric.label_atom_one
	_Bond_site.mol_multimeric.subunit_unique_label_atom_one
	_Bond_site.mol_multimeric.subunit_member_unique_label_atom_one
	_Bond_site.mol_complex.label_atom_one
	_Bond_site.mol_complex.member_unique_label_atom_one
	_Bond_site.mol.label_atom_one
	_Bond_site.mol_poly_seq.num_atom_one
	_Bond_site.mol_chem_struct.label_atom_one
	_Bond_site.mol_chem_struct_atom.atom_label_atom_one

	_Bond_site.mol_multimeric.label_atom_two
	_Bond_site.mol_multimeric.subunit_unique_label_atom_two
	_Bond_site.mol_multimeric.subunit_member_unique_label_atom_one
	_Bond_site.mol_complex.label_atom_two
	_Bond_site.mol_complex.member_unique_label_atom_two
	_Bond_site.mol.label_atom_two
	_Bond_site.mol_poly_seq.num_atom_two
	_Bond_site.mol_chem_struct.label_atom_two
	_Bond_site.mol_chem_struct_atom.atom_label_atom_two

  stop_

save_