Molecular Structure Descriptions: I want to add my support to the comments Dale Tronrud has made concerning the need to describe a molecule and its structure. I feel it is worthwhile to revisit the overall goal of describing molecules or entities and a proposed set of tokens is outlined below. Within the current mmCIF, it appears that only single chain polymers and non-polymers can be described with tokens whose definitions do not assume that a structure has been determined. In most cases, the complexity of the system being studied is known to a higher degree than this before the data is collected (multimeric molecular structures or complex molecules are known to be present). To describe these structures, as Dale has pointed out, additional tokens are needed to define bonds that link polymers, but also tokens are needed to describe the polymers that are associated to form a quarternary molecular structure, to name those structures, and to provide references to nomenclature systems and other databases. I am developing a data deposition form to be used by NMR spectroscopists for submitting information to BioMagResBank and a flat-file format for distributing data from the databank. We are using the STAR format and want to use data tokens compatible with the mmCIF tokens, if not identical, wherever possible. In designing the form, it is useful to define, using one supercategory of tokens, the complete chemical structure for each molecule in the system studied. Unfortunately, the _entity, _chem_bond, and chem_link categories do not appear to have all of the tokens needed. Below (starting with the comment "Biological system definition") are listed a large number of tokens in the format of the deposition form I am constructing. Many of these tokens or equivalent ones are available in mmCIF but many are not. I propose that constructs be added to mmCIF that will allow higher order molecular structures to be described before experimental data is implied. I am not concerned that the actual constructs adopted follow the outline listed below. However, I would be interested in comments on this outline as we plan to make it available to the NMR community in the near future. Please remember that this is a draft and is in the format of a deposition form not a dictionary. With these tokens, the intent is to be able to describe hemoglobin, hexose kinase, lipoproteins, peptidoglycans, enzyme inhibitor complexes, and hopefully a wide variety of other structures. The _system tokens are included, because I need to describe solution systems that may involve the transient interaction between two or more entities. All of the structural information has been unified under the 'entity' umbrella. I would prefer to use the term 'mol' instead of entity for two reasons: 1) entity is a vary broad term that can cover a wide variety of objects. 2) entity has a relatively well defined meaning in the database world that causes confusion when discussing schemas and file formats. The _chem_link_bond tokens in mmCIF are now listed as _entity_chem_link. Link_bond seemed redundant, but this is also a minor issue. Also, the term 'label' has been used in many tokens where `id' is used in mmCIF. Within our database, we have reserved the term `id' for numeric values that simply identify a particular instance of a relationship or row in a table and they do not have any implied meaning, as is the case with using `VAL' as an `id' for a chemical compound structure in mmCIF. I think in mmCIF `id' tokens at times do carry contextual information and at other times are intend as just row markers and that this can be confusing. A category `_bond_site' has been constructed similar to `_atom_site' for use in loops where data relevant to individual bonds may be listed. Other Considerations: 1. The abbreviation `comp' is used both for compounds and computer. Possibly `cmptr' could be used in computer related tokens. 2. `exptl' with some fonts appears to mean `experiment one'. Could this be abbreviated further to `expt' without causing many problems. 3. Additional tokens: _citation.CAS_AN Chemical Abstracts number _citation.book_city Is in publisher token, but I think would be useful on its own. _citation.keywords Often useful for searching 4. The term monomer or abbreviation `mon' can be confused as meaning either the monomeric unit of a polymer or a monomer in a dimeric or higher order structure. Since `mon' is used relatively infrequently in the current mmCIF, `residue' might be substituted. 5. I would recommend using Chemical Abstracts abbreviations for journals as a standard. Best Regards, Eldon ########################### # Biological system definition # ########################### # This section defines the macromolecules and small molecules that form the # system reported on in this entry. The system may consist of a single molecule, # such as ribonuclease, but may also be defined by several molecules as in the case of a study # involving the tryptophan repressor - DNA operator complex formed in the presence of # tryptophan. Include the molecules for which NMR data is provided and those molecules # that are significant for the study. Do not include buffers, salts, solvents, etc as these should # be described in the section where the contents of each sample are listed. save_system _System.name _System.detail loop_ _System_constit.name _System_constit.label _System_constit.function stop_ save_ ###################### # Molecule definitions # ###################### # Three classes of molecular structure are defined in this section, multimeric molecules, complex molecules, and simple or homogeneous molecules. Multimeric molecules are defined as macromolecules that have quaternary structure and are formed by the association of two or more subunits each of which may have various degrees of complexity. Complex molecules are constructed of covalently linked homo- or hetero-polymers, of complexes involving tightly associated but non-covalently linked molecules, or of non-polymer compounds bound to one or more polymers. Homogeneous molecules are either polymeric or non-polymeric. Polymeric molecules of this class are constructed of one type of monomer (i.e. amino acids, deoxyribonucleotides, ribonucleotides, carbohydrates, etc.) Examples of non-polymeric molecules would include free amino acids, enzyme prosthetic groups, substrates, inhibitors, and other small molecules. A category, `_entity_chem_struct' is available for defining the atoms and bonds that make up a small molecule, a monomer found in a polymer, or a molecular fragment that is part of a complex molecule. Create a `saveframe' block for each molecule that comprises the system being studied. ###################### # Multimeric molecules # ###################### save_<_entity_multimeric.label> _entity_multimeric.label _entity_multimeric.name _entity_multimeric.details loop_ _entity_multimeric.subunit_unique_label _entity_multimeric.subunit_unique_name _entity_multimeric.subunit_label stop_ loop_ _entity_multimeric.reference.database_label _entity_multimeric.reference.database_code stop_ loop_ _entity_multimeric.class_system_name # Enzyme Commission; CAS registry _entity_multimeric.class_system_code # EC number; CAS registry number stop_ loop_ _entity_multimeric.synonym stop_ save_ # The following saveframe is used to declare the molecules that are present in each # subunit of a multimeric macromolecule. Each subunit found in a multimeric macromolecule # must be assigned a unique identifier so that the locations of individual atoms can be # described when specific data are listed later in the form. save_<_entity_multimeric.subunit_label> _entity_multimeric.subunit_label _entity_multimeric.subunit_name loop_ _entity_multimeric.subunit_member_unique_label _entity_multimeric.subunit_member_label stop_ loop_ _entity_multimeric.subunit_synonym stop_ save_ ############################# # Complex Molecular structures # ############################# # Two types of complex molecules are defined: Those that consist of multiple covalently # linked polymer structures. For instance, insulin, CD2, and erythropoietin. And, those that # are defined by prosthetic groups or other molecules and atoms covalently or tightly # associated with a polymer (cytochrome c, flavodoxin, calmodulin with bound calcium, the # alpha chain of hemoglobin, etc.) save_<_entity_complex.label> _entity_complex.label _entity_complex.common.name _entity_complex.formula_weight _entity_complex.details loop_ _entity_complex.member_unique_label _entity_complex.member_label stop_ loop_ _entity_complex_reference.database_name _entity_complex_reference.database_code _entity_complex_reference.details stop_ loop_ _entity_complex.class_system_name _entity_complex.class_system_code stop_ loop_ _entity_complex.synonym stop_ save_ ####################################################### # Single chain simple polymeric molecules and small molecules # ####################################################### # Molecules are complete chemical structures of either polymer or non-polymer type. # Ribonuclease, lysozyme, water, acetone, and dioxane are a few examples. save_<_entity.label> _entity.label _entity.common.name _entity_chem_struct.label _entity.type _entity_poly.type _entity.formula_weight _entity.details loop_ _entity_reference.database_name _entity_reference.database_code _entity_reference.details stop_ loop_ _entity.class_system_name _entity.class_system_code stop_ # The sequence of a polymeric molecule is provided in the following loop. Standard one- # letter or three-letter nomenclature for amino acids and nucleotides will be assumed unless # indicated otherwise. Any non-standard monomers should be given unique labels and should # have their chemical structure and linkage to the polymer described in the section following # this loop where unique chemical compounds are described. loop_ _entity_poly_seq.num _entity_poly_seq.position_label # Author defined sequence # position label. _entity_poly_seq.mol_chem_struct.label stop_ loop_ _entity.synonym stop_ save_ # Structures for complete chemical compounds that are non-polymer entities or fragments of # chemical compounds that are monomers used to form polymers and their linkage within the # polymer. save_<_entity_chem_struct.label> _entity_chem_struct.label _entity_chem_struct.name _entity_chem_struct.detail # List the atoms that comprise the compound, their chirality and formal charge. Include # protons in the atom list. loop_ _entity_chem_struct_atom.atom_label _entity_chem_struct_atom.chirality _entity_chem_struct_atom.charge stop_ # List the bonds and their type that link the atoms in the compound. Include bonds to protons # in the list. loop_ _entity_chem_struct_bond.label _entity_chem_struct_bond.atom_label_atom_one _entity_chem_struct_bond.atom_label_atom_two _entity_chem_struct_bond.value_order stop_ save_ ######################### # Molecule chemical links # ######################### # The covalent bonds that link non-standard monomers within a polymer, one polymer to # another, or chemical compounds that are covalently attached to a polymer are listed here. save_molecule_chemical_links loop_ _entity_chem_link.label _entity_chem_link.mol_multimeric.label_atom_one _entity_chem_link.mol_multimeric.subunit_unique_label_atom_one _entity_chem_link.mol_multimeric.subunit_member_unique_label_atom_one _entity_chem_link.mol_complex.label_atom_one _entity_chem_link.mol_complex.member_unique_label_atom_one _entity_chem_link.mol.label_atom_one _entity_chem_link.mol_poly_seq.num_atom_one _entity_chem_link.mol_chem_struct.label_atom_one _entity_chem_link.mol_chem_struct_atom.atom_label_atom_one _entity_chem_link.mol_multimeric.label_atom_two _entity_chem_link.mol_multimeric.subunit_unique_label_atom_two _entity_chem_link.mol_multimeric.subunit_member_unique_label_atom_one _entity_chem_link.mol_complex.label_atom_two _entity_chem_link.mol_complex.member_unique_label_atom_two _entity_chem_link.mol.label_atom_two _entity_chem_link.mol_poly_seq.num_atom_two _entity_chem_link.mol_chem_struct.label_atom_two _entity_chem_link.mol_chem_struct_atom.atom_label_atom_two _entity_chem_link.value_order stop_ save_ ################################ # Unique atom identification labels # ################################ # The following loop can be used to define an identification label for unique atoms within a # molecular structure. In constructing the data tables found below the atom identification # label can be used instead of repeating the large number of tokens required to define specific # atoms. save_atom_identification_labels loop_ _Atom_site.id _Atom_site.mol_multimeric.label _Atom_site.mol_multimeric.subunit_unique_label _Atom_site.mol_multimeric.subunit_member_unique_label _Atom_site.mol_complex.label _Atom_site.mol_complex.member_unique_label _Atom_site.mol.label _Atom_site.mol_poly_seq.num _Atom_site.mol_chem_struct.label _Atom_site.mol_chem_struct_atom.atom_label stop_ save_ ################################ # Unique bond identification labels # ################################ # As for atoms, this loop can be used to define an identification label for unique bonds in a molecular structure. save_bond_identification_labels loop_ _Bond_site.id _Bond_site.mol_multimeric.label_atom_one _Bond_site.mol_multimeric.subunit_unique_label_atom_one _Bond_site.mol_multimeric.subunit_member_unique_label_atom_one _Bond_site.mol_complex.label_atom_one _Bond_site.mol_complex.member_unique_label_atom_one _Bond_site.mol.label_atom_one _Bond_site.mol_poly_seq.num_atom_one _Bond_site.mol_chem_struct.label_atom_one _Bond_site.mol_chem_struct_atom.atom_label_atom_one _Bond_site.mol_multimeric.label_atom_two _Bond_site.mol_multimeric.subunit_unique_label_atom_two _Bond_site.mol_multimeric.subunit_member_unique_label_atom_one _Bond_site.mol_complex.label_atom_two _Bond_site.mol_complex.member_unique_label_atom_two _Bond_site.mol.label_atom_two _Bond_site.mol_poly_seq.num_atom_two _Bond_site.mol_chem_struct.label_atom_two _Bond_site.mol_chem_struct_atom.atom_label_atom_two stop_ save_