There is no convenient way to provide systematic nomenclature, sources, etc. for chemical and biological units which consist of ensembles of entities. The STRUCT_BIOL_ENS category allows the structural representations of such ensembles to be defined, but there is no parallel subcategory for ENTITY to handle the reference information. As a solution we propose the following changes to the ENTITY category and the addition of an ENTITY_ENS sub-category. Please note that this is not a theoretical problem. For many structures, the meaningful entities for which sources, systematic names, etc., are used are on a higher level than individual chains or components, to which such information must be, at present, linked. This creates the unappealing possibility of having to repeat such information many times in an entry associating the same information with each chain in an ensemble to ensure completeness for searches. It would seem preferable to create a higher level of structure. At the very least this would provide a means to state sources for complete structures of multiple chains when the sources are most meaningfully given at that level. The major issue to consider is whether to do this sort of thing at the level of ENTITY, to add more items to STRUCT_BIOL, or to create a new category distinct from both. As long as one is careful not to create loops, it seems sufficient for the purpose at hand to simply augment ENTITY in the manner suggested below and to avoid a major expansion of the dictionary with a large number of new items similar to the ones already well-defined in ENTITY and its sub-categories. What we propose below is to allow an additional _entity.type value of 'ensemble', but any of the entity types for which substructure is meaningful, i.e. 'polymer', 'non-polymer' or 'ensemble' could be given sub-structure with entries in the ENTITY_ENS sub-category, and in some cases the higher level ensemble may appear as the entity_id in an atom_site list, rather than a lower level chain. Consider, for example, a family of viruses with three well-defined protein domains: 1, 2 and 3, which are normally expressed as a chain in the order 3-2-1 and then cleaved. However, there are some members of this family of viruses which retain 3-2 as a single chain. It is useful to identify each of the domains as an entity, but for the structure with the uncleaved 3-2 chain, it is the ensemble polymer, 3-2, for which the entity_poly_seq chain would be mapped to the atom_site list. Also consider the case of an engineered structure spliced together from well-known major sub-components and with strong homologies to some natural product. It would be useful to include the entity source information for each of the sub-components, the engineered structure, and for the entity to which the strong homology obtained. As one last example, consider the case of structures containing immunoglobulin Fab fragments, which could require a single chain to be linked to two different PIR and/or SWISS-PROT entries. These databases represent sequences for the various immunoglobulin domains as separate entries (see DBREF in PDB format description.) In that case, suppose the full chain were entity 1, then we could define two sub-chains with entity id's 1-1 and 1-2, each of which would be given the appropriate entity_poly_seq information (duplicating portions of the entity_poly_seq sequences for entity 1 but with new entity id's and num's) and then entity 1 would be declared to be an ensemble derived from entity 1-1 and 1-2. Only entity 1 would appear in the atom_site list, and the database citations would be given for the sub-chains to which they apply using entity_reference. We have not created an explicit link to STRUCT_BIOL, but such links may eventually prove useful. For the moment it seems sufficient to make an implicit link where needed by use of related names and keep the changes to the dictionary minimal. What follows is a marked up revision to the ENTITY category and then a completely new ENTITY_ENS sub-category. We have marked changes with "#" at the end of the relevant lines in the ENTITY category. -- HJB+FCB ############ ## ENTITY ## ############ save_ENTITY _category.description ; Data items in the ENTITY category record details (such as chemical composition, name, and source) about the molecular entityies that are present in a crystallographic structure. Items in the various ENTITY sub-categories provide a full chemical description of these molecular entities. Entities are of three basic types: 'polymer', 'non-polymer' # and 'water', and one composite type, 'ensemble'. # # Note that the water category includes only water; ordered solvent such as a sulfate ion or acetone would be described as # individual non-polymer entities. # The ENTITY category is specific to macromolecular CIF applications, and replaces the function of the CHEMICAL category It is important to remember that the ENTITY data are not the result of the crystallographic experiment; those results are represented in the ATOM_SITE data items. ENTITY data items describe the chemistry of the molecules under investigation, and can most usefully be though of as the ideal groups to which the structure is restrained or constrained during refinement. Ensemble type entities are the ideal representations of the # chemical or biological units formed from more than one # entity, such as biological units described in the category # STRUCT_BIOL_GEN. However, any entity type, other than water, # may have appropriate subcomponents defined by entries in the # ENTITY_ENS sub-category. # It is also important to remember that entities do not correspond directly to the enumeration of the contents of the asymmetric unit. Entities are described only once, even in those structures that contain multiple observations of an entity. The STRUCT_ASYM data items, which reference the entity list, describe and label the contents of the asymmetric unit. ; _category.id entity _category.mandatory_code no _category_key.name '_entity.id' loop_ _category_group.id 'inclusive_group' 'entity_group' loop_ _category_examples.detail _category_examples.case # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Example 1 - based on PDB entry 5HVP and laboratory records for the structure corresponding to PDB entry 5HVP ; ; loop_ _entity.id _entity.type _entity.formula_weight _entity.details 1 polymer 10916 ; The enzymatically competent form of HIV protease is a dimer. This entity corresponds to one monomer of an active dimer. ; 2 non-polymer 'need number here' 'Acetyl-Petstatin' # 3 water 18 '.' hivp ensemble 20832 'HIV protease dimer' # 5hvp ensemble 'need number here' ; The complete entry is cited as 5hvp # ; ; # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - save_ save__entity.details _item_description.description ; A description of special aspects of the entity. ; _item.name '_entity.details' _item.category_id entity _item.mandatory_code no _item_type.code text save_ save__entity.formula_weight _item_description.description ; Formula mass in daltons of the entity. ; _item.name '_entity.formula_weight' _item.category_id entity _item.mandatory_code no loop_ _item_range.maximum _item_range.minimum . 1.0 1.0 1.0 _item_type.code float save_ save__entity.id _item_description.description ; The value of _entity.id must uniquely identify a record in the ENTITY list. Note that this item need not be a number; it can be any unique identifier. ; loop_ _item.name _item.category_id _item.mandatory_code '_entity.id' entity yes '_atom_site.entity_id' atom_site no '_entity_ens.ens_entity_id' entity_ens yes # '_entity_ens.entity_id' entity_ens yes # '_entity_keywords.entity_id' entity_keywords yes '_entity_name_com.entity_id' entity_name_com yes '_entity_name_sys.entity_id' entity_name_sys yes '_entity_poly.entity_id' entity_poly yes '_entity_poly_seq.entity_id' entity_poly_seq yes '_entity_poly_seq_dif.entity_id' entity_poly_seq_dif yes '_entity_reference.entity_id' entity_reference yes '_entity_src_gen.entity_id' entity_src_gen yes '_entity_src_nat.entity_id' entity_src_nat yes '_struct_asym.entity_id' struct_asym yes _item_type.code char loop_ _item_linked.child_name _item_linked.parent_name '_atom_site.entity_id' '_entity.id' '_entity_ens.ens_entity_id' '_entity.id' # '_entity_ens.entity_id' '_entity.id' # '_entity_keywords.entity_id' '_entity.id' '_entity_name_com.entity_id' '_entity.id' '_entity_name_sys.entity_id' '_entity.id' '_entity_poly.entity_id' '_entity.id' '_entity_poly_seq.entity_id' '_entity_poly.entity_id' '_entity_poly_seq_dif.entity_id' '_entity_poly_seq.entity_id' '_entity_reference.entity_id' '_entity.id' '_entity_src_gen.entity_id' '_entity.id' '_entity_src_nat.entity_id' '_entity.id' '_struct_asym.entity_id' '_entity.id' save_ save__entity.src_method _item_description.description ; The method by which the sample for the entity was produced. Entities isolated directly from natural sources (tissues, soil samples, etc.) are expected to have further information in the ENTITY_SRC_NAT category. Entities isolated from genetically manipulated sources are expected to have further information in the ENTITY_SRC_GEN category. ; _item.name '_entity.src_method' _item.category_id entity _item.mandatory_code no _item_type.code ucode loop_ _item_enumeration.value _item_enumeration.detail nat ; entity isolated from a natural source ; man ; entity isolated from a genetically manipulated source ; save_ save__entity.type _item_description.description ; Defines the type of the entity. Polymer entities are expected to have corresponding ENTITY_POLY and associated entries. Polymer entities # may have subcomponents described with ENTITY_ENS entries. # Non-polymer entities are expected to have corresponding CHEM_COMP and associated entries. Non-polymer entities # may have subcomponents described with ENTITY_ENS entries. # Water entities are not expected to have corresponding entries in the ENTITY category. Ensemble entities are normally expected to have # corresponding STRUCT_BIOL_GEN entries, but this is not # mandatory, since reference may be needed to an ensemble # not structually determined. If an ensemble is more properly # identified as a polymer or coherent non-polymer compnent # then the types 'polymer' or 'non-polymer; should be used. In # any of these cases entities which are ensembles of significant # subcomponents are expected to have entries in the ENTITY_ENS # sub-category. # ; _item.name '_entity.type' _item.category_id entity _item.mandatory_code no _item_type.code ucode loop_ _item_enumeration.value _item_enumeration.detail polymer 'entity is a polymer' non-polymer 'entity is not a polymer' water 'water in the solvent model' ensemble 'ensemble of entities' save_ ##### ***** what follows from here down is new ******** ##################### ## ENTITY_ENS ## ##################### save_ENTITY_ENS _category.description ; Data items in the ENTITY_ENS category specify ensembles of entities which may themselves be viewed as meaningful chemical entities. A given polymeric or non-polymeric entity or water may participate more than once, a fractional number of times, or an indeterminate number of times in an ensemble of entities, and may participate in more than one ensemble. Entity ensembles will not normally be cited by _atom_site.entity_id if it is appropriate to cite a simpler entity, but there are structures for which a polymer which is properly identified as an ensemble of polymers should be cited by the more complex entity rather than the simpler components. Ensembles often will correspond to data items defined in the STRUCT_BIOL category, in which case it is recommended that similar or related names be used. A strict relationship is not enforced to permit generality of references to ensemble entities for which structural information is not available. ENTITY_ENS should be used to construct _chemical_ entities, rather than particular structural occurances of entities. A particular ensemble should be defined only once, no matter how many times it occurs in the structure. Thus if a structure consists of two homologous chains forming a dimer and two copies of the dimer occur in the asymmetric unit, we could define one entity for the chain and one entity described using ENTITY_ENS for the dimer. The dimer need not be defined as an entity unless it has chemical significance beyond consideration of the specifics of the structural determination. The ensemble created need not contain exact or complete replicas of the component entities, but must be derived from those components by some well-defined mechanism which can be explained in _entity_ens.detail ; _category.id entity_ens _category.mandatory_code no loop_ _category_key.name '_entity_ens.ens_entity_id' '_entity_ens.entity_id' '_entity_ens.multiplicity' loop_ _category_group.id 'inclusive_group' 'entity_group' loop_ _category_examples.detail _category_examples.case # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Example 1 - based on PDB entry 5HVP and laboratory records for the structure corresponding to PDB entry 5HVP ; ; loop_ _entity_ens.ens_entity_id _entity_ens.entity_id _entity_ens.mutiplicity _entity_ens.detail 1 1 2 'complete dimer of the HIV-1 protease' hivp 1 2 'dimer' 2 2 1 'acetyl-pepstatin to complete the complex' 5HVP 1 2 'full entity, beginning with dimer' 5HVP 2 1 'inhibitor' 5HVP 3 2 'CL' 5HVP 4 162 'HOH' ; # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - save_ save__entity_ens.ens_entity_id _item_description.description ; This data item is a pointer to _entity.id in the ENTITY category. The value of _entity_ens.id must uniquely identify the ensemble of entities being described by a group of records in the ENTITY_ENS list. While this item is _not_ a pointer to _struct_biol.id, it is recommended that the values chosen be related in a consistent manner to avoid confusion. Note that this item need not be a number; it can be any unique identifier. ; _item.name '_entity_ens.ens_entity_id' _item.category_id entity_ens _item.mandatory_code yes _item_type.code text loop_ _item_examples.case '1' '5HVP' save_ save__entity_ens.entity_id _item_description.description ; This data item is a pointer to _entity.id in the ENTITY category. This data item is used to list the component entities of which an ensemble is composed ; _item.name '_entity_ens.entity_id' _item.mandatory_code yes save_ save__entity_ens.detail _item_description.description ; Details describing the entity ensemble ; _item.name '_entity_ens.detail' _item.category_id entity_ens _item.mandatory_code no _item_type.code text loop_ _item_examples.case 'dimer' 'complete entity studied' save_ save__entity_ens.multiplicity _item_description.description ; The value of _entity_ens.multiplicity is a text string which describes the number of times the entity given by _entity_ens.entity_id for the same record participates in the ensemble. Note that this item need not be a number. For an multimeric entity, this value will normally be an integer, but for cases where fractions of an entity participate in an ensemble and for polymers with indefinite repeats, non-integral or even non-numeric values may appear. ; _item.name '_entity_ens.multiplicity' _item.category_id entity_ens _item.mandatory_code yes _item_type.code text loop_ _item_examples.case _item_examples.detail . ; the entity appears an indeterminate number of times ; 1 'the entity appears once' .5 'half the entity is used in the ensemble' x ; the entity appears an indeterminate number of times (different symbols such as x and y can be used to show that there is no relationship between the number of occurances of the different entities) ; 2x ; the entity appears an indeterminate number of times, but there is a dependence among the multiplicies of different entities, so that, say, one entity might have this multiplicity of 2x and another of 3x to show a 2:3 ratio ; save_