This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.
[IUCr Home Page] [CIF Home Page] [mmCIF Home Page]

entity ensembles

herbert_bernstein (yaya@aip.org)
Wed, 20 Dec 95 12:38:50 EST


There is no convenient way to provide systematic nomenclature, sources, etc.
for chemical and biological units which consist of ensembles of entities.
The STRUCT_BIOL_ENS category allows the structural representations of such 
ensembles to be defined, but there is no parallel subcategory for ENTITY
to handle the reference information.  As a solution we propose the following
changes to the ENTITY category and the addition of an ENTITY_ENS sub-category.

Please note that this is not a theoretical problem.  For many structures, the 
meaningful entities for which sources, systematic names, etc., are used are
on a higher level than individual chains or components, to which such
information must be, at present, linked.  This creates the unappealing
possibility of having to repeat such information many times in an entry 
associating the same information with each chain in an ensemble to
ensure completeness for searches.  It would seem preferable to create a
higher level of structure.  At the very least this would provide a means
to state sources for complete structures of multiple chains when the sources
are most meaningfully given at that level.

The major issue to consider is whether to do this sort of thing at the
level of ENTITY, to add more items to STRUCT_BIOL, or to create a new category
distinct from both.  As long as one is careful not to create loops, it
seems sufficient for the purpose at hand to simply augment ENTITY in the
manner suggested below and to avoid a major expansion of the dictionary
with a large number of new items similar to the ones already well-defined
in ENTITY and its sub-categories.

What we propose below is to allow an additional _entity.type value of
'ensemble', but any of the entity types for which substructure is meaningful,
i.e. 'polymer', 'non-polymer' or 'ensemble' could be given sub-structure
with entries in the ENTITY_ENS sub-category, and in some cases the
higher level ensemble may appear as the entity_id in an atom_site list,
rather than a lower level chain.  Consider, for example, a family of
viruses with three well-defined protein domains:  1, 2 and 3, which
are normally expressed as a chain in the order 3-2-1 and then cleaved.
However, there are some members of this family of viruses which retain
3-2 as a single chain.  It is useful to identify each of the domains as an 
entity, but for the structure with the uncleaved 3-2 chain, it is the 
ensemble polymer, 3-2, for which the entity_poly_seq chain would be mapped to 
the atom_site list.  Also consider the case of an engineered structure spliced
together from well-known major sub-components and with strong homologies to some
natural product.  It would be useful to include the entity source
information for each of the sub-components, the engineered structure, and
for the entity to which the strong homology obtained.

As one last example, consider the case of structures containing immunoglobulin
Fab fragments, which could require a single chain to be linked to two
different PIR and/or SWISS-PROT entries.  These databases represent sequences
for the various immunoglobulin domains as separate entries (see DBREF in
PDB format description.)  In that case, suppose the full chain were
entity 1, then we could define two sub-chains with entity id's 1-1 and 1-2,
each of which would be given the appropriate entity_poly_seq information
(duplicating portions of the entity_poly_seq sequences for entity 1 but
with new entity id's and num's) and then entity 1 would be declared to
be an ensemble derived from entity 1-1 and 1-2.  Only entity 1 would
appear in the atom_site list, and the database citations would be
given for the sub-chains to which they apply using entity_reference.

We have not created an explicit link to STRUCT_BIOL, but such links
may eventually prove useful.  For the moment it seems sufficient to
make an implicit link where needed by use of related names and keep the
changes to the dictionary minimal.

What follows is a marked up revision to the ENTITY category and then
a completely new ENTITY_ENS sub-category.

We have marked changes with "#" at the end of the relevant lines in the
ENTITY category.

    -- HJB+FCB

############
## ENTITY ##
############

save_ENTITY
    _category.description
;              Data items in the ENTITY category record details (such as
               chemical composition, name, and source) about the molecular
               entityies that are present in a crystallographic structure.

               Items in the various ENTITY sub-categories provide a full
               chemical description of these molecular entities.

               Entities are of three basic types:  'polymer', 'non-polymer'    #
               and 'water', and one composite type, 'ensemble'.                #
                                                                               #
               Note that the water category includes only water;  ordered
               solvent such as a sulfate ion or acetone would be described as  #
               individual non-polymer entities.                                #

               The ENTITY category is specific to macromolecular CIF
               applications, and replaces the function of the CHEMICAL category
               It is important to remember that the ENTITY data are not the
               result of the crystallographic experiment;  those results are
               represented in the ATOM_SITE data items.  ENTITY data items
               describe the chemistry of the molecules under investigation,
               and can most usefully be though of as the ideal groups to which
               the structure is restrained or constrained during refinement.

               Ensemble type entities are the ideal representations of the     #
               chemical or biological units formed from more than one          #
               entity, such as biological units described in the category      #
               STRUCT_BIOL_GEN.  However, any entity type, other than water,   #
               may have appropriate subcomponents defined by entries in the    #
               ENTITY_ENS sub-category.                                        #

               It is also important to remember that entities do not correspond
               directly to the enumeration of the contents of the asymmetric
               unit.  Entities are described only once, even in those structures
               that contain multiple observations of an entity.  The
               STRUCT_ASYM data items, which reference the entity list,
               describe and label the contents of the asymmetric unit.
;
    _category.id                  entity
    _category.mandatory_code      no
    _category_key.name          '_entity.id'
     loop_
    _category_group.id           'inclusive_group'
                                 'entity_group'
     loop_
    _category_examples.detail
    _category_examples.case
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
;
    Example 1 - based on PDB entry 5HVP and laboratory records for the
                structure corresponding to PDB entry 5HVP
;
;
     loop_
    _entity.id
    _entity.type
    _entity.formula_weight
    _entity.details
      1  polymer       10916
    ;                  The enzymatically competent form of HIV protease is a
                       dimer.  This entity corresponds to one monomer of an
                       active dimer.
    ;
      2  non-polymer  'need number here'  'Acetyl-Petstatin'                   #
      3  water         18  '.'
      hivp  ensemble   20832 'HIV protease dimer'                              #
      5hvp  ensemble   'need number here'
    ;                  The complete entry is cited as 5hvp                     #
    ; 

;
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     save_

save__entity.details
    _item_description.description
;              A description of special aspects of the entity.
;
    _item.name                  '_entity.details'
    _item.category_id             entity
    _item.mandatory_code          no
    _item_type.code               text
     save_

save__entity.formula_weight
    _item_description.description
;              Formula mass in daltons of the entity.
;
    _item.name                  '_entity.formula_weight'
    _item.category_id             entity
    _item.mandatory_code          no
     loop_
    _item_range.maximum
    _item_range.minimum
                                  .    1.0
                                 1.0   1.0
    _item_type.code               float
     save_

save__entity.id
    _item_description.description
;              The value of _entity.id must uniquely identify a record in the
               ENTITY list.

               Note that this item need not be a number;  it can be any unique
               identifier.
;
     loop_
    _item.name
    _item.category_id
    _item.mandatory_code
               '_entity.id'                      entity               yes
               '_atom_site.entity_id'            atom_site            no
               '_entity_ens.ens_entity_id'       entity_ens           yes      #
               '_entity_ens.entity_id'           entity_ens           yes      #
               '_entity_keywords.entity_id'      entity_keywords      yes
               '_entity_name_com.entity_id'      entity_name_com      yes
               '_entity_name_sys.entity_id'      entity_name_sys      yes
               '_entity_poly.entity_id'          entity_poly          yes
               '_entity_poly_seq.entity_id'      entity_poly_seq      yes
               '_entity_poly_seq_dif.entity_id'  entity_poly_seq_dif  yes
               '_entity_reference.entity_id'     entity_reference     yes
               '_entity_src_gen.entity_id'       entity_src_gen       yes
               '_entity_src_nat.entity_id'       entity_src_nat       yes
               '_struct_asym.entity_id'          struct_asym          yes
    _item_type.code               char
     loop_
    _item_linked.child_name
    _item_linked.parent_name
               '_atom_site.entity_id'            '_entity.id'
               '_entity_ens.ens_entity_id'       '_entity.id'                  #
               '_entity_ens.entity_id'           '_entity.id'                  #
               '_entity_keywords.entity_id'      '_entity.id'
               '_entity_name_com.entity_id'      '_entity.id'
               '_entity_name_sys.entity_id'      '_entity.id'
               '_entity_poly.entity_id'          '_entity.id'
               '_entity_poly_seq.entity_id'      '_entity_poly.entity_id'
               '_entity_poly_seq_dif.entity_id'  '_entity_poly_seq.entity_id'
               '_entity_reference.entity_id'     '_entity.id'
               '_entity_src_gen.entity_id'       '_entity.id'
               '_entity_src_nat.entity_id'       '_entity.id'
               '_struct_asym.entity_id'          '_entity.id'
     save_

save__entity.src_method
    _item_description.description
;              The method by which the sample for the entity was produced.
               Entities isolated directly from natural sources (tissues, soil
               samples, etc.) are expected to have further information in the
               ENTITY_SRC_NAT category.  Entities isolated from genetically
               manipulated sources are expected to have further information in
               the ENTITY_SRC_GEN category.
;
    _item.name                  '_entity.src_method'
    _item.category_id             entity
    _item.mandatory_code          no
    _item_type.code               ucode
     loop_
    _item_enumeration.value
    _item_enumeration.detail      nat
;                                 entity isolated from a natural source
;
                                  man
;                                 entity isolated from a genetically
                                  manipulated source
;
     save_

save__entity.type
    _item_description.description
;              Defines the type of the entity.

               Polymer entities are expected to have corresponding
               ENTITY_POLY and associated entries.  Polymer entities           #
               may have subcomponents described with ENTITY_ENS entries.       #

               Non-polymer entities are expected to have corresponding
               CHEM_COMP and associated entries.  Non-polymer entities         #
               may have subcomponents described with ENTITY_ENS entries.       #

               Water entities are not expected to have corresponding
               entries in the ENTITY category.

               Ensemble entities are normally expected to have                 #
               corresponding STRUCT_BIOL_GEN entries, but this is not          #
               mandatory, since reference may be needed to an ensemble         #
               not structually determined.  If an ensemble is more properly    #
               identified as a polymer or coherent non-polymer compnent        #
               then the types 'polymer' or 'non-polymer; should be used.  In   #
               any of these cases entities which are ensembles of significant  #
               subcomponents are expected to have entries in the ENTITY_ENS    #
               sub-category.                                                   #

;
    _item.name                  '_entity.type'
    _item.category_id             entity
    _item.mandatory_code          no
    _item_type.code               ucode
     loop_
    _item_enumeration.value
    _item_enumeration.detail      polymer      'entity is a polymer'
                                  non-polymer  'entity is not a polymer'
                                  water        'water in the solvent model'
                                  ensemble     'ensemble of entities'
     save_

##### ***** what follows from here down is new ******** 
#####################
## ENTITY_ENS      ##
#####################

save_ENTITY_ENS
    _category.description
;              Data items in the ENTITY_ENS category specify ensembles of
               entities which may themselves be viewed as meaningful
               chemical entities.  A given polymeric or non-polymeric
               entity or water may participate more than once, a
               fractional number of times, or an indeterminate number
               of times in an ensemble of entities, and may participate
               in more than one ensemble.  Entity ensembles will not normally
               be cited by _atom_site.entity_id if it is appropriate to cite
               a simpler entity, but there are structures for which a polymer
               which is properly identified as an ensemble of polymers should
               be cited by the more complex entity rather than the simpler
               components. Ensembles often will correspond to data items
               defined in the STRUCT_BIOL category, in which case it
               is recommended that similar or related names be used.  A
               strict relationship is not enforced to permit generality
               of references to ensemble entities for which structural
               information is not available.

               ENTITY_ENS should be used to construct _chemical_ entities,
               rather than particular structural occurances of entities.
               A particular ensemble should be defined only once, no matter
               how many times it occurs in the structure.  Thus if a structure
               consists of two homologous chains forming a dimer and two copies
               of the dimer occur in the asymmetric unit, we could define one
               entity for the chain and one entity described using ENTITY_ENS
               for the dimer.  The dimer need not be defined as an entity unless
               it has chemical significance beyond consideration of the
               specifics of the structural determination.

               The ensemble created need not contain exact or complete
               replicas of the component entities, but must be derived
               from those components by some well-defined mechanism
               which can be explained in _entity_ens.detail

;
    _category.id                  entity_ens
    _category.mandatory_code      no
     loop_
    _category_key.name          '_entity_ens.ens_entity_id'
                                '_entity_ens.entity_id'
                                '_entity_ens.multiplicity'
     loop_
    _category_group.id           'inclusive_group'
                                 'entity_group'
     loop_
    _category_examples.detail
    _category_examples.case
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
;
    Example 1 - based on PDB entry 5HVP and laboratory records for the
                structure corresponding to PDB entry 5HVP
;
;
     loop_
    _entity_ens.ens_entity_id
    _entity_ens.entity_id
    _entity_ens.mutiplicity
    _entity_ens.detail
      1     1     2    'complete dimer of the HIV-1 protease'
      hivp  1     2    'dimer'
      2     2     1    'acetyl-pepstatin to complete the complex'
     5HVP   1     2    'full entity, beginning with dimer'
     5HVP   2     1    'inhibitor'
     5HVP   3     2    'CL'
     5HVP   4     162  'HOH'
;
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     save_


save__entity_ens.ens_entity_id
    _item_description.description
;              This data item is a pointer to _entity.id in the ENTITY category.

               The value of _entity_ens.id must uniquely identify the
               ensemble of entities being described by a group of records
               in the ENTITY_ENS list.  While this item is _not_ a
               pointer to _struct_biol.id, it is recommended that
               the values chosen be related in a consistent manner to
               avoid confusion.

               Note that this item need not be a number;  it can be any unique
               identifier.
;
    _item.name                  '_entity_ens.ens_entity_id'
    _item.category_id             entity_ens
    _item.mandatory_code          yes
    _item_type.code               text
     loop_
    _item_examples.case          '1'
                                 '5HVP'
     save_

save__entity_ens.entity_id
    _item_description.description
;              This data item is a pointer to _entity.id in the ENTITY category.
 
               This data item is used to list the component entities of which an
               ensemble is composed
;
    _item.name                  '_entity_ens.entity_id'
    _item.mandatory_code          yes
     save_

save__entity_ens.detail
    _item_description.description
;              Details describing the entity ensemble
;
    _item.name                  '_entity_ens.detail'
    _item.category_id             entity_ens
    _item.mandatory_code          no
    _item_type.code               text
     loop_
    _item_examples.case          'dimer'
                                 'complete entity studied'
     save_

save__entity_ens.multiplicity
    _item_description.description
;              The value of _entity_ens.multiplicity is a text string which
               describes the number of times the entity given by
               _entity_ens.entity_id for the same record participates in
               the ensemble.

               Note that this item need not be a number.  For an multimeric
               entity, this value will normally be an integer, but for cases
               where fractions of an entity participate in an ensemble and
               for polymers with indefinite repeats, non-integral or
               even non-numeric values may appear.
;
    _item.name                  '_entity_ens.multiplicity'
    _item.category_id             entity_ens
    _item.mandatory_code          yes
    _item_type.code               text
     loop_
    _item_examples.case 
    _item_examples.detail
                                 . 
;                                      the entity appears an indeterminate
                                       number of times
;
                                 1    'the entity appears once'
                                 .5   'half the entity is used in the ensemble'
                                 x
;                                      the entity appears an indeterminate
                                       number of times  (different symbols such
                                       as x and y can be used to show that
                                       there is no relationship between the
                                       number of occurances of the different
                                       entities)
;
                                 2x
;                                      the entity appears an indeterminate
                                       number of times, but there is a
                                       dependence among the multiplicies of
                                       different entities, so that, say, one
                                       entity might have this multiplicity of
                                       2x and another of 3x to show a 2:3 ratio
;
    save_