****************************** ** READ AND COMMENT, PLEASE ** ****************************** Hello folks - OK, all of the little stuff in the current round has been dealt with. Now for some of the meatier issues. At the end of this message, I summarize the whole thread on PDB Z values versus the contents of _cell.formula_units_Z. I know that including it all here makes this message unduly long, but this is how I am trying to do my own bookkeeping in managing this process. That having been said, I would like to quote from a couple of these messages. First Fran, in her original posting: > I do not think there is a need to have a discussion about the meaning > or the usefulness of this parameter; it has been present in PDB entries > as long as the present format has existed. What would be appreciated > would be the addition of a field in the mmCIF dictionary that would > hold this quantity and would facilitate the translation of PDB to mmCIF > and vice versa. Then Lynn, in a response, speaking of the definition of _cell.formula_units_Z: > I don't > think we should revise the definitions in a way > that will make mmCIF less compatible with CIF. These two messages summarize the conclusion of the committee on this issue - that we really ought not to mess with the core definition in something as fundamental as this, and that we are happy to add a data item to give Fran the convertability that she needs. BUT...there are some points that were raised in this dialog that need a reply. At one point, Herb said: > I don't think that one would actually be revising the core CIF > definition in any way, just allowing the mmCIF definition to > cover a case which is exactly parallel to the small molecule > definition, by taking the entity definitions as the functional > equivalents of formulae. The macromolecular case is fuzzier > than the small molecule case, but the PDB Z happens to match > the wording of the _cell.formula_units_Z if one adds entity > definitions as formulae. The key point for me here is in the last sentence "..if one adds entity defintions as formulae." This maybe is already clear to everyone, but just to be sure I want to stress that the ENTITY category only tells us that the asymmetric unit contains apples and oranges and pears, or grapes and apples and bananas, or whatever. The relevant category for the discussion of Z is STRUCT_ASYM, the contents of which tell us that the asymmetric units contains 2 apples and 3 oranges and 6 pears, or 24 grapes and 1 apple and 2 bananas. That is the point I would like to stress as we add Fran's new data item, and hence I have written the defintion as follows: save__cell.Z_PDB _item_description.description ; The number of the polymeric chains in a unit cell. In the case of heteropolymers, Z is the number of occurrences of the most populous chain. This data item is provided for compatibility with the original Protein Data Bank format, and only for that purpose. This is not a very satisfactory definition, as the multiplicity of macromolecular structures can be different for the different components of the cell. A more useful measure of a "macro- molecular Z" could be obtained by counting the number of times each molecular entity appears in the STRUCT_ASYM list and multiplying by the number of equivalent positions in the unit cell. There would be one such number for each type of molecular entity. ; Comments? Paula - - - - - Lynn Ten Eyck writes: There he goes again, spoiling a good discussion with facts . . . Herbert Bernstein (yaya@aip.org) writes >Here, for those who don't have it handy, is the 1992 PDB commentary >on Z, which shows, I think, how close to the existing mmCIF >Z value definition it is: > >"Confusion over the value to use for Z (number of molecules per cell) >arises because of different conceptions of the meaning of 'molecule'. >We have adopted the (crystallographic) convention that Z should equal >the number of times the same polymeric chain is contained in the cell. >In case of different numbers of chains per cell this will be explained >in the REMARK section and Z will denote the number of the more >populous species per cell." This is moderately reasonable; I think in the case of, say, a heterotrimer A2B I would prefer Z=1 instead of Z=2, because it would seem to me that the "molecule" is A2B. However, I believe these cases are rare and there is not much point arguing over them further. >That being said, it would seem the definition of _cell.formula_units_Z >would work with the addition of the following: > >For macromolecular structures, the value of _cell.formula_units_Z >is the number of occurances of the entity defined by _entity_poly_seq >in the cell. For heterogeneous combinations of polymers for which >the populations of distinct polymeric entities with a cell differs, >the value for the most populous ones will be used. This works, unless there is a holdout for applying a superstructure of NCS definitions . . . I think actually the heterogeneous multimeric problems are covered in the requirement that the biological unit be defined. - - - - - Herbert Bernstein writes: Here, for those who don't have it handy, is the 1992 PDB commentary on Z, which shows, I think, how close to the existing mmCIF Z value definition it is: "Confusion over the value to use for Z (number of molecules per cell) arises because of different conceptions of the meaning of 'molecule'. We have adopted the (crystallographic) convention that Z should equal the number of times the same polymeric chain is contained in the cell. In case of different numbers of chains per cell this will be explained in the REMARK section and Z will denote the number of the more populous species per cell." That being said, it would seem the definition of _cell.formula_units_Z would work with the addition of the following: For macromolecular structures, the value of _cell.formula_units_Z is the number of occurances of the entity defined by _entity_poly_seq in the cell. For heterogeneous combinations of polymers for which the populations of distinct polymeric entities with a cell differs, the value for the most populous ones will be used. This is not a critical issue, but it would seem helpful to users of future multidisciplinary data base searches to use the same name for a common concept where possible. - - - - - Herbert Bernstein writes: I don't think that one would actually be revising the core CIF definition in any way, just allowing the mmCIF definition to cover a case which is exactly parallel to the small molecule definition, by taking the entity definitions as the functional equivalents of formulae. The macromolecular case is fuzzier than the small molecule case, but the PDB Z happens to match the wording of the _cell.formula_units_Z if one adds entity definitions as formulae. - - - - - Lynn Ten Eyck writes: Frances has raised the issue of the PDB Z value and a place to put it in mmCIF. Personally I do not see a lot of use for the quantity -- it doesn't give the crystallographic multiplicity of asymmetric units, and for heteropolymers seems to me to be actively misleading. _cell.formulat_units_Z is rooted in small molecule crystallography, as is essentially all of the chemical_formula material. It seems to me that the ENTITY data items were defined because macromolecular crystalls are not as well defined chemically as small molecule crystals, and if we need a placeholder for the PDB Z value we should put one in. I don't think we should revise the definitions in a way that will make mmCIF less compatible with CIF. Would revision of the definition of _cell.formula_units_Z do this? - - - - - Frances Bernstein writes: This is a follow-on message to the one I sent yesterday about needing a token for the PDB Z value, based on information provided by Herbert Bernstein. _cell.formula_units_Z is defined as: ; The number of the formula units in the unit cell as specified by _chemical_formula.structural, _chemical_formula.moiety or _chemical_formula.sum. ; But when I look at _chemical_formula I find: ; Data items in the CHEMICAL_FORMULA category would not, in general, be used in a macromolecular CIF. See instead the ENTITY data items. which seems to imply that one should not be using _cell.formula_units_Z in mmCIF. Herbert pointed out that if one were to revise the definition of _cell.formula_units_Z, it could be used to hold the PDB Z value. Apparently Phil Bourne assumes that _cell.formula_units_Z is meant to hold the PDB Z value and he uses it for that purpose in pdb2cif. - - - - - Frances Bernstein writes: The PDB has a field on CRYST1 records that is called Z but it is not the same as the crystallographic Z. Here is the definition from the PDB format description: The Z-value is the number of polymeric chains in a unit cell. In the case of heteropolymers, Z is the number of occurrences of the most populous chain. I do not think there is a need to have a discussion about the meaning or the usefulness of this parameter; it has been present in PDB entries as long as the present format has existed. What would be appreciated would be the addition of a field in the mmCIF dictionary that would hold this quantity and would facilitate the translation of PDB to mmCIF and vice versa. - - - - - ******************************************************************************** Dr. Paula M. D. Fitzgerald ______________ voice and FAX: (908) 594-5510 Merck Research Laboratories ______________ email: paula_fitzgerald@merck.com P.O. Box 2000, Ry50-105 ______________ or bean@merck.com Rahway, NJ 07065 USA (for express mail use 126 E. Lincoln Ave. instead of P. O. Box 2000) ********************************************************************************