There have been a number of letters written requesting an expansion of the structure description hierarchy in mmCIF. This letter is a long description of the issues surrounding the implementation of these additional levels. The current form of the mmCIF description of the contents of the asymmetric unit of a macromolecular crystal can be written in the form "chain/residue/atom.variant". mmCIF uses different terminology (asym, seq_id, atom_id, and ???) but since I find the mmCIF phrasing very confusing I will use a more URL-like syntax for this discussion (I am not proposing that mmCIF change). Putting together suggestions from Frances and Herbert Bernstein along with comments from others a fuller description of the contents of a mmCIF file would be model//chain.variant/residue.variant/atom.variant I think this form would fill all the needs expressed in the newsgroup so far. The "model//" level would allow several independent models to be placed in a single mmCIF file. In its simplest use it could label all the individual NMR models proposed for a molecule. The variants of the chain allows for models which have been refined with multiple copies of each chain in an attempt to model disorder. The difference between chain variants and models is that all the chain variants are presumed to coexist in time or space where different models are simply different ways of interpreting the observations. Residue variants are used to model sequence heterogeneity. Atom variants are used to model discreetly disordered atoms. The latter are reasonably described by the current version of mmCIF. Model Issues I can see the case for placing a whole series of NMR models in one mmCIF file. It would be quite convenient for the mmCIF reader to understand the organization of the models and their relationship to one another. It is also very useful as a program data structure to have several models described at one time. If fact I intend to implement this level in TNT's data structures when time presents itself. However the number of problems posed by this level of description for mmCIF rapidly multiplies. Even for NMR models the problems begin. The first problem is that the statistics for the agreement of the model and the observations must be "loop_"ed. I don't know how NMR people measure model quality but the equivalent for X-ray people would be to loop the R value and all stereochemistry agreement statistics. Since model quality will certainly be required data for any deposition this elaboration will be unavoidable. It rapidly becomes worse. Take the case of someone comparing two refinement protocols. It would be most convenient to store both models in a single file. However now the refinement parameters, and refinement history would have to be looped over the models. In addition the target stereochemistry libraries must be looped because one refinement might be with EREF and another with PROLSQ. Now consider a study where a X-ray data set and a neutron data set are available. It is reasonable to presume a model could be refined against the X-ray data alone, another against the neutron data, and a third refined against both. Very interesting things could be learned by the comparison of the three models. Now the data set table must be looped. (I know the current mmCIF does not include back links to the structure factor files but it should and I am still lobbying.) As you can see one can construct very reasonable pairings of models which would result in the duplication of practically any table in mmCIF. The extreme would be to try to place the model for T4 lysozyme and Thermolysin in one mmCIF file because they were both solved in the same lab. In addition the pairings would depend on the interest of authors of the mmCIF file. Others might want to pair the same models in different ways. If the mmCIF committee decides to implement a "model" level in the mmCIF definition I suggest they place strict limits on its utility. The models should be determined using the same analysis procedure of the same data. The structure of the models must be identical -- There cannot be one sequence in one model and a different one in another. All fields describing the agreement of a model with the data must be loop-able. I think these are the minimal changes required to allow NMR models and parallel SA runs to be stored in a single mmCIF file. Trying to go beyond this would be incredibly difficult. In addition I suggest that you allow the possibility of defining, in the group defining research articles, a table of other mmCIF data blocks (in other files) used by that paper. With this tool the authors could cross connect the different mmCIF files referenced in a paper to allow for the retrieval of the related models without appending them all in a single file. This scheme would allow different papers to refer to different sets of mmCIF data blocks. Chain Issues The chain variants are introduced to allow for the description of models where several copies of each chain are included in the refinement in an attempt to model discrete disorder and nonisotropic B-factors. For example, you could have a hemeoglobin molecule in the asymmetric unit, resulting in two Alpha chains (A1, and A2) and two Beta chains (B1, and B2). If a model was refined with 8 copies of each chain the model would contain A1.1, A1.2, ... B1.1, ... and B2.8. This addition is fairly straight forward but there are a few problems to watch. First, it should be a requirement that all the chain variants be of exactly the same type. A1.1 should be the same "entity" as A1.3. In fact the loop defining the entities for each chain (asym) should not mention the variants at all. Second, it should be possible to ignore this level without having to mark something in the mmCIF file. Most models will not use this level of description -- Their mmCIF files should not be cluttered up by being forced to state that all atoms in chain A1 are in variant 1 if there is no other. I think even including the "." in one column too confusing. I do not know the details of mmCIF well enough to know how to construct the default variant. Third, if a mmCIF file uses the default variant then the use of a nondefault variant for that same chain should be illegal. For example, if there is a chain A1 (no variant indicator) there cannot also be an A1.1. Residue Issues The residue variant field is suggested to handle the problem of heterogeneity of peptide sequence. If residue 123 is sometimes a SER and other times a ARG one would define a 123.1 which is identified as SER and a 123.2 as ARG. It would be more chemically proper to define two chains A1 and A2 which are different entities but are constrained to have identical parameters everywhere except for occupancies and residue 123. While this is a more complete description of the model the complexity of the constraint makes it impractical. If my previous suggestion for the explicit definition of connectivity of residues in entities is adopted the sequence of these things (along with a great number of other things) becomes quite easy. In the table of residue types there would simply be an entry for 123.1 and 123.2. In the connectivity table links would be defined between 122 and 123.1, 122 and 123.2, 123.1 and 124, and 123.2 and 124. The first caveat (not in the PDB sense I hope) is that once again we need a default for the residue variant so that this field does not lead to confusion for the vast majority of models which have no need of this construction. If one version of a residue is declared to be a variant then the default cannot be used for that residue in another place. You should not allow both 123 and 123.1 in the same chain. They must be 123.1 and 123.2. In addition, it should be recognized that residue variant 123.1 goes with variant 124.1. This definition is parallel to that of the atom variant case already implemented. My final point is a matter of usage. My example above was that 123.1 is SER and 123.2 is ARG. Suppose that the ARG side chain makes a salt bridge to 224 (GLU) and that 224 assumes one conformation for 123.1 (SER) and another for 123.2 (ARG). The temptation is to model 224 (GLU) with discrete disorder using the atom variants. This would not be a proper description of the model. The proper description is to define a 224.1 (GLU) and a 224.2 (GLU). Provision must be made in the definition of heterogenous sequences for such cases where both "variants" have the same residue type. Atom Issues Discrete disorder is modeled quite well in the current mmCIF. However I would prefer that there was a way to ignore the atom variant column in mmCIF files which contain no discreetly disordered components. If possible, there should be a check that one does not define a CG and a CG.1. If there is a disordered atom then all copies of the atom must be labeled disordered. Dale Tronrud