[Date Prev][Date Next][Date Index]
(21) Draft Dictionaries: matters arising
- To: COMCIFS@uk.ac.iucr
- Subject: (21) Draft Dictionaries: matters arising
- From: bm@uk.ac.iucr (Brian McMahon)
- Date: Thu, 31 Mar 94 16:25:59 BST
Dear Colleagues D21.1 Distribution of draft dictionaries ----------------------------------------- Thank you for your votes to circulate the draft revised Core and mmCIF dictionaries. These are now available for transfer from the Chester office as follows: by e-mail - send to sendcif@iucr.ac.uk a message of the form send cifdic.C94 send cifdic.M94 by ftp - anonymous ftp to ftp.iucr.ac.uk (192.70.242.1), directory pub OR directory cifdics by gopher - gopher.iucr.ac.uk, port 70 by WWW - URL http://www.iucr.ac.uk/welcome.html The machine names are all synonyms for diamond.iucr.ac.uk, and are interchangeable. They bear the following statements at the head of each file. For the Core: ############################################################################## # NOTE: This version of the CIF Core Dictionary is a draft of the # # revised version being considered for approval by the IUCr # # Committee for the Maintenance of the CIF Standard. It is # # hereby released to the general community for comment and # # criticism. Production software should not rely on the new # # features within this draft being available in the future. # # # # Please direct any comments on the definitions in this version # # to the Coordinating Secretary of COMCIFS, Brian McMahon, at # # IUCr, 5 Abbey Square, Chester CH1 2HU, England (bm@iucr.ac.uk). # ############################################################################## and for the mm file: ############################################################################## # NOTE: This version of the mmCIF Dictionary is a draft of the # # version being considered for approval by the IUCr Committee # # for the Maintenance of the CIF Standard. It is hereby released # # to the general community for comment and criticism. Production # # software should not rely on all the features within this draft # # being available. In particular, the Dictionary Definition # # Language (DDL) employed to state the formal elements of the # # definitions will change in accordance with the final published # # DDL description (A. P. F. Cook & S. R. Hall, manuscript # # submitted for publication in J. Chem. Inform. Comput. Sci. 1994). # # # # Please direct any comments on the definitions in this version # # to Paula Fitzgerald, Merck Research Laboratories, PO Box 2000, # # Ry50-105, Rahway, NJ 07065, USA (paula_fitzgerald@merck.com), or # # to the Coordinating Secretary of COMCIFS, Brian McMahon, at # # IUCr, 5 Abbey Square, Chester CH1 2HU, England (bm@iucr.ac.uk). # ############################################################################## Please feel free to distribute these through other relevant ftp sites; but please ensure that the preliminary nature of these documents is made known to any prospective taker. D21.2 Differences between released drafts and versions circulated to you ------------------------------------------------------------------------- With the exception of the header, the Core dictionary is unchanged from the version I sent on 2 March, though David has made some valuable suggestions for modification which should be implemented in due course. I have been too pressed for time to make these changes as yet; and since I shall be on holiday next week I deemed it appropriate to let the current versions take wing. But here are David's comments for the record, and the action that will be taken: D> I worked through the whole core, including the stuff already D> approved and have a number of corrections to suggest. These are of two D> kinds, editorial and substantive. I will send details of the editorial D> corrections later, but I assume that, while the data names and D> definitions are fixed, the way in which we describe the definition can be D> improved and clarified. I further believe that the dictionaries should D> contain good clear English, so infelicities should be removed. Firstly D> to mention some general points under editorial corrections. D> * I am convinced that 'and/or' is just a lazy way of saying either 'and' D> or else 'or' and that in almost all contexts only one of these D> conjunctions actually is correct. I would like to see this idiom removed. Will check on this, though sometimes it is appropriate, and less turgid than "and or or". D> * 'diffraction data' means 'intensities' to someone working with single D> crystals but can mean 'd values' to someone working with powders. We D> should be quite explicit everywhere this or a similar phrase occurs. The D> common expression 'collecting data' suggests images of pollsters knocking on D> people's doors asking them how they intend to vote. D> * Are the escape characters for greek letters etc. available in the D> cifdics or are they part of the cif definition? The dictionaries could D> be printed in more elegant form if the escape characters were switched on. They are a convention employed within CIF, and I have permitted their use in the existing Core Dictionary. I think I prefer to see them remain as an accepted convention (not a Commandment), but encourage their use in the dictionaries. D> Now to more substantive matters. D> * Should _atom_sites_cartn_tran_matrix and *_fract_tran_matrix be D> designated as related items? D> * _atom_type_oxidation_number: The evaluation range should be increased D> to at least 8 (as in OsO4) D> *_atom_type_radius: There is no particular reason to put an upper limit D> on the enumeration range. 4.0 A may be adequate for almost all purposes, D> but one day someone will want to choose a value of 4.1 A This last point is developed at greater length below (point D21.3). D> * In the category 'chemical_formula' there are a couple references to D> 'appendix' that need to be changed to []. D> * There seems to be no provision for _citation_journal_coden_CSD. D> Have they abandoned their codens in favour of some other system? D> Curiously they are available as database_journal_csd. The ASTM coden now D> can appear either as _citation_journal_coden_astm or D> _database_journal_astm. Is there a difference in these two fields? If D> there is, should the _database category be expanded to cover all the D> other possible codens? D> * _diffrn_ambient_environment: The default for this field (not specified D> as a default but given in the definition) should be 'air' not 'air or D> vacuum'. These environments are different and can result D> in differences in the sample. D> * _diffrn_refln_detect_slit: The role of horizontal and vertical will be D> reversed if the diffractometer is turned on its side as is the case with D> some kinds of difractometer (particularly powder diffractometers). These D> definitions would better by described as 'in the diffraction plane' and D> 'perpendicular to the diffraction plane' respectively. D> * We should add an item: _publ_section_exptl_solution to describe how the D> structure was solved (e.g. direct methods, Patterson function). Seems reasonable. Syd, do you agree with this (it's mostly relevant to Acta)? D> * In _geom_ we have a number to indicate the symmetry operation applied D> to an atom. There is a real source of ambiguity here. If D> _sym_equiv_posn_as_xyz is given, this number depends on the order in D> which these terms are given which is something that cannot be assumed in D> star files. We need a _sym_equiv_posn_id field to label these symmetry D> operators. We have a more serious problem if the file relies on software D> to generate the symmetry operators from the space group symbol since D> different programs will generate the operators in a different order. D> Maybe we need to require that if _geom_ is used, the *_as_xyz loop must D> also be given. Paula and I have discussed this before, and Phil has also been agitating to have it implemented. I have been reticent about this, because it introduces a data name which has no physical meaning other than as a pointer into a list; and if the list is consistently generated, the value of the pointer is identical to the position, a value that can be obtained by counting (not a high overhead, as the file is parsed sequentially anyhow). I accept that there is some merit in guarding against shuffling of the order, though haven't previously been won over to the argument. Are there any comments on this from other members? Paula has already given her response in these terms: P> Aha - So David spotted this as well. I know we have talked about this P> before, and decided to punt for the time, but I really think we can't avoid P> dealing with identifying the operations explicitly. Paula has been much more assiduous than I in following up David's suggestions for modification to the mm dictionary, though she has also left some matters in abeyance for the present. Again for the record, here is the relevant dialogue: D> * The following sections could be transferred to core: D> all _diffrn_measure_ items D> all _diffrn_rad_ items D> all _reflns_data_ items D> all items in the category 'struct' (but not the other struct_ D> categories) P> *** Fine by me, but not something I feel strongly about either way. This was another issue we discussed in Chester at the beginning of the year. The decision to leave _diffrn_measurement_ items in the Core but _diffrn_measure_ in mmCIF had more to do with the argument that macromolecular folk are more interested in storing the details of the experiment in a finer-grained databank than the small-molecule guys, than any conviction that the quantities described were specific to mm studies. We'll shift these unless anyone else cares to put up a strong case for maintaining the status quo. D> * in the _entity_mon_angle_ example, the item name should read D> _entity_mon_angle_value_angle instead of D> _entity_mon_angle_value D> * In the _entity_npol_bond category the data name: D> _entity_npol_bond_value should read D> _entity_npol_bond_value_dist (see the example given for the category) D> * In the _entity_npol_tor_value category most of the data names D> have 'nonp' substituted for 'npol'. May be this is deliberate D> but the section should at least be self-consistent. P> Fixed. D> * In _phasing_MIR_der_shell_for and _phasing_MIR_shell_for D> the figure of merit should be defined, i.e. an expression should D> be included to say how it is calculated. P> Right, but I haven't dealt with this yet. D> * In _phasing_MIR_site_ there seems to be no definition of the D> nature of the heavy atom, i.e. what its chemical species is or D> how many electrons it contains. This is a curious omission D> since the occupancy is given, and without knowing how many D> electrons the heavy atom contained, the scattering from a D> given site cannot be determined. P> Again correct, but also not yet dealt with. D> * The data name _refine_ls_restr_model is not particularly clear. D> I would suggest _refine_ls_restr_dev_model as being more D> transparent. The wording of this definition is also rather D> opaque. I would suggest: D> 'For the given parameter type, the rms deviation between D> the ideal values and the values obtained by least squares D> refinement of the model' D> The wording of the definition of _refine_ls_restr_target D> should be revised also to read: D> 'For the given parameter type, the target rms deviation D> between the ideal values and the values obtained by least D> squares refinement of the model' D> Likewise the name should be _refine_ls_restr_dev_target P> Also deferred. D> * In _refine_ls_restr_type enumeration list, should not: D> 'p_xhangle_d' be 'p_xhangle_a'? Or alternatively should not D> its definition be 'x-h bond angle expressed as a distance'? P> I implemented the the alternative suggestion. D> * In the _struct_asym category, the definition is not clear. In D> any crystals I have dealt with there is only one asymmetric D> unit. This must be true for all crystals. May be the mm D> people use asymmetric unit is a different sense, but if so, D> this should be made clear by saying, e.g., 'that form the D> asymmetric unit of each entity'. Note also that 'comprise' is D> used incorrectly - the whole comprises various parts, not the D> other way around. This is also found in the definition of the D> _struct_biol_ category. P> Reworded in both cases D> * _struct_biol_gen_symmetry needs an improved definition starting D> with: D> 'Describes the symmetry required to generate the D> component from ?? (what does it act on? This is not D> clear). The symmetry code comprises the the symmetry D> operation "n" and the cell translation ...' D> The default should be specified as 1_555 D> The same changes apply to _struct_conn_symmetry_* and D> _structure_site_gen_symmetry P> Added the appropriate extension to the definition in all three places. P> Added the enumeration default in all three places P> P> In the course of doing this, I noted that we still had an inconsistent P> construction for _struct_conn_role_ptnr* and _struc_conn_symmetry_ptrn* P> compared to the rest of the data items in that category. I changed these P> two items to four so that the names were consistent. D> * _struct_conn_conn_type_id has a list_link_child with the same D> name. This seems a little unusual if not meaningless. I do D> not know what is intended. P> Fixed this - the list_link_child didn't belong there and was moved to P> where it did belong. D> * A promised note on matrices. D> D> Vectors X and Y are multiplied by matrix M as follows: D> D> Y = MX D> D> where Y and X are column vectors, i.e. written as: D> D> X = | x1 | D> | x2 | D> | x3 | etc. D> D> and M is a matrix written: D> D> M = | m11 m12 m13 ... m1n | D> | m21 m22 m23 ... m2n | D> | ................... | D> | ml1 ml2 ml3 ... mln | D> D> and y1 = m11.x1 + m12.x2 + m13.x3 ... m1n.xn D> etc. D> Y is the result obtained by multiplying the vector X by the matrix D> M. I suggest that we express all the vectors as column vectors, D> with the product (Y) on the left hand side and the matrix and the D> vector being transformed (X) on the right hand side. For example D> the two sides of the matrix equation in D> _atom_sites_cart_trans_matrix should be interchanged. In other D> cases row vectors should be changed to column vectors. P> I'm all for consistency. I checked, and there is only one matrix left in P> the mm dictionary. I haven't changed it yet, but I will if you decided to P> go ahead and impose this throughout the core This change should be implemented throughout for consistency. Thanks, David, for going to the trouble of spelling this out. D21.3 Enumeration ranges ------------------------- Some time ago, Phil sent some of us a list of criticisms and comments from the EMBL group who are interested in mmCIF development. One of their requirements is for tighter validation against the CIF dictionaries, so that numeric values falling outside _enumeration_range values would invalidate the file (or at least the entry). Paula and I gave this some thought, and we came up with four categories of possible enumeration ranges: (1) There exists a finite set of possible integer values. So _symmetry_Int_Tables_number could have an enumeration range 1:230. This represents a 'hard' range of validity. (2) There is a well-bounded set of values, often by convention. Hence, _cell_measurement_theta_ values may be given in the range 0.0:90.0. With angles in particular, there are often alternative but overlapping conventions, so that the EMBL group suggested validation ranges of -180.0:360.0 for quantities such as _diffrn_refln_angle_, presumably because these may either be quoted in the range -180->+180, or 0->360. (3) There is a definite limit imposed by the laws of physics, but which may be exceeded in practice because of round-off errors. Thus _exptl_absorpt_correction_T_ items have an enumeration range 0.0:1.0, but it may be legitimate to have a value slightly in excess of unity. Paula and I concluded that it was best to leave the enumeration range as stated, but validation software should be prepared to accept values outside the range within some epsilon value (where epsilon might be left to the application to set). Question: should this property be indicated within the DDL? Or is responsibility for this to be left completely to the application writers? (4) There is no physical limit on the values that may be found, though in practice there is a range outside which wide variations are never expected. This is the case with David's _atom_type_radius_ range of 0.0:4.0. One may argue that no upper limit should be given (because there is no specific physical limit); or one might argue that some arbitrary upper limit is nonetheless useful, since a value of 27.3 is certainly (or at least almost certainly) wrong. What are your views on this? And if we adopt the principle that a guideline upper limit should be set, how should one decide the value? "Think of a number and double it" is what is involved, but the arbitrariness is somewhat disconcerting. D21.4 Occupancies ------------------ D> When Frank and I put together the original core dictionary we D> spent some time worrying about how to represent sites that were occupied D> by more than one element. The solution that we adopted is that used by D> the ICSD and is not very elegant, namely we repeated the coordinates for D> the same site but with a different atom and the appropriate occupation D> number. D> D> The issue is now raising its head again in BR1070, a paper on the D> mineral tourmaline by Frank Hawthorne. Typically in minerals several D> elements can occupy the same site, in this case one site is occupied 4 D> different atoms Li 0.40, Al 0.55, Mn 0.012 and Cu 0.032. Representing D> this with four different atom records for this site is at best cumbersome D> and requires a program to check each set of coordinates to see if they D> are identical. D> D> The approach taken in the MM dictionary, and indeed the approach D> adopted by mineralogists (including Hawthorne), is to use a separate list D> with the parent-child link. What I would like to propose is a category D> that might be entitled _atom_site_occupancy that would have an D> _atom_site_occupancy_label that matches the _atom_site_label and lists D> the contents of the site, something along the lines of: D> D> loop_ D> _atom_site_occupancy_label D> _atom_site_occupancy_element D> _atom_site_occupancy_value D> x Na 0.541 D> x Ca 0.048 D> y Li 0.40 D> y Al 0.55 D> y Mn 0.012 D> y Cu 0.032 D> z Al 1.000 D> D> This would provide a much more elegant way of handling this problem with D> the advantage of clearly separating out the properties of the site from D> the properties of the atoms that occupy this site. D> D> Even better might be to define a category 'element' in which any D> elemental properties (e.g. scattering factors, oxidation state) could be D> given. It is still useful to retain the present system for sites that D> have a single element occupying them. The new system would be an D> alternative and some heirarchy would be needed to determine which took D> precedence if both were given. (Or is this built into the STAR defaults D> - the last encountered information is used - or perhaps the first?) D> D> If you think this a good thing, may be we could incorporate the D> idea into the present round of core updates. This sounds fine to me - any dissenting voices or additional comments? Neither CIF nor STAR currently has any provision for discriminating between items of data that apparently conflict by assigning a precedence order. Should there be? D21.5 Non-crystallographic symmetry ------------------------------------ Here is a comment a little while back from Phil, which I am sure Paula is looking at. Mostly this type of debate will be with the relevant dictionary maintainer direct, but this one again I feel to be of sufficient interest to place on record (and it shows Phil that I at least read his messages!). PEB> I have some concerns, which were initially brought to my attention by PEB> Alexei Vagin and Jean Richelle in Shoshana Wodak's group. I am still PEB> thinking about a number of these but the first relates to non- PEB> crystallographic symmetry. As far I can see from the cifdic.m94 of 2/25 PEB> there is no explicit way to declare this. Beyond the specification of PEB> the non-crystallographic symmetry elements there needs to be an PEB> indication of what _struct_asym_id's are related by what non-cryst. PEB> symmetry elements. I cant remember, but I assume the policy was to PEB> provide all the coordinates of non-cryst. related parts of the structure PEB> and not to provide a partial list and generate the full list by PEB> the application of non-cryst. symm elements. That would not seem PEB> to be explained in what constitues an _atom_site_ category even though PEB> if you follow the definitions of the _struct_asym_ category and the PEB> _entity category it must be so, but I suspect it will lead to confusion PEB> for those without the benefit of years with mmCIF. PEB> PEB> My concern then is two-fold (i) without tighter definitions folks will PEB> prepare incomplete entries which only include a partial list of atomic PEB> coordinates and a list of non-cryst. symmetry operators that cannot be PEB> parsed and hence used by a computer program (ii) non-cryst symmetry is PEB> a valuable piece of structure information that cannot be used by a program. D21.6 Illegal ways of solving a structure? ------------------------------------------- I have just had the privilege and pleasure of attending one of George Sheldrick's workshops to support his new refinement program. One item that surfaced there was the problem of describing unusual methods of solving a structure. For instance, _atom_sites_solution_ items have an enumeration list that does not include 'Patterson' (or provision for description of modified Patterson methods). There are two problems here - one is the addition of specific terms to the enumeration list, which should cause us no difficulty; but the other is the need to allow some indication that the structure was solved by some 'other' method; for as matters stand, one is "not allowed" to solve a structure other than by difmap, vecmap, heavy, direct, geom, disper or isomor methods!! D21.7 New dictionary structures with _include_file --------------------------------------------------- Having introduced the _include_file term at popular request, Syd has now extended it (logically) to cover the structure of STAR dictionaries, as he outlines below in this recent e-mail: S> ...let you see the latest constructions of these dictionaries which I would S> like to see adopted accross the board. As you are aware the DDL and MIF S> definitions do not currently fall under the umbrella of COMCIFS, but they S> probably will in the future as they are obviously closely connected via the S> DDL. The use of _include_file in the DDL can be used to "connect" a diction- S> ary to dependent definitions. This hierarchy of definitions is clearly S> important to all of us. Peter MR and I have been discussing various aspects S> of this wrt to DDL, and here is an excerpt from an email to him about this. S> ---------------------------------------- S> Dictionaries. This hierarchy list I drew up for you but the concept has S> been discussed mainly with Brian. The naming of dictionaries is a bit S> tricky as they are imbedded in the literature (e.g. "cifdic.c91"). But S> I agree with you, the increasing number of dictionaries does require a S> broader class of dictionary names. Please note that dictionaries are S> STAR files not CIF's ... they are an application ofthe star file format -- S> the "dic" application. As you know I am very keen on generic references S> to files because this obviates the problem of having to continually S> update general references -- such as are needed for dictionaries. S> S> My construction for dictionary file names will be: S> S> <dictionary type>_<dictionary class>.dic S> S> So for the DDL core dictionary the generic name is "ddl_core.dic". I S> know that Brian will be keen to place this in the dictionary as S> <ddl_core.dic> to emphasise its generic nature but this is a unixism. S> S> ------------------------------------------- S> S> ddl_ext.dic [ddl extension file(s)] S> ddl_core_dic [ddl core definitions] S> star_core.dic [primitive definitions common to all star dictionaries] S> S> -------- additional dictionaries are application specific (e.g. CIFmm) S> S> cif_core_ext.dic [cif extensions to core definitions] S> cif_core.dic [cif core definitions] S> cif_mm_ext.dic [cif extensions to macromolecular definitions] S> cif_mm.dic [cif macromolecular definitions] S> S> So a dictionary would always contain at the front an _include_file S> which inserts the dictionary file higher in the tree. S> S> For example... S> ### MACROMOLECULAR CIF DICTIONARY S> S> data_include_dependent_dictionaries S> _include_file cif_mm_ext.dic S> S> data_in_this_dictionary S> _dictionary_name cif_mm.dic S> .. S> etc. etc. S> S> and the included file looks like this.... S> ### MACROMOLECULAR CIF EXTENSION DICTIONARY S> S> data_include_dependent_dictionaries S> _include_file cif_core.dic S> S> data_in_this_dictionary S> _dictionary_name cif_mm_ext.dic S> .. S> etc. etc. S> S> and the included file looks like this.... S> ### CORE CIF DICTIONARY S> S> data_include_dependent_dictionaries S> _include_file cif_core_ext.dic S> S> data_in_this_dictionary S> _dictionary_name cif_core.dic S> .. S> etc. etc. S> S> and the included file looks like this.... S> ### CORE CIF EXTENSION DICTIONARY S> S> data_include_dependent_dictionaries S> _include_file star_core.dicc S> S> data_in_this_dictionary S> _dictionary_name cif_core_ext.dic S> .. S> etc. etc. S> S> and the included file looks like this.... S> ### PRIMITIVE STAR DICTIONARY S> S> data_include_dependent_dictionaries S> _include_file ddl_core.dic S> S> data_in_this_dictionary S> _dictionary_name star_core.dic S> .. S> etc. etc. S> S> and the included file looks like this.... S> S> OK, you get the idea. This way you really do not have to be even aware S> of the hierarchy for your application -- just the dictionary above you S> in the tree. But of course the parsing software must be able to handle S> nested inclusions! The fact that the core dictionaries have an _include_file value that points to extensions seems at first counter-intuitive, but is explained by the fact that the core dictionaries will come first (so will initially not have this '_include_file' line); but as extensions are written they will become known to the relevant core file. This doesn't seem to me to be a controversial development, though it will require dictionary parsing and validation software to be able to follow the '_include_file' pointers. Syd is anxious to formalise this for his MIF and general STAR applications, and so is seeking a star_core.dic that will include certain items that are already in the CIF core dictionary. These are, in brief, the _audit_, _citation_, _journal_ and _publ_ categories, none of which is specific to crystallography (though some of the particular items, say _citation_Medline_AN, are discipline-specific and might well remain in the CIF dictionary). Again, time didn't permit me to check this through thoroughly and implement it for this release of the draft dictionaries, but I am in favour of making this structural rearrangement in principle. Are there any problems you can foresee with (a) this re-stratification of dictionary dependencies and (b) the modified naming scheme Syd suggests? ----------------- Now I am heading off for a week in the country, so will be offline until April 11. Best wishes to all, Brian
- Prev by Date: (20) New dictionaries. Date/time, multimedia
- Next by Date: (22) For information: NMR Information File
- Index(es):