[Date Prev][Date Next][Date Index]
(34) _units simplification; history of DDL
- To: COMCIFS@iucr.ac.uk
- Subject: (34) _units simplification; history of DDL
- From: bm
- Date: Thu, 11 May 1995 14:56:18 +0100
Dear Colleagues It occurs to me that I haven't posted news of another move: Phil Bourne has been at the San Diego Supercomputing Center for a few months now - his current e-mail address is bourne@sdsc.edu. Sorry not to have mentioned it before, Phil! Call for Agreement ================== D34.1 Simplification of units descriptions in DDL1.4 ----------------------------------------------------- I call for agreement from all full members of COMCIFS on the following proposal. Consultants are encouraged to give an 'aye' or 'nay' if they see this decision affecting their own work and interests. I mentioned briefly in D33.5 that Syd was proposing to drop the _units_extension, _units_description and _units_conversion attributes from the published DDL1.4 specification, and to replace them with _units and _units_detail. Some of us have already been involved in correspondence over this, and I include a few extracts from that correspondence below. The DDL1.4 specification is currently being revised post-review, so this is the last opportunity to modify it before publication. You will see from my later article on the development of the DDL that there is precedent for dropping unpublished DDL attributes. We would need to promote the derivative data names (such as _cell_length_a_pm) to "full" dataname status to ensure compatibility with existing uses of the _units_extension idea, but I see no fundamental problems. S> I am in the final throes of returning an expanded DDL specification paper to S> JCICS. In doing this expansion I pondered on the "_units_" attributes S> again. They are an increasing source of irritation. This is the last S> opportunity to expunge them painlessly from the record. S> S> I raise the possibility with you of replacing the attributes S> _units_extension, _units_description and _units_conversion with two new S> ones _units and _units_detail. The first will be a parsible code as S> in ..._extension and second the description of the units. S> S> This new usage implies that each data item will have a fixed units, which I S> think will make most people happy. BM> Syd: In a sense it would be only a "small" job to fix the archived CIFs - BM> the whole point of CIF is that you have well defined tags which are trivial BM> to find (and thus amend). There might be a lot of them, and it might take BM> many CPU cycles, but our machines are patient and forbearing enough... BM> BM> But I don't see that they need to be fixed. Just add _cell_length_a_pm etc BM> to the existing core (as extra, full, datanames with new definitions BM> specifying the units - or, rather, with new _units). This preserves the BM> integrity of any existing files (outside of the archive) which conform to BM> core version 1991. Rather a lot of clutter, it's true. But at least in DDL2, BM> these old items could all be put into a supercategory (what does John call BM> it - _category_group ?) called something like 'compatibility' that will tend BM> to keep them out of the way. PMR> I have no problems in principle with simplifying the _units_. There is a PMR> general problem which I think the CIF community has to face which is how PMR> and when there can be stable software for the language. I originally PMR> wrote quite extensive software for the _units_* hierarchy and this now PMR> would need to be rewritten if anyone is going to use it. PMR> I can't see a simple way round this in the CIF language, but I do PMR> think it's important that we get enough stability to start writing code. PMR> Until that happens there it will be possible for software to check the PMR> basic CIF syntax but not, say, to check entries against the dictionary. PMR> It's not trivial at present to see whether cell_length_a_pm is a PMR> dictionary entry since there is no simple clue to the parser that _pm is PMR> a unit and not part of the name. PMR> I believe that the appropriate way forward is through attributes PMR> such are possible in SGML (no - I'm not deserting CIF, but I think we may PMR> get help from this direction). Here I would write something like: PMR> <CELL><LENGTH><A UNIT="pm" ESD=0.3>1412.3</A></LENGTH></CELL> which shows PMR> the structure more clearly for the parser. An alternative, which could PMR> map easily onto CIF, could be: PMR> <FLOAT ID="_cell_length_a" UNITS="pm" ESD=0.3>1412.3</FLOAT> PMR> PMR> At the risk of being branded a heretic, this has the merit of PMR> parsability and extensibility. A CIF could be translated into that form PMR> by using a translation table. JW> I would welcome the elimination of the current manner in which units JW> attributes are expressed. In the new DDL we have assigned a code for JW> each unit type using the attribute _item_units.code. The unit codes JW> are then defined/described once in a separate category item_units_list. JW> We have also included the category item_units_conversion which provides JW> the correspondence and conversion information between unit types. This has JW> been implemented in the mmCIF dictionary and I think that it has JW> worked out rather well. S> Peter's call for stability is of course appropriate ... he is right and S> it is most definitely what I am trying to achieve with the publication S> of DDL1.4. The problem Peter cites about identifying what is an extension S> to a dataname or a new dataname is in my view the crucial one -- it can be S> done, but....! Continuing discussions ====================== (33)D28.2, D28.3 R Factors -------------------------- G> I must congratulate Paula on achieving what I failed to achieve in a long G> exchange of emails with Syd when the original CIF was being developed, G> namely to change the definition of the R-factor ! (if I had succeeded, life G> might now be easier for authors and co-editors of Acta C, but it was never G> the intention of CIF to make life easy). G> G> I feel strongly that there are two separate pieces of information that should G> be kept logically separate: G> G> (a) A 'conventional' R-factor for the purposes of comparing structures G> that may have been refined by different procedures by different people, G> including the many structures published before the CIF revolution. This G> should be what most people have always understood as an R-factor, e.g. G> the formula given by Paula for reflections defined by the resolution G> ranges and 'observed' criteria exactly as she has specified, but without G> the clause 'and that were included in the refinement'. G> G> (b) The procedure used for refining the structure, including the quantity G> minimized (well, almost) and the specification of which reflections were G> used in this procedure. For example we quite happily refine proteins G> against F-squared for all data whereas it appears that Paula is refining G> against F for a specified subset of the data. It is not the responsibility G> of COMCIFS to define one particular refinement strategy as correct and to G> force everyone to use it, and it would stifle scientific progress to do so. G> G> Given the R-factor defined as in (a), the maximum resolution and the G> completeness of the data, any experienced crystallographer has a feel for G> the quality of a particular structure determination. This only works if G> we keep it simple and in keeping with generally accepted practice. For G> (b), we must be as flexible as possible to allow for progress. Both sets G> of information are essential in the CIF file, but must be kept logically G> separate. D30.3 The New DDL ----------------- In the last mailing (in section D33.1) David Brown posed a few questions about the relationship between DDL1 and DDL2, and asked for a history of the DDL. To meet that request, I have scribbled down a few notes which explain the position as I see it. I am still far from sure in my own mind how future (and indeed, current draft) dictionaries should be written. This worries me slightly, for I cannot clearly articulate why I am uneasy at the suggestion to discard DDL1 for CIF dictionary purposes. Perhaps I am just growing irrationally conservative in my old age! A BRIEF HISTORY OF THE DDL 1. The STAR file ================ The initial intention of a STAR ("Self-Defining Text Archive and Retrieval") file was to provide a set of simple syntactic rules for storing and retrieving text strings in a flexible and extensible manner The rules are indeed simple. Here they are, in essence: The file is divided into tokens separated by white space. A token may include white-space characters provided (a) it is surrounded by matching single or double quotes and contains no <newline> characters; or (b) it is surrounded by <newline><semicolon> character pairs (digraphs). A token beginning with an underscore character is a dataname that MUST have an accompanying value. For non-looped data, this value is the following token. Looped data are permitted. A loop structure contains a header and a body. The header is introduced by the reserved word "loop_" and contains a list of datanames. The values in the loop body are associated with matching datanames in the header, in strict rotation. In STAR (but not in CIF), nesting of loops is permitted. In the header, each level of nesting below the first is introduced by a "loop_" keyword and terminated by "stop_". Other reserved words are "data_xxx" where xxx is an arbitrary string that must be unique within the file; "global_"; "save_xxx" and "save_". These are used to partition the file into distinct data cells with specific scoping rules. Of these, only "data_xxx" is used in CIF. All datanames and associated values must occur within one of these data cells. Comments (introduced by the "#" character and terminated by <newline>) are permitted anywhere that a token is valid. There is no restriction on the nature of the information conveyed by these tokens - the STAR rules are purely syntactic, and allow a specified token to be retrieved by an application - i.e. given a dataname, the matching data value or values should be retrieved. The application starbase (sb) works entirely at this syntactic level, and guarantees to return requested data in fully compliant STAR format. This syntactic format was chosen as the base layer for the CIF, but with some simplifications to aid programming. The only recognised data cell delimiter is the data_xxx block code; nesting of loops is prohibited; block codes and datanames are restricted to 32 characters in length; lines are restricted to 80 characters. The initial semantic layering (i.e. the imposition of "meaning" onto the abstract tokens) was done by devising datanames that were self-expressive, and by imposing some basic data types on the associated values. Effectively only two data types were permitted - "numb" for numerical values (with a permitted standard uncertainty value in trailing parentheses); and "char" for textual information (though some applications might choose to differentiate between multi-line text extending over several lines and bracketed by newlines, and the single-line-or-less character strings with quote marks or no delimiters). [No, Paula, I haven't forgotten "null" - but this is still In The Beginning...] Hence, this could be considered a totally valid CIF: data_proto _my.crystal's.habit 'turns green in daylight' But there is a problem here: the word 'habit', used here in the sense of 'customary behaviour', is used in crystallography with a specialised meaning. This ambiguity is fatal to the purpose of devising a universal exchange mechanism. 2. A Dictionary of Universal Terms ================================== So the next step was to devise a dictionary of datanames which represented very specific terms and definitions in crystallography. The dictionary would list all datanames with a universal meaning, together with a definition of that meaning, an indication of whether the associated value was numeric or textual, and any constraints that could be applied to that value. The datanames were constructed in a way designed to illustrate their relationship with each other, through a hierarchy of subcomponents separated by underscores - hence _atom_site_symmetry_multiplicity etc. And this was indeed how the original CIF dictionary was submitted to Acta - as a MS-Word file with the definitions laid out just as in a lexicographic dictionary. An appendix listed permitted codes for certain values (what we habitually call the 'enumeration lists'), and some general elements of the definitions were also included in the main text of the paper. All this information was available only to the human reader. However, Tony Cook, who was working with Syd on the chemical (MIF) applications of STAR files, pointed out that much of the stored information on each dataname could be extracted by computer if it were presented in an appropriate way - and what way would be more appropriate than as a STAR file, so that the same software being written to extract information from a CIF could be used to extract information from the dictionary? And so it came to pass - by the time the original CIF paper went to press, the dictionary had been recast as a STAR file (with the same syntax restrictions as a CIF). The dictionary information was associated with a new set of datanames, which form the vocabulary of the Dictionary Definition Language, or DDL. Note that this formalism is only mentioned in passing in the CIF paper: the typeset version of the dictionary translates the DDL names into sentences, so that the dictionary again resembles a lexicographic one in its layout of entries. Here is an example of an entry in the core dictionary: data_atom_site_attached_hydrogens _name '_atom_site_attached_hydrogens' _type numb _list yes _list_identifier '_atom_site_label' _enumeration_range 0:4 _enumeration_default 0 loop_ _example _example_detail 2 'water oxygen' 1 'hydroxyl oxygen' 4 'ammonium nitrogen' _definition ; The number of hydrogen atoms attached to the atom at this site excluding any H atoms for which coordinates (measured or calculated) are given. ; The DDL used for this version is documented only in comments at the end of the dictionary file cifdic.C91. I reproduce it here for historical interest, and shall refer to this version as DDL0. ############################################################################## # # DDL Data Name Descriptions # -------------------------- # # _compliance The dictionary version in which the item is defined. # # _definition The description of the item. # # _enumeration A permissible value for an item. The value 'unknown' # signals that the item can have any value. # # _enumeration_default The default value for an item if it is not specified # explicitly. 'unknown' means default is not known. # # _enumeration_detail The description of a permissible value for an item. # Note that that the code '.' normally signals a null # or 'not applicable' condition. # # _enumeration_range The range of values for a numerical item. The # construction is 'min:max'. If 'max' is omitted then the # item can have any value greater than or equal to 'min'. # # _esd Signals if an estimated standard deviation is # expected to be appended (enclosed within brackets) # to a numerical item. May be 'yes' or 'no'. # # _esd_default The default value for the esd of a numerical item # if a value is not appended. # # _example An example of the item. # # _example_detail A description of the example. # # _list Signals if an item is expected to occur in a looped # list. Possible values 'yes','no' or 'both'. # # _list_identifier Identifies a data item that MUST appear in the list # containing the currently defined data item. # # _name The data name of the item defined. # # _type The data type 'numb' or 'char' (latter includes 'text'). # # _units_extension The data name extension code used to specify the units # of a numerical item. # # _units_description A description of the units. # # _units_conversion The method of converting the item into a value based # on the default units. Each conversion number is # preceded by an operator code *, /, +, or - which # indicates how the conversion number is applied. # # _update_history A record of the changes to this file. # #-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof I know of two applications which are able to read the CIF core and validate CIFs against that dictionary, expressed in DDL0: CIFCHK from the CCDC, which we use to check CIFs as we archive them (but this program has never been released to the public domain); and sb (which has gaps - it can't handle _units_extension). The benefit of this approach is, of course, that you do not need to write a new subroutine to handle each new entry in the dictionary - no small consideration, now that we have more than 1200 entries. By the time of the Veszprem computing school in mid-1992, Peter Murray-Rust, for instance, was already working on a C++ object library for CIF. The ambition was to provide a library of routines (written in C++ but callable from C (and maybe even Fortran?)) for extracting CIF data and validating it against the dictionary. 3. Relationships Between Data Names - the DDL Extended ====================================================== At the York meeting in April '93, the viewpoint was advanced that CIFs could usefully be treated as implementations of various formal data models. As I have described its evolution so far, the CIF data model has no explicit mechanism for recording relationships between items of data contained in the file. But there were already certain implicit conventions. The dataname scheme embodied a hierarchy of definitions: consider the three data names _atom_site_label, _atom_site_fract_x and _atom_type_oxidation_number. "Obviously" (to a human), these all have something to do with atoms; the first two have something to do with the positions occupied by atoms in a cell; and one would expect _atom_site_fract_x to have corresponding _y and _z equivalents (which it has). In this hierarchical picture, the hierarchical relationships can be unravelled to a large extent by parsing the name components in the dataname. But only "to a large extent". The hierarchies as set up in the core dictionary are not rigid. The CIF paper lists _chemical_ and _chemical_conn_ as separate "category" parents. (I put the word category in quotes here because it was not a well defined idea. Most people have a feel for the type of categorisation being attempted, but the rules for assigning categories were for a long time difficult to pin down.) In like manner, the _atom_site_ and _atom_site_aniso_ groups of datanames were clearly related, but yet distinct. The relationship was not so clean as the similar form of datanames suggested. In practice, related data are grouped together - conventionally in the form of tables; in CIFs within loops. A 'relational' data model is one which can represent data as tables obeying certain definite rules: each line of the table must be indexed by a unique key. It was argued that the CIF tables could be mapped onto a relational data model, provided that the necessary rules were enforced through the relationships between data names stored in the dictionary. In other words, the DDL should be revised to allow valid relational tables to be constructed and validated. This would make it much easier to load a relational database with the relevant information from a CIF; or, conversely, one supposes, to borrow the tools of relational database management for managing CIFs. So, between the York meeting and the workshop in Tarrytown of October '93, there was a great deal of e-mail correspondence and a consequent evolution in the DDL to meet these objectives. Each dataname was now assigned formally to a category (the name of which was usually the same as the dataname stem, e.g. atom_site - but not necessarily so). All the items in a table (i.e. looped list) belonged to the same category. Items not in a table were assigned a different category. Certain uniqueness constraints could be applied to one or more of the datanames. This had the effect of defining the unique 'key' for the table. For instance, in the atom_site category, _atom_site_label had to be unique (in crystallography-speak, this just means that every atomic site had to have a unique label). Additional DDL names were introduced to record parent-child relationships between data names in different categories. Atom labels occur in the geometry tables, for instance _geom_bond_atom_site_label_1. These labels must match corresponding labels of atoms in the atom_site list, so the 'parent' of _geom_bond_atom_site_label_1 is _atom_site_label, and vice versa. To meet all these requirements, the original DDL was extended and modified, and is now in press as version 1.4. I append below a list of the datanames defined in that version; those prefixed by * are extensions to the DDL0 set: * _category _definition * _dictionary_history * _dictionary_name * _dictionary_update * _dictionary_version _enumeration _enumeration_default _enumeration_detail _enumeration_range _example _example_detail _list * _list_level * _list_link_child * _list_link_parent * _list_mandatory * _list_reference (replaces _list_identifier) * _list_uniqueness _name * _related_item * _related_function _type * _type_conditions * _type_construct _units_extension | _units_description |- under review - see below _units_conversion | The following DDL0 terms were dropped: _compliance (replaced by the _dictionary_ attributes) _esd (incorporated in _type_conditions) _esd_default _list_identifier (replaced by _list_reference) _update_history (replaced by _dictionary_history) The _units_ items are currently under review by Syd, because they pose a problem in reading a dictionary. The idea is that _cell_length_a is defined in the dictionary as a quantity in angstroms; but it has a _units_extension code _pm, which means that the dataname _cell_length_a_pm is to be understood as a cell length in picometres (the numeric conversion factor is embodied in _units_conversion). But a straightforward dictionary lookup doesn't return _cell_length_a_pm as a valid dataname. Two sets of items have involved especially protracted labours. The _list_ items define the attributes and relationships between items in a looped list. The _list_level allows for data names to be assigned to deeper levels in nested loops. As such, it has no use in CIF applications, where nested loops are forbidden; but it allows nested loop structures to be defined in other STAR applications which do permit it. In my view this is important, for it permits the definition of hierarchical data models, where the existence of certain data items is dependent on the existence of others at a higher level in the hierarchy (you can't have a loop at level 2 unless it is nested within a loop at level 1). I don't know enough to say whether this is a necessary or sufficient condition for mapping STAR data structures onto hierarchical data models, but I have a feeling that it may be important for doing this, and I'd be glad to hear any informed commentary on this. In CIF loops, which are always at level 1, the _list_ attributes describe the relations between the data values in a table, and so this is an attempt to map onto a relational data model. If one wants to identify the key to entries in such a table (the data name or names which must have unique values within the table), one must collect together entries with _list_mandatory set to 'yes' and the complete set of _list_uniqueness pointers. Here's a simple example, somewhat adapted from the MIF paper, to show what's going on. The following loop defines a table of bonds in a chemical structure: loop_ _bond_id_1 _bond_id_2 _bond_type C1 C2 double C2 C3 single C3 C4 double C1 C7 single The MIF dictionary entry for _bond_type includes the line loop_ _list_reference '_bond_id_1' '_bond_id_2' meaning that both bond id values must be present for the table entry to make sense. In the entries for _bond_id_1 (and _2) are found the lines _list_mandatory yes loop_ _list_uniqueness '_bond_id_1' '_bond_id_2' meaning that these datanames must be present in the loop, and must together be unique (C1 appears twice as _bond_id_1 in the example, but that's OK: on one occasion it's teamed with C2, on another with C7). These relationships do all hang together; but it's necessary to do a certain amount of hunting through the dictionary to ensure that you've got all the relevant information (and it requires a lot of work to make sure that the dictionary yields this information in a consistent manner). The other set of items I marked for interest are the _type_ group. Because the STAR philosophy is to store and deliver text strings, it was felt that the assignment of data types was largely unnecessary. Integers, floats, double-precision complex numbers and booleans did not need to be stored in any different manner. They can all be coded as text strings: "3", "2.76", "-1.30000988765876 + 0.00022456342987i", "true". But there have been many voices raised to counter this view, and the solution adopted in DDL1.4 is to permit three fundamental types, with _type values of "numb", "char" and "null". ("null" is a device adopted to allow additional information to be stored in dictionaries for humans to read, but machine parsers to ignore.) _type_conditions extends this, so that 1.23(4) is understood as a number with associated standard uncertainty (e.s.d.) in CIF applications, and 1:4 is understood as a range of allowed integer values in certain MIF applications. _type_construct is a more general device which allows a data value to be compared with a pattern (in regular expression notation). Hence a date quantity could be described in the dictionary with a _type_construct of [0-9][0-9]:[0-9][0-9]:[0-9][0-9], meaning any triplet of two-digit integers separated by colons (not the format used in CIF!). The regex notation is very powerful, and in principle this allows very tight control over acceptable patterns in the string representing a data value. A final comment on category. The category assigned to each data item defines which loop (table) it may appear in. If it doesn't normally appear in a loop, its category is more generally defined to encompass related items with the same status. Note that this linking of categories to loops results in rather a large number of distinct category assignments. It is not, perhaps, an effective taxonomic classification. 4. Version 2 Unleashed ====================== The DDL2 version developed by John Westbrook followed on yet another CIF workshop, that at Brussels in October '94. Its philosophy was to use the same mechanism of supplying machine-readable attributes of data names in a STAR dictionary, the attributes themselves labelled by STAR data names. But the intention was more specifically to provide a representation better suited for mapping onto a relational data model than the original DDL. Macromolecular data are increasingly manipulated in relational databases, and it is noteworthy that the PDB has recently published a schema for its proposed relational database implementation of its stored data. In mailing 30 of 15 Feb this year, I already described how (I think) DDL2 is intended to work. Here I shall just pick up a few threads to contrast with specific remarks I made above on DDL1. First, the taxonomy employed allows a classification hierarchy: there are categories, defined in the same way as the categories in DDL1, but there are also category_groups (clusters of categories, or supercategories), and subcategories. Hence _cell.length_a, _cell.length_b and _cell.length_c are the (only) three members of the subcategory "cell_length"; they are all members of the "cell" category; they are all members of the "cell_group" supercategory (as is, say, _cell_measurement_refln.index_h); and they are also members of the "inclusive_group" supercategory (as is everything in the mmCIF dictionary). Second, the properties of a category are listed separately from the properties of its constituent members. If I were to extend my MIF example into DDL2, you would have an entry something like save_BOND _category.id bond loop_ _category_key.name '_bond.id_1' '_bond.id_2' save_ which defines the key of the 'bond' category. None of the entries for _bond.id_1, _bond.id_2 or _bond.type (as they would become in DDL2) would contain anything about key values, except insofar as they indicate membership of the 'bond' category. The DDL2 approach also allows other general properties to be described in the dictionary in (arguably) a more structured way. The _units_conversion entries in DDL1 appear in individual definition blocks, so that the conversion factor from angstroms to picometres is given for every data name which may have associated data names and different units. In the mmCIF2 dictionary, all units required are gathered together in a single table at the beginning of the dictionary. Individual data names have associated with them the unit to attach to the quantity described, and any other units may be generated by consulting the conversion table. (Note, however, that the mmCIF2 does not permit a quantity to be expressed in other units in the CIF: the units of _cell.length_a are angstroms, and there is no _cell.length_a_pm or equivalent.) The mmCIF2 dictionary also contains a larger spread of what are effectively user-defined types: a list of type codes is established at the beginning of the dictionary using the equivalent of _type_construct. In each data name definition, an _item_type.code value is listed, which indexes into this table of user-defined types. The original DDL1 types are preserved (as _item_type_list.primitive_code and _item_type_conditions.code, I think), but there is more freedom to define application-specific types for the dependent application to handle. Both the units and type lists were dropped from my ciftex'd version of the mmCIF dictionary because of technical difficulties in printing them, but they are present in the dictionary file itself. Note that datanames described by DDL2 dictionaries will have an embedded dot character to separate their "category" part from their "instance" part: this will be the most obvious difference between CIFs containing datanames described by DDL2 dictionaries and existing CIFs. The DDL2 proposal endeavours to honour existing CIFs through an aliasing mechanism. For the sake of completeness, I attach here a simple list of all the data names used in the DDL2 set: _block.description _block.id _category.description _category.id _category.mandatory_code _category.method_id _category_examples.case _category_examples.detail _category_examples.id _category_group.category_id _category_group.id _category_group_list.description _category_group_list.id _category_group_list.parent_id _category_key.id _category_key.name _dictionary.block_id _dictionary.title _dictionary.version _dictionary_history.revision _dictionary_history.update _dictionary_history.update_day _dictionary_history.update_month _dictionary_history.update_year _dictionary_history.version _item.category_id _item.mandatory_code _item.sub_category_id _item_aliases.alias_name _item_aliases.name _item_default.name _item_default.value _item_dependent.dependent_name _item_dependent.name _item_description.description _item_description.name _item_enumeration.detail _item_enumeration.name _item_enumeration.value _item_examples.case _item_examples.detail _item_examples.name _item_linked.child_name _item_linked.parent_name _item_range.maximum _item_range.minimum _item_range.name _item_related.function_code _item_related.name _item_related.related_name _item_structure.code _item_structure.name _item_structure_list.code _item_structure_list.dimension _item_structure_list.index _item_type.code _item_type.name _item_type_conditions.code _item_type_conditions.name _item_type_list.code _item_type_list.construct _item_type_list.detail _item_type_list.primitive_code _item_units.code _item_units.name _item_units_conversion.factor _item_units_conversion.from_code _item_units_conversion.operator _item_units_conversion.to_code _item_units_list.code _item_units_list.detail _method.id _method.name _method_list.code _method_list.detail _method_list.id _method_list.inline _method_list.language _sub_category.description _sub_category.id _sub_category.method_id _sub_category_examples.case _sub_category_examples.detail _sub_category_examples.id 5. The Present ============== So now we have two formalisms that can be used to describe the information content of a STAR file: DDL1.4, which is used in the MIF core dictionary, the CIF core, powder and modulated structures extensions, and in minor applications like the WDC9 and ACA abstracts dictionaries; and DDL2.0.x, which is used in the mmCIF dictionary. The current mmCIF also includes all the core definitions reworked in the new formalism. DDL1.4 has evolved in the way I described above, and is a general mechanism for defining data names. It is not tied to any specific data model, but can be mapped onto a relational model, and possibly onto a hierarchical one. DDL2 was developed as a relational model, and is better structured and more consistent in that respect. But it enforces this model on data files that it describes: if a CIF had a table representing some raw experimental data, it is possible that some lines of that table might be repeated (if the same reflection were re-measured, for instance). The relational viewpoint forces those lines to be distinguished, even if only by the addition of a new data name whose only purpose is to number the rows of the table! The stricter categorisation rules may also make more problematic the handling of external data names (i.e. those introduced by a user which have no definition in an official dictionary). At this stage, the two formalisms are very closely compatible - DDL2 was designed to achieve this. But a whole-hearted implementation of all the DDL2 ideas may well lead to a divergence between data files described by DDL1 and DDL2 dictionaries. At present, an application reading a DDL2 dictionary is intended to be able to read and verify existing CIFs (currently described by DDL1 dictionaries) through an in-built aliasing mechanism. We have yet to see it demonstrated how well this will work; but it's unlikely that such a mechanism will track any changes that are made to the DDL1 language and its dependent data files in the future. So we need to consider carefully whether all existing dictionaries should be recast in the DDL2 formalism, and whether that will affect the future evolution of CIF vis-a-vis parallel developments such as MIF. ===== Two last small points: DDL stands for "Dictionary Definition Language" (there have been numerous other descriptions over the last few years!). And DDL is a STAR application (i.e. a set of data names and "values" in STAR format). It's not "the description of the STAR standard". It's of interest to us because it's the language in which CIF definitions are written; but I suppose that, if all else were to fail, we could go back to MS-Word and the language of Shakespeare :-) Regards Brian
- Prev by Date: (33) Modulated structures dictionary, R factors, DDL2, ACA abstracts
- Next by Date: (35) Mostly units and R factors
- Index(es):