Discussion List Archives

[Date Prev][Date Next][Date Index]

(69) mmCIF dictionary approved

  • To: COMCIFS@iucr.ac.uk
  • Subject: (69) mmCIF dictionary approved
  • From: bm
  • Date: Sun, 8 Jun 1997 11:58:43 +0100
Dear Colleagues

A69.1 Approval of mmCIF draft dictionary
----------------------------------------
COMCIFS has unanimously endorsed the draft macromolecular CIF dictionary,
version 0.9.01, as posted on 31 January 1997, subject to editorial
corrections.

This may be taken as the formal adoption by the IUCr of mmCIF as the
preferred archive file format for macromolecular data and structural
reports. The official release version will be made available to the
community upon completion of editorial and proofing stages; I shall discuss
with Paula a likely timescale for this process. Software developers may be
directed to the draft version 0.9.01 as a good indicator of the overall
structure and content of the final version, but should be warned of the
possibility of some change between this and the official published
version.

I shall report on any editorial revisions and the manner of presentation of
the dictionary and its associated documentation as we work through the
'publication' process with Paula. Thanks, as always, go to all those who
have contributed to the work involved in this very substantial project.

I include below (as discussion item D69.1) some comments of Otto Ritter
on the dictionary and its adoption by the PDB.


Ongoing discussions
===================

D67.1 Approval of pdCIF dictionary
----------------------------------
At the time of writing, I have received an "aye" vote from all the members
on motion D67.1, with the exception of Paula's reservations about the
incomplete category structure in the pd_data section of the dictionary. From
the nature and tone of the responses, I have the very strong impression
that there is no desire to impede the formal adoption of the dictionary, and
I am therefore working offline with Paula, Brian Toby and David Brown, in an
attempt to resolve the outstanding technical issues. I hope that this is
felt to be acceptable; I shall willingly open up the discussion to any other
individuals who wish to participate, but I don't think it's necessary to
trouble everyone with all the ongoing correspondence.

One opinion that I do wish to place on the record here is that of Gotzon
Madariaga, who is working on the modulated-structures dictionary:

G> However I would like to point that COMCIFS must establish, as soon as
G> possible, strict rules to avoid these type of discussions that delay the
G> approval and the official use of the dictionaries. I think that the
G> expression "this is an application matter" should not be applied in the
G> future. The problem is that people involved in CIFs belong to two
G> different categories. 
G> 
G> a) Dictionary writers. They try to collect and define those items that are
G> necessary to transmit and archive information relative to some scientific
G> field. For them DDLs are enough. In fact mmCIF and pdCIF dictionaries
G> fulfil the requisites (perhaps with some minor corrections) of DDL2.0 and
G> DDL1.4 respectively. 
G> 
G> b) Software developers. They need additional restrictions that would make
G> they work (parsers, databases, etc...) easier. Perhaps a closer connection
G> between data names prefixes and categories, a full exploitation of
G> _type_construct (including the addition of additional data names that
G> could be superfluous or redundant for dictionary writers), etc. 
G> 
G> In addition COMCIFS (or at least myself) would need a definite policy
G> about the future DDL. Are the actual DDL1.4-based dictionaries
G> provisional?. Would additional (naming) rules favour a sweet migration to
G> DDL2.0? 
G> 
G> These matters should be analyzed carefully perhaps before the approval of
G> any new dictionary, even when I am finishing a new draft of msCIF. 

I think the (very) protracted discussions about DDLs, categories and the pd
and mm dictionaries have been of value in opening up the full technical
debate to all our members. I think that there has been enough volume of
discourse to allow us to extract some general principles and desires for the
direction in which the standard should evolve. Some of these principles need
to be stated and formalised, and probably the future role of COMCIFS is
indeed to give much clearer advice to the next generation of dictionary
authors, and to the community of software developers.

Now that the first two extension CIF dictionaries have been (in effect)
approved, the role and nature of COMCIFS are ripe for review. David Brown
has prepared a discussion paper on this topic, which is intended to promote
discussion amongst ourselves on our future role. This will lead to a report
that David will present to the Executive at the Lisbon meeting in August,
together with any appropriate recommendations that emerge from our
debate.

I shall circulate David's paper as the next COMCIFS communication, to
try to separate cleanly the policy issues involved from the technical
considerations that we have been enjoying of late. Gotzon's views should 
be seen as one contribution to that upcoming debate.

D69.1 Remarks on mmCIF by O. Ritter
-----------------------------------
O> 2-June-97
O> Dear COMCIFS members,
O> 
O> Here are my 2 cents to the discussion on mmCIF. It has taken me
O> quite some time to go over the great volume of documentation and
O> related communications.
O> 
O> 1. The mmCIF dictionary as a data language
O> ------------------------------------------
O> In my view (my background is not in crystallography),
O> mmCIF is a relatively well organized dictionary providing basic format
O> and description of data elements within the domain (X-ray crystallography).
O> The dictionary represents a comprehensive system of syntactic and semantic
O> definitions for domain data, including controlled vocabularies,
O> relationships, and basic integrity constraints. It comes with a
O> data exchange language, the mmCIF format, for declaring and
O> defining concrete data sets.
O> 
O> The dictionary itself is not (and in principle cannot be) complete
O> and agreed upon by everybody. I myself am not qualified to assess the domain
O> science elements of mmCIF. It seems to me, however, that it is the best
O> currently available such dictionary (based on multiple discussions
O> with domain scientists).
O> 
O> The dictionary is quite large and quite "flat". There are only a limited
O> number of layers in its abstraction hierarchy (category_group, category,
O> sub_category, and item).
O> 
O> The language of mmCIF has a sound abstract syntax (basically that of
O> simple relations with equivalents to "keys" in the standard relational
O> model). The concrete syntax of mmCIF is rather idiosyncratic, but this
O> shouldn't matter much as long as humans don't mind or get used to it,
O> and programs can parse it unambiguously.
O> 
O> The mmCIF language is also limited in its expressiveness. One can
O> define and declare data elements, but there are no language constructs
O> for expressing integrity constraints, other than simple ones, or for
O> manipulating the data (e.g., query, insert, update, delete).

I might just remark that CIF was designed essentially as an archive format,
and so did not begin life with requirements for insertion and updating of
data. One possible benefit of the well-defined relations embodied in the 
current mmCIF formulation is that the data may conveniently be loaded into
relational database management systems that more readily support data
manipulation. However, there are some programs that support manipulation of
the native CIFs themselves, such as Hall & Spadaccini's Star_Base (sb)
which has a rich query language for extracting data from any STAR file, not
just a particular dialect of CIF; and John Westbrook's extensive software
suite designed and implemented at Rutgers. No doubt other examples will be
discussed at the St Louis workshop this summer.

O> There are some, but not many tools to manipulate mmCIF data streams.
O> Having more (standard) software tools to inspect and manipulate the data
O> would be of great advantage. The abstract data model behind the mmCIF
O> Dictionary Description Language (DDL), i.e. the system of data abstractions,
O> their properties and relationships, enables one to easily translate mmCIF
O> to/from other more standard representations for which tools exist.
O> E.g., it is perfectly possible to translate (here the meaning is close to
O> transliterate) the mmCIF dictionary and data stream into/from an equivalent
O> ASN.1 stream without loss of information.
O>
O> 2. PDB plans regarding mmCIF
O> ----------------------------
O> The mmCIF format is clearly preferable to the current PDB one, as
O> it is easily parseable, does not have some of the arbitrary limitations
O> on field length as found in PDB, and better supports incomplete or
O> variant data instances. It is very important that the mmCIF DDL can be
O> parsed and managed as structured data, as opposed to the PDB format
O> definition which is a text document.
O> 
O> We plan to support full mmCIF input to, and output from, PDB asap. We
O> are not able to do that shortly with the current resources but we are in
O> the process of getting additional funding for this. If all goes well
O> we'd have a prototype by the year's end and a production version by
O> March '98.
O> 
O> For obvious reasons, mmCIF in its form today cannot, and will not, be
O> the ONLY I/O format of PDB (cf, e.g., NMR data). We have to support
O> the traditional PDB format as well, but we do not plan any substantial
O> further work on it other than basic editorial maintenance and, perhaps,
O> some smaller changes to accommodate for better mmCIF interoperability.
O> 
O> For the language-theoretic reasons mmCIF cannot be taken as
O> a definition of the PDB contents and behavior. We will provide a formal
O> definition of both (contents and behavior) in the meta-database, to which
O> mmCIF will be an important source of information.
O> 
O> 3. Miscellanea
O> --------------
O> I'm curious to hear other people's comments on mmCIF and the DDL.
O> There has been virtually no traffic on mmCIF on this board since I joined,
O> and the mmCIF mail archive at Rutgers has its last communication
O> of 3-Feb-97.

This may reflect a natural hiatus on the part of the community as they await
formal ratification of the dictionary; the dictionary submitted to COMCIFS
does, after all, represent the culmination of several years of very active
involvement by a large number of participants. I should imagine that the
lists will again become active after the summer meetings of the ACA and ECM,
and I expect that the nature of the discussions will become more slanted
towards the practical issues of implementation that will really be starting
to affect users.

O> As Joel Sussman points out, it would be worthwhile to suggest that input
O> should be received from scientists outside the the X-ray crystallographic
O> community, as the new exchange/archive format is likely to be used widely
O> by them too, e.g. NMR, molecular biologists, genomics, ...

There are a number of initiatives in train for the coordination of
structural databases and the development of molecular bioinformatics.
I should certainly appreciate news of such developments for posting to the
COMCIFS list, for it is important that we play an active role in such
initiatives as appropriate.

O> 4. Summary
O> ----------
O> I don't see any technical reasons why the adoption of mmCIF, in its current
O> version, should be delayed or blocked, based on its underlying data
O> model and concrete syntax (data exchange format). It should be noted,
O> however, that the mmCIF DDL is a rather low-level language in terms of
O> abstraction, expressivity, and manipulation constructs; and that more
O> powerful languages (or more user-friendly tools) are needed on both
O> ends of the communication process, i.e., the writer (typically a domain
O> scientist or a laboratory software tool) and the receiver (typically
O> an information archive or processing tool).
O> 
O> In my personal opinion, the real value of the mmCIF dictionary lies in
O> its comprehensive system of definitions and controlled vocabularies,
O> assuming these are correct and reasonably complete from the domain point of
O> view. The whole system of mmCIF items, categories, and other definitions
O> could
O> be equivalently expressed in a more standard data specification language,
O> independent of the application domain, for which there already exist proven
O> industrial-strength software tools for data management (editing,
O> validation, querying, etc.). This would save community resources to
O> develop and maintain such software tools for mmCIF. I understand, though,
O> that there are historical and sociological reasons behind this concrete
O> mmCIF DDL.

There certainly are. One reason that the IUCr actively supports CIF is that
it is a file format in established practical use in the small-molecule
field, and it contributes to the efficient production of Acta Cryst. C.
But our experiences in developing the typesetting software for Acta
suggest that the file format is less important than the fact that the data
are well defined and that there are clear (even if flexible) relationships
between specific items of data. Hence the Union should be alive to
developments in other disciplines, and not seek to adhere to a particular
file structure if it is counterproductive to do so. For the time being,
though, CIF will offer a development environment in which crystallographers
can work on a foundation of solid experience and growing expertise, and its
successes may well lead to its adoption in other domains. Apart from the NMR
structuralists, we have had expressions of interest in such an approach from
fields as diverse as quantum chemistry, astronomy and taxonomy (and I am
still trying to promote its use in cookery books :-). If I may borrow a
comment from Herbert Bernstein, who is working with the IUCr on a project to
develop software tools for the publication of macromolecular structure
reports,

H>   I know there are always better ways to do things, but I know the mmCIF 
H> dictionary is in more than good enough shape to be the basis for sound 
H> publication tools in macromolecular crystallography.  I hope to have a 
H> good new cifdif in a couple of months and the first version of the small 
H> cifed by the end of the year, and I am adding request list support to 
H> cif2cif.  CIF-based publication is a known, working approach to 
H> publication for small molecules.  If we keep going along the same lines 
H> for other areas, we could save years over what it would take to adopt any 
H> different, theoretically better, representation.

O> 5. Acknowledgments
O> -------------------
O> I am grateful to a number of people for their comments, explanations, and
O> discussions wrt mmCIF. These include Enrique Abola, Helen Berman, Frances
O> and Herbert Bernstein, Axel Brunger, Paula Fitzgerald, Kim Henrick, Peter
O> Keller, Phil McNeil, Stuart Moody, Jaime Prilusky, Joel Sussman, and
O> John Westbrook.


Regards
Brian