Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Self-described CIF proposal

Title:
Joe Krahn's request for a mechanism for defining local datanames within the CIF itself may be easier to implement than might be imagined, but only in the DDLm dictionaries that are now being prepared and evaluated.  DDLm is a new dictionary definition language that tidies up a lot of the loose ends that have become apparent in the DDL1 and DDL2 dictionaries in current use. DDLm is cleaner and more flexible, and programs designed to be used with CIF dictionaries written in DDLm will still be able to read all the legacy CIFs.  Without changing the way we write CIFs we will be able to access the advanced DDLm features such as methods.

One characteristic of the DDLm dictionaries is that they are likely to be smaller and more specialized, and will be assembled into a customized virtual dictionary each time a CIF is read.  DDLm contains straightforward procedures for importing and merging dictionaries, and these can include private dictionaries as well as the IUCr approved dictionaries.  This process is controlled by a head dictionary that contains _import statements listing the different component dictionaries that are to be imported and assembled.

For most CIFs a standard head dictionary would be retrieved from the IUCr web site (or a local directory where a copy is stored), but the head dictionary could be a custom dictionary stored on  a local web site, though in this case its URI would have to stable as long as the CIF itself remained archived or the CIF could become unreadable.  Alternatively such a customized head dictionary could be included as a text item in the CIF itself.  

In order to locate the correct CIF dictionary, each CIF will include either a number of _audit items identifying the head dictionary, its version and location, or it will include an _audit text item whose value is the head dictionary itself.  While most CIFs will opt for one of the IUCr templates, specialty CIFs can create their own custom virtual dictionary using the embedded head dictionary.   Such a head dictionary could also include one-off definitions if desired, though such items would only apply to the CIF in which the head dictionary is embedded.

Programs that are DDLm compatible will read in a CIF, look for _audit items to locate the head dictionary, load and assemble the virtual dictionary and then interpret the CIF.  It will be possible to us use a dictionary written in DDLm to read in legacy CIFs and the software to do this would be able to exploit the new features of DDLm.  Existing CIF could of course, still be read by currently available software designed to work with DDL1 and DDL2 dictionaries.  It will not, however, be possible to use existing software to read CIFs written with a DDLm dictionary.  For this reason DDLm dictionaries will initially be more of a programming language than a language for writing CIFs.

To come back to Joe's point, it is currently possible to include privately defined items into CIFs written with the DDL1 and DDL2 dictionaries, but it is awkward.  There is nothing to stop a dictionary definition being included as a text field in a CIF (provided semicolon delimiters are not used) but there is no protocol for extracting this information and including it as part of the dictionary.  There are protocols for merging DDL1- and DDL2-based dictionaries but they are external to the DDLs and the dictionaries.  On the other hand DDLm expects that dictionaries will be routinely merged and the machinery to do this is built into DDLm.

We are hoping that DDLm will receive COMCIFS approval before the end of the year, along with the first dictionaries.  Other dictionaries will follow and DDLm software can be brought into use as the dictionaries are approved.

Those of us who are putting together the first round of DDLm dictionaries and programs would welcome any comments.

David Brown




Joe Krahn wrote:
CIF relies on dictionaries to parse data correctly. The underlying STAR
format does not have a well-defined system for representing
general-purpose data, and leaves these details to a higher-level
specification.

My proposal is to define a "self-described CIF" format. I mentioned this
before, but there was not a lot of interest. I assume that this is
because most CIF developers are working with standardized databases,
where dealing with non-standard self-described data is difficult.
Experimentalists often need to store general-purpose data that cannot
always be handled by trying to create a dictionary that covers all
possible needs. In my opinion, STAR should be flexible enough to
represent data in a manner similar to NetCDF.

The general syntax can be that a CIF data block can contain save-frames
that represent data in the same manner as save-frames within a
dictionary. Dictionary data that is not in a save-frame will have to be
contained in a special save frame, which could be named "dictionary", or
some form of 'un-named' tag such as a single underscore.

As simple example of user-defined data, this could be inserted in a data
block that includes a mass for each atom, but also uses the dictionary
for everything else. To avoid conflicts, non-standard values used in the
context of a standard dictionary could all require a "[user]" prefix.

data_XXX
save__atom_site.[user]mass
    _item_description.description 'Atomic mass for this atom.'
    _item_type.code float
    _item_units.code 'unified_atomic_mass'
    save_
...


For dictionary-oriented data, this idea can still be useful for tagging
a data block with the matching dictionary, for example:

data_XXX
save_dictionary
    _dictionary.title           mmcif_std.dic
    _dictionary.version         2.0.10
    save_
...

Current mmCIF files contain "_audit_conform" entries, but it seems more
useful to have a general mechanism rather than identifying the
dictionary within dictionary-defined fields. Of course, this could also
be done with some sort of formatted comment on the first or second line
of the file.

I think this should be a fairly simple extension to CIF. If CIF
developers don't want to change CIF, this idea could also be implemented
as an alternative STAR implementation, or it could be explicitly defined
as a CIF extension rather than a change to CIF itself.

Joe Krahn
_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers

begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:idbrown@mcmaster.ca
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.