(39) IT volume?; image standard; front of dictionary; data collection

To: [email protected]
Subject: (39) IT volume?; image standard; front of dictionary; data collection
From: bm
Date: Wed, 20 Dec 1995 14:25:11 GMT
Dear Colleagues

Some correspondence on my general introductory review in message (38):

BNF
===

Syd Hall remarked:
S> Good news about the BNF upgrade being on its way.

David Brown inquired:
D> Can someone enlighten me on BNF? 

I present my apologies for not being more explicit. BNF stands for Backus-Naur
form. It's a formal representation of the syntax of a computer language
(which is what CIF is, in these terms). Increasingly it is the expectation
of software engineers that a formal data representation should be supported
by machine-readable syntax descriptions, applications programming interfaces
and other devices to facilitate mechanical software development. It's for this
reason that some of the COMCIFS discussions have taken on more of a
computer-science slant than might have first been expected.

The motivation behind this current work was to give an accurate and formal
description of the STAR syntax in a form susceptible to parser design using
tools familiar to compiler writers, such as lex and yacc. In any case, I
became aware when writing my CIF syntax checker that there were possible
ambiguities in the published STAR descriptions, so the design was to remove
or clarify these grey areas.

Progress on revising dictionaries
=================================

 From Syd:
S> Even better news about the powder and mm dictionaries. Brian T and Paula
S> deserve enormous congrats for completing these quite massive tasks... those 
S> of us who have done the lesser exercise with the core definitions truly
S> understand what must have been involved. 

Peter Murray-Rust asks:
PMR> Is there a sample of the new dictionary and the DDL that it uses?

The DDL is the canonical version 1.4 dictionary: it's at
ftp://ftp.iucr.ac.uk/pub/ddldic.c95 (I realise the CIF home page needs
radical updating now, but the link under "Current CIF Dictionaries" to the
DDL1.4 file is working).

I hope to be able to release the dictionary itself in the next few weeks.

On publishing the CIF standards
===============================

S> As you are aware, and have stated, there are a range of possible publication
S> options for the cif dictionaries and supporting tools, in addition to those
S> already available on the net. I think it worth emphasising that the primary
S> notification of these efforts must still be the normal publication modes.
S> And Paula, Brian, JohnW etc. should be encouraged to get these underway asap.
S> [eg. John is going to put his SIFLIB into JAC at the same time as Herb 
S> Bernstein and I am submitting the CIFtbx2 paper, and I have recommended that 
S> the DDL2 paper goes into JCICS as did DDL and the STAR papers]. The
S> publication options you have mentioned are the follow up material that will
S> give detailed reference and archive data.

Right. At the request of the IUCr President, I have drawn up a very sketchy
outline proposal for a volume of International Tables that might include
these reference and archive materials. Note that it includes provision for
documentation of SIFLIB in greater depth than would be appropriate for the
JAC paper describing its functionality. It may be debatable whether such
program-specific documentation is appropriate for International Tables, but I
would argue that it is, given that it effectively defines an application
programming interface for mmCIF handling (see my comments above on the
requirements for defining a full CIF environment).

| Proposal for International Tables Volume on Crystallographic Information
| ------------------------------------------------------------------------
| A proposal was advanced during the AsCA'95 meeting in Bangkok for the
| publication of the official CIF dictionaries and attendant documentation in
| International Tables format. This would represent a permanent record of the
| official definitions recognised globally for use in Crystallographic
| Information Files. The dictionaries are held as electronic files
| (themselves in CIF-like format), which may easily be converted to text
| format in the style of the published Core dictionary (Acta Cryst. (1991)
| A47, 655-685).
| 
| The volume should contain, as a minimum:
| 
|         DDL1.4 (base Dictionary Definition Language as used in Core): 3pp.
|         DDL2.1 (relational DDL as used in mmCIF): 7pp
|         Core Dictionary, version 1996: 33pp.
|         Powder extensions: 12 pp.
|         Macromolecular (mmCIF) Dictionary: 102pp.
|         (Molecular Information File core dictionary: 3pp. ?)
| 
|         (Page estimates are based on current dictionary versions)
| 
|         Primary papers on STAR and DDL (reprinted from J. Chem. Inf. Comput.
|         Sci.) and CIF.
| 
|         Commentary papers on CIF and/or the associated dictionaries.
| 
| And possibly also:
| 
|         Software documentation for standard CIF software libraries
|         (essentially this is to define an API (applications programming
|         interface) to ensure consistent CIF handling, rather than to
|         supply descriptions of individual software packages.
| 
|         Handbooks of usage for the standard crystallographic databases.
| 
|         Other tabulations of importance in crystallographic data
|         representation (standard image file formats, data compression
|         algorithms...?)
| 
| Future editions will also include further dictionaries:
| 
|         Incommensurate structures
|         Symmetry
|         Charge density
| 
| The volume would naturally be complemented by a CD-ROM including the
| dictionaries in machine-readable format; and perhaps also the standard
| libraries, compiled for various platforms. Indeed a library of standard CIF
| i/o and validation public-domain software could be included.


D37.1 Lengths of data names
---------------------------

S> I want it here for the record that I was not the advocate for restricting
S> the line lengths and data name lengths for CIF. My original proposal was that
S> the cif would have a star syntax. But I was told by other programmers of the 
S> day (circa 1988, but some of whom still flourish!) that cif would not catch
S> on if it were so open-ended. This, and the dropping of save frames and nested
S> loops, seemed to be a compromise necessary for their cooperation in this
S> new venture. At the time I think it was the right decision ...but now?
S> Perhaps its time for an expansion to the full star syntax...PROVIDED that
S> our main collaborators are in agreement! COMCIFS must keep in mind that the 
S> success of cif has been because the whole software community has been able
S> to participate in this standard...it must not become simply a comsci
S> plaything. Nick and I are in agreement about the need for this expansion
S> and I think his summary of this is excellent. Perhaps that only thing we
S> differ on is how does one get the whole community and software culture
S> into applying these upgrades. I believe this has to be handled very
S> carefully and with much advance warning.

However, such a development should not take place until after the current
mmCIF dictionary and DDL are unleashed in their present form. How do people
view the next phase of development?

The project to develop a standard image data format that I described in the
last circular has generated a substantial amount of discussion. Let me, for
indexing purposes, retrospectively assign this thread the title

D38.1 A standard for image data
-------------------------------

Peter Murray-Rust has drawn to my attention another initiative for the
transmission of multidimensional data:

PMR> Another initiative which might be worth following up is the development 
PMR> of standards to send 3- (and multidimensional) data fields using the MIME 
PMR> technology.  The prime mover in this is Scott Nelson at lanl.gov.  Scott 
PMR> and colleagues proposed a MIME type mesh/* for the transmission of such
PMR> data (I'm not sure whether HDF was one of the components).  I'm sure that 
PMR> crystallographers should investigate this as a way of exchanging 3-D 
PMR> fields such as electron density maps.

In respect of the imgCIF proposal (now rechristened imageNCIF), I asked Andy
Hammersley to define what he saw as its goals. He replied:

AH> The aim of "imageNCIF" is to "standardize" the passing of image (and other) 
AH> crystallographic experimental data from: one institute to another;
AH> one make of computer system to another; and from one computer program 
AH> (acquisition or analysis) to another. If a sufficiently large number
AH> of institutes/ programmers/ and producers of detector equipment can
AH> agree on a common format then the present task of having to support 
AH> numerous new and different image (data) formats will be at least 
AH> lessened.

This of course reads like a manifesto of CIF, and so why not have a common
format for including image data as well as all the other types of
crystallographic data that we can represent in CIF?

Well, one sticking point has been the requirement for image data to be
transported in binary format, for compactness and efficiency. This is in
absolute breach of the requirement that STAR files be (ASCII) text based.
Syd is particularly emphatic in maintaining that it is not appropriate to
breach an accepted standard at such a fundamental level:

S> ImageNCIF: why have a syntax that is ALMOST star compliant? Why not, if
S> only for the sake of star compliant browsers, editors, etc. put delimitors
S> for the "_image_data" value. I'm not concerned about the portability of
S> binary data (presumably its encoded externally) but it seems stupid to not
S> encapsulate such data so that it can be handled without reference to a
S> field-length value. I think we must say so...and perhaps more!
S> Here is what I would suggest.
S> 
S>          Since star does not impose line length constraints, why not simply
S>          encapsulate the binary string in semicolons. Ie.
S> 
S>          _image_data
S>          ;
S>          w+fh;auweh8y2! dchtgKth xcdluyqgldjy ZGXnJZHGClUAHs;diuydq ipwye..
S>          ;
S>          
S>          The string is not ascii text but this is a smaller crime than
S>          [requiring a field-length value]

However, there are technical problems with this - a pure binary stream can
contain successive bytes that might be interpreted as <newline><semicolon>
(and in different ways by different operating systems!), and so generate
spurious delimiters.

S> The IUCr has control over the STAR format and on more than one occasion 
S> this has been invoked to prevent violations of the syntax. We should
S> obviously welcome other disciplines using our interchange methods but it
S> will certainly do long term damage if there are mutant star and 
S> cif formats in common use. 

I have suggested to the imgcif-l discussion list the following three ways of
handling image data files without resorting to STAR modifications:

(1) ASCII encode the data within the CIF, and accept the performance penalty
    in extracting and decoding it.

(2) Maintain the image data as external files, and have the header
    information in separate CIF files. Then you need to have well defined
    protocols for referencing the data files from within the CIFs; and you
    incur the penalty of maintaining two files for every image.

(3) Have a completely separate standard for image data. Use an existing
    implementation (HDF, FITS, etc). Does this really matter?

Discussions on the relative merits of these approaches are continuing (there
seems some consensual drift towards (2)), but Andy Hammersley has also
proposed a mechanism of embedding STAR fields in a larger file of some more
general format:

AH> 4: Break with existing CIF to at least some extent, probably because a
AH>    file contains binary data. This could be "CIF" sections within a binary
AH>    files, or could be any other non-CIF format.
AH> 
AH> The difference between proposal 2 and 4. 
AH> 
AH>    Proposal 2 involves a separate "header" file, and a separate "data" 
AH>    (binary) file. Proposal 4 (in at least one scenerio) would store both 
AH>    "header" information and "data" in the same file.
AH> 
AH>    An advantage of proposal 2 (compared to 4) would be that existing CIF
AH>    tools would be able to work directly on the "header" file. For 4, if the
AH>    "header" section is very similar to CIF, then an extraction tool could be
AH>    used to extract the header section and create an ASCII file which could 
AH>    then be used with standard CIF tools. If the format used is well
AH>    defined and simply related to CIF (whilst being "binary"), this
AH>    extraction tool could be very simple to write, and be very portable.
AH> 
AH>    A disadvantage of proposal 2 is that a single "image" would be stored in 
AH>    two separate files. This provides the opportunity for the "header"
AH>    information to be separated from the "data" (Remember Murphy's law)
AH>    Depending on how a sequence of n "images" was stored, you might have
AH>    2*n files or n+1 files.
AH> 
AH>    Image formats of type 2 do exist (e.g. Hamburg OTOKO / BSL) format, but 
AH>    the vast majority of existing image formats fall into the category
AH>    defined by proposal 4.

Is this approach something that COMCIFS could sanction (i.e. the embedding of
CIF "data streams" within a different file format)? The extracted CIF data
would, of course, need to be fully compliant with a standalone CIF file.

##############################################################################

A couple of new threads connected with the Core revision.

D39.1 Introductory data blocks
------------------------------
We have considered various ways of organising the preliminary data in a CIF
dictionary (i.e. those that refer to the dictionary itself). See, for
example, D25.7. As the proposals for STAR primitives and their use in
dictionaries have evolved, the time seems right to have another look at this.
John Westbrook and I agreed the following protocol in Bangkok for DDL1
compliant dictionaries. It broadly follows the structure in D25.7, but with a
few differences which I shall indicate.

We shall use "standard" introductory datablock names:

 data_on_this_dictionary   # will contain the dictionary identifier and history
        _dictionary_name     cif_core.dic
        _dictionary_version  2.0
        _dictionary_history  '1999-99-99  blah-blah-blah etc'

 data_include_dependent_dictionaries   # contains references to other dicts
  #\include  http://www.iucr.ac.uk/cif/ddl_core.dic

Note the #\include preprocessor directive in this example. But note also the
constraints imposed by this construction - ddl_core.dic will be automatically
included in any dictionary that includes the Core. I have an open mind
whether to put the data_include_dependent_dictionaries block into the current
Core revision at all; but by default I *shall*, unless I hear objections.

The global_ declarations will be OMITTED from the CIF Core dictionary. John
is very worried about the implementation of global_'s and unspecified
#\include's together in the same file, and it seems safest to leave out at
least the global_ block, since its function is accomplished by recognition
and implementation of the _enumeration_default values in the DDL dictionary.
In case I've lost you (again), the earlier proposal was to have
     global_
         _list                  no
         _list_mandatory        no
         _list_level            1
         _enumeration           ?
         _enumeration_default   ?
         _type_conditions       none

However, all of these (with the exception of _type_conditions) have
_enumeration_default values in the DDL1.4 dictionary that match the
values given ('no', 'no', '1' etc). There seems nothing to be gained by
duplicating this information. Because _type_conditions has no
_enumeration_default, I need to add an explicit _type_conditions field to
each definition block in the CIF dictionary.

D39.2 Additional items for data collection strategies
-----------------------------------------------------

I've received the following message from Keith Watenpaugh:

KW>    I have been making a valiant effort to convince our small molecule
KW> crystallographers to archive their structural data in a CIF format and 
KW> have that format carry over to entry into our internal databases and
KW> graphics software. Since they want to store some of the details about
KW> the data collection in a machine-parseable form (something Helen was
KW> strongly wanting, as well) I started looking for the keywords for items
KW> such as scan-speed and background time. I was surprised that they are
KW> missing, while items such as:  '_diffrn_refln_counts_bg_1'
KW>                                '_diffrn_refln_counts_bg_2'
KW>                                '_diffrn_refln_counts_net'
KW>                                '_diffrn_refln_counts_peak'
KW>                                '_diffrn_refln_counts_total'
KW> are there, but would not be usable if different scan rates are used. In
KW> fact, this is usually the case with the Siemens diffractometer data
KW> collection. Also, unless one knows the time for which a background was
KW> collected, you can't correct the "...peak" by the "...bg_1/2" items.
KW> Even if using constant scan rates and background counting times, putting
KW> these numbers away in '_diffrn_measurement_details' as character strings
KW> is probably not the way to go. Have I missed something or are items such
KW> as '_diffrn_measurement_scan_rate' and '_diffrn_refln_bg_time', etc?

This seems entirely reasonable. Does anyone wish to proffer _definition's
for the *_scan_rate and *_bg_time proposals; or, better, is anyone able to
supply a more complete set of definitions that will cover all the
requirements for a variable-rate scan? Should I encourage Keith to supply an
extended set?

########################################################################

With that, I am now going to run for cover until January 8.

Best wishes for a merry Christmastide
Brian
Prev by Date: (38) Review of status; length of data names
Next by Date: (40) F(000); neutron diffraction; Uequiv; symmetry-generated sites; MIF
Index(es):
- Date
Discussion List Archives

(39) IT volume?; image standard; front of dictionary; data collection