Crystallographic Binary File

This is an archive copy of the IUCr web site dating from 2008. For current content please visit https://www.iucr.org.

Crystallographic Binary File: Final Discussions

Below is a summary of the final discussions at the ImageNCIF meeting held at the Biology Department of Brookhaven National Laboratory on Monday-Wednesday, 20-22 October 1997. These notes are intended to clarify the handouts the attendees assembled just before the end of the meeting.

Tasks to Accomplish
- Capability of getting and putting single data items:
  probably accomplished by existing CIFLIB or CIFPARSE API's, or by minor modifications thereof:
  errcode = cbf$get_element("array_structure_list.dimension", &detwidth);
  . . . assuming the dictionary items already exist, of course.
- Multi-element gets and puts
  - Do we defer this capability till later?
  - Do we return explicit pointers to internal structures? no.
  - Set up (or co-opt) facility for returning an enumerated list of already-vetted data items.
- Arrays
  How do we do these? Probably existing codes in CIFPARSE will do it.
- Qualitative and non-machine-generated information
  GUIs and other I/O software need to provide easy ways of generating this kind of information. Mechanisms for requiring the user to input some of it will be useful, so the archive won't be missing important information later. Something equivalent to what an HTML forms interface does would help: user's data-run information won't be accepted by GUI unless the necessary items have been explicitly filled in.
- What code do we ourselves need to write?
  - minor modifications or extensions of CIFLIB or CIFPARSE so that the CIF-like portions of the CBF can be read without error and with some level of dictionary-checking. Can the existing codes work on files that contain binary info? We don't know... ask John Westbrook.
  - Code to interpret that header information so that we can read the first binary segment.
  - Code to interpret that header information so that we can read the nth binary segment.
  - Code to read a binary segment into memory.
    How smart should this particular API be? General consensus: not very. It should limit its activity to the actual I/O, correcting the endian-ness of the data, and perform any necessary decompression. Thus not very many header items will need to be accessible to this routine: total size of discfile to read, minimal understanding of organizational structure of those data (width vs. height), type of compression done, type of data present in file. Anything beyond this structure should be the responsibility of a downstream API that looks at this memory buffer and pulls relevant items out of it. This routine should be general enough to handle arbitrary (two-dimensional?) binary data.
  - Code to write a binary memory buffer onto disc.
    Again, this one shouldn't be terribly smart: it should handle endian-ness and compression, and not much else.
  - Populate appropriate arrays or objects with contents of the binary buffer(s).
    This is the more difficult and application-specific step.
  - Memory-freeing routines.
    This is trivial if we're in C++, not quite so trivial in C. But it needs to be done in either case.
  - Read defined subsection of the image.
    A typical application of this concept arises if one wishes to extract a shoebox in (X,Y,Z) from a small group of images. This is relatively efficient if the data aren't compressed; if the data are compressed, it'll probably be slow.
- Data-item ordering:
  The question here is whether we wish to specify that the dictionary-entry items in the header will appear in some specific order. Jim Pflugrath reports that his users urged him to put the items in alphabetic order so they could find them easier.
  - Unspecified on read.
  - Specified on writing?
  The consensus is that we won't specify it in the CBF standard (whatever that is!). The organization of the data will be the responsibility of the code the manages the ascii buffer, not the code that does file input-output operations.
- How many binary data types will we support in V0.1.?
  - data type of input == data type of output
    ... where input might be 16-bit integers (signed or unsigned), 32-bit integers, 64-bit integers, IEEE 32-bit floating point? Consensus appears to be to delay support of floating-point input till later.
  - Compressed disc data could be uncompressed into memory.
  - 32-bit to unsigned-16 bit conversions. In this case the marker for 16-bit overflows would be 65535.
- Human reading of header
  1. If (see above) we specify an order for the dictionary-data items, then when we run `more' on the file, the output will come out in a defined order.
  2. Alternative is leave the data in whatever order the header-generating code wishes to produce, and then produce a tool called cbf_beautifier that reformats the ascii of the header into a format the user wants. We could even make that pretty flexible: the user could produce a .cbfrc file that contains a list of mmCIF dictionary items that he/she wants to see, in the user's desired order; the code would then extract only those items from the header, sort them into the user's ordering, and print them.
- Preserving a history of how the image is manipulated
  - We need to remember whether the image has been dark-current-subtracted, spatially corrected, sensitivity-corrected, dezingered, . . .
  - mmCIF already provides for audit records; these will help a lot. But unless we're going to have multiple data blocks within a single header (undesirable!?) it'll be hard to preserve the whole record with the standard mmCIF audit formalism. Therefore, we should include data names that indicate specific types of manipulations of image. The overall list isn't all that long; the cases mentioned above are about the entire list.
- Spectra and other 2-D plottable data
  - This could be done with binary data blocks containing 2-D plot coordinate values (X,Y).
  - This could also be done in ascii, wherein the mmCIF dictionary itself would include data names for ascii (X,Y) pairs of data, along with control items like label_X, label_Y, range_X, range_Y, graph_title, log/linear. Consensus is that this should be an mmCIF issue, not a CBF issue--Jim Fait will discuss this with John Westbrook and the mmCIF community.
- Data names for experimental controls
  Specifying these names is associated with Bob Sweet's goal of having the header contain all the information needed to characterize the experiment. The aliasing mechanisms of DDL 2.1 allow us to fully populate specialized experimental control data categories even if some of those names already exist in other categories of the mmCIF dictionary.
Assignments of Tasks for CBF V0.1

n.b.I've re-ordered these tasks relative to the grotty-looking overhead we produced into an ordering that makes a bit more sense.
- Overall shepherding of the project: Bob Sweet, Andy Hammersley.
- Report on what we did at this meeting: Bob Sweet, perhaps by Thanksgiving.
- Maintaining a CBF homepage at NDB: John Westbrook, with contributions from many folks.
- Coordinate systems: Jim Pflugrath, via overall report from meeting.
- Further honing of file-structure syntax: Andy Hammersley.
- Assembling a list of data names needed for processing steps: Bob Sweet.
- Adding Dictionary Names: Andy Howard and Paula Fitzgerald, remembering to send the results quickly to Paul Ellis so he can use them!
- Maintaining integrity of dictionary additions according to DDL 2.1 syntax and current mmCIF names: John Westbrook.
- First coding of header data: Paul Ellis, to handle his MAR system.
- Data compression/decompression: Paul Ellis, Andy Hammersley.
- Publicizing what we're doing: IUCr via Brian McMahon; John Westbrook for NDB website; IUCr news; Synchrotron Radiation News.
- Subsequent code-writing: Yves Epelboin, Jim Fait, Andy Howard, many others.
- Testing of code: Yves Epelboin, Paul Ellis, Chris Nielsen, Andy Hammersley, Jim Fait, John Skinner.
- COMCIFS comments: Brian McMahon, David Brown.
- D*Trek implementation: Jim Pflugrath.
- Bruker software implementation:Jim Fait, Bruker employees.
- ADSC implementation: Chris Nielsen.
- MOSFLM interface: Paul Ellis.
- X-GEN interface: Andy Howard.
- Powder-diffraction interface: Andy Hammersley, Brian Toby, perhaps Carlo Segre.
- CBF-to-CIF translation tools: Herb Bernstein.
- DDL1.4 aliases: to be determined.
- Recorder of final discussions: Andy Howard

Crystallographic Binary File: Final Discussions

Tasks to Accomplish

Assignments of Tasks for CBF V0.1