Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Some suggestions on image data files

  • To: Multiple recipients of list <imgcif-l@bnl.gov>
  • Subject: Some suggestions on image data files
  • From: "J.W. Pflugrath" <JWP@msc.com>
  • Date: Tue, 9 Jan 1996 10:47:23 -0500 (EST)
I have not aired any thoughts on this list due to other pressing matters,
but Andy Hammersley prodded me a little bit to participate, so here goes.
Maybe this will also inject some momemtum into the discussion.  Some
of this is old, here goes anyways.  I summarize my conclusion at the bottom,
so jump straight there to save reading time.

>BASIC IDEA
>----------
>
>There seems to be general consensus on the idea of an image format based on 
>a binary file, in which the first part consists of an ASCII "header"; pairs 
>of keywords and values. This "header" describes the binary data which 
>follows at some point after the header. 

Ok, this is a consensus.  Can we assume from now on that we will have
an ASCII header with binary data that follows?

>The idea is to extend the CIF core dictionary to include the necessary 
>information for describing the images and the nature of the binary data. 
>This could be done within the existing CIF concept, but with the obvious 
>exception that the file would be binary and not of variable length ASCII 
>records.

So we use a CIF(-like) format for the ASCII header.  We use binary for the 
data that follows.

>SPECIFICATIONS
>--------------
>
>1. Simplicity: The format should be easy to understand and program.

ASCII header + binary data (AH+BD).  Now that's easy.

>2. General Availability: Should be suitable for programming in "C", 
>Fortran, and other programming languages. Should be suitable for all
>common operating systems.

AH+BD causes no problems there. I have use AH+BD succesfully on a variety 
of architectures (Unix, VMS, DOS) with Fortran, C and C++ languages.
Despite the discussion about problems with binary data, there are really
none as far as programming is concerned.

>3. Should allow storage of "raw" detector values.

It should also allow storage of compressed, modified and/or subsets of the
raw detector values.

>4. Should be extensible e.g. future possibility to cover spiral read-out 
>   detectors.

An ascii header is extensible.  If the header describes the binary data, then
since the header is extensible, its description of the binary data is
extensible and hence the binary data is extensible.

>KEY QUESTIONS
>-------------

>1. Start very simple and limited, but leave the necessary mechanisms in place
>for future development.

>2. Concentrate on the image data, and necessary descriptors.

Some comments on the above items.  First, detector image raw data in binary 
form is not much different from other data that crystallographers work with.
If it is not compressed, to read the data one only needs to know the number 
of data points (pixels), their data type (integer, floatIEEE, floatVMS, 
floatOther), the pixel size (1, 2, 3, 4, n-bytes), and byte-order (big, 
little, other-endian).  Of course that supposes no inserting or padding of 
extra bytes anywhere.  If there is compression, then it is more complicated.
Indeed many image formats are compressed: Siemens (pixels >= 255 are in
a separate lookup table), Enraf-Nonius FAST (it is a Real*2 format),
Rigaku R-AXIS (pixels above 32767 are compressed), and others as well.

The above information is also needed for electron density map files.
In fact, there is little, if any, real differences in what is in an image
file and what is in a map file.  There is not much difference between a
1D, 2D, 3D and nDimensional image if they are all treated as a stream of
pixels.

So if we have a header that describes the data, we will need keywords
to describe the following items.  Pick your own keyword (I have capitalized
them), your own separator from the attributes (I use =, but you can use a
space or a : or anything else), and end for the attributes (I use ;, but you
could use something else).

If we use a single keyword with an array of attributes for the number
of DataPoints, as in:
  DATA_SIZE= NumDimensions SizeinDim1 SizeinDim2 SizeinDim3 
             ... SizeinDim(NumDimensions);
then we leave open whether images are 1D, 2D, or nDimensional.
So a simple header might contain (in any order):

  DATA_SIZE= 2 3076 3076;                                  
  DATA_TYPE= unsigned short int;
  DATA_ENDIAN= big endian;
  COMPRESSION= None;

with some extra header delimiters indicating the start of the header and
the end of the header.  For example, a format similar (not identical) to one
used by MADNES and some other programs might use the following header
for a 3076 x 3076 pixel image with 2-byte pixels that go from 0-65535.

{
HEADER_BYTES=  512;
DATA_SIZE= 2 3076 3076;
DATA_TYPE= unsigned short int;
DATA_ENDIAN= big endian;
COMPRESSION= None;
}
^L                                                          
... padding with spaces out to 512 total characters

(Newlines are not shown above, ^L is a formfeed character).  One reason
to pad out to 512 bytes or to have a minimum size header (no maximum
specified) is to allow fast reading of images on VMS platforms with QIOs, 
but this is a nicety not a requirement.  (I would suggest it be a requirement).

The above format is simple and has some features not immediately apparent.
For example, if you have a program that looks at the first few characters of
a file to determine what type it is and how to treat it, the characters

{
HEADER_BYTES=  

tell you immediately.  It is like a magic number.
A word on simplicity: Note that there is no clutter about image orientation
or color lookup table values.  This image data could not be processed by
a diffraction analysis program without more information.  BUT it is simple
and lets you read all kinds of binary data.

To have this going, one just needs to decide what keywords everyone wants
to use, and what attributes should be allowed for the keywords.

Note that to describe a subimage, you will also need to know the origin
of the subimage.  (Think of an electron density map!)

>3. Note possible future extensions and make sure that the format does not
>preclude these extensions. Simple mechanisms for existing software to
>recognise whether it knows how to handle the data need to be implemented from
>the start.

Well, I would suggest that
COMPRESSION=
be used by existing software from Siemens, Enraf-Nonius, Rigaku and MSC.
You could have, for example,
COMPRESSION=R-AXIS 8 32767;

where the first token in the values indicated the type of compression and
subsequent tokens are specific for that kind of compression.  Or you could
have

COMPRESSION=R-AXIS;
R-AXIS_COMPRESSION= 8 32767;

or even
COMPRESSION=Siemens_1;
SIEMENS_LOOKUP_TABLE= 324 455 12312 ...;

The point here is that keywords do not have to be followed by a single value.
The number of values could be determined by the first value (or from context).

>4. Make as much use as possible of previous initiatives e.g. the 1985 EEC
>workshops.

>WORK PLAN
>---------
>
>Define limited goal phases:
>
>Phase 1:  Simple 2-D images and associated data. This would lead to a draft
>-------   document for COMCIFS.
>                             
>Only Cartesian 2-D detectors. No image compression algorithms considered. 
>Only integer image data stored. Associated data mainly limited to existing 
>image format header information.

I think you have to treat more than 2D and image compression from the
start in this phase as these issues arise from the beginning.  For example,
if you are doing 3D profile-fitting, you might want to create individual
3D profiles and view them.  Why not use the image file format along with
an image display software to view them?  Actually, why not use existing
electron density map display software to view them?  You see, images and
maps are not that different!

>Sub-Phase 1A: Only considers the image data itself, and not any experiment
>parameters or other associated data.

This is a good idea.  It lets you accomplish something right away.
Wait a minute ... Didn't I finish off Sub-Phase 1A in the above discussion? :)

>Sub-Phase 1B: Data items and definitions relevant to particular types
>of experiments need to be defined.

There are lots and lots of keywords/values that can go here, so yes, it
is best to wait on this, but not wait too long.  In software I am writing, 
all geometric and experimental information (or pointers to where this
information can be found) is stored in the image header, so that the user 
should not need to input any commands to process an entire dataset.  This 
follows the spirit of what Bob Sweet wrote:
    >... From there on, the data should include everything necessary about 
    >the experiment for rather complete data reduction.  This will be possible 
    >with a comprehensive header and this is why it might be better to make the 
    >process simple ....

>Phase 2: Image compression: (Important, but likely to be more complicated
>-------  to reach agreement), other data types (e.g. IEEE Reals ?)

But put it in SubPhase 1A anyways.

>Phase 3, 4, etc.: (For the future) Other items, multiple data objects,
>-------           and relationships, Non-Cartesian "images". e.g. spiral 
>                  read-out


>POSSIBLE OBJECTIONS
>-------------------

>Q. What happens if the commercial manufacturers ignore "imageCIF" ?

A.  Speaking as an employee of a commercial manufacturer, they will
    not, especially if they have a chance to provide input during the
    design phase and have extensions for their instruments.

>Q. What happens if the authors of the main analysis programs ignore
>"imageCIF"

A.  If many detector images were written in imageCIF, the authors could
    not ignore the format.

>Q. CIF is not binary, so the extension is contrary to CIF !

A. So what?

========================================================================

Here is my conclusion:

I suggest that we all agree on ASCII header + Binary data as
the image format.  I suggest that the ASCII header 
   - have a minimum of 512-characters and be a multiple of 512-characters
     in length (for easy reading on VMS systems with QIOs),
   - begin with a defined character sequence, that contains the total number
     of characters in the header,
   - that it end with a defined character sequence that contains a form feed
     character so that programs like Unix more and VMS type/page pause
     before the binary data appears
   - that it have keyword=values, where values can represent numbers or
     strings and have more than one element
   - that we agree on keywords for:
        data_dimensions  (including value for number of dimensions)
        data_type        (integer or float, number of bytes)
        data_endianness  
        data_compression
   - that dimensions of 0 are allowed (that is, the image file contains only 
      a header and no binary data)
   - that words like row, column, vertical, and horizontal are not used in
     any keywords.  I am adverse to X, Y and Z also, but may be swayed with
     good arguments.

As an added bonus, I might suggest that if any keyword is missing, that
we decide on a default value.  For example, if data_dimensions is missing,
then the image dimensions are 0 and the file contains only a header and no
binary data.

Jim

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.