Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

status of v1 of the imgcif library

Hi everyone,

This is to let you all know the current status of the first version of
the imgcif library and to get comments on the way I have implemented a
few of the points you have been dicussing.

1.  The compression/decompression and cif/cbf file parsing code is
running.  The full library should be finished next week or the week
after. When it is complete and tested, I'll let you know.

2.  The compression I have implemented is a lossless pixel-to-pixel
differences followed by modified Huffman encoding.  Using Andy and Bob's
terminology, this corresponds to the simplest type of "Predictor
Huffman" algorithm.

This scheme produces a bitstream (which is then encoded in 8-bit bytes)
so the little-endian/big-endian difference dissappears.  It also has the
advantage that the compressed image depends only on the pixel values and
not on the number of bytes occupied by the pixel on the computer doing
the compression or whether the original pixels were signed or unsigned.

Compression and decompression are fairly fast and the compression ratio
is good.  To compress or decompress a 2000*2000 pixel 18-bit image
typically takes less than 2 seconds on a 300MHz pentium-II or an R10000
SG and less than 1.5 seconds on a 500MHz alpha.  With typical images
from SSRL, each pixel yields around 5-6 bits in the compressed image.
This corresponds to a compression ratio around 3:1 compared to the
16-bit per pixel with overflow table scheme used by MAR for uncompressed
images.

I don't think there should be any copyright/patent problems as the
modifications to the basic Huffman algorithm and all the code are mine.

3.  The type of encoding is stored within the binary section (as well as
in the CIF header) so additional compression schemes can be added in the
future.

4.  The binary sections are stored as ';'-delimited strings in an
otherwise pure CIF file. eg:

#
# Array data
#

loop_
_array_data.array_id
_array_data.data
image_1
;
START OF BINARY
(binary data)
END OF BINARY
;

The start and end of the binary section are structured in a way similar
to that described in section 6 (Binary sections) of the OVERVIEW OF THE
FORMAT in the draft proposal.

###_START_OF_HEADER, ###_END_OF_HEADER, ###_START_OF_BIN,
###_END_OF_BINARY, ###_END_OF_CBF are no longer necessary.

5.  Because the binary sections are encoded simply as an extra data
type, a file can contain any number of binary sections appearing in any
order.  There is no restriction to a single binary section.

This can work with very large files with multiple binary sections
because a binary section is read into memory only when that data is
requested by the calling program rather than when the imgcif file is
first parsed.  This lets a program access all of the pure-CIF data very
rapidly and then access the binary data as needed.

6.  The library is pure ANSI C and should run on any computer with an
ANSI C compiler without having to define any system-specifics.  The only
assumption made in the code is that an int is at least 32 bits.  This
assumption may dissapear by the final version.  If someone needs the
library to work on a system with 16-bit ints, please let me know.

    Does anyone have any comments?

        Paul Ellis



Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.