Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

prototype CBF format

  • To: Multiple recipients of list <imgcif-l@bnl.gov>
  • Subject: prototype CBF format
  • From: Andy Hammersley <hammersl@esrf.fr>
  • Date: Tue, 16 Apr 1996 05:36:54 -0400 (EDT)

Here's my attempt to define a CBF format, please excuse present
inconsistences and incompleteness. I've produced this as a
basis for discussion, and to give an idea of how the overall format
would look. Please do not assume any of the data names, concepts, etc. 
are decided; this only an attempt at a proto-type definition. Many 
points of discussion are still undecided, and a large number of new 
concepts are raised within this definition.

Q. Is this an appropriate way forward ?

I separate the definition from comments on discussion items by using 
round brackets to refer to notes kept separate from the main text 
e.g. (1) refers to point 1 in the notes section. I suggest that I try 
to up-date and redistribute this definition as new suggestions are sent
or concensus is reached.

Q. Is this a suitable basis for developing a CBF definition which would 
then be proposed to the COMCIFS/IUCr ?

I'm sorry, if important points which have been raised in discussions
which have been left out of the present document. Please remind me if you
feel that something is missing.


   Andy Hammersley


-----------------------------------------------------------------------------


                    ---------------------------------------
                    The Crystallographic Binary File Format
                    ---------------------------------------


ABSTRACT
--------

This document describes the Crystallographic Binary File (CBF) (0) format; 
a simple self-describing binary format for efficient transport and 
archiving of experimental data for the crystallographic community. The 
format consists of a "CIF-compatible" header section contained within
a binary file. The format of the binary file, and the new CIF data-
items are defined.


1.0 INTRODUCTION
----------------

The Crystallographic Binary File (CBF) format is a complementary format 
to the Crystallographic Information File (CIF) [1], supporting efficient
storage of large quantities of experimental data in a self-describing 
binary format (1).

The initial aim is to support efficient storage of raw experimental data
from area-detectors (images) with no loss of information compared to
existing formats. The format should be both efficient in terms of
writing and reading speeds, and in terms of stored file sizes, and
should be simple enough to be easily coded, or ported to new computer 
systems.

Flexibility and extensibility are required, and later the storage of
other forms of data may be added without affecting the present definitions.

The aims are achieved by a simple binary file format, consisting of a
variable length header section followed by a binary data section. The
binary data is fully described by data name/ value pairs within the
header section. The header section may also contain other auxiliary 
information. CIF data name and item pairs are used in the header
section to describe the binary data.

The present version of the format only tries to deal with simple Cartesian 
2-D detector data. This is essentially the "raw" data from detectors 
that is typically stored in commercial formats or individual formats
internal to particular institutes, but could be other forms
of data. It is hoped that CBF can replace individual laboratory or 
institute formats for "home" built detector systems, be used as a 
inter-program data exchange format, and may be offered as an output
choice by a number of commercial detector manufacturers specialising in
X-ray detector systems.

This format does not imply any particular demands on processing software
nor on the manner in which such software should work. Definitions of units,
coordinate systems, etc. may quite different. The clear precise
definition within CIF, and hence CBF, should help, when necessary, to 
convert from one system to another. Whilst no strict demands are made,
it is clearly to be hoped that software will make as much use as is 
reasonable of information relevant to the processing which is stored 
within the file.


2.0 OVERVIEW OF THE FORMAT
--------------------------

The following describes the major "components" of the CBF format.

1. CBF is a binary file, containing self-describing "image" and
   auxiliary data.

2. It is an exact number of blocks of 512 bytes in length, and may be 
   considered in a block structure (2).

3. The very start of the file has an identification item (3). This item
   also describes the CBF version or level. e.g.

###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0

The first hash means that this line is a comment line for CIF, but the 
three hashes mean that this is a line describing the binary file layout 
for CBF (4). No whitespace may precede the first hash sign.

The version number is defined as a major version number and minor
version number separated by the decimal point. A change in the major 
version may well mean that a program for the previous version cannot
input the new version as some major change has occured to CBF (5). A
change in the minor version may also mean incompatibility, if the CBF
has been written using some new feature. e.g. a new form of linearity 
scaling may be specified and this would be considered a minor version
change. A file containing the new feature would not be readable by a
program supporting only an older version of the format.


4a. The header section, including the identification items which delimit
it, uses only ASCII characters, and is divided into "lines". The "line
separator" symbol(s) is/are the same regardless of the operating system
on which the file is written (6). (This is an importance difference with
CIF, but must be so, as the file contains binary data, so cannot be 
translated from one O.S. to another, which is the case for ASCII text
files.) 

4b. The header section within the delimiting identification items
obeys all CIF rules [1], with the exception of the line separators.

e.g.

o "Lines" are a maximum of 80 characters long.

o All data names start with an underscore character and are a maximum 
  of 32 characters long.

o The hash symbol (outside a character string) means that all text
  up to the line separator is a comment.

o Whitespace outside of character strings is not significant.

o Data names are case insensitive.

o The data item follows the data name separator, and may be of one of
  two types: text string (char) or number (numb). (The type is
  specified for each data name.)

o Text string may be delimited with single of double quotes, or blocks of
  text may be delimited by semi-colons occuring as the first character on
  a line.

o The 'loop_' mechanism allows a data name to have multiple values

Any CIF data name may occur within the header section.

5a. The end of the header section is delimited by the following special
identifier (4), (7):

###_END_OF_CBF_HEADER

The "line" is terminated by the "line separator" immediately after the
"R" or "HEADER". No whitespace can be added at this point.

5b. Whitespace (blank characters and lines) may be used to reserve space
in the header section (for undefined later use), but this white space must
occur before the end of header delimiter item.

6. The header section must contain sufficient data names to fully
describe the binary data section.

7. The binary data starts at the beginning of the next free data block
after the last character of the line separator (8).

8. After the end of the binary data, the last block is fully output
i.e. the file is an exact integer multiple of the block size. The 
values of the extra bytes are undefined.

9. The recommended file extension for a CBF is: cbf
This allows users to recognise file types easily, and gives programs a 
chance to "know" the file type without having to prompt the user.

10. CBF format files are binary files and when ftp is used to transfer
files between different computer systems "binary" or "image" mode
transfer should be selected.

[11. A recommended standard icon could also be defined e.g. a little 
diffaction pattern.)

3.0 DATA NAME CATEGORIES
------------------------

Two new data name categories are proposed:

_binary_

_image_

The '_binary_' category is used for data item describing the general 
storage of binary data. At present only the '_binary_data_class' data
name exists in this category.

The '_image_' category covers all data names concerned with the storage
of "image" or regular array data (14).

Data names from any of the existing categories may be relevant as
auxiliary information in the header section, but data names from the
'_diffrn_' category, are likely to be the most relevant, and a few extra
data names in this category are necessary. 


4.0 DESCRIBING THE BINARY DATA
------------------------------

The type of binary data stored in the file is defined by the
'_binary_data_class' data item. This is of type 'character', and may have 
one of two values in version 1.0 of CBF: 'none' or 'image'.

If the value is 'none' there is no binary data section in the file.

The value 'image' means that the binary data section contains binary
data of class "image".


4.1 The "image" Class of Binary Data (14)
-----------------------------------------

The "image" class is used to store regular arrays of data values, such 
as 1-D histograms, area-detector data, series of area-detector data, and
volume data. Normally such data is regularly spaced in space or time,
however spatial distorted data could nevertheless be stored in such a format.
There is only one data "value" stored per lattice position, although that
value may be of type complex (15).

The "image" class implies that the data items '_image_size_dimensionality', 
'_image_size_dimension_1', '_image_element_data_type',
'_image_intensities_linearity', and '_image_data_compression_type' must 
be defined. The values of these items may in turn require other data
items to be defined.

e.g. If the '_image_size_dimensionality' data item is greater than 1,
then '_image_size_dimension_2' and maybe other '_image_size_dimension_?'
items must be defined up to the dimensionality of the array.


4.2 "Image" Element Rastering and Orientation (17)
--------------------------------------------------

Fundamental to treating a long line of data values as a 2-D image or
series of 2-D images is the knowledge of the manner in which the values 
need to be wrapped. For the raster orientation to be meaningful we
define the sense of the view:

The sense of the view is defined as that looking from the crystal
towards the detector (12). 

(For the present we consider only an equatorial plane geometry, with
2-theta = 0; the detector as being vertically mounted. (16)) The 
raster orientation describes in which corners of the detector the data 
value stream starts and ends, and whether the rastering is carried out 
horizontally or vertically. We define a preferred rastering orientation,
which is the default if the keyword is not defined. This is with the 
start in the lower-lefthand corner and the fastest changing direction 
for the rastering horizontally (Type 1). The rastering type is defined 
by an integer value which may take the value 1 to 8. (All eight possible
methods are defined to allow support for existing systems, which may
have a natural rastering system, which is difficult to change owing to 
the large size of the data involved. However, when at all possible, type
1 is encouraged.)

(Note: With off-line scanners the rastering type depending on which way 
round the imaging plate or film is entered into the scanner. Care may 
need to be taken to make this consistent.)

For 1-D detector data only the first two values are relevant (the
orientation of the detector would be defined separately.)

Below are shown the 8 possible ways of rastering the element stream:
1, 2, 3, 4, 5, 6, 7, 8, 9 (13).

_image_element_ordering 1

# This is the preferred method, and should be used, if possible.

#          ^
#          | 
# 7 8 9    slow
# 4 5 6    |
# 1 2 3    o- fast ->

_image_element_ordering 2

#                   ^
#                   | 
# 9 8 7          slow
# 6 5 4             |
# 3 2 1    <- fast -o

_image_element_ordering 3

#          ^
#          | 
# 3 6 9    fast
# 2 5 8    |
# 1 4 7    o- slow ->

_image_element_ordering 4

#                   ^
#                   | 
# 9 6 3          fast
# 8 5 2             |
# 7 4 1    <- slow -o

_image_element_ordering 5

#          o- fast ->
#          | 
# 1 2 3    slow
# 4 5 6    |
# 7 8 9    v

_image_element_ordering 6

#          <- fast -o
#                   | 
# 3 2 1          slow
# 6 5 4             |
# 9 8 7             v


_image_element_ordering 7

#          o- slow ->
#          | 
# 1 4 7    fast
# 2 5 8    |
# 3 6 9    v

_image_element_ordering 8

#          <- slow -o
#                   | 
# 7 4 1          fast
# 8 5 2             |
# 9 6 3             v

In the case of series of images, subsequent images follow in the element
stream with the same raster ordering.

4.3 "Image" Element Intensity Scaling
-------------------------------------

Existing data storage formats use a wide variety of methods for storing
physical intensities as element values. The simplest is a linear
relationship, but square root and logarithm scaling methods have 
attractions and are used. Additionally some formats use a lower dynamic 
range to store the vast majority of element values, and use some other 
mechanism to store the elements which over-flow this limited dynamic 
range. The problem of limited dynamic range storage is solved by the data
compression method 'byte_offsets' (see next Section), but the
possibility of defining non-linear scaling must also be provided.

The '_image_intensities_linearity' data item specifies how the intensity 
scaling is defined. Apart from linear scaling, which is specified by the
value 'linear', two other methods are available to specify the scaling.

One is to refer to the detector system, and then knowledge of the
manufacturers method will either be known or not by a program. This has
the advantage that any system can be easily accommodated, but requires 
external knowledge of the scaling system.

The recommended alternative is to define a number of standard intensity
linearity scaling methods, with additional data items when needed. A
number of standard methods are defined by '_image_intensities_linearity'
values: 'offset', 'scaling_offset', 'sqrt_scaled', and
'logarithmic_scaled' (11).

The "offset" methods require the data item '_image_intensities_offset' 
to be defined, and the "scaling" methods require the data item 
'_image_intensities_scaling' to be defined.

The above scaling methods allow the element values to be converted to a 
linear scale, but do not necessarily relate the linear intensities to
physical units. When appropriate the data item '_image_intensities_gain'
can be defined. Dividing the linearised intensities by the value of
'_image_intensities_gain' should produce counts.

Two special optional data flag values may be defined which both refer to
the values of the "raw" stored intensities in the file, and not to the
linearized scale values. 
'_image_intensities_undefined' specifies a value which indicates that
the element value is not known. This may be due to data missing e.g. a
circular image stored in a square array, or where the data values are
flagged as missing e.g. behind a beam-stop. 
'_image_intensities_overload' indicates the intensity value at which 
and above, values are considered unreliable. This is usually due to 
saturation.


5.0 DATA COMPRESSION (16)
-------------------------

One of the primary aims of CBF is to allow efficient storage, and
efficient reading and writing of data, so data compression is of great 
interest. Despite the extra CPU over-heads it can very often be faster 
to compress data prior to storage, as much smaller amounts of data need 
to be written to disk, and disk I/O is relatively slow. However, optimum
data compression can result in complicated algorithms, and be highly
data specific. 

At present one simple loss-less integer compression algorithm is
defined. This is referred to as 'byte_offsets' in the 
'_image_data_compression_type' data item. This algorithm will typically 
result in close to a factor of two reduction in data storage size 
relative to typical 2-byte diffraction images. It should give similar 
gains in disk I/O and network transfer. It also has the advantage that 
integer values up to 32 bits may be stored efficiently without the need 
for special over-load tables. It is a fixed algorithm which does not 
need to calculate any image statistics, so is fast.

The algorithm works because of the following property of almost all
diffraction data and much other image data: The value of one element
tends to be close to the value of the adjacent elements, and the vast
majority of the differences use little of the full dynamic range.
However, noise in experimental data means that run-length encoding is 
not useful (unless the image is separated into different bit-planes). If
a variable length code is used to store the differences, with the number
of bits used being inversely proportional to the probability of
occurence, then compression ratios of 2.5 to 3.0 may be achieved. 
However, the optimum encoding becomes dependent of the exact properties 
of the image, and in particular on the noise. Here a lower compression 
ratio is achieved, but the resulting algorithm is much simpler and more 
robust.

The 'byte_offsets' algorithm is the following:

1. The first element of the image is stored as a 4-byte signed integer
   regardless of the raw image element type. The byte order for this and
   any subsequent multi-byte integers is that defined in 
   '_image_byte_order' (9).

2. For every subsequent element the value of the previous element is
   subtracted to produce the difference. For the first element on a line 
   the value to subtract is the value of the first element of the previous
   line. For the first element of a subsequent image the value to subtract
   is the value of the first element of the previous image.

3. If the difference is less than +-127, then one byte is used to store
   the difference, otherwise the byte is set to -128 (ff in hex) and
   if the difference is less than +-32767, then the next two bytes are
   used to store the difference, otherwise -32768 (ffff in hex) is
   written and the following 4-bytes store the difference as a full 
   signed 32-bit integer.

4. The image element order follows the normal ordering as defined by
   '_image_element_ordering'.

It may be noted that one element value may require up to 7 bytes for
stored, however for almost all 16-bit experimental data the vast 
majority of element values will be within +-127 units of the previous
element and so only require 1 byte for storage and a compression factor of
close to 2 is achieved.

There are practical disadvantages to such data compression: the value of
a particular element cannot be obtained without calculating the values of
all previous elements, and there is no simple relationship between element
position and stored bytes. If generally the whole image is required this
disadvantage does not apply. These disadvantages can be reduced by
compressing separately different regions of the images, which is an
approach available in TIFF, but this adds to the complexity reading and 
writing images. An alternative is an optional data item, which defines a
look-up table of element addresses, values, and byte positions within
the compressed data (10). 

6.0 APPLICATION SOFTWARE AND SUPPORT TOOLS
------------------------------------------

The CBF format does not (and cannot) make any strict demands on
applications programs which use CBF files. 

Clearly, any program which outputs a CBF should create only valid CBF's. 
It is desirable that the output file contains as much useful auxiliary 
data as is possible, but no "compulsory" demands on the auxiliary 
information output in the file are made.

Programs inputting CBF's have a more difficult task, in that
theoretically the input CBF could be any size, and use any of the possible
data types, etc. Programs are not required to input all possible format
types, but it is highly recommended to output a clear output message to 
the user explaining why a particular file cannot be input, when this
occurs. e.g. A simple image analysis program may be written to work
with integer element images, so will not allow input of images with 
complex or floating point elements. Clearly, an application program 
will be most useful if it does input CBF's which are appropriate to its 
function. The use of any auxiliary data information is also optional. 
The aim is to provide analysis programs with useful information to help 
process the image data, but manner in which to use this information is 
the decision of the application programmer. It is, nevertheless, 
desirable that this auxiliary information is used. e.g. An application 
program could use the information in the CBF as default values, but 
allow the user to over-ride these values interactively or though scripts.

Source code for routines to input CBF's will be made available. This
code may be modified, provided the original copyright notice is
maintained.

A program to extract the header section of a CBF and output it in
CIF format will be available. Note: This program would not in itself
check the validity of the CIF, but would simply recognize the 
"CIF-compatible" header section and output it, in the correct ASCII
text format for the operating system.

7.0 EXAMPLE OF AN CBF HEADER SECTION
------------------------------------

This is an example of how a header file might look which stores an image
taken using the X-ray image intensifier/ CCD read-out detector system 
on beam-line BM-14 at the ESRF, and subsequently corrected for detector
distortions:


###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0

# Data name
data_Protein_X_1

# Data creation and history details
_audit_creation_date '96-03-27 09:55.05' # (!!!! Time added to date item)
_audit_creation_method 'Raw data output by CAMERA'
_audit_update_record 
;
96-03-27 10:05.31 Detector distortions corrected by FIT2D
;
_computing_data_collection 'spec and CAMERA'
_computing_data_reduction FIT2D            # Distortion correction

# Sample details
_chemical_name_common 'Protein X'

# Experimental details
_diffrn_measurement_method Oscillation
_diffrn_measurement_distance 0.15 # (!!!! New data name)
_diffrn_radiation_wavelength 0.76 # Monochromatic
_diffrn_radiation_source 'ESRF BM-14'
_diffrn_radiation_detector  'ESRF Be XRII/CCD'

# Image data details
_binary_data_class image
_image_size_dimensionality   2      # Single image in the file 
_image_size_dimension_1      1300
_image_size_dimension_2      1200
_image_element_ordering 1           # Default rastering used
_image_element_data_type unsigned_16_bit_integer # 16-bit ADC
_image_byte_order highbytefirst     # Written on a Sun-4 workstation
_image_intensities_linearity linear # Simple linear intensity scaling
_image_intensities_gain      1.2(0.1)
_image_intensities_overload  65535  # Saturation level
_image_intensities_undefined 0      # No defined data value flag

_image_data_compression_type byte_offsets # Save disk space
_image_element_size_1          122e-6 
_image_element_size_2          121e-6

###_END_OF_CBF_HEADER

(Binary image data follows starting at the first free block after the
line separator)


8.0 APPENDIX A: CIF EXTENSIONS DICTIONARY
-----------------------------------------

8.1 Phase 1A: Abstract Image Names Only
---------------------------------------

data_binary_data_class
    _name                     '_image_binary_data_class'
    _type                     'char'
    _enumeration_range        'none', 'image'
    _list                     'no'
    _definition
;              
              Type of binary data stored in the binary data section. 
              'none' means that no binary data is stored after the 
              header section. 'image' means that an array of values
              are stored.
;

data_image_byte_order
    _name                     '_image_byte_order'
    _type                     'char'
    _enumeration_range        'highbytefirst', 'lowbytefirst'
    _list                     'no'
    _definition
;
               The order of bytes for integer values which require more
               than 1-byte. 'highbytefirst' means that the first byte in
               the byte stream of the bytes which make up an integer value
               is the most significant byte of an integer. This is often
               referred to as "big endian". 'lowbytefirst' means that
               the last byte in the byte stream of the bytes which make 
               up an integer value is the most significant byte of an 
               integer. This is often referred to as "little endian".
               (IBM-PC's and compatiables, and Dec-Vaxes use low byte
               first ordered integers, whereas Hewlett Packard 700 
               series, Sun-4, and Silicon Graphics use high byte first
	       ordered integers. Dec-Alphas can produce/use either
               depending on a compiler switch.
;

data_image_data_compression_type
    _name                     '_image_data_compression_type'
    _type                     'char'
    _enumeration_range        'none', 'byte_offsets'
    _list                     'no'
    _definition
;              Type of data compression method used to compress the binary
               data. At present only no data compression or one simple
               method have been defined. It is important that output 
               software writes this data name, and that input software 
               checks the value, for the case that further algorithms
               become defined. 

               'none' means that the data is stored in normal format as
               defined by '_image_element_data_type' and
               '_image_byte_order'.

               'byte_offsets' means that the data is stored in the 
               compression scheme defined in Section 5.0
;

data_image_size_dimensionality
    _name                     '_image_size_dimensionality'
    _type                     'numb'
    _enumeration_range        1:
    _list                     'no'
    _definition
;              The number of dimensions of the data.
;

data_image_element_data_type
    _name                     '_image_element_data_type'
    _type                     'char'
    _enumeration_range        'unsigned_8_bit_integer', 
                              'signed_8_bit_integer',
                              'unsigned_16_bit_integer', 
                              'signed_16_bit_integer', 
                              'unsigned_32_bit_integer', 
                              'signed_32_bit_integer', 
                              '32_bit_real_ieee'
                              '64_bit_real_ieee'
                              '32_bit_complex_ieee' (15)
    _list                     'no'
    _definition
;
               Data type of a single element value. 
;

8.2 Phase 1B: Experiment Data Items
-----------------------------------

data_image_intensities_gain
    _name                     '_image_intensities_gain'
    _type                     'numb'
    _enumeration_range        0.0:
    _list                     'no'
    _definition
;              Detector "gain". The factor by which linearized 
               intensity values should be divided to produce
               counts.
;


data_image_intensities_linearity
    _name                     '_image_intensities_linearity'
    _type                     'char'
    _enumeration_range        'linear',           
                              'offset',           
                              'scaling_offset',   
                              'sqrt_scaled',      
                              'logarithmic_scaled'
    _list                     'no'
    _definition
;
               The intensity linearity scaling used from raw intensity
               to the stored element value. 'linear' is obvious, 'offset'
               means that the value defined by 
               '_image_intensities_offset' should be added to each
               element value. 'scaling' means that the value defined by
               '_image_intensity_scaling' should be multiplied
               with each element value. 'scaling_offset' is the
               combination of the two previous cases, with the scale 
               factor applied before the offset value.
               'sqrt_scaled' means that the square root of raw 
               intensities multiplied by '_image_intensity_scaling' is
               calculated and stored, perhaps rounded to the nearest 
               integer. Thus, linearization involves dividing the stored
               values by '_image_intensity_scaling' and squaring the 
               result. 
               'logarithmic_scaled' means that the logarithm based 10 of
               raw intensities multiplied by '_image_intensity_scaling' 
               is calculated and stored, perhaps rounded to the nearest 
               integer. Thus, linearization involves dividing the stored
               values by '_image_intensity_scaling' and calculating 10
               to the power of this number.
;

data_image_intensities_offset
    _name                     '_image_intensities_offset'
    _type                     'numb'
    _enumeration_range        
    _list                     'no'
    _definition
;
               Offset value to add element values. (see 
               '_image_intensities_linearity')
;

data_image_intensities_scaling
    _name                     '_image_intensities_scaling'
    _type                     'numb'
    _enumeration_range        
    _list                     'no'
    _definition
;
               Scaling value to multiply with element values. (see 
               '_image_intensities_linearity')
;

data_image_element_size_*
    _name                     '_image_element_size_*'
    _type                     'numb'
    _units_description        'Metres'
    _enumeration_range        0.0: 
    _definition
;
               The sizes in metres of an image element. (This supposes
               that the elements are on a regular 2-D grid.)
;


# From the draft revised CIF core dictionary:
#
# Data items in the _diffrn_measurement_ category record details about
# the device used to orient and/or position the crystal during the data
# measurement and the manner in which the diffraction data were measured.
#

# We need more precisely defined items than '_diffrn_measurement_details'

data_diffrn_measurement_distance
    _name                     '_diffrn_measurement_distance'
    _type                     'numb'
    _units_description        'Metres'
    _enumeration_range        0.0: 
    _list                     'no'
    loop_ _example            0.25 # 25cm sample to detector 
    _definition
;
               The distance between the sample and the intersection of
               the direct beam with the detector; the detector being at
               the 2-theta = 0 position with equatorial geometry (17, 18).
;

9.0 REFERENCES
--------------

1. S~R~Hall, F~H~Allen, and I~D~Brown, "The Crystallographic Information
File (CIF): a New Standard Archive File for Crystallography",
Acta Cryst., A47, 655-685 (1991)

----------------------------------------------------------------------------

10.0 NOTES
----------

(0) Crystallographic Binary File and CBF are working titles. As a name
for the format, this appears reasonably appropriate, but maybe a better
name will be suggested.

(1) A pure CIF based format has been considered inappropriate given the 
enormous size of many raw experimental data-sets and the desire for
efficient storage, and reading and writing.

(2) The block size of 512 is only a suggestion at present, and another
size maybe considered preferable.

Jim suggested 512 bytes block size, for efficiency reasons on
`OpenVMS, but there was objection to this. I think that we need to
support the concept a "record length" for Fortran direct access I/O and 
for certain O.S.'s. For other O.S.'s which don't have file structures 
all this necessarily means is that the files are some exact multiple of 
some number of bytes. A program written in "C" or a similar language 
would pad out the end of the file to the right number of bytes.
Such a concept may also be generally useful for efficiency reasons. 

In our choice of this number, we should not especially favour VMS for 
efficiency reasons, but then again ideally we want the reading and writing 
of the files to be as efficient as possible on ALL possible O.S.'s. 
If we can we should avoid building in inefficiency, and if possible
leave the opportunity for memory mapping and similar techniques.

A 512 byte block size is probably also a good size for Un*x pipes (which
may or may not be considered relevant.)

I suggest either 512 or 1024 byte block size, but maybe other numbers 
make more sense for other O.S.'s.

(3) I would like some simple method of checking whether the file really is
a CBF or not. Ideally this would be right at the start of the file. Thus, a 
program only needs to read in n bytes and should then know immediately
if the file is of the right type or not. I think this identifier should
be some straightforward and clear ASCII string. 

c.f. PostScript level I and II. Initially, a restricted format is
probably the most practical to define and implement e.g. only one header
and binary section per file. However, later on we may want to extent the
format to cover multiple header/binary sections. Such an important
change could be communicated to a program through this version/level
number. 

The underscore character has been used to avoid any ambiguity in the
spaces.

(Such an identifier should be long enough that it is highly unlikely to
occur randomly, and if it is ASCII text, should be very slightly
obscure, again to reduce the chances that it is found accidently. Hence 
I added the three hashes, but some other form may be equally valid.)

(4) c.f. PostScript and PostScript document structuring conventions.
    
Maybe some other identifier would be better, but also starting with a
hash e.g. #!! 

(The end of header section marker also uses this mechanism.)

(5) The format should maintain backward compatibility e.g. a version 1.0
file can be read in by a version 1.1, 3.0, etc.  program, but to allow 
future extensions the reverse cannot be guaranteed to be true.

(6) The exact manner in which to define the line separation is a subject
of discussion. Either using a single line-feed character (as is done by
Un*x), or using the combination of a carriage-return character followed
by a line-feed character (as is done by MS-DOS and related systems), are
the likely candidates. 

(7) Some clear identifier signalling the end of the header section and 
where the binary section begins, or some equivalent method for
achieving the same is vital. Here a clear identifier is proposed, but an
alternative method could also work.

(8) If normal computer data e.g. 2-byte integers, or IEEE reals are being 
stored in essentially native format then word boundaries should be 
respected. Given that higher "quadruple" precision data types and 
complex data types may potentially be wanted, I suggest that at least 
32 byte boundaries are respected, but maybe for efficiency or simplity 
reasons it's desirable to use the full block boundaries. 

(9) It would also be possible to define the alogithm so that multi-byte
integer byte ordering is not important.

(10) It is not proposed to try defining this data item at present, but
if the demand arises to efficiently input only sub-sets of the data.

(11) More types may well need to be defined. This list doesn't for
example cover the present Mac-Science intensity scaling scheme.
However, it may be viewed as too complicated to support too large
a range of scaling schemes.

(12) Some may prefer to define the view as the "camera-mans" view, and
may be this is better as part of an overall consistent co-ordinate
system for lab/crystal/detector. I note that MADNES defines the view
from the camera-mans point of view.

     Which definition of the viewing direction should we use ?
     Is there an IUCr standard co-oordinate system ?

(13) Other orderings are clearly possible, this is fairly arbitrary

(14) I still wonder about the use of "image". 

     Should we change the word "image" to "array", which I feel is more
     consistent with the uses which I have defined ?

     Or should we restrict "image" to refer only to a 2-D array type
     data object, and eventually define other data classes such as
     "histogram", "images", "volume", etc ?

     Or leave the class "image" to refer to a whole variety of 
     N-Dimensional arrays ?

(15) How should complex arrays be stored ?

     Should pairs of real and imaginary values be stored as alternative
     values in the element stream ?

     Or should whole array e.g. a image, of the real components be
     stored separately from an identical array containing the imaginary
     components ?
 
(16) Data compression is a subject with no simple best answer. 
     Different algorithms may be considered best in terms of compression
     ratio, read/write times, complexity, or applicability. The 
     algorithm proposed is simple and should be very widely applicable, 
     but certainly does not attempt to obtain optimum compression.

(17) More general detector orientation information has delibrately been
     avoided in the first stage of defining the CBF format, but even to
     describe the sense of an image from an area-detector a certain
     amount of external geometrical information is necessary. 

     Does the IUCr have standard coordinate system to define arbitrary 
     detector position and orientation ?

(18) Other definitions of sample to detector distance are possible, and
     used.

     How should the sample to detector distance be defined ?

(19) From the CIF core dictionary '_audit_creation_date' defines only the
date of creation. The time of the date creation needs to be defined to
a precision of fractions of a second. Either the '_audit_creation_date' 
definition needs to be extended to cover times, or an 
'_audit_creation_time' data name is needed, or a new 
'_image_creation_data_time' name is needed ?

     How is the time of creation of data best stored ?


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.