Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CBF file structuring


Hello Everyone,

    Following the imgcif workshop I've started updating my CBF definition
document. Here is the part which refers to the overall file structuring
and internal identifiers. This reworking follows the ideas discussed at
the workshop e.g. the structure now allows for mutliple binary sections.
A detailed proposal is provided for all the identifiers including the
start of the binary sections.

Please read over this and let me have your comments. Both on the clearness,
or otherwise, of the text, as well as on the proposed structure.

Best Regards,

             Andy

-------------------------------------------------------------------------------


2.0 OVERVIEW OF THE FORMAT
--------------------------

The following describes the major "components" of the CBF format.

1. CBF is a binary file, containing self-describing array data e.g. one
   or more images, and auxiliary data e.g. describing the experiment.

2. It consists of pseudo-ASCII header sections, which are "lines" of 
   ASCII characters separated by the carriage return and line-feed
   characters (ASCII 13, ASCII 10), followed by zero, one, or more binary
   sections. This structure may be repeated.

3. The very start of the file has an identification item (2). This item
   also describes the CBF version or level. The identifier is:

###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION

which must always be present so that a program can easily identify
whether or not a file is a CBF, by simply inputting the first 41 
characters. (The space is a blank (ASCII 32) and not a tab. All
identifier characters are uppercase only.)

(QUESTION: Should all identifiers be case sensitive or not ? Presently
I'm assuming that they're all upper-case only.)

The first hash means that this line within a CIF would be a comment
line, but the three hashes mean that this is a line describing the 
binary file layout for CBF. (All CBF internal identifiers start with 
the three hashes.) No whitespace may precede the first hash sign.

Following the file identifier is the version number of the file. e.g.
the full line might appear as:

###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0

The version number must be separated from the file identifier
characters by whitespace e.g. a blank (ASCII 32).

The version number is defined as a major version number and minor
version number separated by the decimal point. A change in the major 
version may well mean that a program for the previous version cannot
input the new version as some major change has occurred to CBF (3). A
change in the minor version may also mean incompatibility, if the CBF
has been written using some new feature. e.g. a new form of linearity 
scaling may be specified and this would be considered a minor version
change. A file containing the new feature would not be readable by a
program supporting only an older version of the format.

4a. The start of an header section is delimited by the following special
identifier:

###_START_OF_HEADER

followed by the carriage return, line-feed pair.

4b. A header section, including the identification items which delimit
it, uses only ASCII characters, and is divided into "lines". The "line
separator" symbols  (carriage return, line-feed) are the same regardless 
of the operating system on which the file is written. (This is an 
importance difference with CIF, but must be so, as the file contains 
binary data, so cannot be translated from one O.S. to another, which is 
the case for ASCII text files.) 

4c. The header section within the delimiting identification items
obeys all CIF rules [1], with the exception of the line separators.

e.g.

o "Lines" are a maximum of 80 characters long. (For CBF it is probably
   best to allow for this maximum to be larger.)

o All data names start with an underscore character.

o The hash symbol (#) (outside a character string) means that all text
  up to the line separator is a comment.

o Whitespace outside of character strings is not significant.

o Data names are case insensitive.

o The data item follows the data name separator, and may be of one of
  two types: text string (char) or number (numb). (The type is
  specified for each data name.)

o Text strings may be delimited with single of double quotes, or blocks of
  text may be delimited by semi-colons occurring as the first character on
  a line.

o The 'loop_' mechanism allows a data name to have multiple values

Any CIF data name may occur within the header section.

4d. A single header section may contain one or more data blocks (CIF 
    terminology).

4e. The end of the header section is delimited by the following special
identifier:

###_END_OF_HEADER

followed by carriage return, line-feed.

6. The header section must contain sufficient data names to fully
describe the binary data section(s) which follow(s).

7a. The start of a binary data section is identified by a special
identifier of 32 bytes which mixes ASCII and binary such that the start 
of the identifier can easily be found using string search methods. 

The ASCII part is:

###_START_OF_BIN

The full identifier is:

 Byte No. ASCII Symbol                  Byte Value (unsigned) (decimal)
          ------------

    1          #                               35
    2          #                               35
    3          #                               35
    4          _                               95 
    5          S                               83
    6          T                               84
    7          A                               65
    8          R                               82
    9          T                               84
   10          _                               95
   11          O                               79
   12          F                               70
   13          _                               95
   14          B                               66
   15          I                               73
   16          N                               78
   17      Form-feed                           12
   18     Substitute  (Control-Z)              26
   19     End of Transmission (Control-D)      04
   20                                         213
   21                    }    Bytes 21 - 24 define the binary section 
   22                    }    identifier. This a 32-bit unsigned little-
   23                    }    endian integer. The number is used to relate
   24                    }    data defined in the header section.
   25                      }   
   26                      }   Bytes 25 - 32 define the length of
   27                      }   the following binary section in bytes
   28                      }   as a 64-bit unsigned little-endian
   29                      }   integer. (The value 0 means the
   30                      }   size is unknown, and no other
   31                      }   pseudo-ASCII nor binary sections may
   32                      }   follow.)

The binary characters serve specific purposes:

   o The form feed will separate the ASCII lines from the binary 
     sections if the file is listed on most operating systems.

   o The Control-Z will stop the listing of the file on MS-DOS
     type operating systems.

   o The Control-D will stop the listing of the file on Unix
     type operating systems.

   o The unsigned byte value 213 (decimal) is binary 11010101
     This has the eighth bit set so can be used for error checking
     on 7-bit transmission. It is also asymmetric, but with the first
     bit also set in the case that the bit order could be reversed 
     (which is not a known concern).

   o (The carriage return, line-feed pair at the end of the first
     and other lines can also be used to check that the file has not
     been corrupted e.g. by being sent by ftp in ASCII mode.)

   o Bytes 21-24 define the binary id of the binary data. This id is
     also used within the header sections, so that binary data
     definitions can be matched to the binary data sections. 32-bits
     allows many many more binary data sections to be addressed than can
     conceivably be needed.
   
   o Bytes 25-32 define the length of the binary section. This provides
     for enormous expansion from present images sizes, but volume and 
     higher dimensional data may need more than 32-bit sizes in the
     future. 

     This value may be set to zero if this is the last binary section or 
     header section in the file. This allows a program writing, for
     example, a single compressed image to avoid having to rewind the
     file to write the size of the compressed data. (For small files
     compression within memory may be practical, and this may not be an
     issue. However very large files exist where writing the compressed
     data "on the fly" may be the only realistic method.)

     Since the data may be have been compressed, knowing the numbers
     of elements and size of each element does not necessary tell a
     program how many bytes to jump over, of here it stored explicitly.
     This also means that the reading program does not have encode
     information in the header section section to move through the
     file.

(QUESTION: To fit this into 32 bytes I cut "BINARY" to "BIN". I think
this should be some even number of characters and maybe multiple of 4 or
8, and probably a power of two has advantages. Should we leave this as 
I've defined, or use more characters ?)

7b. The "start of binary identifier" must be separated from all other
    identifiers by white space. Usually the "line separator" will
    immediately precede the "start of binary identifier", but blank
    spaces are also allowed.

7c. The binary data does not have to completely fill the bytes defined
    by the byte length value, but clearly cannot be greater than this
    value (except when the value zero has been stored, which means that
    the size is unknown, and no other headers follow). The values of
    any unused bytes is undefined.

7d. At exactly the byte following the full binary section as defined by
    the length value is the end of binary section identifier:

    ###_END_OF_BINARY 

    followed by the carriage return / line feed pair.

    This identifier is in a sense redundant since the binary section
    length value tells the a program how many bytes to jump over to
    the end of the binary section. However, this redundancy has been
    deliberately added for error checking, and for possible file
    recovery in the case of a corrupted file.

8. Whitespace may be used within the pseudo-ASCII sections prior to the
   "start of binary section" identifier to align the start binary data
   sections to word or block boundaries. Similar may be made of unused
   bytes in binary sections. 

   However, in general no guarantee is made of block nor word alignment
   in a CBF of unknown origin.

9. The end of the file is explicitly indicated by the:

###_END_OF_CBF

identifier (including the carriage return, line-feed pair)

10. All binary sections in a single header section must follow the
    header section prior to another header section, or the end of the
    file. 

    The binary identifiers values used within a header section, and 
    hence the immediately following binary section(s) must be unique.

    A different header section may reuse binary identifier values.

    (This allows concatenation of files without renumbering the
    binary identifiers, and provides a certain level of localisation
    of data within the file, to avoid programs having to search 
    potentially huge files for missing binary sections.)

11. The recommended file extension for a CBF is: cbf
This allows users to recognise file types easily, and gives programs a 
chance to "know" the file type without having to prompt the user.

12. CBF format files are binary files and when ftp is used to transfer
files between different computer systems "binary" or "image" mode
transfer should be selected.

2.1 SIMPLE EXAMPLE OF THE ORDERING OF IDENTIFIERS
-------------------------------------------------

Here only the ASCII part of the file structuring identifiers is shown.
The CIF data items are not shown, apart from the 'data_' identifier
which indicates the beginning of a data block.

This shows the structuring of a simple example e.g. one header section
followed by one binary section. Such as could be used to store a
single image.

###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0

###_START_OF_HEADER

data_

###_END_OF_HEADER

###_START_OF_BIN



###_END_OF_BINARY

###_END_OF_CBF


2.2 MORE COMPLICATED EXAMPLE OF THE ORDERING OF IDENTIFIERS
-----------------------------------------------------------

Here only the ASCII part of the file structuring identifiers is shown.
The CIF data items are not shown, apart from the 'data_' identifier
which indicates the beginning of a data block.

This shows the a possible structuring of a more complicated example.
Two header sections, and the first contents two data blocks and defines 
three binary sections. CIF comment line, starting with a hash (#) are
used to example the structure

###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0

# A comment cannot appear before the file identifier, but can appear
# anywhere else, except within the binary sections.

###_START_OF_HEADER

# Here the first data block starts
data_

# The 'data_' identifier finishes the first data block and starts the
# second
data_

###_END_OF_HEADER

# The first header section is finished, but the first binary section
# does not start until the 'start of binary' identifier is found. This
# part of the file is still pseudo-ASCII.

###_START_OF_BIN



###_END_OF_BINARY

# Following the 'end of binary' identifier the file is pseudo-ASCII
# again, so comments are valid up to the next 'start of binary'
# identifier.

# Second binary section. 

###_START_OF_BIN



###_END_OF_BINARY

# Third binary section.

###_START_OF_BIN



###_END_OF_BINARY

# Second Header section

###_START_OF_HEADER

data_

###_END_OF_HEADER

# Since this the last binary section in the file, the byte length could
# optionally be set to zero, which indicates it is undefined. (All the
# other binary sections must have these values defined to allow the
# reader software to jump over sections.)

###_START_OF_BIN



###_END_OF_BINARY

###_END_OF_CBF







Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.