Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Simple file header


Here I'm responding to the points raised by Jim.

> Presumably 100.5e-6 by 99.5e-6 is the nominal pixel size in meters.

The data items I've used are directly out of John Westbrook's DDL2
based imgCIF dictionary. Although in a few cases I slightly changed the
names, and noted this. John has made this available on the web. The
document can be found from a link from URL:

http://ndbserver.rutgers.edu/NDB/mmcif/index.html

(I think it's one level down, but the network's too saturated at present
for me to check, or get the actual URL)

'_array_element_size.size' is defined in the dictionary. With a few 
reasonably obvious exceptions the CIF data items use SGI units.


> Help me out please.  What are the 32 bytes again?  Also, the length of
> the binary section was computed, but wasn't there supposed to be a length
> identifier?  I think it would be difficult to compute the size of compressed 
> binary data from any info in a header.  Or do we just read until 
> '###_END_OF_BIN'?  Or until '<cr> <lf>###_END_OF_BIN'?  Or until ...?

The start of the binary section is a crucial point, so here I'll repeat
the appropriate section. (I'll try to get the document available on the
web, so that it can be consulted easily, but will not happen this week
since we have the "50 years of synchrotron radiation" conference here, and
things are reasonably chaotic at the ESRF as a result.)

So here's the identifier I proposed:

--------------------------------------

7a. The start of a binary data section is identified by a special
identifier of 32 bytes which mixes ASCII and binary such that the start 
of the identifier can easily be found using string search methods. 

The ASCII part is:

###_START_OF_BIN

The full identifier is:

 Byte No. ASCII Symbol                  Byte Value (unsigned) (decimal)
          ------------

    1          #                               35
    2          #                               35
    3          #                               35
    4          _                               95 
    5          S                               83
    6          T                               84
    7          A                               65
    8          R                               82
    9          T                               84
   10          _                               95
   11          O                               79
   12          F                               70
   13          _                               95
   14          B                               66
   15          I                               73
   16          N                               78
   17      Form-feed                           12
   18     Substitute  (Control-Z)              26
   19     End of Transmission (Control-D)      04
   20                                         213
   21                    }    Bytes 21 - 24 define the binary section 
   22                    }    identifier. This a 32-bit unsigned little-
   23                    }    endian integer. The number is used to relate
   24                    }    data defined in the header section.
   25                      }   
   26                      }   Bytes 25 - 32 define the length of
   27                      }   the following binary section in bytes
   28                      }   as a 64-bit unsigned little-endian
   29                      }   integer. (The value 0 means the
   30                      }   size is unknown, and no other
   31                      }   pseudo-ASCII nor binary sections may
   32                      }   follow.)

The binary characters serve specific purposes:

   o The form feed will separate the ASCII lines from the binary 
     sections if the file is listed on most operating systems.

   o The Control-Z will stop the listing of the file on MS-DOS
     type operating systems.

   o The Control-D will stop the listing of the file on Unix
     type operating systems.

   o The unsigned byte value 213 (decimal) is binary 11010101
     This has the eighth bit set so can be used for error checking
     on 7-bit transmission. It is also asymmetric, but with the first
     bit also set in the case that the bit order could be reversed 
     (which is not a known concern).

   o (The carriage return, line-feed pair at the end of the first
     and other lines can also be used to check that the file has not
     been corrupted e.g. by being sent by ftp in ASCII mode.)

   o Bytes 21-24 define the binary id of the binary data. This id is
     also used within the header sections, so that binary data
     definitions can be matched to the binary data sections. 32-bits
     allows many many more binary data sections to be addressed than can
     conceivably be needed.
   
   o Bytes 25-32 define the length in bytes of the binary section. This 
     provides for enormous expansion from present images sizes, but volume 
     and higher dimensional data may need more than 32-bit sizes in the
     future. 

     This value may be set to zero if this is the last binary section or 
     header section in the file. This allows a program writing, for
     example, a single compressed image to avoid having to rewind the
     file to write the size of the compressed data. (For small files
     compression within memory may be practical, and this may not be an
     issue. However very large files exist where writing the compressed
     data "on the fly" may be the only realistic method.) It is however
     recommended that this value be set, as it permits concatenation of
     files.

     Since the data may have been compressed, knowing the numbers
     of elements and size of each element does not necessarily tell a
     program how many bytes to jump over, so here it is stored explicitly.
     This also means that the reading program does not have to decode
     information in the header section to move through the file.

(QUESTION: To fit this into 32 bytes I cut "BINARY" to "BIN". I think
this should be some even number of characters and maybe multiple of 4 or
8, and probably a power of two has advantages. Should we leave this as 
I've defined, or use more characters ?)

--------------------------------------

I agree with Jim that I've changed the binary length information from
being an imgCIF data item to being something which is defined within the
CBF structure. The former was assumed at the meeting. I don't feel strongly
about it, but it seems to me that this information belongs more to CBF and
less to CIF e.g. If you took such a file and translated to a pure CIF, such
a data item would lose any purpose. Having the binary byte length information 
in a fixed position relative to the binary data seems slightly easier to
to program than having it at a variable position relative to the binary
data. With the information defined within the CBF identifier a program
could be written to check the integrity of a CBF, but which knew nothing 
about CIF, other than to ignore the insides of multi-line strings.

Are there strong feeling about this ? Should the binary byte length
be stored in CIF data item ?

In the example I calculated the size from info in the header i.e.

> # 768*512*2 = 786432 bytes (or more, ...

However, this was just for this simple example, which didn't use any data
compression. (This was done for simplicity, and also because I want to
rewrite the data compression part of the proposal, but later.) In the
case of a compressed image, a writing program would either compress in
memory, and therefore know what value to set proir to outputting the 
compressed data, or would compress on the fly, and then would have to
rewind ("fseek") back to the position in the file where this information
is stored.

So no, the size should be written within the identifier (or a data item)
so any reading program knows exactly where it SHOULD find the 'end of binary'
identifier. And this identifier is an added redundant safety check.
Only in a case of a corrupted file, should be program resort to reading
through the binary section until it found the identifier.

I suggested the identifier as ''###_END_OF_BINARY<cr> <lf>', but I think 
Jim's right in suggesting '<cr> <lf>###_END_OF_BINARY<cr> <lf>' as a 
better alternative. (Unless I hear protests, I'll change the proposal to 
include the "line separator" prior to the '###'.)


Regards,


      Andy







Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.