[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
CBF file structuring
- To: imgcif-l@bnl.gov
- Subject: CBF file structuring
- From: Andy Hammersley <hammersl@esrf.fr>
- Date: Wed, 5 Nov 97 17:04:28 +0100
Hello Everyone, Following the imgcif workshop I've started updating my CBF definition document. Here is the part which refers to the overall file structuring and internal identifiers. This reworking follows the ideas discussed at the workshop e.g. the structure now allows for mutliple binary sections. A detailed proposal is provided for all the identifiers including the start of the binary sections. Please read over this and let me have your comments. Both on the clearness, or otherwise, of the text, as well as on the proposed structure. Best Regards, Andy ------------------------------------------------------------------------------- 2.0 OVERVIEW OF THE FORMAT -------------------------- The following describes the major "components" of the CBF format. 1. CBF is a binary file, containing self-describing array data e.g. one or more images, and auxiliary data e.g. describing the experiment. 2. It consists of pseudo-ASCII header sections, which are "lines" of ASCII characters separated by the carriage return and line-feed characters (ASCII 13, ASCII 10), followed by zero, one, or more binary sections. This structure may be repeated. 3. The very start of the file has an identification item (2). This item also describes the CBF version or level. The identifier is: ###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION which must always be present so that a program can easily identify whether or not a file is a CBF, by simply inputting the first 41 characters. (The space is a blank (ASCII 32) and not a tab. All identifier characters are uppercase only.) (QUESTION: Should all identifiers be case sensitive or not ? Presently I'm assuming that they're all upper-case only.) The first hash means that this line within a CIF would be a comment line, but the three hashes mean that this is a line describing the binary file layout for CBF. (All CBF internal identifiers start with the three hashes.) No whitespace may precede the first hash sign. Following the file identifier is the version number of the file. e.g. the full line might appear as: ###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0 The version number must be separated from the file identifier characters by whitespace e.g. a blank (ASCII 32). The version number is defined as a major version number and minor version number separated by the decimal point. A change in the major version may well mean that a program for the previous version cannot input the new version as some major change has occurred to CBF (3). A change in the minor version may also mean incompatibility, if the CBF has been written using some new feature. e.g. a new form of linearity scaling may be specified and this would be considered a minor version change. A file containing the new feature would not be readable by a program supporting only an older version of the format. 4a. The start of an header section is delimited by the following special identifier: ###_START_OF_HEADER followed by the carriage return, line-feed pair. 4b. A header section, including the identification items which delimit it, uses only ASCII characters, and is divided into "lines". The "line separator" symbols (carriage return, line-feed) are the same regardless of the operating system on which the file is written. (This is an importance difference with CIF, but must be so, as the file contains binary data, so cannot be translated from one O.S. to another, which is the case for ASCII text files.) 4c. The header section within the delimiting identification items obeys all CIF rules [1], with the exception of the line separators. e.g. o "Lines" are a maximum of 80 characters long. (For CBF it is probably best to allow for this maximum to be larger.) o All data names start with an underscore character. o The hash symbol (#) (outside a character string) means that all text up to the line separator is a comment. o Whitespace outside of character strings is not significant. o Data names are case insensitive. o The data item follows the data name separator, and may be of one of two types: text string (char) or number (numb). (The type is specified for each data name.) o Text strings may be delimited with single of double quotes, or blocks of text may be delimited by semi-colons occurring as the first character on a line. o The 'loop_' mechanism allows a data name to have multiple values Any CIF data name may occur within the header section. 4d. A single header section may contain one or more data blocks (CIF terminology). 4e. The end of the header section is delimited by the following special identifier: ###_END_OF_HEADER followed by carriage return, line-feed. 6. The header section must contain sufficient data names to fully describe the binary data section(s) which follow(s). 7a. The start of a binary data section is identified by a special identifier of 32 bytes which mixes ASCII and binary such that the start of the identifier can easily be found using string search methods. The ASCII part is: ###_START_OF_BIN The full identifier is: Byte No. ASCII Symbol Byte Value (unsigned) (decimal) ------------ 1 # 35 2 # 35 3 # 35 4 _ 95 5 S 83 6 T 84 7 A 65 8 R 82 9 T 84 10 _ 95 11 O 79 12 F 70 13 _ 95 14 B 66 15 I 73 16 N 78 17 Form-feed 12 18 Substitute (Control-Z) 26 19 End of Transmission (Control-D) 04 20 213 21 } Bytes 21 - 24 define the binary section 22 } identifier. This a 32-bit unsigned little- 23 } endian integer. The number is used to relate 24 } data defined in the header section. 25 } 26 } Bytes 25 - 32 define the length of 27 } the following binary section in bytes 28 } as a 64-bit unsigned little-endian 29 } integer. (The value 0 means the 30 } size is unknown, and no other 31 } pseudo-ASCII nor binary sections may 32 } follow.) The binary characters serve specific purposes: o The form feed will separate the ASCII lines from the binary sections if the file is listed on most operating systems. o The Control-Z will stop the listing of the file on MS-DOS type operating systems. o The Control-D will stop the listing of the file on Unix type operating systems. o The unsigned byte value 213 (decimal) is binary 11010101 This has the eighth bit set so can be used for error checking on 7-bit transmission. It is also asymmetric, but with the first bit also set in the case that the bit order could be reversed (which is not a known concern). o (The carriage return, line-feed pair at the end of the first and other lines can also be used to check that the file has not been corrupted e.g. by being sent by ftp in ASCII mode.) o Bytes 21-24 define the binary id of the binary data. This id is also used within the header sections, so that binary data definitions can be matched to the binary data sections. 32-bits allows many many more binary data sections to be addressed than can conceivably be needed. o Bytes 25-32 define the length of the binary section. This provides for enormous expansion from present images sizes, but volume and higher dimensional data may need more than 32-bit sizes in the future. This value may be set to zero if this is the last binary section or header section in the file. This allows a program writing, for example, a single compressed image to avoid having to rewind the file to write the size of the compressed data. (For small files compression within memory may be practical, and this may not be an issue. However very large files exist where writing the compressed data "on the fly" may be the only realistic method.) Since the data may be have been compressed, knowing the numbers of elements and size of each element does not necessary tell a program how many bytes to jump over, of here it stored explicitly. This also means that the reading program does not have encode information in the header section section to move through the file. (QUESTION: To fit this into 32 bytes I cut "BINARY" to "BIN". I think this should be some even number of characters and maybe multiple of 4 or 8, and probably a power of two has advantages. Should we leave this as I've defined, or use more characters ?) 7b. The "start of binary identifier" must be separated from all other identifiers by white space. Usually the "line separator" will immediately precede the "start of binary identifier", but blank spaces are also allowed. 7c. The binary data does not have to completely fill the bytes defined by the byte length value, but clearly cannot be greater than this value (except when the value zero has been stored, which means that the size is unknown, and no other headers follow). The values of any unused bytes is undefined. 7d. At exactly the byte following the full binary section as defined by the length value is the end of binary section identifier: ###_END_OF_BINARY followed by the carriage return / line feed pair. This identifier is in a sense redundant since the binary section length value tells the a program how many bytes to jump over to the end of the binary section. However, this redundancy has been deliberately added for error checking, and for possible file recovery in the case of a corrupted file. 8. Whitespace may be used within the pseudo-ASCII sections prior to the "start of binary section" identifier to align the start binary data sections to word or block boundaries. Similar may be made of unused bytes in binary sections. However, in general no guarantee is made of block nor word alignment in a CBF of unknown origin. 9. The end of the file is explicitly indicated by the: ###_END_OF_CBF identifier (including the carriage return, line-feed pair) 10. All binary sections in a single header section must follow the header section prior to another header section, or the end of the file. The binary identifiers values used within a header section, and hence the immediately following binary section(s) must be unique. A different header section may reuse binary identifier values. (This allows concatenation of files without renumbering the binary identifiers, and provides a certain level of localisation of data within the file, to avoid programs having to search potentially huge files for missing binary sections.) 11. The recommended file extension for a CBF is: cbf This allows users to recognise file types easily, and gives programs a chance to "know" the file type without having to prompt the user. 12. CBF format files are binary files and when ftp is used to transfer files between different computer systems "binary" or "image" mode transfer should be selected. 2.1 SIMPLE EXAMPLE OF THE ORDERING OF IDENTIFIERS ------------------------------------------------- Here only the ASCII part of the file structuring identifiers is shown. The CIF data items are not shown, apart from the 'data_' identifier which indicates the beginning of a data block. This shows the structuring of a simple example e.g. one header section followed by one binary section. Such as could be used to store a single image. ###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0 ###_START_OF_HEADER data_ ###_END_OF_HEADER ###_START_OF_BIN ###_END_OF_BINARY ###_END_OF_CBF 2.2 MORE COMPLICATED EXAMPLE OF THE ORDERING OF IDENTIFIERS ----------------------------------------------------------- Here only the ASCII part of the file structuring identifiers is shown. The CIF data items are not shown, apart from the 'data_' identifier which indicates the beginning of a data block. This shows the a possible structuring of a more complicated example. Two header sections, and the first contents two data blocks and defines three binary sections. CIF comment line, starting with a hash (#) are used to example the structure ###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0 # A comment cannot appear before the file identifier, but can appear # anywhere else, except within the binary sections. ###_START_OF_HEADER # Here the first data block starts data_ # The 'data_' identifier finishes the first data block and starts the # second data_ ###_END_OF_HEADER # The first header section is finished, but the first binary section # does not start until the 'start of binary' identifier is found. This # part of the file is still pseudo-ASCII. ###_START_OF_BIN ###_END_OF_BINARY # Following the 'end of binary' identifier the file is pseudo-ASCII # again, so comments are valid up to the next 'start of binary' # identifier. # Second binary section. ###_START_OF_BIN ###_END_OF_BINARY # Third binary section. ###_START_OF_BIN ###_END_OF_BINARY # Second Header section ###_START_OF_HEADER data_ ###_END_OF_HEADER # Since this the last binary section in the file, the byte length could # optionally be set to zero, which indicates it is undefined. (All the # other binary sections must have these values defined to allow the # reader software to jump over sections.) ###_START_OF_BIN ###_END_OF_BINARY ###_END_OF_CBF
Reply to: [list | sender only]
- Follow-Ups:
- Re: CBF file structuring (Yves Epelboin)
- Prev by Date: Missing Notice.
- Next by Date: Re: CBF file structuring
- Prev by thread: Correction of typos
- Next by thread: Re: CBF file structuring
- Index(es):