[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
RE: Simple file header
- To: imgcif-l@bnl.gov
- Subject: RE: Simple file header
- From: Andy Hammersley <hammersl@esrf.fr>
- Date: Mon, 17 Nov 97 19:02:48 +0100
Here I'm responding to the points raised by Jim. > Presumably 100.5e-6 by 99.5e-6 is the nominal pixel size in meters. The data items I've used are directly out of John Westbrook's DDL2 based imgCIF dictionary. Although in a few cases I slightly changed the names, and noted this. John has made this available on the web. The document can be found from a link from URL: http://ndbserver.rutgers.edu/NDB/mmcif/index.html (I think it's one level down, but the network's too saturated at present for me to check, or get the actual URL) '_array_element_size.size' is defined in the dictionary. With a few reasonably obvious exceptions the CIF data items use SGI units. > Help me out please. What are the 32 bytes again? Also, the length of > the binary section was computed, but wasn't there supposed to be a length > identifier? I think it would be difficult to compute the size of compressed > binary data from any info in a header. Or do we just read until > '###_END_OF_BIN'? Or until '<cr> <lf>###_END_OF_BIN'? Or until ...? The start of the binary section is a crucial point, so here I'll repeat the appropriate section. (I'll try to get the document available on the web, so that it can be consulted easily, but will not happen this week since we have the "50 years of synchrotron radiation" conference here, and things are reasonably chaotic at the ESRF as a result.) So here's the identifier I proposed: -------------------------------------- 7a. The start of a binary data section is identified by a special identifier of 32 bytes which mixes ASCII and binary such that the start of the identifier can easily be found using string search methods. The ASCII part is: ###_START_OF_BIN The full identifier is: Byte No. ASCII Symbol Byte Value (unsigned) (decimal) ------------ 1 # 35 2 # 35 3 # 35 4 _ 95 5 S 83 6 T 84 7 A 65 8 R 82 9 T 84 10 _ 95 11 O 79 12 F 70 13 _ 95 14 B 66 15 I 73 16 N 78 17 Form-feed 12 18 Substitute (Control-Z) 26 19 End of Transmission (Control-D) 04 20 213 21 } Bytes 21 - 24 define the binary section 22 } identifier. This a 32-bit unsigned little- 23 } endian integer. The number is used to relate 24 } data defined in the header section. 25 } 26 } Bytes 25 - 32 define the length of 27 } the following binary section in bytes 28 } as a 64-bit unsigned little-endian 29 } integer. (The value 0 means the 30 } size is unknown, and no other 31 } pseudo-ASCII nor binary sections may 32 } follow.) The binary characters serve specific purposes: o The form feed will separate the ASCII lines from the binary sections if the file is listed on most operating systems. o The Control-Z will stop the listing of the file on MS-DOS type operating systems. o The Control-D will stop the listing of the file on Unix type operating systems. o The unsigned byte value 213 (decimal) is binary 11010101 This has the eighth bit set so can be used for error checking on 7-bit transmission. It is also asymmetric, but with the first bit also set in the case that the bit order could be reversed (which is not a known concern). o (The carriage return, line-feed pair at the end of the first and other lines can also be used to check that the file has not been corrupted e.g. by being sent by ftp in ASCII mode.) o Bytes 21-24 define the binary id of the binary data. This id is also used within the header sections, so that binary data definitions can be matched to the binary data sections. 32-bits allows many many more binary data sections to be addressed than can conceivably be needed. o Bytes 25-32 define the length in bytes of the binary section. This provides for enormous expansion from present images sizes, but volume and higher dimensional data may need more than 32-bit sizes in the future. This value may be set to zero if this is the last binary section or header section in the file. This allows a program writing, for example, a single compressed image to avoid having to rewind the file to write the size of the compressed data. (For small files compression within memory may be practical, and this may not be an issue. However very large files exist where writing the compressed data "on the fly" may be the only realistic method.) It is however recommended that this value be set, as it permits concatenation of files. Since the data may have been compressed, knowing the numbers of elements and size of each element does not necessarily tell a program how many bytes to jump over, so here it is stored explicitly. This also means that the reading program does not have to decode information in the header section to move through the file. (QUESTION: To fit this into 32 bytes I cut "BINARY" to "BIN". I think this should be some even number of characters and maybe multiple of 4 or 8, and probably a power of two has advantages. Should we leave this as I've defined, or use more characters ?) -------------------------------------- I agree with Jim that I've changed the binary length information from being an imgCIF data item to being something which is defined within the CBF structure. The former was assumed at the meeting. I don't feel strongly about it, but it seems to me that this information belongs more to CBF and less to CIF e.g. If you took such a file and translated to a pure CIF, such a data item would lose any purpose. Having the binary byte length information in a fixed position relative to the binary data seems slightly easier to to program than having it at a variable position relative to the binary data. With the information defined within the CBF identifier a program could be written to check the integrity of a CBF, but which knew nothing about CIF, other than to ignore the insides of multi-line strings. Are there strong feeling about this ? Should the binary byte length be stored in CIF data item ? In the example I calculated the size from info in the header i.e. > # 768*512*2 = 786432 bytes (or more, ... However, this was just for this simple example, which didn't use any data compression. (This was done for simplicity, and also because I want to rewrite the data compression part of the proposal, but later.) In the case of a compressed image, a writing program would either compress in memory, and therefore know what value to set proir to outputting the compressed data, or would compress on the fly, and then would have to rewind ("fseek") back to the position in the file where this information is stored. So no, the size should be written within the identifier (or a data item) so any reading program knows exactly where it SHOULD find the 'end of binary' identifier. And this identifier is an added redundant safety check. Only in a case of a corrupted file, should be program resort to reading through the binary section until it found the identifier. I suggested the identifier as ''###_END_OF_BINARY<cr> <lf>', but I think Jim's right in suggesting '<cr> <lf>###_END_OF_BINARY<cr> <lf>' as a better alternative. (Unless I hear protests, I'll change the proposal to include the "line separator" prior to the '###'.) Regards, Andy
Reply to: [list | sender only]
- Prev by Date: Re: Simple file header
- Next by Date: Re: Simple file header
- Prev by thread: Re: Simple file header
- Next by thread: Re: Simple file header
- Index(es):