[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
prototype CBF format
- To: Multiple recipients of list <imgcif-l@bnl.gov>
- Subject: prototype CBF format
- From: Andy Hammersley <hammersl@esrf.fr>
- Date: Tue, 16 Apr 1996 05:36:54 -0400 (EDT)
Here's my attempt to define a CBF format, please excuse present inconsistences and incompleteness. I've produced this as a basis for discussion, and to give an idea of how the overall format would look. Please do not assume any of the data names, concepts, etc. are decided; this only an attempt at a proto-type definition. Many points of discussion are still undecided, and a large number of new concepts are raised within this definition. Q. Is this an appropriate way forward ? I separate the definition from comments on discussion items by using round brackets to refer to notes kept separate from the main text e.g. (1) refers to point 1 in the notes section. I suggest that I try to up-date and redistribute this definition as new suggestions are sent or concensus is reached. Q. Is this a suitable basis for developing a CBF definition which would then be proposed to the COMCIFS/IUCr ? I'm sorry, if important points which have been raised in discussions which have been left out of the present document. Please remind me if you feel that something is missing. Andy Hammersley ----------------------------------------------------------------------------- --------------------------------------- The Crystallographic Binary File Format --------------------------------------- ABSTRACT -------- This document describes the Crystallographic Binary File (CBF) (0) format; a simple self-describing binary format for efficient transport and archiving of experimental data for the crystallographic community. The format consists of a "CIF-compatible" header section contained within a binary file. The format of the binary file, and the new CIF data- items are defined. 1.0 INTRODUCTION ---------------- The Crystallographic Binary File (CBF) format is a complementary format to the Crystallographic Information File (CIF) [1], supporting efficient storage of large quantities of experimental data in a self-describing binary format (1). The initial aim is to support efficient storage of raw experimental data from area-detectors (images) with no loss of information compared to existing formats. The format should be both efficient in terms of writing and reading speeds, and in terms of stored file sizes, and should be simple enough to be easily coded, or ported to new computer systems. Flexibility and extensibility are required, and later the storage of other forms of data may be added without affecting the present definitions. The aims are achieved by a simple binary file format, consisting of a variable length header section followed by a binary data section. The binary data is fully described by data name/ value pairs within the header section. The header section may also contain other auxiliary information. CIF data name and item pairs are used in the header section to describe the binary data. The present version of the format only tries to deal with simple Cartesian 2-D detector data. This is essentially the "raw" data from detectors that is typically stored in commercial formats or individual formats internal to particular institutes, but could be other forms of data. It is hoped that CBF can replace individual laboratory or institute formats for "home" built detector systems, be used as a inter-program data exchange format, and may be offered as an output choice by a number of commercial detector manufacturers specialising in X-ray detector systems. This format does not imply any particular demands on processing software nor on the manner in which such software should work. Definitions of units, coordinate systems, etc. may quite different. The clear precise definition within CIF, and hence CBF, should help, when necessary, to convert from one system to another. Whilst no strict demands are made, it is clearly to be hoped that software will make as much use as is reasonable of information relevant to the processing which is stored within the file. 2.0 OVERVIEW OF THE FORMAT -------------------------- The following describes the major "components" of the CBF format. 1. CBF is a binary file, containing self-describing "image" and auxiliary data. 2. It is an exact number of blocks of 512 bytes in length, and may be considered in a block structure (2). 3. The very start of the file has an identification item (3). This item also describes the CBF version or level. e.g. ###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0 The first hash means that this line is a comment line for CIF, but the three hashes mean that this is a line describing the binary file layout for CBF (4). No whitespace may precede the first hash sign. The version number is defined as a major version number and minor version number separated by the decimal point. A change in the major version may well mean that a program for the previous version cannot input the new version as some major change has occured to CBF (5). A change in the minor version may also mean incompatibility, if the CBF has been written using some new feature. e.g. a new form of linearity scaling may be specified and this would be considered a minor version change. A file containing the new feature would not be readable by a program supporting only an older version of the format. 4a. The header section, including the identification items which delimit it, uses only ASCII characters, and is divided into "lines". The "line separator" symbol(s) is/are the same regardless of the operating system on which the file is written (6). (This is an importance difference with CIF, but must be so, as the file contains binary data, so cannot be translated from one O.S. to another, which is the case for ASCII text files.) 4b. The header section within the delimiting identification items obeys all CIF rules [1], with the exception of the line separators. e.g. o "Lines" are a maximum of 80 characters long. o All data names start with an underscore character and are a maximum of 32 characters long. o The hash symbol (outside a character string) means that all text up to the line separator is a comment. o Whitespace outside of character strings is not significant. o Data names are case insensitive. o The data item follows the data name separator, and may be of one of two types: text string (char) or number (numb). (The type is specified for each data name.) o Text string may be delimited with single of double quotes, or blocks of text may be delimited by semi-colons occuring as the first character on a line. o The 'loop_' mechanism allows a data name to have multiple values Any CIF data name may occur within the header section. 5a. The end of the header section is delimited by the following special identifier (4), (7): ###_END_OF_CBF_HEADER The "line" is terminated by the "line separator" immediately after the "R" or "HEADER". No whitespace can be added at this point. 5b. Whitespace (blank characters and lines) may be used to reserve space in the header section (for undefined later use), but this white space must occur before the end of header delimiter item. 6. The header section must contain sufficient data names to fully describe the binary data section. 7. The binary data starts at the beginning of the next free data block after the last character of the line separator (8). 8. After the end of the binary data, the last block is fully output i.e. the file is an exact integer multiple of the block size. The values of the extra bytes are undefined. 9. The recommended file extension for a CBF is: cbf This allows users to recognise file types easily, and gives programs a chance to "know" the file type without having to prompt the user. 10. CBF format files are binary files and when ftp is used to transfer files between different computer systems "binary" or "image" mode transfer should be selected. [11. A recommended standard icon could also be defined e.g. a little diffaction pattern.) 3.0 DATA NAME CATEGORIES ------------------------ Two new data name categories are proposed: _binary_ _image_ The '_binary_' category is used for data item describing the general storage of binary data. At present only the '_binary_data_class' data name exists in this category. The '_image_' category covers all data names concerned with the storage of "image" or regular array data (14). Data names from any of the existing categories may be relevant as auxiliary information in the header section, but data names from the '_diffrn_' category, are likely to be the most relevant, and a few extra data names in this category are necessary. 4.0 DESCRIBING THE BINARY DATA ------------------------------ The type of binary data stored in the file is defined by the '_binary_data_class' data item. This is of type 'character', and may have one of two values in version 1.0 of CBF: 'none' or 'image'. If the value is 'none' there is no binary data section in the file. The value 'image' means that the binary data section contains binary data of class "image". 4.1 The "image" Class of Binary Data (14) ----------------------------------------- The "image" class is used to store regular arrays of data values, such as 1-D histograms, area-detector data, series of area-detector data, and volume data. Normally such data is regularly spaced in space or time, however spatial distorted data could nevertheless be stored in such a format. There is only one data "value" stored per lattice position, although that value may be of type complex (15). The "image" class implies that the data items '_image_size_dimensionality', '_image_size_dimension_1', '_image_element_data_type', '_image_intensities_linearity', and '_image_data_compression_type' must be defined. The values of these items may in turn require other data items to be defined. e.g. If the '_image_size_dimensionality' data item is greater than 1, then '_image_size_dimension_2' and maybe other '_image_size_dimension_?' items must be defined up to the dimensionality of the array. 4.2 "Image" Element Rastering and Orientation (17) -------------------------------------------------- Fundamental to treating a long line of data values as a 2-D image or series of 2-D images is the knowledge of the manner in which the values need to be wrapped. For the raster orientation to be meaningful we define the sense of the view: The sense of the view is defined as that looking from the crystal towards the detector (12). (For the present we consider only an equatorial plane geometry, with 2-theta = 0; the detector as being vertically mounted. (16)) The raster orientation describes in which corners of the detector the data value stream starts and ends, and whether the rastering is carried out horizontally or vertically. We define a preferred rastering orientation, which is the default if the keyword is not defined. This is with the start in the lower-lefthand corner and the fastest changing direction for the rastering horizontally (Type 1). The rastering type is defined by an integer value which may take the value 1 to 8. (All eight possible methods are defined to allow support for existing systems, which may have a natural rastering system, which is difficult to change owing to the large size of the data involved. However, when at all possible, type 1 is encouraged.) (Note: With off-line scanners the rastering type depending on which way round the imaging plate or film is entered into the scanner. Care may need to be taken to make this consistent.) For 1-D detector data only the first two values are relevant (the orientation of the detector would be defined separately.) Below are shown the 8 possible ways of rastering the element stream: 1, 2, 3, 4, 5, 6, 7, 8, 9 (13). _image_element_ordering 1 # This is the preferred method, and should be used, if possible. # ^ # | # 7 8 9 slow # 4 5 6 | # 1 2 3 o- fast -> _image_element_ordering 2 # ^ # | # 9 8 7 slow # 6 5 4 | # 3 2 1 <- fast -o _image_element_ordering 3 # ^ # | # 3 6 9 fast # 2 5 8 | # 1 4 7 o- slow -> _image_element_ordering 4 # ^ # | # 9 6 3 fast # 8 5 2 | # 7 4 1 <- slow -o _image_element_ordering 5 # o- fast -> # | # 1 2 3 slow # 4 5 6 | # 7 8 9 v _image_element_ordering 6 # <- fast -o # | # 3 2 1 slow # 6 5 4 | # 9 8 7 v _image_element_ordering 7 # o- slow -> # | # 1 4 7 fast # 2 5 8 | # 3 6 9 v _image_element_ordering 8 # <- slow -o # | # 7 4 1 fast # 8 5 2 | # 9 6 3 v In the case of series of images, subsequent images follow in the element stream with the same raster ordering. 4.3 "Image" Element Intensity Scaling ------------------------------------- Existing data storage formats use a wide variety of methods for storing physical intensities as element values. The simplest is a linear relationship, but square root and logarithm scaling methods have attractions and are used. Additionally some formats use a lower dynamic range to store the vast majority of element values, and use some other mechanism to store the elements which over-flow this limited dynamic range. The problem of limited dynamic range storage is solved by the data compression method 'byte_offsets' (see next Section), but the possibility of defining non-linear scaling must also be provided. The '_image_intensities_linearity' data item specifies how the intensity scaling is defined. Apart from linear scaling, which is specified by the value 'linear', two other methods are available to specify the scaling. One is to refer to the detector system, and then knowledge of the manufacturers method will either be known or not by a program. This has the advantage that any system can be easily accommodated, but requires external knowledge of the scaling system. The recommended alternative is to define a number of standard intensity linearity scaling methods, with additional data items when needed. A number of standard methods are defined by '_image_intensities_linearity' values: 'offset', 'scaling_offset', 'sqrt_scaled', and 'logarithmic_scaled' (11). The "offset" methods require the data item '_image_intensities_offset' to be defined, and the "scaling" methods require the data item '_image_intensities_scaling' to be defined. The above scaling methods allow the element values to be converted to a linear scale, but do not necessarily relate the linear intensities to physical units. When appropriate the data item '_image_intensities_gain' can be defined. Dividing the linearised intensities by the value of '_image_intensities_gain' should produce counts. Two special optional data flag values may be defined which both refer to the values of the "raw" stored intensities in the file, and not to the linearized scale values. '_image_intensities_undefined' specifies a value which indicates that the element value is not known. This may be due to data missing e.g. a circular image stored in a square array, or where the data values are flagged as missing e.g. behind a beam-stop. '_image_intensities_overload' indicates the intensity value at which and above, values are considered unreliable. This is usually due to saturation. 5.0 DATA COMPRESSION (16) ------------------------- One of the primary aims of CBF is to allow efficient storage, and efficient reading and writing of data, so data compression is of great interest. Despite the extra CPU over-heads it can very often be faster to compress data prior to storage, as much smaller amounts of data need to be written to disk, and disk I/O is relatively slow. However, optimum data compression can result in complicated algorithms, and be highly data specific. At present one simple loss-less integer compression algorithm is defined. This is referred to as 'byte_offsets' in the '_image_data_compression_type' data item. This algorithm will typically result in close to a factor of two reduction in data storage size relative to typical 2-byte diffraction images. It should give similar gains in disk I/O and network transfer. It also has the advantage that integer values up to 32 bits may be stored efficiently without the need for special over-load tables. It is a fixed algorithm which does not need to calculate any image statistics, so is fast. The algorithm works because of the following property of almost all diffraction data and much other image data: The value of one element tends to be close to the value of the adjacent elements, and the vast majority of the differences use little of the full dynamic range. However, noise in experimental data means that run-length encoding is not useful (unless the image is separated into different bit-planes). If a variable length code is used to store the differences, with the number of bits used being inversely proportional to the probability of occurence, then compression ratios of 2.5 to 3.0 may be achieved. However, the optimum encoding becomes dependent of the exact properties of the image, and in particular on the noise. Here a lower compression ratio is achieved, but the resulting algorithm is much simpler and more robust. The 'byte_offsets' algorithm is the following: 1. The first element of the image is stored as a 4-byte signed integer regardless of the raw image element type. The byte order for this and any subsequent multi-byte integers is that defined in '_image_byte_order' (9). 2. For every subsequent element the value of the previous element is subtracted to produce the difference. For the first element on a line the value to subtract is the value of the first element of the previous line. For the first element of a subsequent image the value to subtract is the value of the first element of the previous image. 3. If the difference is less than +-127, then one byte is used to store the difference, otherwise the byte is set to -128 (ff in hex) and if the difference is less than +-32767, then the next two bytes are used to store the difference, otherwise -32768 (ffff in hex) is written and the following 4-bytes store the difference as a full signed 32-bit integer. 4. The image element order follows the normal ordering as defined by '_image_element_ordering'. It may be noted that one element value may require up to 7 bytes for stored, however for almost all 16-bit experimental data the vast majority of element values will be within +-127 units of the previous element and so only require 1 byte for storage and a compression factor of close to 2 is achieved. There are practical disadvantages to such data compression: the value of a particular element cannot be obtained without calculating the values of all previous elements, and there is no simple relationship between element position and stored bytes. If generally the whole image is required this disadvantage does not apply. These disadvantages can be reduced by compressing separately different regions of the images, which is an approach available in TIFF, but this adds to the complexity reading and writing images. An alternative is an optional data item, which defines a look-up table of element addresses, values, and byte positions within the compressed data (10). 6.0 APPLICATION SOFTWARE AND SUPPORT TOOLS ------------------------------------------ The CBF format does not (and cannot) make any strict demands on applications programs which use CBF files. Clearly, any program which outputs a CBF should create only valid CBF's. It is desirable that the output file contains as much useful auxiliary data as is possible, but no "compulsory" demands on the auxiliary information output in the file are made. Programs inputting CBF's have a more difficult task, in that theoretically the input CBF could be any size, and use any of the possible data types, etc. Programs are not required to input all possible format types, but it is highly recommended to output a clear output message to the user explaining why a particular file cannot be input, when this occurs. e.g. A simple image analysis program may be written to work with integer element images, so will not allow input of images with complex or floating point elements. Clearly, an application program will be most useful if it does input CBF's which are appropriate to its function. The use of any auxiliary data information is also optional. The aim is to provide analysis programs with useful information to help process the image data, but manner in which to use this information is the decision of the application programmer. It is, nevertheless, desirable that this auxiliary information is used. e.g. An application program could use the information in the CBF as default values, but allow the user to over-ride these values interactively or though scripts. Source code for routines to input CBF's will be made available. This code may be modified, provided the original copyright notice is maintained. A program to extract the header section of a CBF and output it in CIF format will be available. Note: This program would not in itself check the validity of the CIF, but would simply recognize the "CIF-compatible" header section and output it, in the correct ASCII text format for the operating system. 7.0 EXAMPLE OF AN CBF HEADER SECTION ------------------------------------ This is an example of how a header file might look which stores an image taken using the X-ray image intensifier/ CCD read-out detector system on beam-line BM-14 at the ESRF, and subsequently corrected for detector distortions: ###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0 # Data name data_Protein_X_1 # Data creation and history details _audit_creation_date '96-03-27 09:55.05' # (!!!! Time added to date item) _audit_creation_method 'Raw data output by CAMERA' _audit_update_record ; 96-03-27 10:05.31 Detector distortions corrected by FIT2D ; _computing_data_collection 'spec and CAMERA' _computing_data_reduction FIT2D # Distortion correction # Sample details _chemical_name_common 'Protein X' # Experimental details _diffrn_measurement_method Oscillation _diffrn_measurement_distance 0.15 # (!!!! New data name) _diffrn_radiation_wavelength 0.76 # Monochromatic _diffrn_radiation_source 'ESRF BM-14' _diffrn_radiation_detector 'ESRF Be XRII/CCD' # Image data details _binary_data_class image _image_size_dimensionality 2 # Single image in the file _image_size_dimension_1 1300 _image_size_dimension_2 1200 _image_element_ordering 1 # Default rastering used _image_element_data_type unsigned_16_bit_integer # 16-bit ADC _image_byte_order highbytefirst # Written on a Sun-4 workstation _image_intensities_linearity linear # Simple linear intensity scaling _image_intensities_gain 1.2(0.1) _image_intensities_overload 65535 # Saturation level _image_intensities_undefined 0 # No defined data value flag _image_data_compression_type byte_offsets # Save disk space _image_element_size_1 122e-6 _image_element_size_2 121e-6 ###_END_OF_CBF_HEADER (Binary image data follows starting at the first free block after the line separator) 8.0 APPENDIX A: CIF EXTENSIONS DICTIONARY ----------------------------------------- 8.1 Phase 1A: Abstract Image Names Only --------------------------------------- data_binary_data_class _name '_image_binary_data_class' _type 'char' _enumeration_range 'none', 'image' _list 'no' _definition ; Type of binary data stored in the binary data section. 'none' means that no binary data is stored after the header section. 'image' means that an array of values are stored. ; data_image_byte_order _name '_image_byte_order' _type 'char' _enumeration_range 'highbytefirst', 'lowbytefirst' _list 'no' _definition ; The order of bytes for integer values which require more than 1-byte. 'highbytefirst' means that the first byte in the byte stream of the bytes which make up an integer value is the most significant byte of an integer. This is often referred to as "big endian". 'lowbytefirst' means that the last byte in the byte stream of the bytes which make up an integer value is the most significant byte of an integer. This is often referred to as "little endian". (IBM-PC's and compatiables, and Dec-Vaxes use low byte first ordered integers, whereas Hewlett Packard 700 series, Sun-4, and Silicon Graphics use high byte first ordered integers. Dec-Alphas can produce/use either depending on a compiler switch. ; data_image_data_compression_type _name '_image_data_compression_type' _type 'char' _enumeration_range 'none', 'byte_offsets' _list 'no' _definition ; Type of data compression method used to compress the binary data. At present only no data compression or one simple method have been defined. It is important that output software writes this data name, and that input software checks the value, for the case that further algorithms become defined. 'none' means that the data is stored in normal format as defined by '_image_element_data_type' and '_image_byte_order'. 'byte_offsets' means that the data is stored in the compression scheme defined in Section 5.0 ; data_image_size_dimensionality _name '_image_size_dimensionality' _type 'numb' _enumeration_range 1: _list 'no' _definition ; The number of dimensions of the data. ; data_image_element_data_type _name '_image_element_data_type' _type 'char' _enumeration_range 'unsigned_8_bit_integer', 'signed_8_bit_integer', 'unsigned_16_bit_integer', 'signed_16_bit_integer', 'unsigned_32_bit_integer', 'signed_32_bit_integer', '32_bit_real_ieee' '64_bit_real_ieee' '32_bit_complex_ieee' (15) _list 'no' _definition ; Data type of a single element value. ; 8.2 Phase 1B: Experiment Data Items ----------------------------------- data_image_intensities_gain _name '_image_intensities_gain' _type 'numb' _enumeration_range 0.0: _list 'no' _definition ; Detector "gain". The factor by which linearized intensity values should be divided to produce counts. ; data_image_intensities_linearity _name '_image_intensities_linearity' _type 'char' _enumeration_range 'linear', 'offset', 'scaling_offset', 'sqrt_scaled', 'logarithmic_scaled' _list 'no' _definition ; The intensity linearity scaling used from raw intensity to the stored element value. 'linear' is obvious, 'offset' means that the value defined by '_image_intensities_offset' should be added to each element value. 'scaling' means that the value defined by '_image_intensity_scaling' should be multiplied with each element value. 'scaling_offset' is the combination of the two previous cases, with the scale factor applied before the offset value. 'sqrt_scaled' means that the square root of raw intensities multiplied by '_image_intensity_scaling' is calculated and stored, perhaps rounded to the nearest integer. Thus, linearization involves dividing the stored values by '_image_intensity_scaling' and squaring the result. 'logarithmic_scaled' means that the logarithm based 10 of raw intensities multiplied by '_image_intensity_scaling' is calculated and stored, perhaps rounded to the nearest integer. Thus, linearization involves dividing the stored values by '_image_intensity_scaling' and calculating 10 to the power of this number. ; data_image_intensities_offset _name '_image_intensities_offset' _type 'numb' _enumeration_range _list 'no' _definition ; Offset value to add element values. (see '_image_intensities_linearity') ; data_image_intensities_scaling _name '_image_intensities_scaling' _type 'numb' _enumeration_range _list 'no' _definition ; Scaling value to multiply with element values. (see '_image_intensities_linearity') ; data_image_element_size_* _name '_image_element_size_*' _type 'numb' _units_description 'Metres' _enumeration_range 0.0: _definition ; The sizes in metres of an image element. (This supposes that the elements are on a regular 2-D grid.) ; # From the draft revised CIF core dictionary: # # Data items in the _diffrn_measurement_ category record details about # the device used to orient and/or position the crystal during the data # measurement and the manner in which the diffraction data were measured. # # We need more precisely defined items than '_diffrn_measurement_details' data_diffrn_measurement_distance _name '_diffrn_measurement_distance' _type 'numb' _units_description 'Metres' _enumeration_range 0.0: _list 'no' loop_ _example 0.25 # 25cm sample to detector _definition ; The distance between the sample and the intersection of the direct beam with the detector; the detector being at the 2-theta = 0 position with equatorial geometry (17, 18). ; 9.0 REFERENCES -------------- 1. S~R~Hall, F~H~Allen, and I~D~Brown, "The Crystallographic Information File (CIF): a New Standard Archive File for Crystallography", Acta Cryst., A47, 655-685 (1991) ---------------------------------------------------------------------------- 10.0 NOTES ---------- (0) Crystallographic Binary File and CBF are working titles. As a name for the format, this appears reasonably appropriate, but maybe a better name will be suggested. (1) A pure CIF based format has been considered inappropriate given the enormous size of many raw experimental data-sets and the desire for efficient storage, and reading and writing. (2) The block size of 512 is only a suggestion at present, and another size maybe considered preferable. Jim suggested 512 bytes block size, for efficiency reasons on `OpenVMS, but there was objection to this. I think that we need to support the concept a "record length" for Fortran direct access I/O and for certain O.S.'s. For other O.S.'s which don't have file structures all this necessarily means is that the files are some exact multiple of some number of bytes. A program written in "C" or a similar language would pad out the end of the file to the right number of bytes. Such a concept may also be generally useful for efficiency reasons. In our choice of this number, we should not especially favour VMS for efficiency reasons, but then again ideally we want the reading and writing of the files to be as efficient as possible on ALL possible O.S.'s. If we can we should avoid building in inefficiency, and if possible leave the opportunity for memory mapping and similar techniques. A 512 byte block size is probably also a good size for Un*x pipes (which may or may not be considered relevant.) I suggest either 512 or 1024 byte block size, but maybe other numbers make more sense for other O.S.'s. (3) I would like some simple method of checking whether the file really is a CBF or not. Ideally this would be right at the start of the file. Thus, a program only needs to read in n bytes and should then know immediately if the file is of the right type or not. I think this identifier should be some straightforward and clear ASCII string. c.f. PostScript level I and II. Initially, a restricted format is probably the most practical to define and implement e.g. only one header and binary section per file. However, later on we may want to extent the format to cover multiple header/binary sections. Such an important change could be communicated to a program through this version/level number. The underscore character has been used to avoid any ambiguity in the spaces. (Such an identifier should be long enough that it is highly unlikely to occur randomly, and if it is ASCII text, should be very slightly obscure, again to reduce the chances that it is found accidently. Hence I added the three hashes, but some other form may be equally valid.) (4) c.f. PostScript and PostScript document structuring conventions. Maybe some other identifier would be better, but also starting with a hash e.g. #!! (The end of header section marker also uses this mechanism.) (5) The format should maintain backward compatibility e.g. a version 1.0 file can be read in by a version 1.1, 3.0, etc. program, but to allow future extensions the reverse cannot be guaranteed to be true. (6) The exact manner in which to define the line separation is a subject of discussion. Either using a single line-feed character (as is done by Un*x), or using the combination of a carriage-return character followed by a line-feed character (as is done by MS-DOS and related systems), are the likely candidates. (7) Some clear identifier signalling the end of the header section and where the binary section begins, or some equivalent method for achieving the same is vital. Here a clear identifier is proposed, but an alternative method could also work. (8) If normal computer data e.g. 2-byte integers, or IEEE reals are being stored in essentially native format then word boundaries should be respected. Given that higher "quadruple" precision data types and complex data types may potentially be wanted, I suggest that at least 32 byte boundaries are respected, but maybe for efficiency or simplity reasons it's desirable to use the full block boundaries. (9) It would also be possible to define the alogithm so that multi-byte integer byte ordering is not important. (10) It is not proposed to try defining this data item at present, but if the demand arises to efficiently input only sub-sets of the data. (11) More types may well need to be defined. This list doesn't for example cover the present Mac-Science intensity scaling scheme. However, it may be viewed as too complicated to support too large a range of scaling schemes. (12) Some may prefer to define the view as the "camera-mans" view, and may be this is better as part of an overall consistent co-ordinate system for lab/crystal/detector. I note that MADNES defines the view from the camera-mans point of view. Which definition of the viewing direction should we use ? Is there an IUCr standard co-oordinate system ? (13) Other orderings are clearly possible, this is fairly arbitrary (14) I still wonder about the use of "image". Should we change the word "image" to "array", which I feel is more consistent with the uses which I have defined ? Or should we restrict "image" to refer only to a 2-D array type data object, and eventually define other data classes such as "histogram", "images", "volume", etc ? Or leave the class "image" to refer to a whole variety of N-Dimensional arrays ? (15) How should complex arrays be stored ? Should pairs of real and imaginary values be stored as alternative values in the element stream ? Or should whole array e.g. a image, of the real components be stored separately from an identical array containing the imaginary components ? (16) Data compression is a subject with no simple best answer. Different algorithms may be considered best in terms of compression ratio, read/write times, complexity, or applicability. The algorithm proposed is simple and should be very widely applicable, but certainly does not attempt to obtain optimum compression. (17) More general detector orientation information has delibrately been avoided in the first stage of defining the CBF format, but even to describe the sense of an image from an area-detector a certain amount of external geometrical information is necessary. Does the IUCr have standard coordinate system to define arbitrary detector position and orientation ? (18) Other definitions of sample to detector distance are possible, and used. How should the sample to detector distance be defined ? (19) From the CIF core dictionary '_audit_creation_date' defines only the date of creation. The time of the date creation needs to be defined to a precision of fractions of a second. Either the '_audit_creation_date' definition needs to be extended to cover times, or an '_audit_creation_time' data name is needed, or a new '_image_creation_data_time' name is needed ? How is the time of creation of data best stored ?
Reply to: [list | sender only]
- Prev by Date: Re: Line Separators
- Next by Date: Comments of prototype CBF
- Prev by thread: Comments of prototype CBF
- Next by thread: RE: prototype CBF format
- Index(es):