[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Alternative proposal
- To: Multiple recipients of list <imgcif-l@bnl.gov>
- Subject: Alternative proposal
- From: Andy Hammersley <hammersl@esrf.fr>
- Date: Fri, 7 Jun 1996 12:12:10 -0400 (EDT)
Hello Again, I include an e-mail from Nick Spadaccini and Syd Hall. I have some detailed replies to some of their comments and suggestions, but first I'll just forward their e-mail in its entirety. Andy ------------------------------------------------------------------------------- >From nick@cs.uwa.edu.au Mon May 13 10:38 MET 1996 Received: from esrf.esrf.fr by expga.esrf.fr with SMTP (1.38.193.4/16.2) id AA14957; Mon, 13 May 1996 10:38:21 +0200 Return-Path: <nick@cs.uwa.edu.au> Received: from out.esrf.fr (out.esrf.fr [192.168.100.99]) by esrf.esrf.fr (8.6.10/8.6.9) with ESMTP id KAA00372 for <hammersl@esrf.fr>; Mon, 13 May 1996 10:34:03 +0200 From: nick@cs.uwa.edu.au Received: (from uucp@localhost) by out.esrf.fr (8.6.10/8.6.10) id KAA12195 for <hammersl@esrf.fr>; Mon, 13 May 1996 10:43:50 +0100 Received: from unknown(130.95.1.11) by firewall via smap (V1.3) id tmp011286; Mon May 13 10:39:09 1996 Received: from parma.cs.uwa.oz.au (nick@parma.cs.uwa.oz.au [130.95.1.7]) by cs.uwa.oz.au (8.6.8/8.5) with ESMTP id PAA04015; Mon, 13 May 1996 15:22:08 +0800 Received: by parma.cs.uwa.oz.au (8.6.8) id PAA15323; Mon, 13 May 1996 15:22:00 +0800 Date: Mon, 13 May 1996 15:22:00 +0800 Message-Id: <199605130722.PAA15323@parma.cs.uwa.oz.au> To: hammersl@esrf.fr Subject: ImageCIF/CBF Status: RO Syd Hall and I have gone through your submission to COMCIFS concerning the encapsulation of massive binary information, within the confines of a standard STAR File structure. We have recognised the need for adaption, but we believe what we have written is a constructive almagamation and should make things ultimately easier for all concerned. Take a look at it and give us your comments ... A STAR approach to storing binary data -------------------------------------- Syd Hall and Nick Spadaccini (version 1.2 : May 12 1996) Recently there has been several proposals to use a CIF-like approach to exchange and store "non-text" data. In particular there have been two specific format proposals by Andy Hammersley "imageNCIF" and "CBF" (Crystallographic Binary Format) which contain binary machine-specific data plus some "text" data items which (seemingly) comply with the STAR syntax (ie. <tag> <value> with whitespace separators)>. The intent of these proposals is clear: to permit the simple transfer of binary/non-text information in a robust and universally acceptable way. Indeed this is the fundamental aim of the STAR File approach which chose to restrict its binary representations to the ASCII characters to avoid the machine dependencies that are associated with unrestricted binary words and formats. This paper is intended to direct the discussion on the handling of binary data at the fundamental issues of data exchange. In permitting unrestricted binary representations in the proposed formats, are ... * significant efficiencies gained in storage and flexibility? * serious penalties incurred in portability and longevity? To us the answer to the first question is uncertain (the jury is still out) and to the second a probable YES. Despite the inclusion of "CIF-like" data, a CBF is a "binary" file and is therefore machine dependent. Such an approach must suffer from serious inherent drawbacks. 1. A CBF cannot be transferred from site to site without modification (eg. encoding, packing or zipping) and is not at all portable. The members of COMCIFS perhaps should review the history of the development of STAR/CIF and the major driving forces behind it. The most desirable attributes were portability and flexibility. 2. A CBF is not extensible in the same way as STAR. Whereas new data can be inserted or appended to an existing STAR File, as can extensions to a dictionary without requiring software to be re-written, this cannot be the case with a CBF. Here we quote from Andy Hammersley's document, from 2.0 OVERVIEW OF THE FORMAT, "A change in the major version may well mean that a program for the previous version cannot input the new version as some major change has occurred to CBF." 3. Hard-coded block sizes, line lengths, dataname sizes etc are no longer part of the CIF standard, or soon will not be, according to discussions in late 1995. 4. A CBF cannot be handled by the substantial library of STAR conformant software currently available and being developed. The duplication of such tools for specialised applications is very wastful. 5. The CBF format looks like a CIF and there is a significant danger that it could be mistaken for a CIF (if there is an editor that will handle a binary file). This is potentially confusing and may retard rather than accelerate the acceptance of CIF as a standard crystallographic data exchange approach. It is worth stressing that although the header section of a CBF looks like a CIF, the data is not attached to a dataname in a convenient or easily usable form, and multiple "images" cannot be looped or contained with in the file. 6. Finally, mutant forms of CIF such as CBF will tend to be a catalyst for others....based on the often mistaken belief that there is always a better mousetrap, and that its more efficient to adapt a standard than work within it! Such enhancements eventually lead to the complete collapse of the standard....as has been the case for a number of computer languages. The STAR File is a LONG-TERM archival and exchange approach and therefore its syntax must be considered sacrosanct. Having issued all of these dire warnings, we must say some good things about this proposal. On a number of counts Andy is to be congratulated for his pioneering efforts in handling large scale binary images. He has identified the needs and the problems associated with this technology and has put forward some possible solutions. This has made those of us not involved in this field realise the magnitude of the problems, and appreciate that the standard "text" image approach MAY not work for massive data files - which may be terrabytes in size. We now understand his concerns about this data and the need to provide convenient descriptors that identify and define the nature of these files. We also share his hope that ther may be a way of developing standards that will be applicable to the storage of binary files, and will assist where ever possible in this objective. In trying to think of ways that the treatment of binary data could be encompassed within the existing STAR File syntax, we pose the following fundamental questions... (i) If the descriptive parameters of a binary file could be easily "linked" to that file, why can't these be in a separate text file? (ii) Because binary data is machine-specific (and, therefore, so is the encompassing file), is this file suitable for anything other than "transitory local" use (in other words, it is unsuitable for portable or archival purposes)? [We doubt binary files will ever form part of the IUCr archives but such files may be retained inhouse until the appropriate information is extracted in a more archival form. None of us need to be reminded of the inadequacy of machine-specific data in an age when the half-life of a chip or an OS is about 12 months!] (iii) If there was a genuine need for the long term archiving of a very large binary image, wouldn't it be much more efficient (and secure!) to encode this into compressed ascii using the sort of differential or previous pixel coding that Andy uses? [A standard compression algorithm like Lempel-Ziv used in gzip halves storage, and the decoding times are not a consideration for archived data (mind you, the availability of a compatible processor in a few years may be :->). We think it is correct to assume that these images will be kept in-house and stored only for "local" consumption over a limited duration (it is worth stressing, however, that our later recommendations are independent of this assumption being correct!). We also believe that ultimately the images will need to be in a format which is highly compressed and rapidly parsable. Several existing formats provide a machine independent way of doing this ie. no big/little endian dependencies. *** A STAR File approach to handling binary data We believe strongly that the simplest, most efficient and most elegant approach to handling large scale binary data is that the experimental details describing the binary data, and the binary data be in two separate files. The former will be in a standard text STAR file (and recommend always archived with IUCr or whoever) and the latter in a binary file. An important benefit of this approach is that a researcher wanting to review the archived experimental paraameters does not have to load a massive binary file to do so. Most of the data items defined by Andy Hammersley would appear in the parameter file, though those special "header" names and clauses are now unnecessary (as they are non-conformant!). The major additional data item needed specifies the identity (and location) of the file containing the binary data. This is the pointer that tells the application software where to access the binary data file. The new data item is..... _image_binary_data_file "local filename or URL" Here is an example parameter file which refers to one binary file.... data_parameter_file_ex_1 _binary_data_class image # Unneeded if definitions are explicit. _image_size_dimensionality 2 _image_size_dimension_1 1300 _image_size_dimension_2 1200 _image_element_ordering 1 _image_element_data_type unsigned_16_bit_integer _image_byte_order highbytefirst _image_intensities_linearity linear _image_intensities_gain 1.2(1) _image_intensities_overload 65535 _image_intensities_undefined 0 _image_data_compression_type byte_offsets _image_element_size_1 122e-6 _image_element_size_2 121e-6 _image_binary_data_file "/home/andy/esrf.dat" A significant advantage of the linked file approach is that more than one binary file can be referenced in a single CIF. The next example shows the file structure when there are 2 binary files, a 2D image on a local disk, and a 3D image located at a site which can be got via the http protocol. data_parameter_file_ex_2 loop_ _image_binary_data_file _image_size_dimensionality _image_size_dimension_1 _image_size_dimension_2 _image_size_dimension_3 _image_element_ordering _image_intensities_linearity _image_intensities_gain _image_intensities_overload _image_intensities_undefined _image_element_size_1 _image_element_size_2 _image_element_size_3 "/home/andy/esrf.dat" 2 1300 1200 1 1 linear 1.2(1) 65535 0 122e-6 121e-6 0 "http://www.esrf.fr/arch/3D.dat" 3 1280 1024 256 2 log 1.8(3) 32768 -1 9e-5 1e-6 4e-6 Finally, Brian McMahon has suggested that because the ascii parameter (parent) file and the binary (child) file(s) are "detached", there may need to be a back pointer reference in the binary file. He suggests that, > ....the binary file contain a back-pointer to the descriptor or catalogue > file - i.e. the binary file could have a one-"line" header, padded to a > block boundary if need be, something like > > _catalogue_star_file "/home/archive/expt_3.cat" > > The syntax need not be STAR-like, .... > > catalogue_star_file="/home/archive/expt_3.cat" **** In conclusion we make the following claims: * The storage requirements of image data will require the most compact machine independent binary format available. * The impossiblility of imbedding binary data into ascii CIF files makes it non-viable. * By separating the storage of the binary file parameter data as ascii CIF from the image file, one retains the efficiencies of both formats. * What we propose provides a truly machine independent formalism which will permit the use of all existing CIF/STAR software tools to check and parse the file parameters. * Note that the expansion of file names and/or URL's has already been built into a the STAR preprocessor by Spadaccini.
Reply to: [list | sender only]
- Prev by Date: Image sizes
- Next by Date: RE: Alternative proposal
- Prev by thread: Re: Comments on Pflugrath's comments
- Next by thread: RE: Alternative proposal
- Index(es):