Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Alternative proposal

  • To: Multiple recipients of list <imgcif-l@bnl.gov>
  • Subject: Alternative proposal
  • From: Andy Hammersley <hammersl@esrf.fr>
  • Date: Fri, 7 Jun 1996 12:12:10 -0400 (EDT)

Hello Again,

   I include an e-mail from Nick Spadaccini and Syd Hall. I have some
detailed replies to some of their comments and suggestions, but first I'll
just forward their e-mail in its entirety.


         Andy

-------------------------------------------------------------------------------
>From nick@cs.uwa.edu.au Mon May 13 10:38 MET 1996
Received: from esrf.esrf.fr by expga.esrf.fr with SMTP
	(1.38.193.4/16.2) id AA14957; Mon, 13 May 1996 10:38:21 +0200
Return-Path: <nick@cs.uwa.edu.au>
Received: from out.esrf.fr (out.esrf.fr [192.168.100.99]) by esrf.esrf.fr (8.6.10/8.6.9) with ESMTP id KAA00372 for <hammersl@esrf.fr>; Mon, 13 May 1996 10:34:03 +0200
From: nick@cs.uwa.edu.au
Received: (from uucp@localhost) by out.esrf.fr (8.6.10/8.6.10) id KAA12195 for <hammersl@esrf.fr>; Mon, 13 May 1996 10:43:50 +0100
Received: from unknown(130.95.1.11) by firewall via smap (V1.3)
	id tmp011286; Mon May 13 10:39:09 1996
Received: from parma.cs.uwa.oz.au (nick@parma.cs.uwa.oz.au [130.95.1.7]) by cs.uwa.oz.au (8.6.8/8.5) with ESMTP id PAA04015; Mon, 13 May 1996 15:22:08 +0800
Received: by parma.cs.uwa.oz.au (8.6.8) id PAA15323; Mon, 13 May 1996 15:22:00 +0800
Date: Mon, 13 May 1996 15:22:00 +0800
Message-Id: <199605130722.PAA15323@parma.cs.uwa.oz.au>
To: hammersl@esrf.fr
Subject: ImageCIF/CBF
Status: RO


Syd Hall and I have gone through your submission to COMCIFS concerning
the encapsulation of massive binary information, within the confines
of a standard STAR File structure. We have recognised the need for adaption,
but we believe what we have written is a constructive almagamation and
should make things ultimately easier for all concerned. 

Take a look at it and give us your comments ...




                        A STAR approach to storing binary data
                        --------------------------------------
   
                          Syd Hall and Nick Spadaccini
                       
                           (version 1.2 : May 12 1996)
   
   
   Recently there has been several proposals to use a CIF-like approach to
   exchange and store "non-text" data. In particular there have been two specific
   format proposals by Andy Hammersley "imageNCIF" and "CBF" (Crystallographic
   Binary Format) which contain binary machine-specific data plus some "text" 
   data items which (seemingly) comply with the STAR syntax (ie. <tag> <value> 
   with whitespace separators)>.
   
   The intent of these proposals is clear: to permit the simple transfer of
   binary/non-text information in a robust and universally acceptable way. 
   Indeed this is the fundamental aim of the STAR File approach which chose to
   restrict its binary representations to the ASCII characters to avoid the
   machine dependencies that are associated with unrestricted binary words
   and formats. This paper is intended to direct the discussion on the
   handling of binary data at the fundamental issues of data exchange.
   
   In permitting unrestricted binary representations in the proposed formats, 
   are ...
   
         * significant efficiencies gained in storage and flexibility?
   
         * serious penalties incurred in portability and longevity?
   
   To us the answer to the first question is uncertain (the jury is
   still out) and to the second a probable YES.
   
   
   Despite the inclusion of "CIF-like" data, a CBF is a "binary" file 
   and is therefore machine dependent.  Such an approach must suffer from 
   serious inherent drawbacks.
   
   1. A CBF cannot be transferred from site to site without modification
      (eg. encoding, packing or zipping) and is not at all portable.
   
      The members of COMCIFS perhaps should review the history of the
      development of STAR/CIF and the major driving forces behind it. 
      The most desirable attributes were portability and flexibility.
   
   2. A CBF is not extensible in the same way as STAR. Whereas new data can be
      inserted or appended to an existing STAR File, as can extensions to
      a dictionary without requiring software to be re-written, this 
      cannot be the case with a CBF.
   
      Here we quote from Andy Hammersley's document, from 2.0 OVERVIEW OF
      THE FORMAT, "A change in the major version may well mean that a
      program for the previous version cannot input the new version as some
      major change has occurred to CBF."
   
   3. Hard-coded block sizes, line lengths, dataname sizes etc are no
      longer part of the CIF standard, or soon will not be, according to
      discussions in late 1995.
   
   4. A CBF cannot be handled by the substantial library of STAR conformant 
      software currently available and being developed. The duplication of
      such tools for specialised applications is very wastful.
   
   5. The CBF format looks like a CIF and there is a significant danger that 
      it could be mistaken for a CIF (if there is an editor that will 
      handle a binary file). This is potentially confusing and may retard
      rather than accelerate the acceptance of CIF as a standard
      crystallographic data exchange approach.
   
      It is worth stressing that although the header section of a CBF
      looks like a CIF, the data is not attached to a dataname in a 
      convenient or easily usable form, and multiple "images" cannot
      be looped or contained with in the file.
   
   6. Finally, mutant forms of CIF such as CBF will tend to be a catalyst 
      for others....based on the often mistaken belief that there is 
      always a better mousetrap, and that its more efficient to adapt a
      standard than work within it! Such enhancements eventually lead to
      the complete collapse of the standard....as has been the case for
      a number of computer languages. The STAR File is a LONG-TERM archival 
      and exchange approach and therefore its syntax must be considered
      sacrosanct.
   
   
   
   Having issued all of these dire warnings, we must say some good 
   things about this proposal. On a number of counts Andy is to be 
   congratulated for his pioneering efforts in handling large scale
   binary images. He has identified the needs and the problems associated
   with this technology and has put forward some possible solutions. This has
   made those of us not involved in this field realise the magnitude of the
   problems, and appreciate that the standard "text" image approach MAY
   not work for massive data files - which may be terrabytes in size. 
   
   We now understand his concerns about this data and the need to provide
   convenient descriptors that identify and define the nature of these 
   files. We also share his hope that ther may be a way of developing 
   standards that will be applicable to the storage of binary files, and
   will assist where ever possible in this objective. 
   
   In trying to think of ways that the treatment of binary data could 
   be encompassed within the existing STAR File syntax, we pose the 
   following fundamental questions...
   
   (i) If the descriptive parameters of a binary file could be easily 
       "linked" to that file, why can't these be in a separate text file?
   
   (ii) Because binary data is machine-specific (and, therefore, so is
        the encompassing file), is this file suitable for anything
        other than "transitory local" use (in other words, it is unsuitable
        for portable or archival purposes)? 
        [We doubt binary files will ever form part of the IUCr archives
         but such files may be retained inhouse until the appropriate 
         information is extracted in a more archival form. None of us need 
         to be reminded of the inadequacy of machine-specific data in an
         age when the half-life of a chip or an OS is about 12 months!]
   
   (iii) If there was a genuine need for the long term archiving of a 
         very large binary image, wouldn't it be much more efficient
         (and secure!) to encode this into compressed ascii using the
         sort of differential or previous pixel coding that Andy uses? 
         [A standard compression algorithm like Lempel-Ziv used in gzip 
         halves storage, and the decoding times are not a consideration 
         for archived data (mind you, the availability of a compatible 
         processor in a few years may be :->).
   
   
   We think it is correct to assume that these images will be kept in-house
   and stored only for "local" consumption over a limited duration (it is 
   worth stressing, however, that our later recommendations are independent 
   of this assumption being correct!).
   
   We also believe that ultimately the images will need to be in a format 
   which is highly compressed and rapidly parsable. Several existing
   formats provide a machine independent way of doing this ie. no 
   big/little endian dependencies. 
   
   
   
   *** A STAR File approach to handling binary data
   
   We believe strongly that the simplest, most efficient and most elegant
   approach to handling large scale binary data is that the experimental 
   details describing the binary data, and the binary data be in two 
   separate files. The former will be in a standard text STAR file 
   (and recommend always archived with IUCr or whoever) and the latter 
   in a binary file.
   
   An important benefit of this approach is that a researcher wanting to review
   the archived experimental paraameters does not have to load a massive binary
   file to do so.  Most of the data items defined by Andy Hammersley would 
   appear in the parameter file, though those special "header" names and 
   clauses are now unnecessary (as they are non-conformant!).
   
   The major additional data item needed specifies the identity (and location)
   of the file containing the binary data. This is the pointer that tells
   the application software where to access the binary data file.
   
   The new data item is.....
   
                _image_binary_data_file    "local filename or URL"
   
   
   
   Here is an example parameter file which refers to one binary file....
   
   data_parameter_file_ex_1
   
   _binary_data_class		image  # Unneeded if definitions are explicit.
   
   _image_size_dimensionality	2	
   _image_size_dimension_1		1300
   _image_size_dimension_2		1200
   _image_element_ordering		1
   
   _image_element_data_type	unsigned_16_bit_integer
   _image_byte_order		highbytefirst   
   
   _image_intensities_linearity	linear
   _image_intensities_gain		1.2(1)
   _image_intensities_overload	65535
   _image_intensities_undefined	0
   
   _image_data_compression_type	byte_offsets 
   _image_element_size_1          122e-6
   _image_element_size_2          121e-6
   
   _image_binary_data_file           "/home/andy/esrf.dat"
   
   
   
   
   A significant advantage of the linked file approach is that more than
   one binary file can be referenced in a single CIF.  The next example 
   shows the file structure when there are 2 binary files, a 2D image on
   a local disk, and a 3D image located at a site which can be got via the 
   http protocol.
   
   data_parameter_file_ex_2
   
   loop_
       _image_binary_data_file         
       _image_size_dimensionality     
       _image_size_dimension_1       
       _image_size_dimension_2      
       _image_size_dimension_3      
       _image_element_ordering     
       _image_intensities_linearity   
       _image_intensities_gain       
       _image_intensities_overload  
       _image_intensities_undefined
       _image_element_size_1      
       _image_element_size_2     
       _image_element_size_3     
   
   "/home/andy/esrf.dat"
   2 1300 1200   1  1 linear 1.2(1) 65535 0 122e-6 121e-6  0
   
   "http://www.esrf.fr/arch/3D.dat"
   3 1280 1024 256  2 log    1.8(3) 32768 -1 9e-5 1e-6  4e-6
   
   
   
   
   
   Finally, Brian McMahon has suggested that because the ascii parameter 
   (parent) file and the binary (child) file(s) are "detached", there may 
   need to be a back pointer reference in the binary file. He suggests that,
   
   > ....the binary file contain a back-pointer to the descriptor or catalogue
   > file - i.e. the binary file could have a one-"line" header, padded to a
   > block boundary if need be, something like
   >
   >   _catalogue_star_file         "/home/archive/expt_3.cat"
   >
   > The syntax need not be STAR-like, ....
   >
   >   catalogue_star_file="/home/archive/expt_3.cat"
   
   
   
   
   **** In conclusion we make the following claims:
   
   * The storage requirements of image data will require the most compact
     machine independent binary format available.
   
   * The impossiblility of imbedding binary data into ascii CIF files makes
     it non-viable.
   
   * By separating the storage of the binary file parameter data as ascii CIF
     from the image file, one retains the efficiencies of both formats.
   
   * What we propose provides a truly machine independent formalism which will
     permit the use of all existing CIF/STAR software tools to check and parse
     the file parameters.
   
   * Note that the expansion of file names and/or URL's has already been
     built into a the STAR preprocessor by Spadaccini.
   
   





Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.