Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ImageNCIF/CBF

  • To: Multiple recipients of list <imgcif-l@bnl.gov>
  • Subject: Re: ImageNCIF/CBF
  • From: Andy Hammersley <hammersl@esrf.fr>
  • Date: Mon, 17 Jun 1996 08:34:03 -0400 (EDT)


Dear Nick,

(I've posted your proposal to the imageNCIF discussion group, and I'm 
sending this to them as well.) 

   I appreciate your interest and input to the question of storing large
image data-sets in a CIF or CIF-like manner. However, I feel that discussion
should be in an appropriate forum e.g. the imageNCIF group or COMCIFS, or
both, or ...

I also feel that some of your comments are somewhat unfair and out of
context. The document which you have seen is not a "submission to 
COMCIFS". It is only a draft proposal from within the imageNCIF working 
group. Many questions such as allowing multiple images of different sizes
are still being discussed. (Incidently "imageNCIF" and "CBF" are two 
names for the same thing, or if you prefer imageNCIF is the discussion 
group and CBF is the present name for the draft proposal.) 

I disagree with some of the points which you make:

>   1. A CBF cannot be transferred from site to site without modification
>      (eg. encoding, packing or zipping) and is not at all portable.

Not true. ftp in binary mode. WWW/netscape etc. has no problems with
playboy pictures, diffraction data doesn't have to be any different.

>   2. A CBF is not extensible in the same way as STAR. Whereas new data can be
>      inserted or appended to an existing STAR File, as can extensions to
>      a dictionary without requiring software to be re-written, this 
>      cannot be the case with a CBF.
   
The present DRAFT proposal is not extensible in the same sense as a CIF.
However the possibility to append images etc. could be included, but with 
extra complication overheads.

>      Here we quote from Andy Hammersley's document, from 2.0 OVERVIEW OF
>      THE FORMAT, "A change in the major version may well mean that a
>      program for the previous version cannot input the new version as some
>      major change has occurred to CBF."
   
The sorts of changes I was thinking about are far less important than the 
ones which you are proposing for CIF ! i.e. Removing the 80 byte line 
limit. I was thinking of things like adding extra compression algorithms
(minor version change), or maybe later addition of multiple binary
sections (major version change), but NOT changes which change the
structure of the format.

>   3. Hard-coded block sizes, line lengths, dataname sizes etc are no
>      longer part of the CIF standard, or soon will not be, according to
>      discussions in late 1995.
 
The 512 byte blocking size is only a proposal. It could be removed 
completely (but with disadvantages), made a different number, made a 
variable, etc. However, if CIF changes fundamentally it may be more
appropriate to do things differently. (My personnel view is that CIF
should not be changed fundamentally, backwards compatibility is very 
important.) The present CBF document and extra dictionary has been
"cast" in the version one DDL. There is no reason that it should
not be re-defined in DDL-2 terms prior to being presented to COMCIFS
etc. (In fact I've so far tried and failed to copy the PostScript
document describing the DDL-2 4 times. The transfer just stops after
about 20 pages.)

>   4. A CBF cannot be handled by the substantial library of STAR conformant 
>      software currently available and being developed. The duplication of
>      such tools for specialised applications is very wastful.

This is (would be) true, Hence the extraction tool. However, it would
make sense to design the format and/or software, so that as much as 
possible software could be shared. I don't think that this would be difficult.

>   5. The CBF format looks like a CIF and there is a significant danger that 
>      it could be mistaken for a CIF (if there is an editor that will 
>      handle a binary file). This is potentially confusing and may retard
>      rather than accelerate the acceptance of CIF as a standard
>      crystallographic data exchange approach.
   
This is a potential danger. Suggestions to reduce this danger are welcome. 

>      It is worth stressing that although the header section of a CBF
>      looks like a CIF, the data is not attached to a dataname in a 
>      convenient or easily usable form, and multiple "images" cannot
>      be looped or contained with in the file.
   
A data name for the binary data could be added, instead of the end of
header identifier, however, the exact byte position where the binary
data starts is crucial. The looping mechanism could be used to allow
multiple different sized binary "images". This has been suggested by
David Brown. I, however, do not favour this suggestion. David Brown
asks:

DB>       Is it clear how one array is terminated and a second begun when 
DB> the cif contains multiple arrays?  What is the normal method of 
DB> terminating a binary array?  Are there separators in the binary string 
DB> that can be used for this purpose?

The simple answer to this (I think) is that there aren't. Not without 
external knowledge, such as that defined within the header section.
Knowledge of the number of pixels in each binary string could be used
to calculate the byte position at which a particular array starts; unless,
of course, data compression has been used.

I would prefer a format with multiple header/binary section pairs. This 
would need the number of bytes or blocks in each binary section to be
stored in the header sections. So that a program would know how to "jump
over" a binary section and find the start of the next header section. 
Such a mechanism could be defined, but I feel that it would be better to start
with a format which only has one header section and one binary section
initially, for the sake of simplicity. CBF at present could store multiple
images, but all of the same size e.g. a time sequence.

>   6. Finally, mutant forms of CIF such as CBF will tend to be a catalyst 
>      for others....based on the often mistaken belief that there is 
>      always a better mousetrap, and that its more efficient to adapt a
>      standard than work within it! Such enhancements eventually lead to
>      the complete collapse of the standard....as has been the case for
>      a number of computer languages. The STAR File is a LONG-TERM archival 
>      and exchange approach and therefore its syntax must be considered
>      sacrosanct.
   
This is why it's very important that whatever imageNCIF (the working group)
do, that it's within COMCIFS and coordinated with CIF people. This seems
to be happening.

>   problems, and appreciate that the standard "text" image approach MAY
>   not work for massive data files - which may be terrabytes in size. 

Terabytes seems a bit of an exaggeration at present, but who knows what 
will happen in the next 10 years ...

>   (i) If the descriptive parameters of a binary file could be easily 
>       "linked" to that file, why can't these be in a separate text file?
   
Yes, it's possible, I know two file formats which do this. However your 
example shows a huge problem with this "solution" Your file pointers are
wrong as soon as the files are renamed, and in your example, as soon as 
they are copied to another directory. This is a REAL problem. The two 
file solution was raised in the imageNCIF discussions, but nobody
favoured it. Even if you manage to overcome the file pointer problem, 
through names conventions etc. you are still left with one "logical" 
data-set being stored in two separate places. This leaves the
possibility for the two to become separated. And if the possibility 
exists then it WILL happen.

>   (ii) Because binary data is machine-specific (and, therefore, so is
>        the encompassing file), is this file suitable for anything
>        other than "transitory local" use (in other words, it is unsuitable
>        for portable or archival purposes)? 

Binary images are portable and are transferred between different
computer systems. With integer data only byte swapping is necessary, and
IEEE reals are becoming standard. (The format could be stricted to only 
hold integers, if this was felt to be very important.) 

>        [We doubt binary files will ever form part of the IUCr archives
>         but such files may be retained inhouse until the appropriate 
>         information is extracted in a more archival form. None of us need 
>         to be reminded of the inadequacy of machine-specific data in an
>         age when the half-life of a chip or an OS is about 12 months!]

"Archiving" is probably not the real aim, but transferability and portability
are. Hence the aims are largely the same as for CIF. I see no reason
whatsoever to believe that ASCII encoding has a greater longevity than
computer integer representations, nor probably IEEE floating point
representation. In fact at present text data is much less
standardised than Integer data types. We (ImageNCIF) have identified
four commonly used and different ways in which ASCII text data are
defined on different operating systems, whereas multi-byte integer data 
are only stored in two different forms, commonly known as big endian or
little endian. Changes in the future seem more likely to affect
character data than integer and floating point data. Multi-byte
character sets are presently being developed.

Whilst I appreciate your efforts in understanding the problem of mass
data storage and transport, I'm afraid that I and I think the large
majority of the imageNCIF group will reject your proposal for the
reasons given above. Nevertheless, I would welcome your continued
involvement and constructive criticism of the proposals.

Best Regards,

      Andy Hammersley








Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.