Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

imageNCIF


Hello,

   Sorry, I've been distracted by other things like writing analysis
software so I haven't been in communication for a while...

So, coming back to my last e-mail, which raised a few questions, and to
David's reply.

These were my questions:

> 1. Do we have concensus on the binary nature of storage for the "image"
>    data ? (As opposed to ASCII encoding of "image" data.)
 
> 2. Do we have concensus on holding header information and binary "image"
>   information together in the same file ? ( The main alternative to
>   this would be to have a separate header file which could therefore be
>   a pure text file, and a binary file for the "image" data.)

> 3. Within the COMCIF framework could a "Crystallographic Binary File"
>    (or similarly named) format be defined ? With a "CIF-compatible" 
>   header section, and a tool to convert the "cif-compatible" header 
>   sections to "cif-compliant" files.

My points 1 and 2 seem to be accepted:

David writes:

>        Andy has summarised very well the consensus that is developing 
> in the group, ...

whereas point 3 gets a less clear answer.

Davids continues:

>               but it is important to point out that, however much 
> the ascii part of the binary file may look like a cif, it will not be a 
> cif as long as it is included in a file that contains binary 
> information.  

but goes on to say:

>        Still, it would make sense to keep the ascii part in a form that, 
> when extracted, constituted a legitimate cif. ...

I take this to mean a "technical" NO, but more or less a practical YES.

I think that if a binary CIF-compatible format is developed then
ultimately it would be highly desirably that the whole format is "owned"
by the IUCR maintained in the same or similar manner that CIF is
presently maintained. The simplest and best method to do this, I
believe, would be to extend the function of COMCIFS to cover both CIF and
the binary format. A parallel committee could be imagined, but I would
see that as something best avoided.

Still I get the impression that this is a problem of technicality,
rather than necessarily a real practical problem, and at present is 
somewhat hypothetical.

I think its time to start defining details of the format...

For the present I'll continue calling the format "Crystallographic
Binary File" (CBF). This is similar to "CIF:, but clearly has the word
"binary" inserted. But may be other people have better ideas ... BIF ??


-------------------------------------------------------------------------------

Here's my attempt to out-line the format.

1. CBF is a binary file, containing self-describing "image" and
   auxiliary data.

2. It is an exact number of blocks of **** bytes in length. 

Jim suggested 512 bytes block size, for efficiency reasons on
OpenVMS, but there was objection to this. I think that we need to
support the concept a "record length" for Fortran direct access I/O and 
for certain O.S.'s. For other O.S.'s which don't have file structures 
all this necessarily means is that the files are some exact multiple of 
some number of bytes. A program written in "C" or a similar language 
would pad out the end of the file to the right number of bytes.
Such a concept may also be generally useful for efficiency reasons. 

In our choice of this number, we should not especially favour VMS for 
efficiency reasons, but then again ideally we want the reading and writing 
of the files to be as efficient as possible on ALL possible O.S.'s. 
If we can we should avoid building in inefficiency, and if possible
leave the opportunity for memory mapping and similar techniques.

I suggest either 512 or 1024 byte block size, but maybe other numbers 
make more sense for other O.S.'s.

3. The very start of the file has an identification item

I would like some simple method of checking whether the file really is a
CBF or not. Ideally this would be right at the start of the file. Thus, a 
program only needs to read in n bytes and should then know immediately
if the file is of the right type or not. I think this identifier should
be some straightforward and clear ASCII string.

4. Somewhere near the start of the file is the CBF version or level.

c.f. PostScript level I and II. Initially, a restricted format is
probably the most practical to define and implement e.g. only one header
and binary section per file. However, later on we may want to extent the
format to cover multiple header/binary sections. Such an important
change could be communicated to a program through this version/level
number. This could be combined with the identification item

e.g.

### CRYSTALLOGRAPHIC BINARY FILE FORMAT: VERSION 1.0

(Such an identifier should be long enough that it is highly unlikely to
occur randomly, and if it is ASCII text, should be very slightly
obscure, again to reduce the chances that it is found accidently. Hence 
I added the three hashes, but some other form may be equally valid.)

5. Header section: describing following binary section, and containing
other auxiliary information. Defined as for CIF, with the exception of
the line separators.

e.g. _image_size_dimensions 2      # Or equivalent

(Clearly much more detail to be defined.)

[At the ESRF a data format was developed which used the keyword 
"IMAGE", which turned out to mean any binary data section, hence I had 
reservations about using the word "image". However, if we are definitely
referring to images (in some sense) and will use other keywords for
other types of binary data, my previous objection disappears.]

6. Some clear identifier signalling the end of the header section and 
   where the binary section begins, or some equivalent method for
   achieving the same. I favour a very clear identifier, Jim some time
   ago seemed to favour a byte count keyword.

7. The binary data. Starting at a new block ?

If normal computer data e.g. 2-byte integers, or IEEE reals are being 
stored in essentially native format then word boundaries should be 
respected. Given that higher "quadruple" precision data types and 
complex data types may potentially be wanted, I suggest that at least 
32 byte boundaries are respected, but maybe for efficiency or simplity 
reasons it's desirable to use the full block boundaries.

(Data types, possible compression, etc. to be defined)

8. Recommended file extension (restricted to three characters).

e.g. cbf

This allows users to recognise file types easily, and gives programs a 
chance to "know" the file type without having to prompt the user.

---------------------

I guess that those are the main features I would like to see in the
format. The precise syntax is not too important (to me), although it is
important that it is precisely defined. (Precise definition, I feel, is a
strong point of the existing CIF dictionary.)

-------------------------------------------------------------------------------

I'll make a few points on other matters which have been raised:

A. Jim objects to words like "horizontal", "vertical", 'X-direction",
and "Y-direction" which I tend to use. I understand his objections, but
we do need to be able to relate an abstract byte stream, first into some
regular array form, and then to be able to relate the array to an
experimental set-up, and to a computer screen. I think we also want a
simple language in which to be able to do this (at least for the simple
cases). 

[ I guess I sit too much in front of a computer screen, so all
images I work with have an up and a down. Whilst clearly a 2-D detector
does not have to be vertically mounted (in the Lab frame), in practice 
almost all are. So usually the detector has a clear sense of up and
down. Unfortunately, by the time the image has been stored and displayed
the two are often not the same ! The same is true for left and rignt,
but with the added complication that it needs to be defined whether the
image should defined from the sample looking at the detector, or vice versa.]

B. I think that it is best to avoid to words like "short" which has been
   used ("usi"), and I guess "long" which hasn't. These mean particular 
   things to particular language/compiler implementations and may well
   change in the future. Some equivalent wording which is less open to 
   (mis)-interpretation is preferable. e.g. 2_byte, 4_byte

-------------------------------------------------------------------------------

Lastly: Whilst my point 4 was not directly answered (being at least in
part dependent on point 3), I have been asked to present a short talk on
"imageNCIF" at the CIF workshop, which takes place during the IUCr
congress in August. 

I see this as encouraging.

        Andy Hammersley









Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.