[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: prototype CBF format

To: Multiple recipients of list <[email protected]>
Subject: RE: prototype CBF format
From: "J.W. Pflugrath" <[email protected]>
Date: Tue, 11 Jun 1996 12:19:06 -0400 (EDT)
Here are some general comments on the proposed prototype CBF format.  
I have left in only parts that I am commenting on.  You will have to
go back to the original to get the complete context.

Jim Pflugrath

PS:  The opinions expressed are my own and not necessarily those of my
     employer, etc.

==============================================================================

>Q. Is this an appropriate way forward ?

Yes.

>Q. Is this a suitable basis for developing a CBF definition which would 
>then be proposed to the COMCIFS/IUCr ?

Yes.

>The Crystallographic Binary File Format


>2. It is an exact number of blocks of 512 bytes in length, and may be 
>   considered in a block structure (2).

I like this.

>3. The very start of the file has an identification item (3). This item
>   also describes the CBF version or level. e.g.

>###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0

>The first hash means that this line is a comment line for CIF, but the 
>three hashes mean that this is a line describing the binary file layout 
>for CBF (4). No whitespace may precede the first hash sign.

This gives a so-called magic number to identify the file. 

>4a. The header section, including the identification items which delimit
>it, uses only ASCII characters, and is divided into "lines". The "line
>separator" symbol(s) is/are the same regardless of the operating system
>on which the file is written (6). (This is an importance difference with
>CIF, but must be so, as the file contains binary data, so cannot be 
>translated from one O.S. to another, which is the case for ASCII text
>files.) 

>4b. The header section within the delimiting identification items
>obeys all CIF rules [1], with the exception of the line separators.

>o "Lines" are a maximum of 80 characters long.

This seems artificial.  Why cannot lines be more than 80 characters?  Are
there some rules for dividing longer lines into shorter lines?
Are shorter lines padded to 80 characters? 

Down in the code that does the reading there is really no difference
between reading a binary and an ascii file.
Often text files are read a line at a time.  The low level code reads a 
character, looks to see if it is an end-of-line (i.e. linefeed).  If it
is, it stops reading, otherwise it keeps reading.  A binary read is simply
read a set number of bytes and return.  Both types check for end-of-file.

I see the problem, where you are reading the header and not paying attention
to what is in the header (you wish to parse it later), so you also read
into the binary data which is bad.  To some extent, if we have only an end-of-
header keyword, we are going to be very inefficient reading the header.

If we know the header length is in the first keyword (say by definition), and
the header is at least 512 bytes long, we can read 512 bytes, parse it for
the header length, then read the remainder of the header.  When you write
the header, first you build it in memory, then you know its length, so you
write out 512-byte chunks with the length in the first keyword.  Now I
think I read/heard arguments against this, but could they please be
presented again?

>o All data names start with an underscore character and are a maximum 
>  of 32 characters long.

This also seems artificial, but I can live with it.

>o The hash symbol (outside a character string) means that all text
>  up to the line separator is a comment.

>o Whitespace outside of character strings is not significant.

So keywords need not appear at the beginning of a line?
And more that one 'keyword value' can appear on a line?

>o Data names are case insensitive.

>o The data item follows the data name separator, and may be of one of
>  two types: text string (char) or number (numb). (The type is
>  specified for each data name.)

>o Text string may be delimited with single of double quotes, or blocks of
>  text may be delimited by semi-colons occuring as the first character on
>  a line.

I need an example of this.  It seems the FIRST character of a line is
special if it is a semicolon.  Or a semicolon is special if it is the
first character.  This seems odd to me.  Are there any other special
characters and/or special placement in CIF?  (that is besides #, ', and ")

>o The 'loop_' mechanism allows a data name to have multiple values

>Any CIF data name may occur within the header section.

>5a. The end of the header section is delimited by the following special
>identifier (4), (7):

>###_END_OF_CBF_HEADER

>The "line" is terminated by the "line separator" immediately after the
>"R" or "HEADER". No whitespace can be added at this point.

This gives a clear termination of the header and the beginning of binary
data.  No problems with it.

>5b. Whitespace (blank characters and lines) may be used to reserve space
>in the header section (for undefined later use), but this white space must
>occur before the end of header delimiter item.

Did we decide to use whitespace to pad out the header to a multiple of 512
bytes?   If so, is a formfeed character whitespace?  Can I put a comment
with a formfeed in it just before the end-of-header keyword if I want?  As in:

# End of header coming up <ff>
###_END_OF_CBF_HEADER

>...

>4.0 DESCRIBING THE BINARY DATA

>If the value is 'none' there is no binary data section in the file.

This takes care of dataless or header-only files.  I like it.

>4.1 The "image" Class of Binary Data (14)

>...
>e.g. If the '_image_size_dimensionality' data item is greater than 1,
>then '_image_size_dimension_2' and maybe other '_image_size_dimension_?'
>items must be defined up to the dimensionality of the array.

I guess we are restrained to one attribute or value per keyword.  This
seems silly to me, but in the interest of being CIF-like I have no major
objects, it just makes life difficult.  I would not mind seeing the 
persuasive arguments that settled this matter for the CIF-folks.

>4.2 "Image" Element Rastering and Orientation (17)
>--------------------------------------------------

>Fundamental to treating a long line of data values as a 2-D image or
>series of 2-D images is the knowledge of the manner in which the values 
>need to be wrapped. For the raster orientation to be meaningful we
>define the sense of the view:

>The sense of the view is defined as that looking from the crystal
>towards the detector (12). 

Well, I do not think this is fundamental.  What is fundamental is that
a 2D array is usually stored as rows and columns.  What is a row? What is
a column?  Are they horizontal, vertical?  I have suggested that we not use
words such as horizontal and vertical because my horizontal might be
your vertical.  Or my vertical might be dependent on the experiment.
Row and column can also be confusing.  I suggest nomenclature that deals with
the first and second directions. Could either be direction 1 and direction 2 
(or direction 0 and direction 1).  We define the first direction as the
FAST direction and the second direction as the SLOW direction. That is how
the data array is stored in the file.  

The ORIENTATION and VIEW of the image data are separate issues.

If we take ORIENTATION first, I desire a way to orient the image into
a laboratory frame.  The laboratory frame will be an 3D-orthogonal one
with axes X, Y, Z [or 1, 2, 3 :) ].  Caveat: Your XYZ may not be my XYZ.
I can specify orientation vectors for the FIRST image direction and the
SECOND image direction:

_image_orientation_vector_1_1  1
_image_orientation_vector_1_2  0 
_image_orientation_vector_1_3  0

_image_orientation_vector_2_1  0
_image_orientation_vector_2_2  1 
_image_orientation_vector_2_3  0

or whatever.  In the above, my FIRST direction is along lab X, my SECOND
direction is along lab Y.  If my image axes were not orthogonal, I could
still use vectors to designate this.  This might be useful for electron density
maps.  The vectors must be consistent with the lab frame chosen for all
the experimental properties.  This has problems with non-flat images, such
as cylindrical or curved plates.

Next Andy proposed a designation for how to view images.

>Below are shown the 8 possible ways of rastering the element stream:
>1, 2, 3, 4, 5, 6, 7, 8, 9 (13).
>...

Other schemes are in use, such as a short text:
  +x+y, +x-y, ..., +y+x, ..., -y-x

Either way is OK by me, but text strings have a (slightly) better chance of 
being understood all by themselves.  Maybe use something besides x and y
which can be confused with lab X and lab Y.  For example,
  +FAST-SLOW or -FIRST+SECOND

where the first part designates what does across the display left to right
(i.e. width) and the second part designates what goes down the screen from
top to bottom.

>_image_byte_order highbytefirst     # Written on a Sun-4 workstation

Can we have synonyms for some values?  Such as big_endian, little_endian?

>_image_intensities_overload  65535  # Saturation level

Yes, saturated pixel values need to be known.

>_image_element_size_1          122e-6 
>_image_element_size_2          121e-6

It might be useful to have a nominal pixel size and overall image size for
for images that have spatial distortion or where the pixels vary in size.

>...

>(6) The exact manner in which to define the line separation is a subject
>of discussion. Either using a single line-feed character (as is done by
>Un*x), or using the combination of a carriage-return character followed
>by a line-feed character (as is done by MS-DOS and related systems), are
>the likely candidates. 

cr lf works for Unix, but lf alone does not work for DOS, so why not just
decide on cr lf which works for both?

>(7) Some clear identifier signalling the end of the header section and 
>where the binary section begins, or some equivalent method for
>achieving the same is vital. Here a clear identifier is proposed, but an
>alternative method could also work.

A clear identifier is good.  For speed, I like to read in the header
separately and for speed, this requires knowing the length of the header.
So I would not mind having the header length specified at the beginning
of the file.  I have not seen much support for this feature in this forum
though.

>(8) If normal computer data e.g. 2-byte integers, or IEEE reals are being 
>stored in essentially native format then word boundaries should be 
>respected. Given that higher "quadruple" precision data types and 
>complex data types may potentially be wanted, I suggest that at least 
>32 byte boundaries are respected, but maybe for efficiency or simplity 
>reasons it's desirable to use the full block boundaries. 

This should not be a problem when writing to disk unless the disk is
memory mapped.  In that case, you are not worried about being platform
or architecture dependent anyways.

>(12) Some may prefer to define the view as the "camera-mans" view, and
>may be this is better as part of an overall consistent co-ordinate
>system for lab/crystal/detector. I note that MADNES defines the view
>from the camera-mans point of view.
>
>     Which definition of the viewing direction should we use ?
>     Is there an IUCr standard co-oordinate system ?

Users should be allowed to choose what is natural for them.

>(14) I still wonder about the use of "image". 
>
>     Should we change the word "image" to "array", which I feel is more
>     consistent with the uses which I have defined ?
>
>     Or should we restrict "image" to refer only to a 2-D array type
>     data object, and eventually define other data classes such as
>     "histogram", "images", "volume", etc ?
>
>     Or leave the class "image" to refer to a whole variety of 
>     N-Dimensional arrays ?

"array" seems better to me.

>(15) How should complex arrays be stored ?
>
>     Should pairs of real and imaginary values be stored as alternative
>     values in the element stream ?
>
>     Or should whole array e.g. a image, of the real components be
>     stored separately from an identical array containing the imaginary
>     components ?

Do what is simplest.  Store the values on disks as they would be stored
in memory.  For complex numbers, I think this means use the former (i.e.
alternate values).
 
>(17) More general detector orientation information has delibrately been
>     avoided in the first stage of defining the CBF format, but even to
>     describe the sense of an image from an area-detector a certain
>     amount of external geometrical information is necessary. 
>
>     Does the IUCr have standard coordinate system to define arbitrary 
>     detector position and orientation ?

A method suggested by David Thomas at one of the EEC Cooperative Programming
Workshops and used in MADNES is to define 3 detector translations and rotations
with respect to the laboratory frame.  That is, you have 6 vectors.  Then
you define the coordinates of the detector in this 6 dimensional system.
The vectors are not necessarily parallel to any lab vector.  The rotations
are applied in the order given.  So if you rotate around lab Z first, then X,
then Y, you must give the rotation vectors in that order.

The idea is extended to other experimental properties including 
crystal goniometers, source goniometers or directions, etc.

>(18) Other definitions of sample to detector distance are possible, and
>     used.
>
>     How should the sample to detector distance be defined ?

>(19) From the CIF core dictionary '_audit_creation_date' defines only the
>date of creation. The time of the date creation needs to be defined to
>a precision of fractions of a second. Either the '_audit_creation_date' 
>...
>     How is the time of creation of data best stored ?

Among possiblities are an hh:mm:ss.sss format or seconds since midnight.
I like the former, where hh is 0-23 because I am human and not a computer.  
I am not too concerned about time zones.
Reply to: [list | sender only]

Prev by Date: Re: Alternative proposal

Next by Date: Re: Alternative proposal

Prev by thread: prototype CBF format

Next by thread: Too little too late?

Index(es):

Date

Thread
Discussion List Archives

RE: prototype CBF format