[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
No Subject

To: [email protected], [email protected], [email protected]
From: Andy Hammersley <[email protected]>
Date: Mon, 24 Jun 96 16:46:17 +0200

Hello, Brian, David, Nick, Syd, and imageNCIF members,

   I've received too much cif/cbf mail recently to answer every point,
but I'll give a re-cap of my view of the present situation.

My view is that the e-mail conversation which has recently taken place
with Nick, Syd, and Brian, is to a large extent a repeat of e-mails
which have already taken place with David and the imageNCIF group, and 
I think that we more or less reach the same conclusion. I think that 
this is the source of the apparent "Strong feelings appear to be
stirring in the breasts ...". 

I haven't objected since I felt that the basic points had not been made
explicitly enough. Now I have given examples of the scale of the image
storage problem, which was a useful exercise for me to quantify
present (ESRF) storage and data rate problems. (Examples from other labs
would be welcome.) Similarly the undesirability of a two file system was
just accepted by everyone, including David, without explicit discussion
as to why it was dangerous, prone to error, etc. Now various examples
and different members views have been given.

So the re-cap:

--------------------------------

1. Images are large and efficiency is important. (Disk and bus
bandwidths may be limiting in some situations, but this is more
reason for binary data compression, and with RAID arrays, faster I/O
buses, etc. the limiting factor may vary from application to
application.)

(Note: Efficiency rather than necessarily convenience is one of the main
requirements of the members of imageNCIF.)

2. A separate True-CIF header and binary file pair format is considered
undesirable since this is highly prone to inconsistent file pointers,
and can allow the image header to be separated from the binary
data which is being described. (I'm not the only member of the group
who clearly has a strong believe in Murphy's law.)

3. Given that 2. is rejected the format _CANNOT_ be CIF in a true sense.
Hence, I started using the term imageNCIF = image (Not) CIF, because the
format cannot be CIF. So a binary format is proposed with a header
section which is "CIF-related". David came up with the term
"CIF-compatible" as opposed to "CIF-compliant".

4. Having decided that a format which is not-CIF is required, why make
it similar to CIF ? 

a. For experiments using area-detectors this would be 
the first stage in the data reduction path. Ideally various data reduction
programs would output true-CIFs and auxiliary information stored in the 
raw data file would be output in the CIF. This will be easiest if the 
definitions, data names, etc. are the same. All sorts of CIF DDL names
could easily first appear in a CBF file (See example in CBF document).

b. CIF also exists, COMCIFS exists within the IUCr, so there is already a 
structure for maintaining such a "standard". I would like CBF to be
considered in the same way as other CIF's. In other words make COMCIFS
responsible for coordinating both development of CIF's and CBF's. This
would seem to me to be the safest manner in which to avoid diverging
formats.

--------------------------------

I think that brings the discussion up to date. If it is viewed that COMCIFS
is only concerned with CIF's, then I think that Brian provides a
practical solution to what is probably more a theoretical than real
problem. i.e. COMCIFS is concerned with the "imageCIF" file, after it has
been extracted from the CBF. (However, I would still prefer that the
whole format is considered as IUCr property, and maintained as such
along the same lines as CIF.)


It looks like mmCIF is considering a number of "area-detector" data
names. This needs to be coordinated since we are talking about
the same items. It seems clear that the detector to sample distance
and eventually other quantities describing the detector orientation
need there own data names with precise definitions (e.g. different people
use different definitions for the detector distance). Putting detector
distance and 2-theta angle within the '_diffrn_measurement.details' name
is not O.K. e.g.

S>*    _diffrn_measurement.details
S>     ;      440 frames, 0.20 \%, 150 s,
S>            detector distance 12 cm, 
S>            detector angle 22.5 \%
S>     ;

What does this mean ? How can it be reliably parsed to extract the 
information ? 

The inadequacy of this proposal seems to be recognised:

S>     ...                    It is immediately clear though that
S> items such as _diffrn_measurement.details will not be adequate
S> for our purposes and probably Paula will need to be alerted to
S> this. The five values in this example are each important and
S> really need to be assigned a separate definition.    ...


A clear definition of the coordinate system is needed together with an
appropriate conceptual model of the diffractometer and detector. If
this exists within mmCIF, or elsewhere, I would be more than happy.
Data names and definitions should follow naturally. 

(I need to think further about Jim's proposals for "orientation" and 
"view". What does this imply for writing software ?)

-------------------------------------------------------------------------------

A number of points have been made some of which reveal differences in 
emphasis between CIF and CBF:

> From Nick and Syd:

---------------

> On Mon, 17 Jun Andy Hammersley writes ....
> 
> >      (eg. encoding, packing or zipping) and is not at all portable.
> 
> > Not true. ftp in binary mode. WWW/netscape etc. has no problems with
> > playboy pictures, diffraction data doesn't have to be any different.
> 
> Yes but ftp is a straight bit->bit copy protocol. If you copy a binary
> file encoded on a little-endian architecture to one with a big endian
> architecture YOU are still going to have to do the translating. ftp
> doesn't do it for you. The reason why playboy pictures are viewable is
> strictly because the standard encodes a particular architecture into
it,
> little-endian for gif for instance.
>

And

> >_image_byte_order highbytefirst     # Written on a Sun-4 workstation
> 
> > Can we have synonyms for some values?  Such as big_endian, little_endian?
> 
> Why would your standard hard-encode the architecture in which the data file
> was written? Take a look at the gif etc standard. Choose one and everybody
> conforms to it.
> 
>

So binary files can be portable !

gif solves big versus little endian representation by choosing one and 
sticking to it. TIFF on the other hand "solves" the problem by storing 
the information in the very first two bytes of a tiff file: 'II' or
'MM'. TIFF processing software must know how to process both types. In 
both GIF and TIFF "byte swapping" may or may not be necessary and must be
supported by the read/write software (unless the software is restricted
to run on only one type of platform). It could be considered only using 
*-endian storage for CBF, but this would lead to unnecessary
inefficiency, and may be quickly out of date. e.g. At present almost 
all Xtallographic software is Unix based, so we might choose big-endian 
storage, but in some terrible future PC's with DOS-based O.S.'s come to 
rule the world, so both reading and writing of the CBF would require 
byte-swapping. Given that both presently exist in large numbers I
suggest that it's better to allow both.

Answering Jim's question:

I would prefer to limit the allowed values to data names, so I would
prefer not to allow synonyms. However, if "little_endian" and "big_endian"
are preferred I would suggest that they are used instead. (Personnally,
I can never remember which way round big and little endian numbers are
stored, and find the term "highbytefirst" or "lowbytefirst" much more
explicit and easier to understand. If this is just my problem, then I'm
happy to use "little_endian"/"big_endian" since once the software is
written once, you can forget which is which.)


---------------

> On Tue Jun 11 "J.W. Pflugrath" writes ....
> 
> >###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0
> 
> >The first hash means that this line is a comment line for CIF, but
the 
> >three hashes mean that this is a line describing the binary file
layout 
> >for CBF (4). No whitespace may precede the first hash sign.
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> A simple question; why not?

This is from the CBF document. The idea is to have an identifier to
recognise the file type at the start of the file. In Unix-speak this is
a "magic number". This is a concept which doesn't seem to exist in a 
CIF. i.e. Neither a computer program nor a human being has a simple
fast method to recognise if a particular file is a CIF or not. Such a 
concept is felt to be generally useful. e.g. If someone tries to send a 
TIFF file to a program designed to read CBF, I would like a simple clear
error message like "not correct file type".

By not allowing white space this is always the same and easy to check.
Clearly a more versatile rule could be given, but the processing
software would have to be more sophisticated and simple O.S. tools might
no longer work.

---------------

> Hmmm.... My understanding of Andy's proposal is that the header, is part
> of the binary file except it contains ASCII bytes, the rest of the data
> is also in binary in the form of packed bytes. If you run "more" over it,
> it will deal with the entire file. BTW Jim could you send me the copy of
> your version of "more". I don't have one that I can pipe a binary file
> through and have it work the way you say.
> 

A CBF doesn't presently exist as far as I know (after all the format is
still being defined), but I have looked at a number of existing binary
files of similar "ASCII-header" lay-out. 'more' on hp-ux A.09.01 does
what I require. i.e. allows me to see the ASCII information stored in
the header section. It more or less skips the binary
image. Alternatively I use emacs, which displays binary with a backslash
followed by the value in decimal. (I use this to view the header, I
don't want to pipe the output anywhere other than to a terminal.)

---------------

> > It is correct that binary information might become difficult to read in 
> > the future if the format for numbers change. 
> 
> It is difficult to read now.
> 

And

> (ii) Check out the directions that other disciplines are taking with 
>      large scale data. Is it true for example that increasingly data 
>      formats are ascii based, and if so, why? 
>

There seems to be a culture problem here. I and many of the members of
the imageNCIF group write software which copes with binary data written
by both little and big endian machines, which may be transparently
compiled and run on both little and big endian machines. Many other
people have also done this, people who write TIFF, GIF and other image
processing software. In the area of area detectors and images I don't
know of anyone using ASCII as a serious storage format. 

> >The "line" is terminated by the "line separator" immediately after the
> >"R" or "HEADER". No whitespace can be added at this point.
> 
> > This gives a clear termination of the header and the beginning of binary
> > data.  No problems with it.
> 
> Don't understand how the presence of spaces after the "R" could confuse
> anything? Why require this restriction?
> 

Knowing the precise byte position of the start of the binary data is 
absolutely crucial. You may view this as a disadvantage of binary.
Thus strict rules are needed outside the ASCII header section. Of
course the information defining this start could be defined differently.

> >5b. Whitespace (blank characters and lines) may be used to reserve
space
> >in the header section (for undefined later use), but this white space
must
> >occur before the end of header delimiter item.
> 
> You're going to reserve space for currently non-existant data! Why is
> this needed in free format protocol?
> 

Strictly this section is redundant, but it shows how it would be
possible (if required) to reserve space in the header for later addition
of extra auxiliary information, without having to re-write the whole file.

-----------------

> There is NO restriction of one value to each keyword. Data may be
> loop_ed. We have a problem "things that seem silly to you, as being
> attributed to being CIF-like". It is a silly restriction and it is NOT a
> CIF requirement. Please read the CIF/STAR specification. Reference
above.
> 
>

CIF only allows one level of loops, as I understand things. David
suggested that we don't use loops e.g. for the size of various
dimensions, so that the one available level of looping is available for
other, presently undetermined, purposes. The CBF document is presently 
written on this basis. Maybe this could be re-considered ?

-------------

> >   2. A CBF is not extensible in the same way as STAR. Whereas new data can be
> >      inserted or appended to an existing STAR File, as can extensions to
> >      a dictionary without requiring software to be re-written, this 
> >      cannot be the case with a CBF.
>    
> > The present DRAFT proposal is not extensible in the same sense as a CIF.
> > However the possibility to append images etc. could be included, but with 
> > extra complication overheads.
> 
> Wouldn't it be nice to do this with something as simple as vi or cat! :-)
> 

And

> >(8) If normal computer data e.g. 2-byte integers, or IEEE reals are being 
> >stored in essentially native format then word boundaries should be 
> >respected. Given that higher "quadruple" precision data types and 
> >complex data types may potentially be wanted, I suggest that at least

> >32 byte boundaries are respected, but maybe for efficiency or simplity 
> >reasons it's desirable to use the full block boundaries. 
> 
> .... And will this make this even more portable and easier to understand? 
> Some vision is needed here.
> 

Here the compromise between functionality/ efficiency/ and simplicity
becomes apparent. The view of the imageNCIF group has tended to be that
efficiency and simplicity are the main priorities.

Thus there is a choice between: easily using memory mapping
_OR_ being able to concatenate files or arbitrarily editting the header
(assuming that you have an appropriate editor e.g. emacs).

Restrictions of block boundaries, and even word boundaries could all be
removed, BUT with the price of less efficient access, and extra
complication in the code. This would add extra functionally which is not
obviously desirable. (Whilst I consider it an advantage to be able to
use standard system tools to examine such files, I consider it a
disadvantage and a huge danger to allow editting and other arbitrary 
changing of such files. Imagine someone arbitrarily changes the number
of pixels in one of the image directions !)

--------------

> > cr lf works for Unix, but lf alone does not work for DOS, so why not just
> > decide on cr lf which works for both?
> 
> Ummm .... <cr><lf> does not work for Unix. Only the <lf> bit of the
> <cr><lf> combo works. in other words under DOS you would have one
> buffer, and under Unix you would have the same buffer plus the <cr>
> character on the end. YOU would still have to strip it off.
>

<cr><lf> works in the sense that a human being could use "emacs", maybe
"more", and other standard system tools to look at the header section. 
On Unix they might be slightly annoyed by the "^M" appearing at the
end of every line, but otherwise the header section would read normally.

------------------

Yves Epelboin writes:

> I am not personnaly satisfied with all the details of the proposed 
> format but I do not see any major difficulty since I believe we will 
> need a library of functions to extract relevant information. I do not 
> see any difficulty in thefuture to enhance, modify and adapt these 
> functions when computers will evolve.
 
And

> If no agreement is acceptable for COMCIF people, let us defien a 
> complete new format with or without STAR format since we will use 
> binary files.
> May be we should start the discussion again without this restriction.

What are the problems ? There they follow-ons from making the format
"CIF-compatible" or otherwise ? 

------------------

Brian McMahon writes:

> How does CBF propose to include an audio annotation of an included
> data set :-)  ?

(Thought experiment)

The '_binary_data_class' would be extended to have an extra value, say
'audio'. Existing software would examine the CBF and on finding a value
for '_binary_data_class' which they didn't recognise would exit
hopefully with a simple error message explaining the problem e.g.

Error: Unknown binary data type: "audio" 

There would be a number of compulsory and optional data names presumably
of class audio e.g. '_audio_size_dimension' might give the total number of
samples and be compulsory. Alternatively, the '_array' data class could
be argued as being a suitable basis for such storage, but new data
compression algorithms would doubtless be appropriate.

The problem would be to allow this audio information and the image
information to both be defined in the same file. CBF rules to allow
multiple header/binary sections or a single header section referring to
a binary section containing both types of data would be required. I
discussed this is a previous mail message to Nick.

-------------------------------------------------------------------------------

I am off on holiday on the 6th July, then I'll be at Stoney Brook/
Broohaven for a week, and then at Seattle. I look forward to meeting you
in Seattle and

--------------
from Brian:

>  ...             In any case, I hope no irrecoverable decisions are made
> before we all have a chance to get together and work through some of the
> issues over a beer at Seattle.        ...

--------------

continue discussion over a beer.

         Andy
Reply to: [list | sender only]

Prev by Date: Forwarded mail

Next by Date: Actual Face-To-Face Conversation

Prev by thread: Re: Actual Face-To-Face Conversation

Next by thread: Forwarded mail

Index(es):

Date

Thread
Discussion List Archives

No Subject