Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CBF - long discussion


Hi Andy

Syd and I have gone through the discussion group comments and have these
responses. We will follow your suggestion and send them also to the 
discussion group - which I assume is imgcif-l@bnl.gov.

Nick and Syd

##################

Brian passed on the the bunch of comments from your discussion group
following the circulation of our CBF comments and criticisms. We checked
them all out....some were interesting....others seem to indicate a quite
serious lack of understanding of either star or your binary format.

The first thing that has to be established in these discussions is that 
Syd and I are critically interested in your proposals for only one 
reason....that is to make sure that you achieve your objectives for
handling massive data in a way that does not prejudice current star/cif
standards and practices.

It is increasingly clear to us, as it should be to you, that the basic idea
of CBF has little relevance to either CIF or STAR and therefore its [CIF/STAR]
syntax concepts should not even be mentioned. We would gather that your 
user group wants to build a binary file format which is neither CIF nor STAR 
compliant and we are more than happy to not participate in these deliberations.
We were seeking in our earlier proposal to try and let you to have your cake 
and eat it ....but if our proposed merging of concepts is unacceptable because 
of separated files (inline comment: look carefully at this as the stated
points against this possibility seem somewhat specious and untechnical)
then we ask that CIF and STAR NOT be referred to in your format name or
descriptions. Of course you can use any of the syntax ideas you want so long
as they are not referred to as being cif/star compliant (because they will
not be!). 

Having made that statement of our interests and intentions, my remaining
remarks are in response to comments of your working group (some of which 
indicate a serious lack of understanding of the techical problems you face
and for which the star-text approach was designed to overcome), plus some 
constructive advice on possible considerations.

(i)  An apparently overlookable historical point: standard files based on 
     machine-specific binary formats have been around for eons and the 
     portability problems they pose was one of the reasons for the ascii 
     universal file approaches of asn.1 and star. The bin file problems 
     have not diminished in recent years, and we urge you to confer with
     some network and comsci specialists on portability, efficiency etc. 
     if you are determined that this direction is a viable longterm approach.

(ii) Check out the directions that other disciplines are taking with 
     large scale data. Is it true for example that increasingly data 
     formats are ascii based, and if so, why? 
     
(iii) Following your latest mail I checked with another comsci colleague
      of mine who specialises in languages, networking protocols and 
      distributed environments. He sees no advantage in an exchange
      protocol based on pure binary or mixed ascii/binary formats (the
      latter is achievable)
      The efficiency differences with pure ascii are minimal because the real
      limitation is the disk or network bandwidths. The BIG plus in his 
      view is the human manipulation of the file with STANDARD software.
      "Every other consideration will increasingly pale to insignificance"...
      his words. Explore these points with other experts in the business.

Some quick tidying up to do with several of the comments.....

On Mon Jun 10 "J.W. Pflugrath" writes ...

> In practice, having 2 separate files complicates things enormously.  For
> example, you make a tar or backup tape of your data.  If the ascii file
> ends up on one tape and the binary file ends up on another tape you have
> an annoying problem.

Yes, the extra parent file is a minus.... but weigh this against the 
advantages of easy access to parameters...no contest in my view. In any
case there will always need to be a systematic approach to handling multiple
files, whether they be binary, or ascii and binary.    

> Well, one need not load the entire file to view the experimental parameters.
> They are in the header which can be read easily without reading the
> binary data (e.g. on Unix systems: 'more image_file').
> So this is not a benefit at all.

Hmmm.... My understanding of Andy's proposal is that the header, is part
of the binary file except it contains ASCII bytes, the rest of the data
is also in binary in the form of packed bytes. If you run "more" over it,
it will deal with the entire file. BTW Jim could you send me the copy of
your version of "more". I don't have one that I can pipe a binary file
through and have it work the way you say.

> Well, as long as we are putting a backpointer reference in the binary file,
> why not put the whole set of experimental parameters there too?  If we are
> going to do that, why not use a CIF(STAR-like) syntax?
   
This point is addressed above, As far as we are concerned you may use whatever 
syntax you like, just don't refer to it as CIF/STAR. We believe that it is important
to avoid that confusion in the users community.

>>   * The impossiblility of imbedding binary data into ascii CIF files makes
>>     it non-viable.

> Has it been decided this was an impossibility?

No bit configs are impossible with unix because it doesn't distinguish file types.
The issue here is one of viability as an exchange standard ... where are the gains,
what are the losses, what extra software is needed, and how is compliance to be
achieved if it is to be used as a "standard". These are the serious issues.

> Ok, we add a tool to strip the ASCII header off the binary image file to
> yield what Syd Hall and Nick Spadaccini have proposed.  Or do CIF/STAR
> tools recognize an end-of-CIF/STAR keyword.  If they did, they would be
> able to work with an ASCII header-binary data file.

Its more than disconcerting to try and contribute to the CBF discussions and
be countered by arguments that show a fundamental lack of understanding of
how STAR works and why it is so powerful in handing repetitive data over other 
formats such as asn.1. These are tag/value protocols, what would one need a
"end-of-CIF/STAR keyword" for, and what does it mean? Ref: Hall & Spad. (1993)
J Chem Inform & Comp Sci 34, 505-508.

 ..... Yves Epelboin writes ...


> It is correct that binary information might become difficult to read in 
> the future if the format for numbers change. 

It is difficult to read now.

> But why to believe that ASCII will remain forever? 

Seems likely that ascii is so imbedded in our culture, and is human
readable, and is completely machine independent that it will be around
for a "little" while longer.


On Tue Jun 11 "J.W. Pflugrath" writes ....

>###_CRYSTALLOGRAPHIC_BINARY_FILE: VERSION 1.0

>The first hash means that this line is a comment line for CIF, but the 
>three hashes mean that this is a line describing the binary file layout 
>for CBF (4). No whitespace may precede the first hash sign.
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A simple question; why not?

>4a. The header section, including the identification items which delimit
>it, uses only ASCII characters, and is divided into "lines". The "line
>separator" symbol(s) is/are the same regardless of the operating system
>on which the file is written (6). (This is an importance difference with
>CIF, but must be so, as the file contains binary data, so cannot be 
>translated from one O.S. to another, which is the case for ASCII text
>files.) 

>4b. The header section within the delimiting identification items
>obeys all CIF rules [1], with the exception of the line separators.

Any exception means CBF does NOT obey the STAR syntax which is, by the
way,  patented by the IUCr! The criticality in violating ONE rule in a 
"universal exchange format" cannot be over stressed.

>o "Lines" are a maximum of 80 characters long.

This is a current CIF restriction that is likely to be relaxed. The star
syntax does not fix record lengths. Advice: Don't fix the record length!

>o All data names start with an underscore character and are a maximum 
>  of 32 characters long.

Ditto above. this will also go eventually. Don't fix the name length either.

> So keywords need not appear at the beginning of a line?

Correct!

> And more that one 'keyword value' can appear on a line?

Its part of CIF/STAR - free formatting. BTW the above has to be in a
loop_

>o Data names are case insensitive.

free formatting again.

>o The data item follows the data name separator, and may be of one of
>  two types: text string (char) or number (numb). (The type is
>  specified for each data name.)

>o Text string may be delimited with single of double quotes, or blocks of
>  text may be delimited by semi-colons occuring as the first character on
>  a line.

> I need an example of this.  It seems the FIRST character of a line is
> special if it is a semicolon.  Or a semicolon is special if it is the
> first character.  This seems odd to me.  Are there any other special
> characters and/or special placement in CIF?  (that is besides #, ', and ")

The CIF/STAR specification is pretty explicit about this. The ; as the
first character of a line puts you into the "semi-colon delimited text
string" state, and you return out of this state by the detection of
another ; as the first char of a line.

>The "line" is terminated by the "line separator" immediately after the
>"R" or "HEADER". No whitespace can be added at this point.

> This gives a clear termination of the header and the beginning of binary
> data.  No problems with it.

Don't understand how the presence of spaces after the "R" could confuse
anything? Why require this restriction?

>5b. Whitespace (blank characters and lines) may be used to reserve space
>in the header section (for undefined later use), but this white space must
>occur before the end of header delimiter item.

You're going to reserve space for currently non-existant data! Why is
this needed in free format protocol?

> Did we decide to use whitespace to pad out the header to a multiple of 512
> bytes?   If so, is a formfeed character whitespace?  Can I put a comment
> with a formfeed in it just before the end-of-header keyword if I want?  As in:

Dear me ....

>If the value is 'none' there is no binary data section in the file.

> This takes care of dataless or header-only files.  I like it.

Hang on .... you have a end of header marker!, what do you need a data
value for to tell you there is no data?

> I guess we are restrained to one attribute or value per keyword.  This
> seems silly to me, but in the interest of being CIF-like I have no major
> objects, it just makes life difficult.  I would not mind seeing the 
> persuasive arguments that settled this matter for the CIF-folks.

There is NO restriction of one value to each keyword. Data may be
loop_ed. We have a problem "things that seem silly to you, as being
attributed to being CIF-like". It is a silly restriction and it is NOT a
CIF requirement. Please read the CIF/STAR specification. Reference above.

>_image_byte_order highbytefirst     # Written on a Sun-4 workstation

> Can we have synonyms for some values?  Such as big_endian, little_endian?

Why would your standard hard-encode the architecture in which the data file
was written? Take a look at the gif etc standard. Choose one and everybody
conforms to it.

>(6) The exact manner in which to define the line separation is a subject
>of discussion. Either using a single line-feed character (as is done by
>Un*x), or using the combination of a carriage-return character followed
>by a line-feed character (as is done by MS-DOS and related systems), are
>the likely candidates. 

> cr lf works for Unix, but lf alone does not work for DOS, so why not just
> decide on cr lf which works for both?

Ummm .... <cr><lf> does not work for Unix. Only the <lf> bit of the
<cr><lf> combo works. in other words under DOS you would have one
buffer, and under Unix you would have the same buffer plus the <cr>
character on the end. YOU would still have to strip it off.

>(8) If normal computer data e.g. 2-byte integers, or IEEE reals are being 
>stored in essentially native format then word boundaries should be 
>respected. Given that higher "quadruple" precision data types and 
>complex data types may potentially be wanted, I suggest that at least 
>32 byte boundaries are respected, but maybe for efficiency or simplity 
>reasons it's desirable to use the full block boundaries. 

... And will this make this even more portable and easier to understand? 
Some vision is needed here.


> I propose a new working name for the standard:  BINCIF
> This stands for    But It's Not CIF

Stick to CBF, anything containing CIF is not acceptable.



On 13 Jun "I. David Brown" writes ....

[deleted]

David was just trying to point out what seemed to be a lack of
understanding of what CIF was and how it worked.


On Thu Jun 13 Peter Keller writes ....

More stuff about what CIF/STAR is and how it works


On Mon, 17 Jun Andy Hammersley writes ....

>      (eg. encoding, packing or zipping) and is not at all portable.

> Not true. ftp in binary mode. WWW/netscape etc. has no problems with
> playboy pictures, diffraction data doesn't have to be any different.

Yes but ftp is a straight bit->bit copy protocol. If you copy a binary
file encoded on a little-endian architecture to one with a big endian
architecture YOU are still going to have to do the translating. ftp
doesn't do it for you. The reason why playboy pictures are viewable is
strictly because the standard encodes a particular architecture into it,
little-endian for gif for instance.

>   2. A CBF is not extensible in the same way as STAR. Whereas new data can be
>      inserted or appended to an existing STAR File, as can extensions to
>      a dictionary without requiring software to be re-written, this 
>      cannot be the case with a CBF.
   
> The present DRAFT proposal is not extensible in the same sense as a CIF.
> However the possibility to append images etc. could be included, but with 
> extra complication overheads.

Wouldn't it be nice to do this with something as simple as vi or cat! :-)

>   5. The CBF format looks like a CIF and there is a significant danger that 
>      it could be mistaken for a CIF (if there is an editor that will 
>      handle a binary file). This is potentially confusing and may retard
>      rather than accelerate the acceptance of CIF as a standard
>      crystallographic data exchange approach.
   
> This is a potential danger. Suggestions to reduce this danger are welcome. 

See first suggestion at the beginning of this letter.

>   6. Finally, mutant forms of CIF such as CBF will tend to be a catalyst 
>      for others....based on the often mistaken belief that there is 
>      always a better mousetrap, and that its more efficient to adapt a
>      standard than work within it! Such enhancements eventually lead to
>      the complete collapse of the standard....as has been the case for
>      a number of computer languages. The STAR File is a LONG-TERM archival 
>      and exchange approach and therefore its syntax must be considered
>      sacrosanct.
   
> This is why it's very important that whatever imageNCIF (the working group)
> do, that it's within COMCIFS and coordinated with CIF people. This seems
> to be happening.


The point cannot be made too often: ANY deviation from the current syntax
requirements (apart from some cif restrictions to the star syntax) means
that CBF is a new format. As such we wish you every success with this new
approach and are prepared to help where we can.

Good luck, Nick and Syd.

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.