Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Westbrook's draft dictionary

Greetings,

Here is some follow-up to David Brown's comments...

I. David Brown wrote:
> 
>         I had a look at John's draft dictionary.  I may be missing
> something but it seems to me that this draft presents a solution
> that we discussed earlier and rejected.
> 
>         He treats the binary file as a piece of STAR text set between
> semicolon delimiters.  There are a number of difficulties with this
> seemingly simple solution.  Firstly the STAR definitions requires all
> fields to contain only ascii characters.  Secondly carriage returns are
> used to terminate lines even within text strings without themselves being
> part of the text.  Finally there is no guarantee that a binary string will
> not contain the code for 'CR ;' thereby terminating the string in the
> middle.  Any binary sequence cannot, in the nature of things, be
> self-terminating, its length has to be specified externally, and this is
> contrary to all the principles of STAR.  I would be delighted to discover
> that this problem is overcome in DDL2, but it seems to me insurmountable.

These observations are quite correct and I apologize for leaving out
some important implementation details of the embedded binary data item
approach that Andy and I had discussed off-line.   My suggestion for 
how to overcome the parsing problem is to treat the binary data items
like 
variable length network packets.  A  binary  data 
item might be look something like the following:

{data_length, [chksum or some other kind of signatures ...], data}


The advantage here is that you have a rather simple mechanism that would 
permit the integration of binary data into CIF-like files.   The
disadvantage
is that it breaks all of the STAR and CIF conventions. 

> That is why we have been leaning towards a fully binary file with an
> extractable ascii header that when extracted is cif compatible.
>

#
# Pure ascii block ...

data_experiment1

_entry.id   experiment1

loop_
_entry_link.id
_entry_link.entry_id
_entry_link.details
binary_block_1   experiment1 'First binary data set'
binary_block_2   experiment1 'Second binary data set'

#
#   Define encoding details
#
loop_
_array_structure.id
_array_structure.byte_order
_array_structure.encoding_type
dataset1   big_endian  64_bit_real_ieee
dataset2   big_endian  64_bit_real_ieee

#
#  Define the organization
#
loop_
_array_structure.array_id  
_array_structure.index
_array_structure.dimension 
_array_structure.precedence 
_array_structure.direction
 dataset1    1    10     1     increasing
 dataset1    2    100    2     decreasing
 dataset2    1    20     1     increasing
 dataset2    2    20     2     increasing

#
# First binary block  ...
#
data_binary_block_1

_entry.id binary_block_1

_entry_link.id          experiment1
_entry_link.entry_id    binary_block_1
_entry_link.details     'Contains the description of my binary data'

_array_data.array_id
_array_data.data
dataset1 {1000,FFAD00A,0FFFFA82774688299A9A9A9A99A9ADFA897255377377
....
.........}

#
# Second binary block
#
data_binary_block_2

_entry_link.id          experiment1
_entry_link.entry_id    binary_block_2
_entry_link.details     'Contains the description of my binary data'

_array_data.array_id
_array_data.data
dataset2 {400,FFAD00A,0FFFFA82774688299A9A9A9A99A9ADFA897255377377 .....
.........}

#--------------------------------------------------------------------------

Following David's suggestion, the binary data is segregated in separate
datablocks.
In the pure ascii datablock, the entry_link items identify the blocks
containing
the binary items.   CIF provides no mechanism for referencing data items
between
data blocks. Hence, it is not possible to specify that 'dataset1'
resides in 
the data block named 'binary_block_1.  However, in this example we
stretch the
significance of the entry_link items to mean that the indicated data
blocks are
required to resolve all of the data items referenced in the current
block.  
In this way you are essentially specifying a search list of data blocks
in the 
current file.  

In the present example since _array_data.array_id is the child of
_array_structure.id,
and correspondingly, a link is specified in each binary block pointing
to the 
ascii block where the the parent item is defined.   In the ascii block
their is no
formal necessity for the entry_link specification as the binary blocks
that 
are referenced only contain child data items.   The inclusion of these
items does seem a
convenience from an organization point of view.  

Given the above example, one could simply separate the file on data
block boundaries
and be left with one ascii CIF and two other files that would require
a bit of special treatment.  


-- 
******************************************************************
*  John Westbrook                    Ph:  (908) 445-4290         *
*  Department of Chemistry           Fax: (908) 445-4320         *
*  Rutgers University                                            *
*  PO Box 939                     e-mail: jwest@ndb.rutgers.edu  *
*  Piscataway, NJ 08855-0939                                     *
******************************************************************

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.