Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Westbrook's draft dictionary

I would like to vigorously endorse _both_ sides in the binary vs ascii dispute,
because we need both formats.  Just as we need the character string "36" to
be able to exchange documents referring to the number 36, we need a pure
character string representation of all information in an imageCIF.  Just as
we need the verious internal binary representions of 36 to be able to do
useful computations, we need efficient binary representations of certain
information in an imageCIF.

It is not necessary to perturb CIF in any major way to handle this
approach.  John has already done the major work with this new dictionary
building on Andy's work and John laid the groundwork in the work he did on
mmCIF.  In mmCIF, the simple data types of CIF have been refined to allow
us to deal with the distinctions between character strings that represent
integers and character strings that represent reals.  The IUCr already
distinguishes various kinds of text fields which represent flat ascii text
versus formatted text.  Now we need to provide some additional types for
text fields so that a token may be identified with lines of 80 column text
fully respecting the standard CIF conventions for text fields, but where
the characters represent exactly the sort of binary information John has
just proposed.  The indentification of the encoding could be done by a
header in the text field, as the IUCr has done for formatted text, but I
would suggest we would have a more robust and extensible system by
associating the type of the field with the token itself in the dictionary.
This would requires some extension to the DDL to cover data type semantics
as well as data type syntax, but, if we follow the MIME model, this could
be done very efficiently.


So, suppose we have defined one of these encoded binary tokens, with the
name _block_whatever_icbe (where the icbe at the end might remind us that
this is
an imageCIF binary encoded field).  For archiving this is just what we
would use, but for doing an experiment, even though we would take an
encoding which keeps the overhead to less than 3/16, we well may need to
work internally with a pure binary format, or even, for efficiency to
exchange with our colleagues in binary.  That is fine.  Just as we have
mutually exclusive alternate tokens for other quantties, we could also have
a _block_whatever_bin token, which would flag use of a pure binary field
exactly as John has proposed in our internal file format, which, while very
efficiently translatable to and from a CIF, is _not_ itself a CIF because
of the points raised by I. D. Brown.

If this binary file is not itself a CIF, then why need we be concerned with
it in this discussion?  Because the entire point of CIFs is to provide a
sound way to present and archive information with well understood internal
representations.  The long process of creating mmCIF dealt in large part
with the _internal_ workings of databases.  Much of what is in small
molecule CIF and mmCIF relates to the internal working of refinement
programs.  Why should we not be equally concerned with the internal working
of data collection programs?

Thus, what I am proposing is that data collection work would go forward
with a mixed ascii/binary format, call it imageIPCIF (for image internal
pseudo-CIF), which would have a rigorously defined translation to and from
a pure CIF imageCIF, with the tokens for both imageIPCIF and imageCIF laid
out in one common imageCIF dictionary, written in a slightly extended DDL2
to allow for the full presentation of data-type semantics and type-to-type
conversions for both
ascii encoded and pure binary fields, with the understanding that pure
binary fields are for use within imageIPCIF documents, and are presented
within imageCIF to clarify the semantics of the tokens used there, not as
permission to use them within a CIF.

In imageIPCIF files we would whatever ascii sections we needed.  When we
hit one of the tokens of a binary type, the file would revert to binary
following John's convention, and then back to ascii.  For maximum
interchange, I would suggest that the ascii section follow the Postscript
convention that any of the following is an acceptable end-of-line:
      <CR>
      <CR><LF>
      <LF>
This is what allows Postscript files to be created on almost any platform
and be seen as a text file on that platform, yet still be processed by all
Postscript printers.

When we wish to archive such a file or send it to someone working on a
platform that may have difficulty with our binary conventions, we taken the
ascii section and spew it out verbatim until we hit a token of binary type.
The we look up the associated icbe token, and replace the token and
translate the data field, and we have an honest, pure ascii CIF for
interchange.  Reversing the process is also simple.

If people like this general approach, we can look in more detail at the
necessary MIME types and alternate ways of doing this.

 -- Herbert


=====================================================
****                BERNSTEIN + SONS
*   *       INFORMATION SYSTEMS CONSULTANTS
****     P.O. BOX 177, BELLPORT, NY 11713-0177
*   * ***
**** *            Herbert J. Bernstein
  *   ***     yaya@bernstein-plus-sons.com
 ***     *
  *   *** 1-516-286-1339    FAX: 1-516-286-1999
=====================================================



Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.