[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: Westbrook's draft dictionary
- To: Multiple recipients of list <imgcif-l@bnl.gov>
- Subject: Re: Westbrook's draft dictionary
- From: yaya@bernstein-plus-sons.com (Herbert J. Bernstein)
- Date: Sat, 1 Feb 1997 09:33:58 -0500 (EST)
I would like to vigorously endorse _both_ sides in the binary vs ascii dispute, because we need both formats. Just as we need the character string "36" to be able to exchange documents referring to the number 36, we need a pure character string representation of all information in an imageCIF. Just as we need the verious internal binary representions of 36 to be able to do useful computations, we need efficient binary representations of certain information in an imageCIF. It is not necessary to perturb CIF in any major way to handle this approach. John has already done the major work with this new dictionary building on Andy's work and John laid the groundwork in the work he did on mmCIF. In mmCIF, the simple data types of CIF have been refined to allow us to deal with the distinctions between character strings that represent integers and character strings that represent reals. The IUCr already distinguishes various kinds of text fields which represent flat ascii text versus formatted text. Now we need to provide some additional types for text fields so that a token may be identified with lines of 80 column text fully respecting the standard CIF conventions for text fields, but where the characters represent exactly the sort of binary information John has just proposed. The indentification of the encoding could be done by a header in the text field, as the IUCr has done for formatted text, but I would suggest we would have a more robust and extensible system by associating the type of the field with the token itself in the dictionary. This would requires some extension to the DDL to cover data type semantics as well as data type syntax, but, if we follow the MIME model, this could be done very efficiently. So, suppose we have defined one of these encoded binary tokens, with the name _block_whatever_icbe (where the icbe at the end might remind us that this is an imageCIF binary encoded field). For archiving this is just what we would use, but for doing an experiment, even though we would take an encoding which keeps the overhead to less than 3/16, we well may need to work internally with a pure binary format, or even, for efficiency to exchange with our colleagues in binary. That is fine. Just as we have mutually exclusive alternate tokens for other quantties, we could also have a _block_whatever_bin token, which would flag use of a pure binary field exactly as John has proposed in our internal file format, which, while very efficiently translatable to and from a CIF, is _not_ itself a CIF because of the points raised by I. D. Brown. If this binary file is not itself a CIF, then why need we be concerned with it in this discussion? Because the entire point of CIFs is to provide a sound way to present and archive information with well understood internal representations. The long process of creating mmCIF dealt in large part with the _internal_ workings of databases. Much of what is in small molecule CIF and mmCIF relates to the internal working of refinement programs. Why should we not be equally concerned with the internal working of data collection programs? Thus, what I am proposing is that data collection work would go forward with a mixed ascii/binary format, call it imageIPCIF (for image internal pseudo-CIF), which would have a rigorously defined translation to and from a pure CIF imageCIF, with the tokens for both imageIPCIF and imageCIF laid out in one common imageCIF dictionary, written in a slightly extended DDL2 to allow for the full presentation of data-type semantics and type-to-type conversions for both ascii encoded and pure binary fields, with the understanding that pure binary fields are for use within imageIPCIF documents, and are presented within imageCIF to clarify the semantics of the tokens used there, not as permission to use them within a CIF. In imageIPCIF files we would whatever ascii sections we needed. When we hit one of the tokens of a binary type, the file would revert to binary following John's convention, and then back to ascii. For maximum interchange, I would suggest that the ascii section follow the Postscript convention that any of the following is an acceptable end-of-line: <CR> <CR><LF> <LF> This is what allows Postscript files to be created on almost any platform and be seen as a text file on that platform, yet still be processed by all Postscript printers. When we wish to archive such a file or send it to someone working on a platform that may have difficulty with our binary conventions, we taken the ascii section and spew it out verbatim until we hit a token of binary type. The we look up the associated icbe token, and replace the token and translate the data field, and we have an honest, pure ascii CIF for interchange. Reversing the process is also simple. If people like this general approach, we can look in more detail at the necessary MIME types and alternate ways of doing this. -- Herbert ===================================================== **** BERNSTEIN + SONS * * INFORMATION SYSTEMS CONSULTANTS **** P.O. BOX 177, BELLPORT, NY 11713-0177 * * *** **** * Herbert J. Bernstein * *** yaya@bernstein-plus-sons.com *** * * *** 1-516-286-1339 FAX: 1-516-286-1999 =====================================================
Reply to: [list | sender only]
- Prev by Date: Re: Westbrook's draft dictionary
- Next by Date: Accident
- Prev by thread: Re: Westbrook's draft dictionary
- Next by thread: Draft DDL 2.1 CIF binary file extension dictionary
- Index(es):