[Date Prev][Date Next][Date Index]
(39) IT volume?; image standard; front of dictionary; data collection
- To: COMCIFS@iucr.ac.uk
- Subject: (39) IT volume?; image standard; front of dictionary; data collection
- From: bm
- Date: Wed, 20 Dec 1995 14:25:11 GMT
Dear Colleagues Some correspondence on my general introductory review in message (38): BNF === Syd Hall remarked: S> Good news about the BNF upgrade being on its way. David Brown inquired: D> Can someone enlighten me on BNF? I present my apologies for not being more explicit. BNF stands for Backus-Naur form. It's a formal representation of the syntax of a computer language (which is what CIF is, in these terms). Increasingly it is the expectation of software engineers that a formal data representation should be supported by machine-readable syntax descriptions, applications programming interfaces and other devices to facilitate mechanical software development. It's for this reason that some of the COMCIFS discussions have taken on more of a computer-science slant than might have first been expected. The motivation behind this current work was to give an accurate and formal description of the STAR syntax in a form susceptible to parser design using tools familiar to compiler writers, such as lex and yacc. In any case, I became aware when writing my CIF syntax checker that there were possible ambiguities in the published STAR descriptions, so the design was to remove or clarify these grey areas. Progress on revising dictionaries ================================= From Syd: S> Even better news about the powder and mm dictionaries. Brian T and Paula S> deserve enormous congrats for completing these quite massive tasks... those S> of us who have done the lesser exercise with the core definitions truly S> understand what must have been involved. Peter Murray-Rust asks: PMR> Is there a sample of the new dictionary and the DDL that it uses? The DDL is the canonical version 1.4 dictionary: it's at ftp://ftp.iucr.ac.uk/pub/ddldic.c95 (I realise the CIF home page needs radical updating now, but the link under "Current CIF Dictionaries" to the DDL1.4 file is working). I hope to be able to release the dictionary itself in the next few weeks. On publishing the CIF standards =============================== S> As you are aware, and have stated, there are a range of possible publication S> options for the cif dictionaries and supporting tools, in addition to those S> already available on the net. I think it worth emphasising that the primary S> notification of these efforts must still be the normal publication modes. S> And Paula, Brian, JohnW etc. should be encouraged to get these underway asap. S> [eg. John is going to put his SIFLIB into JAC at the same time as Herb S> Bernstein and I am submitting the CIFtbx2 paper, and I have recommended that S> the DDL2 paper goes into JCICS as did DDL and the STAR papers]. The S> publication options you have mentioned are the follow up material that will S> give detailed reference and archive data. Right. At the request of the IUCr President, I have drawn up a very sketchy outline proposal for a volume of International Tables that might include these reference and archive materials. Note that it includes provision for documentation of SIFLIB in greater depth than would be appropriate for the JAC paper describing its functionality. It may be debatable whether such program-specific documentation is appropriate for International Tables, but I would argue that it is, given that it effectively defines an application programming interface for mmCIF handling (see my comments above on the requirements for defining a full CIF environment). | Proposal for International Tables Volume on Crystallographic Information | ------------------------------------------------------------------------ | A proposal was advanced during the AsCA'95 meeting in Bangkok for the | publication of the official CIF dictionaries and attendant documentation in | International Tables format. This would represent a permanent record of the | official definitions recognised globally for use in Crystallographic | Information Files. The dictionaries are held as electronic files | (themselves in CIF-like format), which may easily be converted to text | format in the style of the published Core dictionary (Acta Cryst. (1991) | A47, 655-685). | | The volume should contain, as a minimum: | | DDL1.4 (base Dictionary Definition Language as used in Core): 3pp. | DDL2.1 (relational DDL as used in mmCIF): 7pp | Core Dictionary, version 1996: 33pp. | Powder extensions: 12 pp. | Macromolecular (mmCIF) Dictionary: 102pp. | (Molecular Information File core dictionary: 3pp. ?) | | (Page estimates are based on current dictionary versions) | | Primary papers on STAR and DDL (reprinted from J. Chem. Inf. Comput. | Sci.) and CIF. | | Commentary papers on CIF and/or the associated dictionaries. | | And possibly also: | | Software documentation for standard CIF software libraries | (essentially this is to define an API (applications programming | interface) to ensure consistent CIF handling, rather than to | supply descriptions of individual software packages. | | Handbooks of usage for the standard crystallographic databases. | | Other tabulations of importance in crystallographic data | representation (standard image file formats, data compression | algorithms...?) | | Future editions will also include further dictionaries: | | Incommensurate structures | Symmetry | Charge density | | The volume would naturally be complemented by a CD-ROM including the | dictionaries in machine-readable format; and perhaps also the standard | libraries, compiled for various platforms. Indeed a library of standard CIF | i/o and validation public-domain software could be included. D37.1 Lengths of data names --------------------------- S> I want it here for the record that I was not the advocate for restricting S> the line lengths and data name lengths for CIF. My original proposal was that S> the cif would have a star syntax. But I was told by other programmers of the S> day (circa 1988, but some of whom still flourish!) that cif would not catch S> on if it were so open-ended. This, and the dropping of save frames and nested S> loops, seemed to be a compromise necessary for their cooperation in this S> new venture. At the time I think it was the right decision ...but now? S> Perhaps its time for an expansion to the full star syntax...PROVIDED that S> our main collaborators are in agreement! COMCIFS must keep in mind that the S> success of cif has been because the whole software community has been able S> to participate in this standard...it must not become simply a comsci S> plaything. Nick and I are in agreement about the need for this expansion S> and I think his summary of this is excellent. Perhaps that only thing we S> differ on is how does one get the whole community and software culture S> into applying these upgrades. I believe this has to be handled very S> carefully and with much advance warning. However, such a development should not take place until after the current mmCIF dictionary and DDL are unleashed in their present form. How do people view the next phase of development? The project to develop a standard image data format that I described in the last circular has generated a substantial amount of discussion. Let me, for indexing purposes, retrospectively assign this thread the title D38.1 A standard for image data ------------------------------- Peter Murray-Rust has drawn to my attention another initiative for the transmission of multidimensional data: PMR> Another initiative which might be worth following up is the development PMR> of standards to send 3- (and multidimensional) data fields using the MIME PMR> technology. The prime mover in this is Scott Nelson at lanl.gov. Scott PMR> and colleagues proposed a MIME type mesh/* for the transmission of such PMR> data (I'm not sure whether HDF was one of the components). I'm sure that PMR> crystallographers should investigate this as a way of exchanging 3-D PMR> fields such as electron density maps. In respect of the imgCIF proposal (now rechristened imageNCIF), I asked Andy Hammersley to define what he saw as its goals. He replied: AH> The aim of "imageNCIF" is to "standardize" the passing of image (and other) AH> crystallographic experimental data from: one institute to another; AH> one make of computer system to another; and from one computer program AH> (acquisition or analysis) to another. If a sufficiently large number AH> of institutes/ programmers/ and producers of detector equipment can AH> agree on a common format then the present task of having to support AH> numerous new and different image (data) formats will be at least AH> lessened. This of course reads like a manifesto of CIF, and so why not have a common format for including image data as well as all the other types of crystallographic data that we can represent in CIF? Well, one sticking point has been the requirement for image data to be transported in binary format, for compactness and efficiency. This is in absolute breach of the requirement that STAR files be (ASCII) text based. Syd is particularly emphatic in maintaining that it is not appropriate to breach an accepted standard at such a fundamental level: S> ImageNCIF: why have a syntax that is ALMOST star compliant? Why not, if S> only for the sake of star compliant browsers, editors, etc. put delimitors S> for the "_image_data" value. I'm not concerned about the portability of S> binary data (presumably its encoded externally) but it seems stupid to not S> encapsulate such data so that it can be handled without reference to a S> field-length value. I think we must say so...and perhaps more! S> Here is what I would suggest. S> S> Since star does not impose line length constraints, why not simply S> encapsulate the binary string in semicolons. Ie. S> S> _image_data S> ; S> w+fh;auweh8y2! dchtgKth xcdluyqgldjy ZGXnJZHGClUAHs;diuydq ipwye.. S> ; S> S> The string is not ascii text but this is a smaller crime than S> [requiring a field-length value] However, there are technical problems with this - a pure binary stream can contain successive bytes that might be interpreted as <newline><semicolon> (and in different ways by different operating systems!), and so generate spurious delimiters. S> The IUCr has control over the STAR format and on more than one occasion S> this has been invoked to prevent violations of the syntax. We should S> obviously welcome other disciplines using our interchange methods but it S> will certainly do long term damage if there are mutant star and S> cif formats in common use. I have suggested to the imgcif-l discussion list the following three ways of handling image data files without resorting to STAR modifications: (1) ASCII encode the data within the CIF, and accept the performance penalty in extracting and decoding it. (2) Maintain the image data as external files, and have the header information in separate CIF files. Then you need to have well defined protocols for referencing the data files from within the CIFs; and you incur the penalty of maintaining two files for every image. (3) Have a completely separate standard for image data. Use an existing implementation (HDF, FITS, etc). Does this really matter? Discussions on the relative merits of these approaches are continuing (there seems some consensual drift towards (2)), but Andy Hammersley has also proposed a mechanism of embedding STAR fields in a larger file of some more general format: AH> 4: Break with existing CIF to at least some extent, probably because a AH> file contains binary data. This could be "CIF" sections within a binary AH> files, or could be any other non-CIF format. AH> AH> The difference between proposal 2 and 4. AH> AH> Proposal 2 involves a separate "header" file, and a separate "data" AH> (binary) file. Proposal 4 (in at least one scenerio) would store both AH> "header" information and "data" in the same file. AH> AH> An advantage of proposal 2 (compared to 4) would be that existing CIF AH> tools would be able to work directly on the "header" file. For 4, if the AH> "header" section is very similar to CIF, then an extraction tool could be AH> used to extract the header section and create an ASCII file which could AH> then be used with standard CIF tools. If the format used is well AH> defined and simply related to CIF (whilst being "binary"), this AH> extraction tool could be very simple to write, and be very portable. AH> AH> A disadvantage of proposal 2 is that a single "image" would be stored in AH> two separate files. This provides the opportunity for the "header" AH> information to be separated from the "data" (Remember Murphy's law) AH> Depending on how a sequence of n "images" was stored, you might have AH> 2*n files or n+1 files. AH> AH> Image formats of type 2 do exist (e.g. Hamburg OTOKO / BSL) format, but AH> the vast majority of existing image formats fall into the category AH> defined by proposal 4. Is this approach something that COMCIFS could sanction (i.e. the embedding of CIF "data streams" within a different file format)? The extracted CIF data would, of course, need to be fully compliant with a standalone CIF file. ############################################################################## A couple of new threads connected with the Core revision. D39.1 Introductory data blocks ------------------------------ We have considered various ways of organising the preliminary data in a CIF dictionary (i.e. those that refer to the dictionary itself). See, for example, D25.7. As the proposals for STAR primitives and their use in dictionaries have evolved, the time seems right to have another look at this. John Westbrook and I agreed the following protocol in Bangkok for DDL1 compliant dictionaries. It broadly follows the structure in D25.7, but with a few differences which I shall indicate. We shall use "standard" introductory datablock names: data_on_this_dictionary # will contain the dictionary identifier and history _dictionary_name cif_core.dic _dictionary_version 2.0 _dictionary_history '1999-99-99 blah-blah-blah etc' data_include_dependent_dictionaries # contains references to other dicts #\include http://www.iucr.ac.uk/cif/ddl_core.dic Note the #\include preprocessor directive in this example. But note also the constraints imposed by this construction - ddl_core.dic will be automatically included in any dictionary that includes the Core. I have an open mind whether to put the data_include_dependent_dictionaries block into the current Core revision at all; but by default I *shall*, unless I hear objections. The global_ declarations will be OMITTED from the CIF Core dictionary. John is very worried about the implementation of global_'s and unspecified #\include's together in the same file, and it seems safest to leave out at least the global_ block, since its function is accomplished by recognition and implementation of the _enumeration_default values in the DDL dictionary. In case I've lost you (again), the earlier proposal was to have global_ _list no _list_mandatory no _list_level 1 _enumeration ? _enumeration_default ? _type_conditions none However, all of these (with the exception of _type_conditions) have _enumeration_default values in the DDL1.4 dictionary that match the values given ('no', 'no', '1' etc). There seems nothing to be gained by duplicating this information. Because _type_conditions has no _enumeration_default, I need to add an explicit _type_conditions field to each definition block in the CIF dictionary. D39.2 Additional items for data collection strategies ----------------------------------------------------- I've received the following message from Keith Watenpaugh: KW> I have been making a valiant effort to convince our small molecule KW> crystallographers to archive their structural data in a CIF format and KW> have that format carry over to entry into our internal databases and KW> graphics software. Since they want to store some of the details about KW> the data collection in a machine-parseable form (something Helen was KW> strongly wanting, as well) I started looking for the keywords for items KW> such as scan-speed and background time. I was surprised that they are KW> missing, while items such as: '_diffrn_refln_counts_bg_1' KW> '_diffrn_refln_counts_bg_2' KW> '_diffrn_refln_counts_net' KW> '_diffrn_refln_counts_peak' KW> '_diffrn_refln_counts_total' KW> are there, but would not be usable if different scan rates are used. In KW> fact, this is usually the case with the Siemens diffractometer data KW> collection. Also, unless one knows the time for which a background was KW> collected, you can't correct the "...peak" by the "...bg_1/2" items. KW> Even if using constant scan rates and background counting times, putting KW> these numbers away in '_diffrn_measurement_details' as character strings KW> is probably not the way to go. Have I missed something or are items such KW> as '_diffrn_measurement_scan_rate' and '_diffrn_refln_bg_time', etc? This seems entirely reasonable. Does anyone wish to proffer _definition's for the *_scan_rate and *_bg_time proposals; or, better, is anyone able to supply a more complete set of definitions that will cover all the requirements for a variable-rate scan? Should I encourage Keith to supply an extended set? ######################################################################## With that, I am now going to run for cover until January 8. Best wishes for a merry Christmastide Brian
- Prev by Date: (38) Review of status; length of data names
- Next by Date: (40) F(000); neutron diffraction; Uequiv; symmetry-generated sites; MIF
- Index(es):