[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [Imgcif-l] proposed change in first line of imgcif files
- To: The Crystallographic Binary File and its imgCIF application to image data <imgcif-l@iucr.org>
- Subject: Re: [Imgcif-l] proposed change in first line of imgcif files
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 18 Sep 2008 09:52:50 -0400 (EDT)
- In-Reply-To: <84F0D152-F08A-485B-B9FD-AA2011B1836E@mrc-lmb.cam.ac.uk>
- References: <20080826195337.H76753@epsilon.pair.com><279aad2a0809172141u3034905bq6ba660c89703b4bb@mail.gmail.com><84F0D152-F08A-485B-B9FD-AA2011B1836E@mrc-lmb.cam.ac.uk>
Dear Colleagues, There is an important issue on the table -- how to handle critical "extra-CIF" data, such as magic numbers and whitespace and comments. For people generating the data, they are just magic numbers, whitespace and comments. For some people, they are just noise to be discarded, but for some uses, there is essential information (such as how to parse a particular imgCIF file) buried in there, and it must not be lost. Many of us have had our own private schemes for managing such "extra-CIF" data within CIF-related APIs. Now I wish to propose to formalize one such scheme, subject to the approval of those affected. The net result should be: if you have a CIF with magic numbers, whitespace and comments, including one with the newly proposed magic numbers for imgCIF, you would do just what you are currently doing, and if you have your own way to formally preserve some subset you need of magic numbers, whitespace and comments, you could keep doing whatever you are currently doing, but, if you want a way to capture such information in a CIF context with appropriate changes to your parsers, this would allow you to do it in a way you would be fairly sure would not get tripped up by new datasets. First with this message, I am asking Brian to reserve a prefix to some new tags. I suggest the prefix "ws", for "whitespace", but if that is taken, then whatever prefix Brian provides should be used in place of "ws" for the rest of this discussion. Now to the specific issue of magic numbers: I propose the new tag _ws.prologue to carry as text whatever magic numbers, whitespace and comments occur before a given data block. Clearly, only the value of ws.prologue for the first datablock of a file is meaningful to convey the magic number of that file. Please understand, that I am not proposing that anyone actually put _ws.prologue into a CIF that they distribute. If they are just putting out a CIF, it would make much more sense to just put out the internal value of _ws.prologue as the actual magic number, whitespace and comments before the data block. But, if you are running a database or creating a CIF copy application, this would give you a neat internal place-holder for that important information. As soon as Brian approves a prefix, I will create a draft version of CBFlib that uses this tag, and a few more to allow all comments and whitespace in an imgCIF or CIF to be accessible to an application via the CBFlib API. I will also adapt the existing comments handling logic in CIFtbx to handle the same tags. Here is what I think is a full list for CIFS. (There are a few more for dictionary save frames, but that can be in another discussion). Please provide comments and suggestions _ws.prologue -- the whitespace and comments from before the given datablock _ws.epilogue -- the whitespace and comments from the tail end of a given datablock. Note that the _ws.prologue of the next datablock, if any would have precedence in "eating" what comes before it, if given. <category_name>.ws_prologue -- the whitespace and comments from before the given category. If the category is looped, all provided values are concatenated on output. On input the entire prologue is the value of this tag for the first row. <category_name>.ws_row_prologue -- the whitespace and comments from before the given category row. If the category not looped, the tag returns null. <category_name>.ws_prologue_<column_name> -- the whitespace and comments to appear before the column name. If the catgeory is looped, the values should be associated with the first row. On output, all the values for all the rows of a given category are concatenated. <category_name>.ws_epilogue_<column_name> -- the whitespace and comments to appear after the column name. Only needed for looped categories, and then the values are asociated with first row. <category_name>.ws_value_epilogue_<column_name> -- the whitespace and comments to appear after the value of the given column name. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Thu, 18 Sep 2008, Harry Powell wrote: > Hi > > this is right, in essence. But there's more... > > On 18 Sep 2008, at 05:41, James Hester wrote: >> > >> Correct me if I'm wrong, but I >> believe it is because imgCIF files can be enormous and the overhead of >> reading through the entire file to determine missing tags is >> prohibitive. > > What I would like to be able to do is determine a lot about the image > from the first line of the imgCIF/CBF; in this case, "a lot" means > things like the type of detector for several reasons (I don't believe > these will ever change, even if imgCIF becomes a universal standard > that everyone adopts and uses), including: > > (1) I have to rely on our local analysis of what default values of > things to use, rather than items in the imgCIF header - getting a > detector/image type from the first line means not having to parse the > entire header in at least some cases. > > (2) In the particular case of the PIlatus images, I'm perfectly happy > to read in the original Pilatus "cbf", which essentially had the first > line (saying it was a cbf, and is particularly suited to those > programs which don't use the header information to any useful extent) > and the binary section, the miniCBF, which has what I describe in > shorthand as "the useful information", or the full CBF, which is close > to what people who have worked with CIF over the last 15 - 20 years > would recognize as a fully-formed CIF. The difference comes in how > quickly I can read these images in, what parsing routines to use (e.g. > home written routines, ones someone else may have donated or cbflib, > or a combination of the three), and how much work I need to do to > interpret the header stuff. > > >> >> >> Would it be possible to fix this with a DDL attribute that dictionary >> writers could use to indicate that a data item should appear at the >> 'beginning' of a data block? This attribute would work as follows: it >> could take values 'beginning', 'middle' and 'end', with 'middle' being >> the default. An imgCIF dictionary would specify certain data values >> as occuring at the 'beginning' (ie in the header) and input programs >> such as MOSFLM would then need only to read until they found a data >> value that was not specified as belonging to the 'beginning' (or >> alternatively, until they found a data value belonging to the 'end'.) >> This attribute would actually be used quite rarely in the CIF world as >> I don't think there is a general need for this sort of control of >> order within a datablock. Note that using multiple dictionaries (eg >> mmCIF and imgCIF) would not introduce ambiguity about positioning, as >> the order within the beginning/middle/end zones would still be >> arbitrary. >> >> I'd be interested to hear whether or not such a scheme would remove >> the need for a special header comment (I'm looking particularly at >> Harry and Herbert here), and if the response is positive, I will take >> it to COMCIFS for discussion. >> >> James. >> >> On Wed, Aug 27, 2008 at 10:35 AM, Herbert J. Bernstein >> <yaya@bernstein-plus-sons.com> wrote: >>> >>> There was an informal meeting to discuss imgCIF at the IUCr >>> Congress in Osaka on 26 August 2008. Details of the >>> discussion will follow in future nessages. This message >>> will summarize a proposal for a change in the first line >>> of all CBF/imgCIF files that are not fully populated >>> with all the imgCIF tags needed for processing by mosflm >>> and adxv. >>> >>> 1. What problem is being solve?. As the use of imgCIF >>> has increased, two very distinct sets of files have appeared: >>> the "miniCBFs" used for the Pilatus 6m detector and >>> more fully populated imgCIF files, such as the ones >>> produced for ADSC detectors. While the information >>> necessary for processing can be discovered from context >>> in handling a miniCBF, it may be necessary to read fairly >>> far into the file to discover that the file is indeed a >>> miniCBF, complicating the design of reading software. >>> >>> 2. The proposed solution. Currently CBF files begin >>> with a magic number comment line >>> 1 2 3 4 5 >>> 12345678901234567890123456789012345678901234567890 >>> ###CBF: VERSION n.m >>> >>> We propose to extend the magic number comment line with >>> two optional fields to read >>> >>> 1 2 3 4 5 >>> 12345678901234567890123456789012345678901234567890 >>> ###CBF: VERSION n.m style style_version >>> >>> where "style" is a unique CBF style identifier left >>> justified as a single word in columns 25-34 and >>> "style_version" is a left justified integer in >>> columns 35-44. >>> >>> Each style will be registered in a central repository >>> along with information on the tags that will be >>> carried forthat stye and a template of the tags >>> that would be needed to fully populate the file. >>> >>> More details will follow on this list and on the >>> CBFlib wiki after the Osaka meeting is over. >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya@dowling.edu >>> ===================================================== >>> _______________________________________________ >>> imgcif-l mailing list >>> imgcif-l@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/imgcif-l >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> imgcif-l mailing list >> imgcif-l@iucr.org >> http://scripts.iucr.org/mailman/listinfo/imgcif-l > > Harry > -- > Dr Harry Powell, MRC Laboratory of Molecular Biology, MRC Centre, > Hills Road, Cambridge, CB2 0QH > > > > _______________________________________________ > imgcif-l mailing list > imgcif-l@iucr.org > http://scripts.iucr.org/mailman/listinfo/imgcif-l > _______________________________________________ imgcif-l mailing list imgcif-l@iucr.org http://scripts.iucr.org/mailman/listinfo/imgcif-l
Reply to: [list | sender only]
- Follow-Ups:
- Re: [Imgcif-l] proposed change in first line of imgcif files (James Hester)
- References:
- [Imgcif-l] proposed change in first line of imgcif files (Herbert J. Bernstein)
- Re: [Imgcif-l] proposed change in first line of imgcif files (James Hester)
- Re: [Imgcif-l] proposed change in first line of imgcif files (Harry Powell)
- Prev by Date: Re: [Imgcif-l] proposed change in first line of imgcif files
- Next by Date: Re: [Imgcif-l] proposed change in first line of imgcif files
- Prev by thread: Re: [Imgcif-l] proposed change in first line of imgcif files
- Next by thread: Re: [Imgcif-l] proposed change in first line of imgcif files
- Index(es):