Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Imgcif-l] proposed change in first line of imgcif files

Dear Colleagues,

   There is an important issue on the table -- how to handle critical
"extra-CIF" data, such as magic numbers and whitespace and comments.
For people generating the data, they are just magic numbers,
whitespace and comments.  For some people, they are just noise
to be discarded, but for some uses, there is essential information
(such as how to parse a particular imgCIF file) buried in there, and
it must not be lost.

   Many of us have had our own private schemes for managing such
"extra-CIF" data within CIF-related APIs.  Now I wish to propose to
formalize one such scheme, subject to the approval of those affected.
The net result should be:  if you have a CIF with magic numbers,
whitespace and comments, including one with the newly proposed
magic numbers for imgCIF, you would do just what you are currently
doing, and if you have your own way to formally preserve some
subset you need of magic numbers, whitespace and comments, you
could keep doing whatever you are currently doing, but, if you
want a way to capture such information in a CIF context with
appropriate changes to your parsers, this would allow you to do
it in a way you would be fairly sure would not get tripped up
by new datasets.

   First with this message, I am asking Brian to reserve a prefix
to some new tags.  I suggest the prefix "ws", for "whitespace",
but if that is taken, then whatever prefix Brian provides should
be used in place of "ws" for the rest of this discussion.

   Now to the specific issue of magic numbers:

   I propose the new tag _ws.prologue to carry as text whatever
magic numbers, whitespace and comments occur before a given
data block.  Clearly, only the value of ws.prologue for the
first datablock of a file is meaningful to convey the magic
number of that file.

   Please understand, that I am not proposing that anyone actually
put _ws.prologue into a CIF that they distribute.  If they are
just putting out a CIF, it would make much more sense to just
put out the internal value of _ws.prologue as the actual magic
number, whitespace and comments before the data block.  But, if
you are running a database or creating a CIF copy application,
this would give you a neat internal place-holder for that important
information.

   As soon as Brian approves a prefix, I will create a draft version
of CBFlib that uses this tag, and a few more to allow all comments
and whitespace in an imgCIF or CIF to be accessible to an application
via the CBFlib API.  I will also adapt the existing comments handling
logic in CIFtbx to handle the same tags.  Here is what I think
is a full list for CIFS.  (There are a few more for dictionary
save frames, but that can be in another discussion).  Please 
provide comments and suggestions

   _ws.prologue   -- the whitespace and comments from before the
given datablock
   _ws.epilogue   -- the whitespace and comments from the tail end
of a given datablock.  Note that the _ws.prologue of the next
datablock, if any would have precedence in "eating" what comes
before it, if given.

   <category_name>.ws_prologue -- the whitespace and comments from
before the given category.  If the category is looped, all provided
values are concatenated on output.  On input the entire prologue
is the value of this tag for the first row.

   <category_name>.ws_row_prologue -- the whitespace and comments from
before the given category row.  If the category not looped,
the tag returns null.

   <category_name>.ws_prologue_<column_name> -- the whitespace and
comments to appear before the column name.  If the catgeory is looped,
the values should be associated with the first row.  On output, all
the values for all the rows of a given category are concatenated.

   <category_name>.ws_epilogue_<column_name> -- the whitespace and
comments to appear after the column name.  Only needed for looped
categories, and then the values are asociated with first row.

   <category_name>.ws_value_epilogue_<column_name> -- the whitespace
and comments to appear after the value of the given column name.
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Thu, 18 Sep 2008, Harry Powell wrote:

> Hi
>
> this is right, in essence. But there's more...
>
> On 18 Sep 2008, at 05:41, James Hester wrote:
>>
>
>> Correct me if I'm wrong, but I
>> believe it is because imgCIF files can be enormous and the overhead of
>> reading through the entire file to determine missing tags is
>> prohibitive.
>
> What I would like to be able to do is determine a lot about the image
> from the first line of the imgCIF/CBF; in this case, "a lot" means
> things like the type of detector for several reasons (I don't believe
> these will ever change, even if imgCIF becomes a universal standard
> that everyone adopts and uses), including:
>
> (1) I have to rely on our local analysis of what default values of
> things to use, rather than items in the imgCIF header - getting a
> detector/image type from the first line means not having to parse the
> entire header in at least some cases.
>
> (2) In the particular case of the PIlatus images, I'm perfectly happy
> to read in the original Pilatus "cbf", which essentially had the first
> line (saying it was a cbf, and is particularly suited to those
> programs which don't use the header information to any useful extent)
> and the binary section, the miniCBF, which has what I describe in
> shorthand as "the useful information", or the full CBF, which is close
> to what people who have worked with CIF over the last 15 - 20 years
> would recognize as a fully-formed CIF. The difference comes in how
> quickly I can read these images in, what parsing routines to use (e.g.
> home written routines, ones someone else may have donated or cbflib,
> or a combination of the three), and how much work I need to do to
> interpret the header stuff.
>
>
>>
>>
>> Would it be possible to fix this with a DDL attribute that dictionary
>> writers could use to indicate that a data item should appear at the
>> 'beginning' of a data block?  This attribute would work as follows: it
>> could take values 'beginning', 'middle' and 'end', with 'middle' being
>> the default.  An imgCIF dictionary would specify certain data values
>> as occuring at the 'beginning' (ie in the header) and input programs
>> such as MOSFLM would then need only to read until they found a data
>> value that was not specified as belonging to the 'beginning' (or
>> alternatively, until they found a data value belonging to the 'end'.)
>> This attribute would actually be used quite rarely in the CIF world as
>> I don't think there is a general need for this sort of control of
>> order within a datablock.  Note that using multiple dictionaries (eg
>> mmCIF and imgCIF) would not introduce ambiguity about positioning, as
>> the order within the beginning/middle/end zones would still be
>> arbitrary.
>>
>> I'd be interested to hear whether or not such a scheme would remove
>> the need for a special header comment (I'm looking particularly at
>> Harry and Herbert here), and if the response is positive, I will take
>> it to COMCIFS for discussion.
>>
>> James.
>>
>> On Wed, Aug 27, 2008 at 10:35 AM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>>
>>> There was an informal meeting to discuss imgCIF at the IUCr
>>> Congress in Osaka on 26 August 2008.  Details of the
>>> discussion will follow in future nessages.  This message
>>> will summarize a proposal for a change in the first line
>>> of all CBF/imgCIF files that are not fully populated
>>> with all the imgCIF tags needed for processing by mosflm
>>> and adxv.
>>>
>>> 1.  What problem is being solve?.  As the use of imgCIF
>>> has increased, two very distinct sets of files have appeared:
>>> the "miniCBFs" used for the Pilatus 6m detector and
>>> more fully populated imgCIF files, such as the ones
>>> produced for ADSC detectors.  While the information
>>> necessary for processing can be discovered from context
>>> in handling a miniCBF, it may be necessary to read fairly
>>> far into the file to discover that the file is indeed a
>>> miniCBF, complicating the design of reading software.
>>>
>>> 2.  The proposed solution.  Currently CBF files begin
>>> with a magic number comment line
>>>           1         2         3         4         5
>>>  12345678901234567890123456789012345678901234567890
>>>  ###CBF: VERSION n.m
>>>
>>> We propose to extend the magic number comment line with
>>> two optional fields to read
>>>
>>>           1         2         3         4         5
>>>  12345678901234567890123456789012345678901234567890
>>>  ###CBF: VERSION n.m     style     style_version
>>>
>>> where "style" is a unique CBF style identifier left
>>> justified as a single word in columns 25-34 and
>>> "style_version" is a left justified integer in
>>> columns 35-44.
>>>
>>> Each style will be registered in a central repository
>>> along with information on the tags that will be
>>> carried forthat stye and a template of the  tags
>>> that would be needed to fully populate the file.
>>>
>>> More details will follow on this list and on the
>>> CBFlib wiki after the Osaka meeting is over.
>>> =====================================================
>>> Herbert J. Bernstein, Professor of Computer Science
>>>   Dowling College, Kramer Science Center, KSC 121
>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                 +1-631-244-3035
>>>                 yaya@dowling.edu
>>> =====================================================
>>> _______________________________________________
>>> imgcif-l mailing list
>>> imgcif-l@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/imgcif-l
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> imgcif-l mailing list
>> imgcif-l@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/imgcif-l
>
> Harry
> -- 
> Dr Harry Powell, MRC Laboratory of Molecular Biology, MRC Centre,
> Hills Road, Cambridge, CB2 0QH
>
>
>
> _______________________________________________
> imgcif-l mailing list
> imgcif-l@iucr.org
> http://scripts.iucr.org/mailman/listinfo/imgcif-l
>
_______________________________________________
imgcif-l mailing list
imgcif-l@iucr.org
http://scripts.iucr.org/mailman/listinfo/imgcif-l

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.