Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Imgcif-l] Adding references to external files to imgCIF

Dear All,

Over a year later I have now written up definitions in DDL2 for inclusion
in imgCIF. The full definitions are at the Github issue (
https://github.com/COMCIFS/imgCIF/issues/7). Please have a look and provide
feedback here or there. Note that I have added datanames for specifying
that the images are contained within compressed archives. I've checked a
few known sources of images (proteindiffraction.org, zenodo, a uni
repository) and this scheme seems to cover those bases. If you have time,
please have a look at your favourite open archive of raw data to see if
this scheme is sufficient for you to specify a particular image in that
archive.  I've reproduced the examples from the definitions below.

Of course, in a perfect world we would just give a DOI but those days are
not yet upon us due to landing pages. Happy to be corrected on that.

best wishes,
James.

Examples
========
#  The frames are contained in a single HDF5-format file accessible
#   at https://zenodo.org/record/12345/files/tartaric.h5. An array of 2D
#   images is found at HDF5 location /entry1/detector1/data

     loop_
    _array_data.array_id
    _array_data.binary_id
    _array_data.external_format
    _array_data.location_uri
    _array_data.external_path
    _array_data.external_frame
    1 1 HDF5 https://zenodo.org/record/12345/files/tartaric.h5
/entry1/detector1/data 1
    1 2 HDF5 https://zenodo.org/record/12345/files/tartaric.h5
/entry1/detector1/data 2
    ...

 #  Frames are contained in individual Smart6000 Bruker-format files
 #   accessible using https://uni_repo.edu/5341 in subdirectory run1.

  loop_
    _array_data.array_id
    _array_data.binary_id
    _array_data.external_format
    _array_data.external_version
    _array_data.location_uri
    1 1 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.001
    1 2 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.002
    ...

#  Frames with SMV format are contained at data.proteindiffraction.org in a
tarred
#    archive compressed with bzip2.

    loop_
    _array_data.array_id
    _array_data.binary_id
    _array_data.external_format
    _array_data.location_uri
    _array_data.external_archive_format
    _array_data.external_archive_path
    1 1 SMV

https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.ta
r.bz2
        TBZ
        MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0001.img
    1 2 SMV

https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.ta
r.bz2
        TBZ
        MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0002.img


On Tue, 5 Mar 2019 at 16:37, James Hester <jamesrhester@gmail.com> wrote:

> OK, I've drafted up some definitions (just the human-readable part for
> now) for you all to peruse.  Please look at
> https://github.com/COMCIFS/imgCIF/issues/7 and provide feedback here or
> there.
>
> all the the best,
> James.
>
> On Thu, 14 Feb 2019 at 14:39, James Hester <jamesrhester@gmail.com> wrote:
>
>> Thanks for the support Herbert. Does anybody have any concerns or
>> improvements to the data names that I sent originally? If not, I guess I
>> will write up some formal dictionary definitions for your consideration.
>>
>> James.
>>
>> On Wed, 13 Feb 2019 at 21:39, Herbert J. Bernstein <yayahjb@gmail.com>
>> wrote:
>>
>>> Dear Colleagues,
>>>
>>>   Since 2012 NIAC and COMCIFS have worked cooperatively to make
>>> imgCIF/CBF and NeXus/HDF5 fully interoperable.  This is very
>>> far along, e.g.with NeXus/HDF5 NXtransformations having been added to
>>> NeXus/HDF5 to carry the same information as imgCIF/CBF AXIS.
>>> What James has suggested will allow imgcif/CBF to carry the same dataset
>>> structure information as is conveyed in the external links of
>>> an Eiger dataset, which divides the collected data into a master file
>>> with the metadata and a set of datafiles.  This structural division
>>> may not be important for some smaller datasets with only a few hundred
>>> to a few thousand frames, but can be very important in
>>> handling datasets with more frames than that that are encountered in
>>> serial crystallography.  Even for the smaller datasets this approach can
>>> help to solve a problem for archives and facilities that need to store
>>> metadata in a relational database while the data itself has been parked in
>>> raw file systems, non-relational databases, zenodo, etc.  As with almost
>>> all of CIF, imgCIF/CBF metadata maps very easily and directly
>>> into relational tables, while putting NeXus/HDF5 metadata into a
>>> relational database first requires exactly the same sort of transformations
>>> as we have already designed to map NeXus/HDF5 metadata into imgCIF/CBF
>>> To me it seems that James' suggestion is not a reinvention
>>> of this particular wheel, but may be an important step in avoiding
>>> reinvention of the wheel.  This may avoid a lot of unnecessary
>>> transformation
>>> of huge quantities of raw data in serial crystallography while making
>>> the metadata more accessible.
>>>
>>>   I would suggest giving James' suggestion serious consideration.
>>>
>>>   Regards,
>>>     Herbert
>>> while putting
>>>
>>> On Wed, Feb 13, 2019 at 4:02 AM James Hester <jamesrhester@gmail.com>
>>> wrote:
>>>
>>>> Dear Graeme,
>>>>
>>>> The context of this is the idea that a single imgCIF file could be
>>>> generated from a collection of raw image files (in whatever format,
>>>> whether
>>>> HDF5, or ADSC, or Bruker, or Rigaku, etc.) which would contain the
>>>> metadata
>>>> pertaining to that collection. In such a situation, some way of
>>>> referring
>>>> to the raw frames from within the imgCIF file is required.
>>>>
>>>> I agree that a perfectly reasonable approach is not to generate any new
>>>> file at all, and simply to access the metadata directly in whatever
>>>> format
>>>> happens to be there. This was my initial impulse as well and it took me
>>>> a
>>>> while to understand that the actual proposal was to create an imgCIF
>>>> file,
>>>> rather than just use imgCIF datanames for specification purposes.  From
>>>> a
>>>> semantic point of view both amount to the same thing so my only real
>>>> motivation here is to add an image linking facility to imgCIF so that
>>>> the
>>>> "generate a summary metadata file" approach is possible.
>>>>
>>>> Could we just copy the HDF5 way of referring to objects in other HDF5
>>>> files
>>>> as a quick solution?
>>>>
>>>> all the best,
>>>> James.
>>>>
>>>> On Wed, 13 Feb 2019 at 19:03, Graeme.Winter@Diamond.ac.uk <
>>>> Graeme.Winter@diamond.ac.uk> wrote:
>>>>
>>>> > Dear James,
>>>> >
>>>> > On the face of it, this looks a lot to me like a reinvention of HDF5 -
>>>> > perhaps with specific semantics - and there is already a (complete?)
>>>> > mapping from imgCIF to HDF5 / NeXus
>>>> >
>>>> > Have I missed something? No offence meant, trying to understand the
>>>> shape
>>>> > of the problem you are trying to solve
>>>> >
>>>> > Thanks & best wishes Graeme
>>>> >
>>>> > > On 13 Feb 2019, at 05:15, James Hester <jamesrhester@gmail.com>
>>>> wrote:
>>>> > >
>>>> > > Dear All,
>>>> > >
>>>> > > Recent Commdat discussion revealed a desire to reference external
>>>> images
>>>> > > from within an imgCIF file. This would allow the metadata for a
>>>> dataset
>>>> > to
>>>> > > be held within a single imgCIF file, while the frames themselves
>>>> remain
>>>> > > separate. This avoids the impracticality of navigating through an
>>>> > enormous
>>>> > > mulit-frame imgCIF file in order to extract a relatively compact
>>>> amount
>>>> > of
>>>> > > information.
>>>> > >
>>>> > > As a starting proposal, I suggest we extend the _array_data
>>>> category with
>>>> > > the following three datanames:
>>>> > >
>>>> > > (1) _array_data.external_format    A value drawn from an enumerated
>>>> list
>>>> > of
>>>> > > formats (e.g. "SMV","HDF5","Bruker"). The definition for each
>>>> enumerated
>>>> > > value would explain how to interpret _array_data.internal_path
>>>> > > (2) _array_data.location_url           A URI for the file
>>>> containing the
>>>> > > image. A relative URL is relative to the location of the imgCIF file
>>>> > > (3) _array_data.internal_path        A format-specific string
>>>> describing
>>>> > > the location of the frame within the file identified by
>>>> > > _array_data.location_uri, interpreted according to the value given
>>>> in
>>>> > > _array_data.external_format
>>>> > >
>>>> > > So for a multi-frame HDF5 file buried in a subdirectory of the
>>>> location
>>>> > > referenced with a DOI, with appropriate definitions of the path
>>>> notation:
>>>> > >
>>>> > > loop_
>>>> > > _array_data.array_id
>>>> > > _array_data.binary_id
>>>> > > _array_data.external_format
>>>> > > _array_data.location_uri
>>>> > > _array_data.internal_path
>>>> > > 1 1 NXMX doi:x.y.z
>>>> directory/run/masterfilename:/entry1/detector/data[0]
>>>> > > 1 2 NXMX doi:x.y.z
>>>> directory/run/masterfilename:/entry1/detector/data[1]
>>>> > > ...
>>>> > >
>>>> > > Or for a bunch of single-frame files generated by an ADSC detector
>>>> in the
>>>> > > same directory as the imgCIF file
>>>> > >
>>>> > > _array_data.array_id
>>>> > > _array_data.binary_id
>>>> > > _array_data.external_format
>>>> > > _array_data.location_uri
>>>> > > 1 1 ADSC ./tartaric.001
>>>> > > 1 2 ADSC ./tartaric.002
>>>> > > 1 3 ADSC ./tartaric.003
>>>> > > ...
>>>> > >
>>>> > > The imgCIF data items describing the structure of the data array
>>>> would
>>>> > > refer to the data after it has been provided by the format. The
>>>> form in
>>>> > > which it is provided should be specified in the definition of each
>>>> value
>>>> > of

>>>> > > "_array_data.external_format".  So, for example, the various
>>>> compression
>>>> > > methods in HDF5 would be invisible if the data as returned are
>>>> specified
>>>> > to
>>>> > > be an array of Reals.
>>>> > >
>>>> > > From the point of view of initial data validation, it would be
>>>> sufficient
>>>> > > to check that all referenced files are accessible, and that the
>>>> provided
>>>> > > locations exist.
>>>> > >
>>>> > > Thoughts?
>>>> > > James.
>>>> > >
>>>> > > --
>>>> > > T +61 (02) 9717 9907
>>>> > > F +61 (02) 9717 3145
>>>> > > M +61 (04) 0249 4148
>>>> > > _______________________________________________
>>>> > > imgcif-l mailing list
>>>> > > imgcif-l@iucr.org
>>>> > > http://mailman.iucr.org/cgi-bin/mailman/listinfo/imgcif-l
>>>> >
>>>> >
>>>> > --
>>>> > This e-mail and any attachments may contain confidential, copyright
>>>> and or
>>>> > privileged material, and are for the use of the intended addressee
>>>> only. If
>>>> > you are not the intended addressee or an authorised recipient of the
>>>> > addressee please notify us of receipt by returning the e-mail and do
>>>> not
>>>> > use, copy, retain, distribute or disclose the information in or
>>>> attached to
>>>> > the e-mail.
>>>> > Any opinions expressed within this e-mail are those of the individual
>>>> and
>>>> > not necessarily of Diamond Light Source Ltd.
>>>> > Diamond Light Source Ltd. cannot guarantee that this e-mail or any
>>>> > attachments are free from viruses and we cannot accept liability for
>>>> any
>>>> > damage which you may sustain as a result of software viruses which
>>>> may be
>>>> > transmitted in or with the message.
>>>> > Diamond Light Source Limited (company no. 4375679). Registered in
>>>> England
>>>> > and Wales with its registered office at Diamond House, Harwell
>>>> Science and
>>>> > Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
>>>> >
>>>> > 
>>>> 
>>>> --
>>>> T +61 (02) 9717 9907
>>>> F +61 (02) 9717 3145
>>>> M +61 (04) 0249 4148
>>>> _______________________________________________
>>>> imgcif-l mailing list
>>>> imgcif-l@iucr.org
>>>> http://mailman.iucr.org/cgi-bin/mailman/listinfo/imgcif-l
>>>> 
>>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
imgcif-l mailing list
imgcif-l@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/imgcif-l

Reply to: [list | sender only]