Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Imgcif-l] Adding references to external files to imgCIF

Dear All,

Just a quick note: a further year later and the external data pointers
work has not yet been merged, and neither has a further proposed data
name [1]. On the bright side an implementation using these pointers
has been published as a test of practicality [2]. It would of course
be most welcome if imgCIF deliberative processes could get themselves
to the point that these new data names are merged into the official
version of the main dictionary, given that no issues have been
identified.

Meanwhile, in order to facilitate use of automated DDLm checking tools
on data files using imgCIF data names, I have now generated (1) a
direct translation of current version 1.8.4 into DDLm (2) a direct
translation with added external data pointers to DDLm in a separate
"journals-extension" branch. Both of these currently exist as pull
requests on the https://github.com/COMCIFS/imgCIF repository, which is
intended to hold the DDLm version of the imgCIF dictionary. Anyone is
most welcome to comment on these pull requests of course, but I
emphasise that they simply use a different dictionary language for
defining the same data names, and therefore should have no
implications for current imgCIF/CBF usage.

best wishes,
James.

[1] pull request at https://github.com/yayahjb/cbflib/pull/39
[2] https://github.com/jamesrhester/ImgCIFHandler.jl

On Mon, 12 Apr 2021 at 16:38, James H <jamesrhester@gmail.com> wrote:
>
> Dear All,
>
> Over a year later I have now written up definitions in DDL2 for inclusion in i
mgCIF. The full definitions are at the Github issue (https://github.com/COMCIFS/
imgCIF/issues/7). Please have a look and provide feedback here or there. Note th
at I have added datanames for specifying that the images are contained within co
mpressed archives. I've checked a few known sources of images (proteindiffractio
n.org, zenodo, a uni repository) and this scheme seems to cover those bases. If
you have time, please have a look at your favourite open archive of raw data to
see if this scheme is sufficient for you to specify a particular image in that a
rchive.  I've reproduced the examples from the definitions below.
>
> Of course, in a perfect world we would just give a DOI but those days are not
yet upon us due to landing pages. Happy to be corrected on that.
>
> best wishes,
> James.
>
> Examples
> ========
> #  The frames are contained in a single HDF5-format file accessible
> #   at https://zenodo.org/record/12345/files/tartaric.h5. An array of 2D
> #   images is found at HDF5 location /entry1/detector1/data
>
>      loop_
>     _array_data.array_id
>     _array_data.binary_id
>     _array_data.external_format
>     _array_data.location_uri
>     _array_data.external_path
>     _array_data.external_frame
>     1 1 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detecto
r1/data 1
>     1 2 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detecto
r1/data 2
>     ...
>
>  #  Frames are contained in individual Smart6000 Bruker-format files
>  #   accessible using https://uni_repo.edu/5341 in subdirectory run1.
>
>   loop_
>     _array_data.array_id
>     _array_data.binary_id
>     _array_data.external_format
>     _array_data.external_version
>     _array_data.location_uri
>     1 1 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.001
>     1 2 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.002
>     ...
>
> #  Frames with SMV format are contained at data.proteindiffraction.org in a ta
rred
> #    archive compressed with bzip2.
> 
>     loop_
>     _array_data.array_id
>     _array_data.binary_id
>     _array_data.external_format
>     _array_data.location_uri
>     _array_data.external_archive_format
>     _array_data.external_archive_path
>     1 1 SMV
>         https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc00015
74_7k69.tar.bz2
>         TBZ
>         MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0001.img
>     1 2 SMV
>         https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc00015
74_7k69.tar.bz2
>         TBZ
>         MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0002.img
> 
> 
> On Tue, 5 Mar 2019 at 16:37, James Hester <jamesrhester@gmail.com> wrote:
>>
>> OK, I've drafted up some definitions (just the human-readable part for now) f
or you all to peruse.  Please look at https://github.com/COMCIFS/imgCIF/issues/7
 and provide feedback here or there.
>>
>> all the the best,
>> James.
>>
>> On Thu, 14 Feb 2019 at 14:39, James Hester <jamesrhester@gmail.com> wrote:
>>>
>>> Thanks for the support Herbert. Does anybody have any concerns or improvemen
ts to the data names that I sent originally? If not, I guess I will write up som
e formal dictionary definitions for your consideration.
>>>
>>> James.
>>>
>>> On Wed, 13 Feb 2019 at 21:39, Herbert J. Bernstein <yayahjb@gmail.com> wrote
:
>>>>
>>>> Dear Colleagues,
>>>>
>>>>   Since 2012 NIAC and COMCIFS have worked cooperatively to make imgCIF/CBF
and NeXus/HDF5 fully interoperable.  This is very
>>>> far along, e.g.with NeXus/HDF5 NXtransformations having been added to NeXus
/HDF5 to carry the same information as imgCIF/CBF AXIS.
>>>> What James has suggested will allow imgcif/CBF to carry the same dataset st
ructure information as is conveyed in the external links of
>>>> an Eiger dataset, which divides the collected data into a master file with
the metadata and a set of datafiles.  This structural division
>>>> may not be important for some smaller datasets with only a few hundred to a
 few thousand frames, but can be very important in
>>>> handling datasets with more frames than that that are encountered in serial
 crystallography.  Even for the smaller datasets this approach can
>>>> help to solve a problem for archives and facilities that need to store meta
data in a relational database while the data itself has been parked in
>>>> raw file systems, non-relational databases, zenodo, etc.  As with almost al
l of CIF, imgCIF/CBF metadata maps very easily and directly
>>>> into relational tables, while putting NeXus/HDF5 metadata into a relational
 database first requires exactly the same sort of transformations
>>>> as we have already designed to map NeXus/HDF5 metadata into imgCIF/CBF   To
 me it seems that James' suggestion is not a reinvention
>>>> of this particular wheel, but may be an important step in avoiding reinvent
ion of the wheel.  This may avoid a lot of unnecessary transformation
>>>> of huge quantities of raw data in serial crystallography while making the m
etadata more accessible.
>>>>
>>>>   I would suggest giving James' suggestion serious consideration.
>>>>
>>>>   Regards,
>>>>     Herbert
>>>> while putting
>>>>
>>>> On Wed, Feb 13, 2019 at 4:02 AM James Hester <jamesrhester@gmail.com> wrote
:
>>>>>
>>>>> Dear Graeme,
>>>>>
>>>>> The context of this is the idea that a single imgCIF file could be
>>>>> generated from a collection of raw image files (in whatever format, whethe
r
>>>>> HDF5, or ADSC, or Bruker, or Rigaku, etc.) which would contain the metadat
a
>>>>> pertaining to that collection. In such a situation, some way of referring
>>>>> to the raw frames from within the imgCIF file is required.
>>>>>
>>>>> I agree that a perfectly reasonable approach is not to generate any new
>>>>> file at all, and simply to access the metadata directly in whatever format
>>>>> happens to be there. This was my initial impulse as well and it took me a
>>>>> while to understand that the actual proposal was to create an imgCIF file,
>>>>> rather than just use imgCIF datanames for specification purposes.  From a
>>>>> semantic point of view both amount to the same thing so my only real
>>>>> motivation here is to add an image linking facility to imgCIF so that the
>>>>> "generate a summary metadata file" approach is possible.
>>>>>
>>>>> Could we just copy the HDF5 way of referring to objects in other HDF5 file
s
>>>>> as a quick solution?
>>>>>
>>>>> all the best,
>>>>> James.
>>>>>
>>>>> On Wed, 13 Feb 2019 at 19:03, Graeme.Winter@Diamond.ac.uk <
>>>>> Graeme.Winter@diamond.ac.uk> wrote:
>>>>>
>>>>> > Dear James,
>>>>> >
>>>>> > On the face of it, this looks a lot to me like a reinvention of HDF5 -
>>>>> > perhaps with specific semantics - and there is already a (complete?)
>>>>> > mapping from imgCIF to HDF5 / NeXus
>>>>> >
>>>>> > Have I missed something? No offence meant, trying to understand the shap
e
>>>>> > of the problem you are trying to solve
>>>>> >
>>>>> > Thanks & best wishes Graeme
>>>>> >
>>>>> > > On 13 Feb 2019, at 05:15, James Hester <jamesrhester@gmail.com> wrote:
>>>>> > >
>>>>> > > Dear All,
>>>>> > >
>>>>> > > Recent Commdat discussion revealed a desire to reference external imag
es
>>>>> > > from within an imgCIF file. This would allow the metadata for a datase
t
>>>>> > to
>>>>> > > be held within a single imgCIF file, while the frames themselves remai
n
>>>>> > > separate. This avoids the impracticality of navigating through an
>>>>> > enormous
>>>>> > > mulit-frame imgCIF file in order to extract a relatively compact amoun
t
>>>>> > of
>>>>> > > information.
>>>>> > >
>>>>> > > As a starting proposal, I suggest we extend the _array_data category w
ith
>>>>> > > the following three datanames:
>>>>> > >
>>>>> > > (1) _array_data.external_format    A value drawn from an enumerated li
st
>>>>> > of
>>>>> > > formats (e.g. "SMV","HDF5","Bruker"). The definition for each enumerat
ed
>>>>> > > value would explain how to interpret _array_data.internal_path
>>>>> > > (2) _array_data.location_url           A URI for the file containing t
he
>>>>> > > image. A relative URL is relative to the location of the imgCIF file
>>>>> > > (3) _array_data.internal_path        A format-specific string describi
ng
>>>>> > > the location of the frame within the file identified by
>>>>> > > _array_data.location_uri, interpreted according to the value given in
>>>>> > > _array_data.external_format
>>>>> > >
>>>>> > > So for a multi-frame HDF5 file buried in a subdirectory of the locatio
n
>>>>> > > referenced with a DOI, with appropriate definitions of the path notati
on:
>>>>> > >
>>>>> > > loop_
>>>>> > > _array_data.array_id
>>>>> > > _array_data.binary_id
>>>>> > > _array_data.external_format
>>>>> > > _array_data.location_uri
>>>>> > > _array_data.internal_path
>>>>> > > 1 1 NXMX doi:x.y.z directory/run/masterfilename:/entry1/detector/data[
0]
>>>>> > > 1 2 NXMX doi:x.y.z directory/run/masterfilename:/entry1/detector/data[
1]
>>>>> > > ...
>>>>> > >
>>>>> > > Or for a bunch of single-frame files generated by an ADSC detector in
the
>>>>> > > same directory as the imgCIF file
>>>>> > >
>>>>> > > _array_data.array_id
>>>>> > > _array_data.binary_id
>>>>> > > _array_data.external_format
>>>>> > > _array_data.location_uri
>>>>> > > 1 1 ADSC ./tartaric.001
>>>>> > > 1 2 ADSC ./tartaric.002
>>>>> > > 1 3 ADSC ./tartaric.003
>>>>> > > ...
>>>>> > >
>>>>> > > The imgCIF data items describing the structure of the data array would
>>>>> > > refer to the data after it has been provided by the format. The form i
n
>>>>> > > which it is provided should be specified in the definition of each val
ue
>>>>> > of
>>>>> > > "_array_data.external_format".  So, for example, the various compressi
on
>>>>> > > methods in HDF5 would be invisible if the data as returned are specifi
ed
>>>>> > to
>>>>> > > be an array of Reals.
>>>>> > >
>>>>> > > From the point of view of initial data validation, it would be suffici
ent
>>>>> > > to check that all referenced files are accessible, and that the provid
ed
>>>>> > > locations exist.
>>>>> > >
>>>>> > > Thoughts?
>>>>> > > James.
>>>>> > >
>>>>> > > --
>>>>> > > T +61 (02) 9717 9907
>>>>> > > F +61 (02) 9717 3145
>>>>> > > M +61 (04) 0249 4148
>>>>> > > _______________________________________________
>>>>> > > imgcif-l mailing list
>>>>> > > imgcif-l@iucr.org
>>>>> > > http://mailman.iucr.org/cgi-bin/mailman/listinfo/imgcif-l
>>>>> >
>>>>> >
>>>>> > --
>>>>> > This e-mail and any attachments may contain confidential, copyright and
or
>>>>> > privileged material, and are for the use of the intended addressee only.
 If
>>>>> > you are not the intended addressee or an authorised recipient of the
>>>>> > addressee please notify us of receipt by returning the e-mail and do not
>>>>> > use, copy, retain, distribute or disclose the information in or attached
 to
>>>>> > the e-mail.

>>>>> > Any opinions expressed within this e-mail are those of the individual an
d
>>>>> > not necessarily of Diamond Light Source Ltd.
>>>>> > Diamond Light Source Ltd. cannot guarantee that this e-mail or any
>>>>> > attachments are free from viruses and we cannot accept liability for any
>>>>> > damage which you may sustain as a result of software viruses which may b
e
>>>>> > transmitted in or with the message.
>>>>> > Diamond Light Source Limited (company no. 4375679). Registered in Englan
d
>>>>> > and Wales with its registered office at Diamond House, Harwell Science a
nd
>>>>> > Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> T +61 (02) 9717 9907
>>>>> F +61 (02) 9717 3145
>>>>> M +61 (04) 0249 4148
>>>>> _______________________________________________
>>>>> imgcif-l mailing list
>>>>> imgcif-l@iucr.org
>>>>> http://mailman.iucr.org/cgi-bin/mailman/listinfo/imgcif-l
>>>
>>>
>>>
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________
imgcif-l mailing list
imgcif-l@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/imgcif-l

Reply to: [list | sender only]