The Big Data Science Center at the Shanghai Synchrotron Radiation Facility: the dawn of the scientific superfacilities

Alessandro Sepe
[Thumbnail]
Users at the Shanghai Synchrotron Radiation Facility (SSRF) can seamlessly access the scientific and technological architecture of the Big Data Science Center (BDSC) through thin terminals to take full advantage of all the services offered by the BDSC in real time while running their experiments at the SSRF beamlines.

The massive amount of raw data produced nowadays at large scientific facilities creates not only enormous new opportunities but also tremendous challenges. Already, only a small fraction of this multidisciplinary and scientifically complex Big Data are fully analysed and, ultimately, used in scientific publications, and it is predicted that in a few years, conventional data-analysis approaches will be overwhelmed, preventing the users from producing meaningful science from their large-scale experiments. This is a problem for all synchrotron and neutron facilities, as well as for X-ray free-electron laser facilities, where tens of petabytes are produced annually. Beamtime is expensive and the lack of automated data-analysis pipelines reduces beamtime efficiency. This “data deluge” effect [1] has implications for all large scientific facilities worldwide in that it affects fast data collection and storage and curation of the data, including data movement and deposition in a database.

We are witnessing the dawn of artificial intelligence (AI), machine learning (ML) and robotic automation within the field of large scientific facilities, generating profound changes in how petabytes of interdisciplinary datasets are intelligently processed, managed, analysed and visualised. The consequent evolution of large scientific facilities into superfacilities enables multimodal user science to confront the Big Data challenges, fundamental for the entire scientific community. The Big Data Science Center (BDSC) at the Shanghai Synchrotron Radiation Facility (SSRF), Shanghai Advanced Research Institute (SARI), Chinese Academy of Sciences, Zhangjiang Laboratory, is the first scientific superfacility in China, and one of the first worldwide [2]. With its state-of-the-art developments it aims to dramatically accelerate and automate the multidisciplinary research of all the users at large national scientific facilities, thus effectively increasing the rate of their scientific discoveries and the resulting technological advancements, with a clear societal impact. Therefore, the BDSC Big Data Science platform targets the research projects that several national and international universities, academies, research institutes and industries are pursuing at SSRF, where massive support in terms of scientific computation is required to enable the most complete knowledge transfer from scientific research to industrial developments, while elastically interfacing them with the top Chinese National Supercomputer Centers.

Video documentary showing the BDSC activities, international potential and impact on the user’s science at the SSRF, and more in general on the large national scientific facilities worldwide. Here, the BDSC state-of-the-art superfacility capabilities are demonstrated using the Biological Macromolecular Crystallography Beamline (MX Beamline) as a case study, including users performing a real experiment in real time. The entire BDSC workflow is shown, from the preparation of the user experiment at the beamline, through the setting-up of the beamline, to the real-time Big Data analysis and results visualisation, including the user’s data being processed and monitored through the BDSC superclusters and Control Room, as well as the users interacting with the BDSC Platform. If you experience problems playing the video, please click here.

The BDSC is one of the latest SSRF upgrades resulting from the SSRF Phase II project. The BDSC aims to support all the SSRF beamlines, as well as their national and international users, through its state-of-the-art scientific computational infrastructure, including high-performance computing (HPC), latest-generation storage systems and advanced software platforms. Users and beamlines at the SSRF are thus benefiting, seamlessly, from the most advanced Big Data processing, movement, analysis, results interpretation and visualisation capabilities offered by the BDSC. The BDSC aims, in fact, to support all the SSRF users to produce high-impact science and technology, matching the highest international quality standards, thus further enabling the users to publish their results in top-notch internationally renowned peer-reviewed journals.

Professor Alessandro Sepe, Director of the BDSC, has designed and architected this novel Big Data Science platform at SSRF, which allows all its users to fully exploit the scientific and technological potential of the BDSC for their research. The entire BDSC staff supported Professor Sepe’s efforts in developing, deploying and then constantly upgrading the BDSC infrastructure, which is now fully operational. Here, state-of-the-art Big Data science and technologies, AI, internet of things (IoT), real-time unstaffed and remotely controlled experiments, robotic automation, HPC, cloud/fog supercomputing and massive parallelisation are converging on the SSRF through the BDSC’s fully centralised platform, accelerating the user multidisciplinary science performed at SSRF by an outstanding factor of 60, and effectively creating the first-ever world-class user-friendly Chinese superfacility. This is greatly augmenting the interpretation of all the scientific data generated by all the experiments at SSRF. By developing, deploying and upgrading the Big Data Science platform, the BDSC is fostering full robotic automation at the SSRF beamlines, aiming at real-time unstaffed and remotely controlled experiments, while sharing its successful experience with all the other multidisciplinary large facilities worldwide.

The Big Data Science platform developed by the BDSC is, in fact, collecting, tagging and tracking large volumes of metadata from all the experiments at SSRF to fully automate the entire large-facility lifecycle. Hundreds of petabytes of scientific data are thus tagged, to be then ingested by neural networks for ML. This remarkable scientific and technological achievement also allows non-expert users at SSRF to obtain scientifically meaningful results in real time, instead of spending months, after returning to their home institutions, processing unstructured raw data. The BDSC is thus effectively extending the use of large national facilities to the largest number of scientific and industrial disciplines ever, dramatically increasing the scientific and technological productivity of large scientific facilities like SSRF, shifting the focus of their users from data science to pure science, thus enabling a true user-science-centric and multimodal infrastructure.

The BDSC is also engaged in further expanding the SSRF scientific computational capabilities, directly interfacing, through the BDSC framework, the SSRF with the supercomputers in China, including the Shanghai Supercomputer Center (SSC), with extremely low-latency networks.

The BDSC further aims at seeding the Chinese National Scientific Grid based on the platform model being developed at the BDSC.

References

[1] C. Wang, U. Steiner & A. Sepe (2018). Synchrotron Big Data Science. Small, 14, 1802291.

[2] C. Wang, F. Yu, Y. Liu, X. Li, J. Chen, J. Thiyagalingam & A. Sepe (2021). Deploying the Big Data Science Center at the Shanghai Synchrotron Radiation Facility: the first superfacility platform in China. Mach. Learn.: Sci. Technol. 2, 035003.

28 June 2022

Copyright © - All Rights Reserved - International Union of Crystallography

The permanent URL for this article is https://www.iucr.org/news/newsletter/volume-30/number-2/the-big-data-science-center-at-the-shanghai-synchrotron-radiation-facility-the-dawn-of-the-scientific-superfacilities