


IUCr activities
International Tables Volume C Second Edition - Data Mining Chapters
![thumbnail [thumbnail]](https://www.iucr.org/__data/assets/image/0006/159720/thumbnail.jpg)
Crystallography has been (and still remains) one of the most multidisciplinary sciences that links together frontier areas of research and has, directly or indirectly, produced the largest number of Nobel Laureates throughout history (up to 29 Prizes for 48 Laureates now). It is therefore not surprising that the recent Nobel Prizes for Physics and Chemistry in 2024 once again highlight the groundbreaking advancements made in the field of crystallography. This time, the award was given for the development of methods to predict protein structures using methods of machine learning.
Crystallographers were among the first scientists to recognize the importance of collecting data on crystal structures. The subsequent explosive growth has been caused by the accumulation of huge amounts of data and the growing ability of computers to store them. Data mining has appeared as the tool of choice when extracting hidden patterns and new knowledge from data by using computing power and applying new techniques of machine learning. This has been aided by such precedent discoveries in computer science as neural networks (1943), clustering, genetic algorithms (1950s), decision trees (1960s) and support vector machines (1980s). Even if the key algorithms used in data mining today were known in mathematics as early as the beginning of the last century, data mining is still a young field of science with the rapid evolution of methods and tools and the explosive growth of computational possibilities. Partly for this reason, it is very difficult to find a unified terminology and presentation within the huge amount of specialized literature, like in handbooks on data mining. Therefore, to navigate between tasks, algorithms and their applications as used in crystallography, the new edition of International Tables for Crystallography (Volume C: Mathematical, Physical and Chemical Tables) will include two new chapters: 'Data mining. I. Machine learning in crystallography' and 'Data mining. II. Prediction of protein structure and optimization of protein crystallizability'.
The first chapter, written by D. W. M. Hofmann and L. N. Kuleshova, introduces readers to the concept, terminology and overall scheme of machine learning as it is used today. Then, the main tasks of machine learning, such as clustering, anomaly detection, classification, regression and summarization (dimensionality reduction), are characterized and specified in relation to actual problems of crystallography. For example, together with an overview of clustering algorithms and methods of similarity detection (also known as pattern recognition), several important applications of crystal structure clustering are described. Clustering of polymorphs defined at non-ambient conditions reveals significant structural changes and the need to take into account additional descriptors (temperature and pressure) during Crystal Structure Prediction.
The presented mathematical concept behind the tasks of classification and regression is illustrated using examples drawn from crystal-density estimation and force-field development. The new approach for the refining of the model and parameters of a force field allows one to handle any types of atoms including metals and ions, and to take into account external conditions. It helps improve significantly the predictability of Crystal Structure Prediction, which belongs one of the high level primary goals of machine learning.
The summarization (dimensionality reduction) task is illustrated in the chapter by calculation and prediction of solubility, the property that is of crucial importance for pharmaceutical developments and drug discovery, as are estimates of the crystal energy for monitoring the stability of proposed new drugs.
The importance of the anomaly (outlier) detection task is also highlighted. Since all methods of machine learning are very sensitive to data quality, they can easily detect anomalies and outliers. An accurate analysis of any outliers is very often worthwhile as it can indicate the kinds of problems present: erroneous data, or the inefficiency or unsuitability of the models. Furthermore, several cleansers have been developed for use during data pre-processing in addition to the two cleansers used at present in the Cambridge Structural Database, the first of which, enCIFer (CCDC, 2016), checks the syntax of the data, while the second, PLATON (Spek, 2016), checks the reasonableness of the structure and, for instance, annotates the records of structures that contain large voids. It is also illustrated how the data cleansers develop during machine learning work within the algorithms and how they can be used for database cleansing. The removal of anomalous data from a dataset often results in a significant increase in the accuracy of data-mining procedures.
The second chapter, written by A. Kloczkowski et al., is dedicated to the application of machine learning techniques to extract predictive information from the protein, DNA and RNA databases. Mining for information in biological databases involves various forms of data analysis such as clustering, sequence homology searches, structure homology searches, examination of statistical significance, etc. In particular, the data mining of structural fragments of proteins from known structures in the Protein Data Bank (PDB) significantly improves the accuracy of secondary structure prediction. Accordingly, in the chapter, an original method (fragment data mining, FDM) is discussed. The method mines the structural segments from the PDB and utilizes structural information by matching the sequence of these structural fragments with the aim of improving the prediction of secondary structure. Consequently, further improvements are discussed that combine FDM with the classical GOR V secondary structure prediction method. This one is based on information theory and Bayesian statistics, coupled with evolutionary information from multiple sequence alignments. The newer and more accurate approach for secondary structure predictions is also introduced: the SPINE-X method based on a machine learning methodology to predict secondary structures by mining protein sequences and structures. The obtained results strongly suggest that data mining can be an efficient and accurate approach for secondary structure prediction in proteins. The last part of this second chapter discusses applications of data mining to the problem of optimizing protein crystallization conditions. Machine learning can be used to improve the yield and quality of protein crystals, and thus aid in solving protein structures by X-ray crystallography. There is a vast amount of data for protein structures, sequences and crystallization conditions that can be mined to aid in structure prediction and structure determination.
We hope that the methods and concepts presented in these chapters will make data mining and machine learning more accessible to the general practitioner in crystallography, and allow new applications in the field and the discovery of non-trivial and scientifically relevant knowledge.
These early view chapters are freely available for a limited time.
References
Spek, A. (2016). PLATON, University of Utrecht, The Netherlands.
CCDC (2016). enCIFer, Cambridge Crystallographic Data Centre, UK.
Copyright © - All Rights Reserved - International Union of Crystallography