AsCA 2018

MS15: Database developments, validation and data mining

Auckland, New Zealand, 4 December 2018

For the first time at an Asian Crystallographic Association meeting, there was a dedicated microsymposium on data-related topics, chaired by Amy Sarjeant of the Cambridge Crystallographic Data Centre and Genji Kurisu of PDBj, the Japanese partner of the Worldwide Data Bank. The session gave a good overview of current developments within some of the major crystallographic research databases, and reminded the audience of the burgeoning importance to the practising scientist of proper data management and archiving.

Speakers and Chairs: Amy Sarjeant (Co-Chair), Stephen Burley, James Hester, Matthew Lightfoot, Janet Newman, Brian McMahon, Takeshi Kawabata, Genji Kurisu (Co-Chair).

Janet Newman discussed a database of crystallisation conditions and screens that could be used to guide crystallisation campaigns for proteins and other biological macromolecules. Takeshi Kawabata described the services being developed by PDBj to support researchers using cryoelectron microscopy for macromolecular structural research. Stephen Burley reported on progress towards improving validation of ligand structures in the Protein Data Bank. Matthew Lightfoot presented some recent and ongoing developments in deposition and access for users of the Cambridge Structural Database. The session was bookended by two contributions illustrating the IUCr's close involvement with data management and characterisation: James Hester discussed approaches to developing software for automated validation and evaluation of data relationships within any sufficiently well characterised data set, while Brian McMahon gave an account of data projects in which the IUCr has been involved over the last three decades. These include the CIF project and, more recently, considerations of the case for routine deposition of X-ray diffraction images and other experimental raw data.

Programme

Tuesday 4 December 2018
MS 15: Database developments, validation and data mining
14:00-14:15	James Hester	What is a dataset?	Abstract \| Presentation (9.4 MB)
James R. Hester¹ ¹ ANSTO, Locked Bag 2001, Kirrawee DC, Australia 2232 Several widely-accepted standards now exist that make finding and citing datasets straightforward. However, after a dataset has been found, application software is not generally able to determine how to process the dataset or if the data are appropriate to its task, especially given that data may be stored in a variety of formats and arrangements and divided or aggregated. If a dataset is formed from several separate digital objects, there is no standard, computationally-robust way to describe their relationships to a computer. These and other issues are easy enough for a human to program on a case-by-case basis, but a generic framework that would underpin automatic processing is currently lacking. The contents of any dataset can be modelled as a collection of relational tables [1]. A machine-readable set of definitions for the columns of these tables, such as those provided by the CIF dictionaries, can provide all of the information necessary for performing computations on the dataset. A dataset is complete if the set of columns required to fulfill a task is available or can be computed from the rest of the data. A particular use-case is thus equivalent to a list of columns. Application software accesses the data via an interface that presents that data in relational form using a shared standard ontology to name the columns. By requesting specific columns, software can immediately determine if the dataset is suitable for its needs. Integration of multiple distinct data blobs is equivalent to filling in blocks within tables. [1] Hester, J. R. (2016) "A robust, format-agnostic scientific data transfer framework", Data Science Journal 15, p12 DOI https://doi.org/10.5334/dsj-2016-012 (hide \| hide all)
14:15-14:40	Janet Newman	Data for crystallisation - answers are in the distance	Abstract \| Presentation (3.7 MB)
Janet Newman¹, Vincent J. Fazio², Alex Khassapov², Marko Ristic¹, Nicolas Rosa¹ and Luke Thorburn¹ ¹ CSIRO Biomedical Manufacturing, 343 Royal Parade, Parkville, Vic 3054, Australia ² CSIRO Scientific Computing, Private Bag 10, Clayton South, Vic 3169, Australia Most attempts to crystallise a protein (or any macromolecule) start by setting up the protein against one or more commercially available screens. There are a good number of vendors of crystallisation screens, and each sells many different screens. Navigating through what is available, and what each screen contains is challenging, and comparing offerings between vendors is almost impossible as there are no standards for how crystallisation data should be described. We have created a defined vocabulary for crystallisation experiments, and have also implemented the concept of 'distance' between two crystallisation conditions. This allows us to build up a database of crystallisation conditions and screens - and to search on screens, conditions, chemicals or similarity. This information can be accessed through the website c6.csiro.au. The webtool can be used to guide crystallisation campaigns, both initial searches and optimisation experiments, and suggestions on how this tool can be used during the course of crystallisation will be presented. (hide \| hide all)
14:40-14:55	Matthew Lightfoot	The Cambridge Structural Database - Developments in deposition and access	Abstract \| Presentation (4.8 MB)
Matthew P. Lightfoot¹ and Suzanna C. Ward¹ ¹ The Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK Over the last few years CCDC have developed services to make it as easy as possible to deposit and access small molecule crystal structure data and this presentation will highlight several of these developments. Recent changes to our deposition services allow depositors easily to deposit, view and manage their data. This talk will detail some of these changes including the ability to share data across institutions and developments to ensure that crystallographers are recognised for their work. As well as these recent changes we will discuss ways in which we are looking to further develop these services. An important area for CCDC is data quality and integrity and we will explore our validation checks and new deposition guidelines which aim to aid depositors and help improve the quality and integrity of the data that is deposited at CCDC. We will also discuss the increasing use of data publications, in particular CSD Communications, as a way to directly share data and how we are working to ensure the rise of data publications does not negatively impact the quality of the CSD. We will conclude this presentation by highlighting recent efforts to integrate and link with other data resources including our recent collaboration with FIZ Karlsruhe that resulted in the launch of joint deposition and access services for crystallographic data across all chemistry. We will show how these new services enable researchers to share data through a single deposition portal and provide free worldwide access to all chemical structures. (hide \| hide all)
14:55-15:20	Stephen Burley	Ligand Validation for the Protein Data Bank	Abstract \| Presentation (4.2 MB)
Stephen K. Burley^1,2 ¹ RCSB PDB, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, United States ² RCSB PDB, Skaggs School of Pharmacy and Pharmaceutical Sciences and San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, United States The Protein Data Bank (PDB) is the global repository for experimentally-determined 3D structures of biological macromolecules. It is managed by the Worldwide Protein Data Bank (wwPDB, wwpdb.org). In addition to biopolymer structure data, the PDB Chemical Component Dictionary (CCD) catalogues small molecule ligands, encompassing IUPAC atom nomenclature for standard amino acids and nucleotides, stereo-chemical assignments, bond order assignments, experimental model and computed ideal coordinates, systematic names, and chemical descriptors. Precise knowledge of interactions between macromolecules and small ligands is central to our understanding of biological function, drug action, mechanisms of drug resistance, and drug-drug interactions. The wwPDB OneDep system supports PDB data deposition, validation, and biocuration. OneDep produces Validation Reports using standards developed with expert task forces. For X-ray structures, the fit of the ligand to electron density difference maps is assessed quantitatively using real-space R-factors (RSR and RSR Z-scores). Within OneDep, 3D electron density difference maps are produced for expert review. For released structures, precomputed electron density difference maps for bound ligands and wwPDB Validation Reports can be accessed from RCSB PDB Structure Summary pages. The 2015 Ligand Validation Workshop generated community recommendations aimed at further improving validation of ligand structures in the PDB (Adams et al. 2016; Structure 24, 502-508). Progress towards implementation of these recommendations will be reported together with ongoing enhancements to the CCD and wwPDB Validation Report. wwPDB members are RCSB PDB (supported by NSF, NIH, and DOE), PDBe (EMBL-EBI, Wellcome Trust, BBSRC, MRC, and EU), and PDBj (NBDC-JST), and BMRB (NIGMS). (hide \| hide all)
15:20-15:35	Takeshi Kawabata	Databases and Web services from PDBj for Electron Microscopy	Abstract \| Presentation (5.0 MB)
Takeshi Kawabata¹, Hirofumi Suzuki¹ and Genji Kurisu¹ ¹ Institution for Protein Research, Osaka University, Osaka, Japan Cryo-electron microscopy has recently emerged as a powerful technique to solve atomic 3D structure. Protein Data Bank Japan (PDBj) provides several WEB databases and services for supporting researchers of electron microscopy. The database 'EM navigator' provides a user-friendly view of 3D density maps stored in EMDB (Electron Microscopy Data Bank). The 'Omokage search' service enables us to search 3D maps or atomic models with similar shapes by the query map or model given by a user. The 'gmfit' service provides fitting calculations between a 3D map and an atomic model through WEB. The calculation is fast due to the density is approximately represented as Gaussian mixture model. The gmfit program has improved to perform a partial fitting of an atomic model in only a subspace of the 3D map, using masking function. The helix detection program using Gaussian function is almost ready to open. Finally, we now announce a mirror site of the EMPIAR database (Electron Microscopy Public Archive) is open in Japan (https://empiar.pdbj.org). EMPIAR is a public resource for 2D electron microscopy images developed in EMBL-EBI. A set of 2D images are raw experimental data to reconstruct a 3D density map, its file size is quite large: average file size is about 500 Gbyte. Although these 2D images are very large to handle, they are necessary to validate 3D map, enhance developments of image processing software, and educate and train EM users. We are now preparing to open the deposition site of EMPIAR in PDBj. (hide \| hide all)
15:35-16:00	Brian McMahon	The element of trust: validating and valuing crystallographic data	Abstract \| Presentation (4.9 MB)
Brian McMahon¹, John R. Helliwell² and James R. Hester³ ¹ International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, UK. E-mail: bm@iucr.org ² School of Chemistry, University of Manchester, Manchester M13 9PL, England ³ Australian Nuclear Science and Technology Organisation, New Illawarra Road, Lucas Heights, NSW 2234, Australia The IUCr encourages best practice in data management, and performs data validation in its own journals and by collaboration with structural databases. Its Crystallographic Information Framework (CIF) provides precise data definitions and led to automated criteria (checkCIF) for testing the reasonableness of a derived structural model. These procedures for small-molecule structures are used by many journals and databases. (Reciprocally, PDB validation reports are inspected during the review process of IUCr biological journals.) Many journals and databases also request structure-factor or other underlying data to allow more detailed validation. Attention is now shifting from processed to raw experimental data. An IUCr Working Group explored the idea of routinely depositing raw diffraction images. Storing such large volumes of data was once prohibitively expensive, but technological improvements have overcome this objection. There is ongoing debate about the scientific value of image deposition, but workshops and publications have informed the discussion with detailed analysis of the potential scientific benefits (DDDWG, 2017). Validation of raw data sets will involve characterisation of image data (using imgCIF-based data names) and formal requirements for essential metadata to allow interpretation of individual images. Work towards such requirements is being carried out under the aegis of the IUCr's Data and CIF Committees (CommDat and COMCIFS). The recently upgraded CIF specification facilitates DDLm, a machine-readable description of relationships between data items that can automatically generate software methods for testing and evaluating such relationships. This will go further towards ensuring the integrity of published and deposited crystallographic data. Reference: DDDWG (2017). Final report of the Diffraction Data Deposition Working Group. http://forums.iucr.org/viewtopic.php?f=21&t=396 (hide \| hide all)