Ten years and change: the MX data archive at ALS 8.3.1
University of California, San Francisco and Lawrence Berkeley National Laboratory, USA
With very few exceptions, every image ever collected at the ALS beamline 8.3.1 since July 2001 has been backed up. So far, this is a little more than 4 million images and 66 terabytes on ~50,000 DVD-R disks, but more recently LTO4 tapes are being used as well so that second-copy failure modes are orthogonal to the first copy (~3000 images have proved unrecoverable). Assigning these ~20,000 datasets to PDB entries (~900 credited to 8.3.1) has proved difficult, as no metadata exists and nearly 700,000 images are called 'test'. Unit cell dimensions are not as unique in the PDB as one might expect, with one cell (48 62 84 90 101 104) within 5 Å and 5° of more than 1/5th of the entire PDB database! Exhaustive rigid-body refinement of all compatible-cell PDB entries against the 3000 MAD/SAD data sets collected between 6 and 11 years ago has revealed that ~50% of all unique crystal forms studied in that period have yet to appear in the PDB. Essentially all of these unpublished data sets are also unsolvable by application of modern MAD/SAD phasing techniques, but in one trial case from 2002, where the name of the protein was available, molecular replacement via the BALBES server proved successful. The original author is now working on the publication, but it is noteworthy that the search model that worked came from a structural genomics center in 2007, five years later. No doubt this is just one of many examples of structures that were not solvable with technology available at the time the data were collected, but remain scientifically valuable after sufficiently powerful methods are developed in the future.