Feature article

Crystallography and proteomics

The sequencing of the genomes of over two dozen species including the human has opened up research opportunities on a grander scale than has ever been possible before. It is clear that we know very little about the functions of more than 50% of the gene products whose existence is predicted by the sequences being determined with mind numbing rapidity. Many proposals have been made concerning the best way to codify, analyze and use this information. There is general agreement that it is worthwhile to determine what all of the gene products do, how they do it and how we can control these processes. Potential economic and health benefits arising from this new knowledge have generated heated debate and high anxiety over many aspects of the process including the possibility that by patenting genes, the medical industrial complex could hold the world hostage. The exorbitant investment required to determine what all of these gene products do has led to an equally divisive debate about which techniques (x-ray crystallography, NMR spectroscopy, mass spectroscopy, computer modeling, and computational methods) will be most effective, economical and efficient in providing the answers.

It has been proposed that sequence analysis software will make it possible to catalog all protein structures into a finite number of families that have a common fold, that structure prediction programs can be used to predict the three dimensional shapes of the various folds, and that predicted shapes will permit accurate prediction of function. There is ample evidence to suggest that this scenario, while appealing, is greatly over simplified. We know that large families of proteins can have equivalent folds without so much as one fully conserved amino acid, that proteins having the same fold can have entirely different functions, and that many proteins have a number of significantly different forms that may be functionally important.

X-ray crystallographers have participated in the development of schemes for rapid throughput determination of a core of structures expected to map out the full range of folds. Ironically the macromolecules about which the least is currently known, integral membrane proteins, are being excluded from most proposals for high throughput structure determination because they present challenges that are incompatible with current technology and its facile automation.

Ironically, X-ray structures that provide atomic level resolution seem to be considered passe. If a crystal diffracts to better the 3Å resolution, the structure of the molecule crystallized appears to be ineligible for publication in Science, Nature or other journals that define what is currently fashionable in science.

The following are some quotations from recent articles addressing the topic of structural proteomics and beyond that capture some of the range and flavor of this important debate.

  • The speed of acquiring data is now exceeding our ability to comprehend it and put it into the proper biological context. T. Hesman, Science News, 157, April 29, 2000
  • The fact gathering tendency is apparent not just in structural genomics, but also in functional genomics (to identify the role of each gene in the genome) and proteomics (with similar aim for each protein in the cell or organism). Evidently, there are enough facts to keep biologists busy gathering them for decades. So when will they have time to think? Nature 403, Jan. 27, 2000
  • The US National Inst of General Medical Sciences (NIGMS) will launch a structural genomics initiative. After an initial fiveyear pilot stage, the NIGMS programme aims to generate 10,000 structures from as many protein families as possible. Some groups intend to characterize every protein structure within small organisms such as the bacterium Mycoplasma genitalium, while others may try to obtain as many structures as possible involved in a single cellular process, such as cell division. Paul Smaglik, Nature 403, Feb 17, 2000
  • So far, 4,473 species of bacteria in 905 genera have been validly described, of these, just 29 species in 21 genera have been fully sequenced. The assumption that we now have an essentially complete knowledge of microbial metabolism, and need only skim new genomes to compare to existing ones, risks missing the novelty that has been, and will continue to be, present in each new microbial genome. Julian Parkhill, Nature Biotechnol. 18, May 18, 2000
  • Only 265-350 of the 480 protein-coding genes of M. genitalium (which has the smallest known genome of self-replicating organisms) including 100 genes of unknown function, are essential for growth under laboratory conditions. Nature Biotechnol. 17, 207, 1999 Combinatorial chemistry has caused a great cultural change- the redefining of the scientific method itself. For hundreds, if not thousands, of years, scientists have been taught to execute experiments one at a time with very careful control of parameters. Combinatorial chemistry and high throughput methods in general suggest that this fundamental concept can be successfully challenged and expanded. Peter E. Cohen, C&EN, May 15, 2000 Who needs careful control when there is money to be made?
  • Molecular analysis is clearly required to understand higher levels of biological organization, but the converse is also true: the biology of the molecules of life can only be understood in the context of functioning cells and organisms. F.C. Kafatos, Science 287, Feb 25, 2000
  • Pharmacogenomics requires an understanding of the apparent genetic 'disorder' in any organisms genome, of genotype-phenotype mapping, of gene-gene interactions, of intraspecific genetic variability, and of selforganizational processes, rather than endless lists of DNA bases. Sol Hadden, Nature 404, April 6, 2000
  • Features of the very same system depend on the scale of observation. This precludes the extrapolation of knowledge at one level to higher levels where the 'complexity' increases. Understanding why this is so, and determining how to formalize the problem of emergent features and multiscale description is one of the goals of the science of complex systems. Sui Huang, Nature Biotechnol., 18, May 18, 2000
  • Cambridge Healthtech Institutes Beyond Genome 2000 Conference, June 19-23, 2000 covers Bioinformatics and Genome Research, focusing on the computational advances necessary to comprehend the vast amount of information gathered through the Human Genome Project, Silico Biology that explores tools being developed to translate raw data into workable models that will provide guidance for target selection, and Proteomics, which will provide in depth coverage of recent developments in the field of high throughput protein expression analysis and its impact on diagnostic and therapeutic product development (from a meeting announcement). And all in five days!