A look at why we still have no cures, few treatments and little profound knowledge
There are presently hundreds of papers published weekly in the area of genomics, in silico modeling and drug discovery. This month’s rant will briefly explore the current state of affairs, probe for weakness, especially non-informative “discovery,” and offer a bit of guidance. For our physical science readers, I present a bit of background and request the molecular biologists bear with me.
Background
From the earliest discoveries concerning nucleic acids and their biochemistry, through the Watson-Crick model of the DNA molecule and translation of genes into proteins, it was apparent that these genetic elements called genes were responsible for much of what went on in living cells. From there, it was a natural step to hypothesize that, if genes dictated proteins, and these proteins would initiate, accelerate or inhibit biological processes, then it should be possible to alter the blueprints to make either more or less of these proteins — or stop making them altogether — thus ultimately treating or even (gasp!) curing a disease.
This little trick required knowledge of genetic boundaries, gene-to-protein mapping, and the state of regulation of these genes. The knowledge was not available until the late 1980s and 1990s when advances in molecular biology produced microchips (called microarrays) and high-throughput screening techniques. We can now (we think) determine the state of regulation of genes in the “resting” state as well as under different conditions of drug dosage, time and physical insult. We know something about protein production and can reasonably model the processes to determine patterns of regulation in these states. Yet, still no cures, few treatments and little profound knowledge despite the onslaught of publications. What gives? (There’s obviously a bit we don’t yet know and this ignorance is deadly).
Current state
As there is no scarcity of papers modeling these phenomena, there is no shortage of opinions as to where to go from here. The most obvious of which deal with systems biology and bioinformatics, urging researchers to overlay ever more sophisticated software algorithms that will tie the genomics to the proteomics to the pathways to the biological systems and relate all of this to an endpoint such as a disease state. Now, let’s step back for a moment and consider the questions that we have been asking and those that we should be asking.
From a genomic standpoint, the question is: What genes are ‘lighting up’ with this drug or that disease state? From a proteomics standpoint, we ask: What proteins are being produced in significantly greater or lesser quantities? Similarly, for pathways and systems, i.e., What is being affected and how? Now, as to what we really need to know…
I translate these questions into the physical science realm by imagining the ultimate engine for change as an old-fashioned electronic dipswitch with hundreds (thousands?) of individual switches, each with three settings: high, resting and low. I then cast the problem as one of genomics, as the genes will ultimately determine the other components.
We may now ask: Is there a unique minimally sufficient set of genes to generate the physiologic result of interest (simple action, disease state, etc)? If there is, What are the elements (presumably genes) of this set, and what are the switch settings (high, resting and low)? If the labels (genes) for the switches could be determined, it would be a straightforward — though not simple — exercise to determine their settings for a given physiologic state.
The hard part, of course, is determining the genes. As the biological interactions through proteins, pathways and systems are complex, and involve upstream and downstream feedback and feed-forward loops, a whole genome scan is counterproductive, as many non-relevant genes may light up for any given treatment or disease.
Toward a solution
To sidestep the problem, a bit of reverse engineering is required. By this, I specifically refer to mining each pathway and system for known genetic elements pertinent to the state under study, then examining the known interactions in each loop. By doing this through multiple pathways and systems, we can filter the number of relevant genetic elements to examine in a microarray experiment.
Starting with a set of, say, 20 to 6,000 (the lower bound here is one commonly used in practice and the upper bound has been proffered to occur in certain physiologic states in mice). The set is then tweaked by mathematical algorithms, of which there is no shortage. Multivariate nonlinear modeling and data mining are especially useful for this type of analysis. This is applied cross-system, i.e., across the genomic, proteomic, pathway and system databases. Unfortunately, we are not yet there in
1. completeness of the databases
2. fine-tuning of the mathematics
3. designing the proper experiments
We get ever closer on the data and math, but still fail with the integration.
John Wass is a statistician based in Chicago, IL. He may be reached at [email protected].