Transcription factors are proteins that
bind to DNA to promote or suppress protein production. Since almost all
diseases involve disruption of the protein-production process, transcription
factors are promising biological targets for drugs—and could even serve as
drugs themselves.
But there are likely thousands of
transcription factors in humans, each of which might bind to the genome at tens
of thousands of different locations. Previously, there was no cost-effective
way to figure out exactly where transcription factors bind—which exact DNA
letters in a given stretch of genome each of them attaches to. Biologists thus
relied on approximate methods to identify the general vicinity of binding
sites.
In the August issue of PLoS
Computational Biology, a team of researchers from Massachusetts Institute
of Technology (MIT)’s Computer Science and Artificial Intelligence Laboratory
presented a new analytic technique that identifies binding sites with much
greater accuracy. As a consequence, the researchers were able to infer
previously unknown relationships among transcription factors, which could
provide clues to the roles they play in biological processes.
The researchers initially tested their
technique on two sets of experimental data, which they say represent both “relatively easy and difficult cases” for analysis. In the easy case, their new
technique identified the precise locations at which transcription factors bound
to the genome with more than 90% accuracy, while the accuracy of existing
techniques was about 10% or less. In the difficult case, the new method was
more than 55% accurate, compared to about 5% for existing techniques.
The leading method for determining how
transcription factors behave in living cells is to chop up the DNA from
millions of cells and use protein antibodies to extract the fragments that have
a particular transcription factor attached to them. While the DNA sequence that
a transcription factor binds to consists of only about six to 12 DNA letters,
the fragment extracted by the antibody could be a couple of hundred letters
long. Sequencing the fragments can determine where in the genome they came
from, but it offers little information about where on the fragment the transcription
factor is attached.
Feedback loop
David Gifford, a professor of electrical engineering and computer science and
director of the Computational Genomics Group, his graduate student Yuchun Guo,
and Shaun Mahony, a research scientist in the group, developed a new algorithm
for analyzing millions of experimentally identified fragments and inferring the
precise locations at which transcription factors bind to them.
Previous methods would compare the
fragments to try to identify sequences they had in common. But that’s just the
first step in the MIT researchers’ method. They then use that initial, rough
guess about common sequences to predict where, throughout the entire genome,
the transcription factor would bind, then compare those predictions to the
experimental data on where the factor actually did bind. On the basis of that
comparison, they then refine their estimate of the specific binding sequence
and repeat the whole process.
“We iterate between estimating where
proteins bind and using that information to discover the sequences that they
bind to,” Gifford says, “and then we go backward and use the sequences they
bind to to improve the estimate of where they’re binding.”
But determining transcription factors’
precise binding sites is just the first step in understanding their role in
protein production. For a single transcription factor, that role can vary
according to both the type of cell in which it’s active and its interactions
with other transcription factors. It’s the second of these elements that the
MIT researchers are shedding light on, by identifying spatial relationships
between binding locations that imply a functional relationship between the
corresponding transcription factors.
The genome’s language
That approach, Gifford says, is similar to the statistical analysis of
language, which artificial-intelligence researchers have used to build
language-interpreting computer systems. Indeed, Gifford says, the sequences
that transcription factors bind to can be thought of as words and their spacing
as the “syntax” of the genome.
“If you did an analysis of the English
language, you would find a lot of relationships between words that were highly
significant, because they co-occur,” Gifford says. “You would not necessarily
understand from the analysis what their meaning was, but you would know that
they were highly significant and did carry meaning.” The same is true of the
DNA “words” that constitute the transcription-factor binding sites. “If you
look at a null model, which would posit random occurrence of words, then you
ask how unlikely it is that you would see these things together,” Gifford says. “And we’re testing everything against a random model.”
The MIT researchers’ analysis identified a
handful of relationships between transcription factors that were already known,
but it also identified 390 more statistically significant relationships between
binding sites. Some of those may be red herrings, but many of them could turn out
to indicate previously unsuspected relationships between transcription factors,
which could help biologists unravel the mysteries of genetic expression.