A new massively parallel binding assay finds binding rules but they’re not always obeyed
August 6, 2024
One of the important discoveries of 21st century biology is that only a small portion of our genome encodes genes, while much more of it encodes regulatory information that controls when and where genes are expressed. Unfortunately, unlike the genetic code that describes how genes are encoded (which was cracked in the 1960’s), the code for regulatory DNA has been much harder to work out. That’s a problem, because genetic variants in regulatory DNA have important effects on human health.
A paper published in Science in October of 2022 shows how these regulatory variants can affect our risk for disease, and why the regulatory code is so hard to figure out. With an impressively thorough set of experiments, the paper’s authors show how a single-nucleotide variant in a regulatory DNA sequence leads to a six-fold increase in the risk for a particular type of low-grade glioma, a slow-growing tumor that eventually develops into an aggressive form of brain cancer. The variant disrupts a binding site for a regulatory protein called OCT2, and thereby increases the expression of the Myc gene, which in turn has an impact on tumor growth.
It’s a nice piece of work, but if you ask genome scientists why disrupting this particular binding site changes Myc expression, they couldn’t tell you. The OCT2 binding site is the short sequence TCTGCAAT, and here is the problem: there are millions of these short OCT2 binding sequence motifs in the genome, and yet only a small fraction (1% or fewer) are actually bound by OCT2. What makes the OCT2-bound sites different from the vast excess of identical but unbound sites? The problem isn’t limited to OCT2, but holds for basically all of the regulatory proteins that function by recognizing short DNA sequences. To answer this question is what it means to solve the regulatory code.
In the latest issue of Molecular Cell, a paper (free preprint here) introduces a clever new technology to get at the question of regulatory protein binding sites. The technology, called ChIP-ISO, is a variation on a common theme in functional genomics: rather than just test natural genomic sequences, it is much more effective to develop assays that can perturb sequences at scale. Such functional perturbation data is great for training models and for identifying functional differences among sequences, such as bound versus unbound transcription factor binding motifs.
ChIP-ISO, developed by Lu Bai’s lab at Penn State, is a way to test regulatory protein binding to thousands of binding site perturbations in one experiment. It involves (1) synthesizing a library of perturbed DNA sequences, (2) integrating that library into a population of cells at one location in the genome, and (3) measuring protein binding to those sequences via immunoprecipitation and sequencing (ChIP-seq).
With ChIP-ISO, the Bai lab analyzed binding of the protein FOXA1 to a regulatory DNA sequence containing three FOXA1 binding motifs (orange bars in the figure below). In the same sequence are a handful of other binding motifs for different regulatory proteins. Looking at a sequence like this, we have no good models of the regulatory code that tell us ahead of time which of these binding motifs is important. So what do the data say?
Transcription factor binding sites in the CCND1 enhancer analyzed by Xu, et al.
ChIP-ISO showed that FOXA1 binding site 3 has the biggest effect when deleted, and that deleting binding sites for two other factors, CEPB (green) and AP-1 (purple) also affect FOXA1 binding. In other words, FOXA1, CEPB, and AP-1 bind DNA cooperatively when their binding motifs are close together. But if cooperative binding is the rule, why isn’t there cooperative binding at FOXA1 sites 1 and 2, which are close to other binding motifs (light gray bars)? Nobody knows. Maybe the proteins that bind these sites aren’t expressed, or maybe those proteins don’t bind cooperatively with FOXA1. And what about thousands of other sites in the genome where FOXA1 binds, but AP-1 and CEPB do not? Or the millions of unbound FOXA1 binding motifs?
FOXA1 binding motif from HOCOMOCO
We don’t have good models that answer these questions, just a set of loose rules that seem to apply in some cases and not others. Our standard models for regulatory protein binding specificities are short motifs like the one above, and we have very little idea how they combine to create functional regulatory DNA elements. And if we can’t say why a regulatory DNA element works, we certainly can’t easily say why this genetic variant breaks it and not that one.
But with scalable methods like ChIP-ISO, we have a chance figuring this out, or at least building deep learning models that figure it out. Right now ChIP-ISO is limited to thousands of perturbations in an experiment, but if we scale up to millions, the models might figure it out.