How big is the problem?

Transcription factors often work in combination to remodel the epigenome. Although we have the tools to simultaneously interrogate hundreds or thousands of TF combinations, we can only explore a fraction of the total hypothesis space.

As an example, imagine we want to explore the effects of reprogramming with up to 5 TFs at a time. Even if we narrow down to only 100 “interesting” TFs and hold all other experimental parameters constant (cell type, indication, delivery mechanism), we still have a huge combinatorial explosion resulting in over 75 million possible perturbations! If we run 10 screens with 1000 perturbations per screen, and 100 cells per perturbation, we will have sequenced 1 million cells, and will have only only observed $0.013\%$ of the above 75 million experiments.

Approximate experiment space:

100 “interesting” TFs
Up to 5 TFs in a perturbation
75M possible TF combinations

Approximate difficulty of search:

1000 perturbations/screen
100 cells/TF combination
0.013% of TF combinations observed with 1M cells in 10 screens

To tackle this experimental space, we’ve incorporated Predictive Modeling as a critical component of our discovery pipeline. We feel that the most efficient way to discover new medicines is through iterative feedback between computational and experimental teams, leveraging each team’s strengths toward our discovery engine. Here, we describe some of the challenges our Predictive Modeling team will face over the next few years.

What are our modeling challenges?

To efficiently discover new medicines, we need to build models that bridge data modalities, identify relevant assay readouts, and ultimately predict the outcome of new unobserved experiments. All of these tasks require careful coordination between computational and experimental scientists. Here we outline several of our modeling challenges below.

Gene perturbation prediction

Even with high-throughput sequencing assays, we can observe a small fraction of the total number of experiments. To efficiently explore the search space, we need to be able to predict the outcome of new experiments and use those results to prioritize experiments that are likely to result in “hits” or where we are uncertain of their outcome. We certainly do not want to run experiments to which we already know the answer!

The missing barcode problem

Nominally, the output of a sequencing experiment will be a list of observed cells, their sequencing readouts, and the list of barcodes associated with perturbations they received. Unfortunately, current methods detect 50-90% of barcodes in individual cells. We might need to build models robust to these missing data, especially when we expect the number of perturbations to be high, such as semi-supervised factor models. However, we’re also excited by the opportunity to build barcode denoising models that can take advantage of clever experimental designs where possible.

Pooled screen layout.png

Diagram of “ground truth” data structure associated with pooled-perturbation screens. Rows correspond to cells, column X corresponds to multi-omic readouts (intentionally described without detail). Columns of Y correspond to individual perturbations. Each element of Y indicates a perturbation was not present (NP) (gray), a perturbation was in the screen but not observed in that cell (white) or the perturbation was observed in the cell. At some rate, unobserved barcodes may actually be present.

Generalizing to new TFs

Another challenge is representing our perturbations so that our models may generalize to new, unobserved perturbations. We need to build representation spaces where our models can compare different transcription factors and use these differences to estimate now new TFs may change gene expression in our cells. There are many ways to do this, including binding-motif features, properties of the TF proteins, known binding partners, and self-supervised representations such as protTrans. We’re excited to build representations that marry the training and intuition of our scientists with modern data-driven learning.

Leveraging external datasets