News

When are Foundation Models Not the Answer?

Genomic large language models can be impressive, but we need ‘on-device’ data-efficient deep learning strategies to tackle all human cell types.

November 18, 2024

Seeing the light at the end of an LLM tunnel

Apple recently rolled out Apple Intelligence, their version of deep learning tools designed to be “AI for the rest of us.” Generative AI is intended to help us become more efficient by drafting emails we’re too busy to write, summarizing documents we don’t want to read, and turning us into passable artists and graphic designers, no matter how limited our skills. Apple’s release is the latest entry in the fierce competition among tech companies to transform the impressive capabilities of generative AI—such as ChatGPT and Imagen—into must-have consumer features integrated into everyday life.

However, the challenge with running generative AI on a phone is that these enormous models cannot actually operate on a phone’s hardware alone. State-of-the-art large language models require vast amounts of data for training and massive computing power to function. One solution is to rely on remote servers to handle the heavy lifting, but this approach comes with limitations. Privacy risks are an obvious concern, but there’s also the issue of personalization. Ideally, generative AI on a phone would continue learning from user interactions to optimize its output over time. Challenges like these are why “on-device” learning is central to Apple’s AI implementation. Apple employs various strategies to achieve this, including fine-tuned “adapters,” which are “small collections of model weights that are overlaid onto the common base foundation model.” According to Apple, these adapters can be dynamically loaded and swapped, allowing the foundation model to specialize on-the-fly for specific tasks.

We always need to train models of cell-type activities with less data.

In a similar vein, we have been considering the possibilities of genomic “on-device learning.” While the analogy to phones may not be perfect, there is a pressing need for data-efficient deep learning in genomics that large foundation models alone may not be able to address. Alphafold, while Nobel Prize-level impressive, doesn’t face one of the key challenges that makes modeling the non-coding genome so difficult: cell-type specificity.

Most functional DNA elements outside of protein-coding sequences operate in highly cell type-specific ways. For these sequences, there is no context-free functional target for the model to predict. While the structure of a G-protein coupled receptor remains consistent across cell types, the Rhodopsin promoter, for example, is entirely inactive outside of rod photoreceptors.

Foundation models in genomics are not yet—and may never be—adequate for predicting the activity of non-coding DNA across all human cell types. This is partly because the necessary data may never be available at the required scale. Much of the current training data in this field comes from cell lines, which are at best mediocre proxies for endogenous cell types within whole tissues. For disease-relevant cell types, some of which are rare in the human body, will we ever gather enough data to fully understand the sequence grammar of regulatory DNA?

We believe this is unlikely for two key reasons:

  1. Single-cell technologies, which can assay rare populations, may not generate all the necessary measurements from these populations. Additionally, the epigenomic readouts we obtain are often proxies for function rather than direct measurements of it.
  2. For any given cell type, there may not be enough examples in the genome to train a foundation model effectively. For instance, 40,000 open chromatin regions in one cell type is insufficient for training a robust foundation model.

This underscores the need to explore a genomic equivalent of on-device learning within a small data regime. If we aim to create models capable of accurately predicting the effects of non-coding mutations or generating specific regulatory activity in rare, disease-affected cell types, such an approach may be essential.

The Team @ MGI
Print Friendly, PDF & Email