This work extends the popular scGPT model by injecting language embeddings from multiple knowledge sources (NCBI gene cards, UniProt protein summaries, and GO annotations) to complement experimental scRNA-seq data for predicting single‐cell perturbations. The authors show that while language embeddings alone do not outperform traditional biological representations, they add a significant complementary value—especially for single-gene perturbations (via subcellular localization) and two-gene perturbations (via protein summaries).
See detailed discussion below for methodology, performance metrics, and limitations
This study introduces the scGenePT model, which augments the scGPT foundation model by incorporating language embeddings derived from extensive gene-related databases. By integrating textual data from sources such as NCBI gene card summaries, UniProtKB protein summaries, and GO annotations, the authors seek to provide complementary prior knowledge to traditional experimental scRNA-seq data for predicting the outcomes of genetic perturbations in single cells.
Empirically, scGenePT demonstrated consistent improvements over the scGPT baseline and other state-of-the-art models such as GEARS on both the Norman and Adamson datasets. For instance, on the Adamson single-gene perturbation dataset, scGenePT incorporating NCBI+UniProt embeddings achieved a Pearson correlation of 0.781 ± 0.02 and an MSE of 0.133 ± 0.01 on all genes, outperforming scGPT’s 0.782 ± 0.02 correlation but with lower MSE, indicating more accurate predictions. On two-gene perturbations in the more challenging Norman dataset, scGenePT GO-all and NCBI+UniProt models reached superior performance with Pearson correlations around 0.698 and improved MSE compared to baselines. These gains were particularly notable in combinatorial perturbations involving unseen genes, underscoring scGenePT’s enhanced ability to generalize beyond training perturbations.
Beyond metrics, interpretability analyses revealed that language-informed models produce biologically more plausible predictions. For example, the model better captured directionality and magnitude of perturbation effects for genes like POU3F2 in the Norman dataset, accurately predicting downregulation for downstream targets such as FABP5, HSP90AB1, and NPM1, consistent with known biology. This reflects how language embeddings from curated sources provide meaningful prior biological context that guides model predictions in complex perturbations, especially when experimental training data are limited or absent.
Nonetheless, the authors prudently discuss limitations including potential overfitting risks associated with dataset-specific language embeddings and the challenge of generalizing to broader perturbation and cell type contexts. They advocate for the systematic curation of auxiliary text sources, investigation of alternative embedding alignment methods, and extension to diverse datasets to solidify findings.
This study provides a meaningful contribution by rigorously demonstrating that the integration of language-derived knowledge with experimental single-cell data enhances predictive performance in single-cell perturbation models. The approach’s solid experimental validation, combined with interpretability analyses, reveals the tangible biological benefits of multimodal data fusion. While the conceptual novelty of incorporating language embeddings into genomic foundation models is incremental relative to prior multimodal efforts, the thorough empirical and analytical treatment employed offers important insights and paves the way for future innovation in computational single-cell biology.
For instance, the study shows a significant performance boost with the incorporation of GO Cellular Component annotations in single-gene perturbations, where predictions become more biologically plausible by capturing changes in subcellular localization .
The scGenePT paper offers an insightful step forward in the integration of heterogeneous data modalities for single-cell analysis. By smartly blending language-derived gene summaries with experimental scRNA-seq data, the model achieves enhanced performance in predicting complex perturbation effects. Continued efforts to refine alignment techniques and extend evaluations to additional datasets will further solidify the utility of such integrative approaches in computational biology.