BGPT: Paper Review: Separating selection from mutation in antibody language models

Fuel Your Discoveries

Quick Explanation Copied

Key result

The paper argues (and empirically demonstrates) that antibody foundation models trained with masked objectives conflate nucleotide-level mutation biases (codon accessibility, SHM rate heterogeneity) with functional amino-acid selection, and that explicitly separating these processes with a deep amino-acid selection model (DASM) improves zero-shot functional prediction while being dramatically faster.

Evidence includes (i) a near ~100× probability gap in AbLang2 between amino-acid alternatives requiring 1 vs multiple nucleotide changes, and (ii) a decline in functional predictive correlation from ~0.49 to ~0.30 when conditioning on codons requiring multiple nucleotide mutations; DASM recovers performance without that bias.

Long Explanation

Paper Review (BGPT): “Separating selection from mutation in antibody language models”

Primary handle: 10.7554/eLife.109644

Date in manuscript: April 07, 2026.

1) VISUAL: What the paper claims to fix

Mutation bias → functional confounding (masked LMs)

Conventional antibody MLM

Masked objective learns codon accessibility & SHM rate patterns.
Also learns germline memory (explicitly discussed for antibody LMs).
→ confounding in functional mutation scoring.

Paper’s fix: Thrifty+DASM

Separate neutral mutation model from amino-acid selection.
Learn selection factors with a transformer encoder.
Compute mutation probabilities as product: neutral × selection.

All core methodological claims above come from the manuscript’s introduction, results, formulation, and discussion.

2) VISUAL: Mutation-accessibility bias in AbLang2 (Figure 1a,c)

2.1 Bias in predicted amino-acid probabilities (1 vs multi nucleotide changes)

The paper reports that AbLang2 assigns almost 100× lower probability to amino acids requiring multiple nucleotide mutations relative to those requiring a single nucleotide change.

2.2 Bias harms functional prediction correlation (single-mutant DMS)

On the Koenig et al. FLAb benchmark subset, the paper reports a drop in predictive correlation from 0.49 to 0.30 when moving from amino acids requiring a single nucleotide mutation to those requiring multiple nucleotide mutations (and notes remaining overall degradation when combining classes).

3) VISUAL: DASM improves functional prediction while controlling nucleotide bias

3.1 Koenig benchmark: sign and magnitude changes (Table 1 in provided text)

The manuscript includes a table of Pearson correlations on the Koenig benchmark (binding and expression scored in multiple ways). In the provided excerpted table, DASM shows large positive correlations for binding and moderate-positive for expression; AbLang2 shows negative expression correlations in the excerpt and smaller binding correlations.

3.2 Conditional perplexity in Rodriguez parent→child mutation trajectories

For predicting the identity of amino-acid changes in held-out parent-child sequence pairs, the paper uses “conditional perplexity” and reports median values of 4.88 for DASM vs 7.39 for AbLang2 across 1000 sequences (lower is better).

4) VISUAL: DASM is much faster and smaller (efficiency claim)

The manuscript claims that DASM provides predictions for all variants in a single pass, removing iterative masking used by masked LMs; it reports CPU/GPU speedups in the text and table (with DASM runtime per sequence far lower than AbLang2/ESM2).

Skeptical note: runtime can depend on implementation details, batching, and scoring protocol. The comparison here is based on the manuscript’s described hardware and timing scripts; it is persuasive for relative orders-of-magnitude but should still be re-verified when ported to different inference pipelines.

5) Model formulation: Where separation happens (mechanistic clarity)

The paper gives a likelihood for codon mutation probabilities factorized into a fixed neutral mutation component and a learned selection factor (transformer-encoder over input amino-acid sequence) that yields codon-level amino-acid selection factors. It further frames the learned selection factors as analogous to mutation–selection models (moving from classic equilibrium-frequency views toward per-site per-amino-acid selection factors).

6) Skeptical critique: what could still be confounded?

6.1 Reconstruction/phylogeny assumptions
DASM training relies on clonal-family clustering, phylogenetic inference, and ancestral reconstruction to create parent–child pairs (PCPs). Biases in tree inference, rate models, or clustering could shift what “parent” vs “child” substitutions represent. The paper uses a named substitution model and rate heterogeneity framework and describes paired heavy/light rate partitioning; these are plausible but still model-dependent.

6.2 Neutral model quality (out-of-frame + multihit correction)
The neutral mutation component is inferred from out-of-frame data and includes a “multihit correction” to account for SHM clustering. If out-of-frame neutrality is imperfect or the multihit correction does not fully match within-frame neutrality in all contexts, the selection factor estimates could partly compensate for residual neutral-model errors.

6.3 Metric interpretation: pseudo-perplexity and conditional perplexity
Comparisons for LMs use pseudo-perplexity variants and masked-marginals schemes; for DASM comparisons, selection factors are used directly since selection factors do not correspond to likelihoods needed for perplexity. Metric mismatches can sometimes make comparisons look better or worse depending on evaluation protocol alignment. The manuscript documents these scoring choices; still, a careful re-evaluation under alternative scoring conventions would strengthen robustness claims.

6.4 Dataset scope: antibodies and benchmark-centric evaluation
The evaluation relies heavily on FLAb and additional antibody binding datasets; while this supports the central claim in the antibody domain, generalization beyond antibodies or beyond the particular antibody datasets used for evaluation remains an open empirical question. The paper discusses plans for extension beyond antibodies, but the present evidence is concentrated in antibody settings.

7) Evidence table (quick scan)

Evidence type	Claim	Reported number(s)
Mutation accessibility bias	AbLang2 downweights multi-nucleotide-accessible AA alternatives	~100× lower probability
Functional confounding	Functional correlation drops for multi-nucleotide-accessible mutants	0.49 → 0.30 Pearson correlation
Selection-factor utility	DASM recovers functional predictive signal without nucleotide bias	Koenig correlations: DASM positive (binding) while AbLang2 excerpted negative/near-zero
Trajectory prediction	Lower conditional perplexity for DASM	Median 4.88 (DASM) vs 7.39 (AbLang2) on 1000 sequences
Efficiency	Single-pass inference yields large speedups	CPU/GPU timing: DASM orders faster than AbLang2/ESM2 (log-scale plotted)

8) Where to go next (BGPT bespoke actions)

Feedback:

Updated: April 18, 2026