BGPT: Paper Review: Reliable protein–protein docking with AlphaFold, Rosetta, and replica exchange

Fuel Your Discoveries

Quick Explanation Copied

AlphaRED: using AlphaFold confidence to decide when to “dock globally” vs “refine locally”

The paper proposes AlphaRED, a protein–protein docking pipeline that uses AlphaFold-multimer (AFm) confidence to switch between global replica-exchange docking and local backbone refinement, improving near-native docking success over AFm alone on DB5.5 and showing gains on CASP15 nanobody–antigen targets.

Key evidence points: interface-pLDDT AUC = 0.86, success (DockQ > 0.23) = 63% overall, and Ab–Ag success rises to 43% (from ~20% with AFm alone).

Long Explanation

Paper Review (Rigorous, skeptical, evidence-based)

Target paper: Harmalkar, Lyskov, Gray — Reliable protein–protein docking with AlphaFold, Rosetta, and replica exchange (eLife, 2024).

1) Visual overview: what AlphaRED changes

Core idea: AFm sometimes predicts good subunit shapes but wrong docking orientation. AlphaRED attempts to detect likely interface-orientation failure using interface-pLDDT, then either:

interface-pLDDT ≤ 85: run global rigid-body replica-exchange docking (ReplicaDock 2.0) to search the docking landscape.
interface-pLDDT > 85: skip global search and do local backbone refinement with directed backbone moves on AFm-flagged mobile residues.

This “confidence-guided branching” is the method’s main novelty claim.

2) Evidence visualizations (from reported aggregate results)

Values are taken directly from the paper’s reported success rates.

The paper reports interface-pLDDT AUC = 0.86, while the provided text does not include the AUC values for the other metrics.

Reported: among 97 targets with AFm interface-pLDDT ≤ 85, AlphaRED reduces interface-RMSD for 93.

The paper states Ab–Ag success was ~20% for AFm acceptable-or-better and 43% for AlphaRED.

3) Method mechanics (what exactly gets computed?)

Docking benchmark dataset: DB5.5 is used, with targets containing both bound complexes and unbound subunits, and targets are classified by unbound-to-bound RMSD (rigid/medium/difficult).
AFm inputs: The paper generates AF-multimer structures from sequences using ColabFold, specifically referencing AF-multimer v2.3.0 and generating five relaxed predictions (default recycles).
Confidence metric repurposing: interface-pLDDT is the mean of per-residue pLDDT over predicted interface residues (interface residues defined by residue–residue proximity, as described in the text).
Docking quality ground truth: DockQ is used as the docking quality metric with success defined by DockQ > 0.23 corresponding to CAPRI acceptable or higher.

Skeptical check: The paper explicitly acknowledges that AF training overlap could inflate benchmark success (“upper bound” argument).

4) Validation: benchmark vs blind CASP15

Benchmark (DB5.5): performance is summarized over 254 targets, with a special subset of 67 antibody–antigen complexes extracted from DB5.5.

Blind test (CASP15): because training overlap could bias benchmark results, the paper additionally tests on CASP15 targets that were not included in AFm training (as asserted by the authors).

Reported qualitative outcome: one example (T205) improves interface RMSD (AFm to AlphaRED), and the authors state improvements across the five CASP15 targets using predicted top-models as reference.

Critical limitation (important): the provided text indicates reference structures for some CASP15 cases are top predicted models rather than experimentally released structures, which weakens interpretability of absolute docking accuracy.

5) Metrics: what do pLDDT/LDDT/RMSD imply mechanistically?

LDDT (and per-residue LDDT) is described as a superposition-free local distance-difference test used to estimate local similarity of models relative to a reference.
AlphaRED’s hypothesis is operational: interface-pLDDT can serve as a discriminator of whether AFm docking orientation is near-native. This is supported in the paper by ROC/AUC and a confusion-matrix approach at the interface-pLDDT ≥ 85 cutoff.

Skeptical note: AUC and precision/recall at a single threshold do not guarantee calibrated probability or transferability outside the benchmark distribution; the paper itself frames generalization as an open challenge and tests it partially with CASP15.

6) Critical assessment: strengths, blind spots, and potential failure modes

Strengths supported by the paper text

Mechanism-relevant bridging: using AFm confidence to steer physics-based sampling is more than “post-hoc ranking”; it changes the sampling regime (global vs local) and explicitly targets docking-orientation errors.
Quantitative improvements in the “failure slice”: 93/97 AFm-failure targets show improved interface RMSD after AlphaRED.
Ab–Ag focus is statistically meaningful in this benchmark: Ab–Ag success reported as 43% for AlphaRED vs ~20% for AFm top predictions.

Blind spots / critical limitations (from the paper text)

Training-set overlap (benchmark inflation): DB5.5 targets may overlap with AF training data; authors state measured success could be an upper bound.
CASP15 reference may not be fully experimental: where experimental structures are not released, the paper uses top CASP15 predicted models as references, weakening absolute accuracy interpretation.
Confidence metrics aren’t universally reliable: the paper highlights that overall AFm confidence may not correlate with interface correctness; hence they introduce interface-pLDDT. That’s a strength—but also implies confidence transfer could fail in other interface classes not represented in their benchmark.
Compute-time realism depends on infrastructure: the paper frames feasibility in CPU-hours for their server/pipeline usage (as described in provided text). Without independent benchmarking on other clusters, runtime comparisons can be incomplete.

What would most likely disprove the main practical claim?

If interface-pLDDT discrimination (AUC ~0.86 at the described thresholding) fails on a truly independent, non-overlapping dataset, the “branching logic” would lose reliability.
If docking improvements for AFm-failure cases do not reproduce across other docking benchmarks (or under different AFm model versions), the improvement could be method/benchmark-specific.

7) Links to key related literature (confidence metrics + docking score)

AlphaFold (single-chain) conceptual basis:
AlphaFold-multimer for complexes:
DockQ scoring used as the “native-like docking” yardstick:
lDDT (local distance difference test) definition used for local similarity logic:

8) How to use this paper (practical takeaway for docking work)

If you are building a docking pipeline, AlphaRED’s transferable pattern is:

Compute an interface-local confidence (interface-pLDDT) rather than using global confidence alone.
Let that confidence select the sampling regime (global search vs local refinement).
Rank outputs by docking quality scores consistent with benchmark definitions (DockQ).

9) Author reviews (bespoke)

Feedback:

Updated: March 20, 2026