Quickly verify claims by accessing the underlying experimental data and figures.
Press Enter ↵ to solve
Fuel Your Discoveries
"The aim of science is not to open the door to infinite wisdom, but to set a limit to infinite error."
- Bertolt Brecht
Quick Explanation
Copied
AlphaRED: using AlphaFold confidence to decide when to “dock globally” vs “refine locally”
The paper proposes AlphaRED, a protein–protein docking pipeline that uses AlphaFold-multimer (AFm) confidence to switch between global replica-exchange docking and local backbone refinement, improving near-native docking success over AFm alone on DB5.5 and showing gains on CASP15 nanobody–antigen targets.
Key evidence points: interface-pLDDT AUC = 0.86, success (DockQ > 0.23) = 63% overall, and Ab–Ag success rises to 43% (from ~20% with AFm alone).
Long Explanation
Paper Review (Rigorous, skeptical, evidence-based)
Target paper: Harmalkar, Lyskov, Gray — Reliable protein–protein docking with AlphaFold, Rosetta, and replica exchange (eLife, 2024).
1) Visual overview: what AlphaRED changes
Core idea: AFm sometimes predicts good subunit shapes but wrong docking orientation. AlphaRED attempts to detect likely interface-orientation failure using interface-pLDDT, then either:
interface-pLDDT ≤ 85: run global rigid-body replica-exchange docking (ReplicaDock 2.0) to search the docking landscape.
interface-pLDDT > 85: skip global search and do local backbone refinement with directed backbone moves on AFm-flagged mobile residues.
This “confidence-guided branching” is the method’s main novelty claim.
Values are taken directly from the paper’s reported success rates.
The paper reports interface-pLDDT AUC = 0.86, while the provided text does not include the AUC values for the other metrics.
Reported: among 97 targets with AFm interface-pLDDT ≤ 85, AlphaRED reduces interface-RMSD for 93.
The paper states Ab–Ag success was ~20% for AFm acceptable-or-better and 43% for AlphaRED.
3) Method mechanics (what exactly gets computed?)
Docking benchmark dataset: DB5.5 is used, with targets containing both bound complexes and unbound subunits, and targets are classified by unbound-to-bound RMSD (rigid/medium/difficult).
AFm inputs: The paper generates AF-multimer structures from sequences using ColabFold, specifically referencing AF-multimer v2.3.0 and generating five relaxed predictions (default recycles).
Confidence metric repurposing:interface-pLDDT is the mean of per-residue pLDDT over predicted interface residues (interface residues defined by residue–residue proximity, as described in the text).
Docking quality ground truth: DockQ is used as the docking quality metric with success defined by DockQ > 0.23 corresponding to CAPRI acceptable or higher.
Skeptical check: The paper explicitly acknowledges that AF training overlap could inflate benchmark success (“upper bound” argument).
4) Validation: benchmark vs blind CASP15
Benchmark (DB5.5): performance is summarized over 254 targets, with a special subset of 67 antibody–antigen complexes extracted from DB5.5.
Blind test (CASP15): because training overlap could bias benchmark results, the paper additionally tests on CASP15 targets that were not included in AFm training (as asserted by the authors).
Reported qualitative outcome: one example (T205) improves interface RMSD (AFm to AlphaRED), and the authors state improvements across the five CASP15 targets using predicted top-models as reference.
Critical limitation (important): the provided text indicates reference structures for some CASP15 cases are top predicted models rather than experimentally released structures, which weakens interpretability of absolute docking accuracy.
5) Metrics: what do pLDDT/LDDT/RMSD imply mechanistically?
LDDT (and per-residue LDDT) is described as a superposition-free local distance-difference test used to estimate local similarity of models relative to a reference.
AlphaRED’s hypothesis is operational: interface-pLDDT can serve as a discriminator of whether AFm docking orientation is near-native. This is supported in the paper by ROC/AUC and a confusion-matrix approach at the interface-pLDDT ≥ 85 cutoff.
Skeptical note: AUC and precision/recall at a single threshold do not guarantee calibrated probability or transferability outside the benchmark distribution; the paper itself frames generalization as an open challenge and tests it partially with CASP15.
6) Critical assessment: strengths, blind spots, and potential failure modes
Strengths supported by the paper text
Mechanism-relevant bridging: using AFm confidence to steer physics-based sampling is more than “post-hoc ranking”; it changes the sampling regime (global vs local) and explicitly targets docking-orientation errors.
Quantitative improvements in the “failure slice”: 93/97 AFm-failure targets show improved interface RMSD after AlphaRED.
Ab–Ag focus is statistically meaningful in this benchmark: Ab–Ag success reported as 43% for AlphaRED vs ~20% for AFm top predictions.
Blind spots / critical limitations (from the paper text)
Training-set overlap (benchmark inflation): DB5.5 targets may overlap with AF training data; authors state measured success could be an upper bound.
CASP15 reference may not be fully experimental: where experimental structures are not released, the paper uses top CASP15 predicted models as references, weakening absolute accuracy interpretation.
Confidence metrics aren’t universally reliable: the paper highlights that overall AFm confidence may not correlate with interface correctness; hence they introduce interface-pLDDT. That’s a strength—but also implies confidence transfer could fail in other interface classes not represented in their benchmark.
Compute-time realism depends on infrastructure: the paper frames feasibility in CPU-hours for their server/pipeline usage (as described in provided text). Without independent benchmarking on other clusters, runtime comparisons can be incomplete.
What would most likely disprove the main practical claim?
If interface-pLDDT discrimination (AUC ~0.86 at the described thresholding) fails on a truly independent, non-overlapping dataset, the “branching logic” would lose reliability.
If docking improvements for AFm-failure cases do not reproduce across other docking benchmarks (or under different AFm model versions), the improvement could be method/benchmark-specific.
7) Links to key related literature (confidence metrics + docking score)
AlphaFold (single-chain) conceptual basis:
AlphaFold-multimer for complexes:
DockQ scoring used as the “native-like docking” yardstick:
lDDT (local distance difference test) definition used for local similarity logic:
8) How to use this paper (practical takeaway for docking work)
If you are building a docking pipeline, AlphaRED’s transferable pattern is:
Compute an interface-local confidence (interface-pLDDT) rather than using global confidence alone.
Let that confidence select the sampling regime (global search vs local refinement).
Rank outputs by docking quality scores consistent with benchmark definitions (DockQ).
9) Author reviews (bespoke)
Feedback:
Updated: March 20, 2026
BGPT Paper Review
Study Novelty
90%
High novelty comes from the specific, interface-focused confidence repurposing (interface-pLDDT) to deterministically select between global replica-exchange docking and local directed refinement, rather than treating AF confidence only as post-ranking or entirely as a template for docking.
Scientific Quality
90%
Strong benchmark design and metric alignment (DockQ, CAPRI mapping) with quantitative discrimination (interface-pLDDT ROC AUC 0.86) and an explicit branching protocol. Skeptical flags: possible benchmark inflation via AF training overlap and reliance on CASP15 predicted references when experimental structures are not yet released.
Study Generality
70%
Generality is supported by broad DB5.5 coverage (254 targets) and a blind-like CASP15 test, but the provided text indicates dataset- and model-version-specific parameter choices (AFm v2.3.0; interface-pLDDT threshold 85) and limited blind coverage (five CASP15 nanobody–antigen targets).
Study Usefulness
90%
Practical usefulness is high because the pipeline is directly deployable (code on GitHub; online server referenced) and provides a clear operational rule (interface-pLDDT threshold) tied to measurable docking improvements.
Study Reproducibility
80%
The paper reports explicit protocol components (AFm via ColabFold without templates; ReplicaDock 2.0 perturbation magnitudes and replica exchange setup; Rosetta backbone movers; thresholds) and provides code/server availability. Remaining gaps likely include full parameter details in supplementary methods (not fully provided in the excerpt) and dependency on specific AFm version and server environment.
Explanatory Depth
80%
Mechanistic explanation is largely operational and metric-based (confidence → sampling regime → docking-orientation correction; interface-local confidence correlates with DockQ). It does not provide deep physical causal modeling beyond the sampling/refinement logic described.
This would parse DB5.5/CASP15 result tables from AlphaRED outputs to compute DockQ success fractions by interface-pLDDT branch and plot ROC/AUC-like discrimination summaries for each metric.
Get emailed when your analysis is done!
We'll email you the results when your analysis is finished.
Hypothesis Graveyard
Because interface-pLDDT correlates with docking correctness in this benchmark (AUC 0.86), one might overgeneralize that it will be universally calibrated; that would fail on interface classes where confidence is high despite orientation errors, matching the paper’s own motivation that global confidence can be misleading.
It’s tempting to attribute all gains to improved backbone sampling; however, the paper’s results emphasize docking orientation correction and interface discrimination, so a “backbone-only” interpretation would be incomplete if interface orientation dominates error modes.