Reasoning Over Evidence — Falsification, Uncertainty, Truth-grounding & Epistemics
REFUTE scores how well language models critique real science papers—not how fluently they sound. It measures falsification, calibrated uncertainty, limitation-finding, and the discipline to refuse when evidence is missing.
Labs and reviewers increasingly trust AI to read the literature. The dangerous failure mode isn't a model that sounds wrong—it's one that sounds convincing while overstating weak evidence. REFUTE is built to separate persuasive language from honest scientific judgment—rewarding calibration and truth-seeking, not cynicism or verbosity.
Truth Score (0–100) — a composite of skill, calibration, and discrimination across all four channels. Higher is better. June 2026 run; 16 of the 17 tested models have the complete judge-free axes needed to score.
Truth Score = 40% critique skill + 25% calibration (Brier skill score) + 20% forced-choice + 15% planted-flaw discrimination. Reported only when calibration & forced-choice axes are complete (Gemma-4 excluded). Full numbers: leaderboard_master.json.
The best critics are not always the most truthful. Grok-3-Mini writes the sharpest critiques of any model—yet lands at #6 on Truth Score. GPT-5.4 is starker still: near the top on skill, it falls to #11 because it's overconfident when evidence is weak. The open-weight GLM-5.1 makes the opposite trade and wins overall.
On calibration (Brier, lower is better) GLM-5.1 scores 0.12; GPT-5.4 is the worst among the top skill tier at 0.24—it over-flags flaws and cries wolf. The Grok models show the same pattern: top-tier critiques, weaker calibration (Brier 0.19 and 0.23). One polished paragraph can't hide that. (GPT-5.2, GPT-5.4 and Claude-Opus-4.7 are a statistical tie on skill; Grok-3-Mini posts the single highest score.)
Each answer is graded on four independent channels, so strength in one can't mask weakness in another.
Quality of reasoning about a paper's methods and claims—graded against an expert rubric, length-controlled so verbosity doesn't win.
Does stated confidence match the strength of the evidence? Scored with a strictly proper Brier loss—no LLM judge involved.
Shown two study summaries, can the model pick the more flawed one? A judge-free forced choice where chance is 50%.
Catches deliberate errors hidden in the methods or results—graded on objective labels, not prose, so fluent writing earns no credit.
The critique channel spans five task types, 24 questions each (120 total), drawn from 2025–2026 papers.
What would disprove the claim? The model must name concrete, measurable observations that would refute the authors' main result.
Find the real weaknesses. Paper-specific design flaws—missing controls, confounds, untested assumptions—not boilerplate like "larger sample needed."
Withhold judgment without data. Key results are hidden; a disciplined model declines to conclude, a weak one invents outcomes.
Narrow claims to what was shown. Rewrite overbroad statements, separating what the data demonstrate from what is merely asserted.
Match certainty to evidence. State a confidence level justified by the study design, not by how the abstract sounds.
Skill is judged by a two-model panel (GLM-5 + Kimi-K2.6) on length-controlled answers; calibration, forced-choice, and planted-flaw axes are judge-free with objective labels. Full protocol and limitations live in RESULTS.md.
All text-only, all on Hugging Face. Pick the split that fits your goal.
The 60 questions that best separate strong from weak models. Use this for headline rankings.
The full benchmark — all five judgment skills, balanced 24 per task.
Sound/flawed twins with objective labels for judge-free calibration and flaw detection.
Load any split in two lines:
from datasets import load_dataset
# Headline rankings — start here
hard = load_dataset("BGPT-OFFICIAL/refute", "refute_hard_60", split="train")
row = hard[0]
print(row["task"], row["paper_title"][:80])
# Feed row["input"] to your model; score against row["reference"] / row["rubric"]
Whether a model behaves like a careful scientist when reading a paper: proposing falsifiable tests, finding real limitations, calibrating confidence to the evidence, refusing to invent missing results, and resisting overclaim. It scores epistemic discipline—not writing fluency.
@misc{refute2026,
title={REFUTE: Reasoning Over Evidence Benchmark},
author={BGPT},
year={2026},
howpublished={\url{https://huggingface.co/datasets/BGPT-OFFICIAL/refute}}
}