BGPT BGPT
REFUTE Benchmark · A BGPT Open Eval

Does your model reason like a scientist?

Reasoning Over Evidence — Falsification, Uncertainty, Truth-grounding & Epistemics

REFUTE scores how well language models critique real science papers—not how fluently they sound. It measures falsification, calibrated uncertainty, limitation-finding, and the discipline to refuse when evidence is missing.

17
frontier models tested
120
rubric-graded questions
4
independent scoring channels
2025–26
fresh papers, reduced contamination

Why this benchmark exists

Labs and reviewers increasingly trust AI to read the literature. The dangerous failure mode isn't a model that sounds wrong—it's one that sounds convincing while overstating weak evidence. REFUTE is built to separate persuasive language from honest scientific judgment—rewarding calibration and truth-seeking, not cynicism or verbosity.


The leaderboard

Truth Score (0–100) — a composite of skill, calibration, and discrimination across all four channels. Higher is better. June 2026 run; 16 of the 17 tested models have the complete judge-free axes needed to score.

GLM-5.1 open
69.2
Claude-Opus-4.7 api
68.6
Claude-Opus-4.6 api
68.4
Kimi-K2.6 open
66.3
GLM-5 open
62.5
Grok-3-Mini api
62.3
Qwen3.5-397B open
62.0
GPT-5.2 api
61.8
Gemini-3.1-Pro api
60.0
Qwen3-235B open
59.2
GPT-5.4 api
57.9
Grok-4.1-Fast api
55.7
Cogito-v2.1-671B open
53.9
Llama-3.3-70B open
49.4
DeepSeek-V4-Pro open
48.6
gpt-oss-120b open
45.7
Truth Score leader ranked model top skill, poor calibration

Truth Score = 40% critique skill + 25% calibration (Brier skill score) + 20% forced-choice + 15% planted-flaw discrimination. Reported only when calibration & forced-choice axes are complete (Gemma-4 excluded). Full numbers: leaderboard_master.json.


The catch: fluent ≠ honest

The best critics are not always the most truthful. Grok-3-Mini writes the sharpest critiques of any model—yet lands at #6 on Truth Score. GPT-5.4 is starker still: near the top on skill, it falls to #11 because it's overconfident when evidence is weak. The open-weight GLM-5.1 makes the opposite trade and wins overall.

Best at critique (skill /10)

1
Grok-3-Mini 7.46
2
GPT-5.4 7.22
3
GPT-5.2 7.21
4
Claude-Opus-4.7 7.10
5
Grok-4.1-Fast 7.04

Most truthful (Truth Score)

1
GLM-5.1 69.2
2
Claude-Opus-4.7 68.6
3
Claude-Opus-4.6 68.4
6
Grok-3-Mini 62.3
11
GPT-5.4 57.9

On calibration (Brier, lower is better) GLM-5.1 scores 0.12; GPT-5.4 is the worst among the top skill tier at 0.24—it over-flags flaws and cries wolf. The Grok models show the same pattern: top-tier critiques, weaker calibration (Brier 0.19 and 0.23). One polished paragraph can't hide that. (GPT-5.2, GPT-5.4 and Claude-Opus-4.7 are a statistical tie on skill; Grok-3-Mini posts the single highest score.)


Four things REFUTE measures

Each answer is graded on four independent channels, so strength in one can't mask weakness in another.

Channel 1

Critique skill

Quality of reasoning about a paper's methods and claims—graded against an expert rubric, length-controlled so verbosity doesn't win.

Channel 2

Uncertainty honesty

Does stated confidence match the strength of the evidence? Scored with a strictly proper Brier loss—no LLM judge involved.

Channel 3

Flaw discrimination

Shown two study summaries, can the model pick the more flawed one? A judge-free forced choice where chance is 50%.

Channel 4

Planted-flaw detection

Catches deliberate errors hidden in the methods or results—graded on objective labels, not prose, so fluent writing earns no credit.


Five judgment skills

The critique channel spans five task types, 24 questions each (120 total), drawn from 2025–2026 papers.

Falsification

What would disprove the claim? The model must name concrete, measurable observations that would refute the authors' main result.

Limitations

Find the real weaknesses. Paper-specific design flaws—missing controls, confounds, untested assumptions—not boilerplate like "larger sample needed."

Refusal

Withhold judgment without data. Key results are hidden; a disciplined model declines to conclude, a weak one invents outcomes.

Overclaim

Narrow claims to what was shown. Rewrite overbroad statements, separating what the data demonstrate from what is merely asserted.

Confidence

Match certainty to evidence. State a confidence level justified by the study design, not by how the abstract sounds.


How evaluation works

  1. Show a paper card — a structured summary of a recent study (title, problem, methods, population, results)
  2. Ask a reviewer's question — what a careful scientist would probe: limitations, falsifiers, calibrated confidence
  3. Grade on four channels — critique skill, calibration, forced-choice discrimination, and planted-flaw detection
  4. Combine into a Truth Score — when calibration and forced-choice axes are both present

Skill is judged by a two-model panel (GLM-5 + Kimi-K2.6) on length-controlled answers; calibration, forced-choice, and planted-flaw axes are judge-free with objective labels. Full protocol and limitations live in RESULTS.md.


Three ways to use it

All text-only, all on Hugging Face. Pick the split that fits your goal.

refute_120
120 items

The full benchmark — all five judgment skills, balanced 24 per task.

refute_soundness
74 vignettes

Sound/flawed twins with objective labels for judge-free calibration and flaw detection.

Load any split in two lines:

from datasets import load_dataset

# Headline rankings — start here
hard = load_dataset("BGPT-OFFICIAL/refute", "refute_hard_60", split="train")

row = hard[0]
print(row["task"], row["paper_title"][:80])
# Feed row["input"] to your model; score against row["reference"] / row["rubric"]

Frequently Asked Questions

What does REFUTE actually test?

Whether a model behaves like a careful scientist when reading a paper: proposing falsifiable tests, finding real limitations, calibrating confidence to the evidence, refusing to invent missing results, and resisting overclaim. It scores epistemic discipline—not writing fluency.

Why use 2025–2026 papers?

What is the Truth Score?

How do you stop verbose answers from winning?

Who judges the critique tasks?

Do open-weight models really beat closed ones?

What are the known limitations?

How do I run my own model on it?

What license is it under?


Cite REFUTE

@misc{refute2026,
  title={REFUTE: Reasoning Over Evidence Benchmark},
  author={BGPT},
  year={2026},
  howpublished={\url{https://huggingface.co/datasets/BGPT-OFFICIAL/refute}}
}
Dataset card → Live leaderboard → Methods & results → Collection →


Back to BGPT