REFUTE · June 2026 panel

Can AI read science honestly?

REFUTE (Reasoning Over Evidence) is an auto-graded benchmark on 240 items sampled from 2025–2026 empirical papers in the BGPT corpus. Nineteen frontier models were evaluated in June 2026 on whether they can recall findings, name falsifiers, resist overclaim, and discriminate subtle methodological errors — under identical grading rules for every model.

240

release items

72.8%

mean item accuracy

evaluated models

39→73

Truth Score range

Truth Score rankings

Composite score (0–100) blending four judge-free MCQ axes, probabilistic calibration (Brier skill score), and length-controlled critique skill. Rankings below round to the nearest integer. Full 19-model panel, June 2026, release set only.

Claude Opus 4.7

Grok 4.2

Gemini 3.1 Pro

Grok 4.3

Claude Opus 4.6

Grok 3 Mini

Qwen3.5 397B

GLM-5

Top 8 of 19 · expand table for ranks 9–19 · MCQ columns are exact-match accuracy on release IDs

See full 19-model breakdown by question type

#	Model	Truth	Facts	Falsify	Flaws
1	Claude-Opus-4.7	73	85%	94%	80%
2	Grok-4.2	71	83%	95%	69%
3	Gemini-3.1-Pro	70	88%	94%	85%
4	Grok-4.3	68	78%	87%	76%
5	Claude-Opus-4.6	68	79%	94%	61%
6	Grok-3-Mini	68	80%	85%	72%
7	Qwen3.5-397B-A17B	67	80%	72%	70%
8	GLM-5	65	78%	75%	60%
9	GLM-5.1	63	70%	63%	59%
10	GPT-5.4	61	80%	75%	69%
11	Cogito-v2.1-671B	60	75%	63%	63%
12	GPT-5.2	58	63%	67%	59%
13	Qwen3-235B-Instruct	58	72%	70%	54%
14	Kimi-K2.6	56	67%	47%	56%
15	Grok-4.1-Fast	52	67%	63%	49%
16	gpt-oss-120b	49	72%	55%	59%
17	DeepSeek-V4-Pro	49	58%	50%	60%
18	Gemma-4-31B	46	62%	31%	50%
19	Llama-3.3-70B	39	45%	57%	38%

Panel n=19 · 240 items · mean item accuracy 72.8% · full results (JSON) · methods

Epistemic axes dissociate

Critique skill and calibration are measured independently. They do not track together across models.

Critique skill vs uncertainty calibration for six REFUTE panel models

Critique skill (blue) and uncertainty calibration (pink) for six illustrated models from the June 2026 panel (full n=19). Higher is better on both axes.

Generative critique skill vs objective MCQ axes for top REFUTE models

Top 8 models by generative critique skill vs. objective MCQ accuracy — channels that do not move together.

High critique quality does not imply well-calibrated uncertainty.

Models can score strongly on evidence-based reasoning while overstating confidence when data are weak. REFUTE reports each axis separately so this dissociation is visible rather than averaged away. On the release set, Truth Score spans 39–73 across the full 19-model panel; methodological discrimination (63% mean item accuracy) is the primary separator.

Truth Score spread and headroom across 19 REFUTE panel models

Truth Score spread (39–73) across the full panel. Grok 4.2 ranks #2 overall (71) but sits #6 on flaw discrimination (69%), behind leaders such as Gemini 3.1 Pro (85%).

Four measurement axes

60 + 60 + 40 + 80 items. Each axis is 4-way multiple choice (chance = 25%). Release items were difficulty-filtered from oversampled pools using a 10-model development panel; the published 19-model evaluation uses this fixed set only.

Knowledge · 60 items

Closed-book finding recall

Reported result vs. direction- or magnitude-mutated distractors. No paper title or DOI in the prompt.

Mean item accuracy 73%

Falsification · 60 items

Concrete falsifier selection

Pick the specific observation that would weaken the claim — not a generic limitation or confirmatory restatement.

Mean item accuracy 70%

Overclaim · 40 items

Calibrated vs. hyped conclusion

Paired claims with identical limitation clauses; hype applied to the main clause only.

Mean item accuracy 98% (saturated; down-weighted in Truth Score)

Discrimination · 80 items

Methodological soundness

4-way choice among summaries with woven flaw types (sample size, confounding, p-value misuse, fabricated numbers).

Mean item accuracy 63% — primary hard axis

Mean item accuracy by REFUTE v2 question type across the 19-model panel

Mean item accuracy by question type (19-model panel, June 2026). Flaw discrimination is the hardest axis and the main rank separator.

Reproduce the evaluation

Load items from Hugging Face. Grade exact match on the model's final line: ANSWER=<letter>

pip install -U datasets

from datasets import load_dataset
items = load_dataset("BGPT-OFFICIAL/refute", "refute_knowledge", split="train")
# configs: refute_knowledge, refute_falsifier_choice,
#          refute_overclaim_choice, refute_discrimination_hard

Protocol: eval_protocol_mcq_v2.json · integrator guide

Validity checks & limitations (release set)

Exact-match MCQ grading on four v2 axes; no DOI or paper title in public prompts
Item selection: difficulty-filtered using a 10-model development panel before the full 19-model evaluation wave
Truth Score v2 = 20% knowledge + 20% calibration (BSS) + 25% discrimination + 15% falsifier + 5% overclaim + 15% critique skill
Overclaim axis saturated (~98% item accuracy); down-weighted to 5% in Truth Score for that reason
MCQ axes are judge-free; generative critique is only partially represented (15% skill weight) — use v1 configs for open-ended evaluation
Scope: English-language empirical papers (2025–2026 BGPT export); reasoning models use 1024-token budget + robust final-answer parsing
0 metric mismatches between live logs and published scores on release IDs

FAQ

What is REFUTE?

An open benchmark for scientific epistemics: whether language models ground claims in evidence, calibrate uncertainty, and discriminate flawed reasoning when reading recent empirical work. Built from structured extractions in the BGPT database.

What is Truth Score?

Why 2025–2026 papers?

How is it graded?

License

Cite

@misc{bgpt_refute_v2_2026,
  title = {REFUTE: Reasoning Over Evidence Benchmark},
  author = {{BGPT Team}},
  year = {2026},
  url = {https://huggingface.co/datasets/BGPT-OFFICIAL/refute}
}

Dataset Leaderboard Results Technical report

Back to BGPT