3-OF-3 PANEL VERIFIED R1 + R2 + R3 ROUNDS 2026-05-27 12:45 UTC

Same weights.
Different substrate.
Seven panel-verified gains.

All comparisons: Modulum hosted vs Vanilla H100 baseline running the identical Gemma-4-31B-Q4 weights on matched H100-class clusters (5 endpoints each side). Statistical significance verified by Codex + Grok 4.3 + Gemini 3.1-pro-preview, independent then convergent across three panel rounds.

The headline RAG-grounding finding

Source: runs/rag-resistance-sf-20260524/ — 6-stack × 6-condition × N=30 paired probe. Surfaced by data-audit; R3 panel-verified 2026-05-27.

RAG present_normal · N=30 paired

86.7%
Modulum
0.0%
Vanilla
+86.7pp delta
Counts26/30 vs 0/30 committed
Fisher exactp = 7.8 × 10⁻¹³
McNemar pairedb=26 c=0, p = 2.98 × 10⁻⁸
"Under matched weights and hardware, Modulum converted explicitly present context into committed answers in 86.7% of paired cases, while vanilla remained ambiguous in all 30 cases."

RAG present_paraphrase · N=30 paired

90.0%
Modulum
0.0%
Vanilla
+90.0pp delta
Counts27/30 vs 0/30 committed
Fisher exactp = 9.2 × 10⁻¹⁴
McNemar pairedb=27 c=0, p = 1.49 × 10⁻⁸
"On paraphrased context (same fact, different wording), Modulum commits in 90% of paired cases; vanilla 0%."

RAG present_typo · N=30 paired

86.7%
Modulum
3.3%
Vanilla
+83.3pp delta
Counts26/30 vs 1/30 committed
Fisher exactp = 1.9 × 10⁻¹¹
McNemar pairedb=25 c=0, p = 5.96 × 10⁻⁸
"On typo-perturbed context, Modulum holds 86.7% commit; vanilla collapses to 3.3% — substrate-attributable robustness to surface-level noise."

Decay slope — substrate advantage grows with context length

BABILong qa1 paired Modulum vs Vanilla on identical Gemma-4-31B-Q4 weights, N=200 per length. R4 panel verification queued.

ContextModulumVanillaDeltaFisher pOdds ratio
32k180/200 (90.0%)156/200 (78.0%)+12.0pp1.6 × 10⁻³2.54
64k154/200 (77.0%)125/200 (62.5%)+14.5pp2.2 × 10⁻³2.01
128k143/200 (71.5%)105/200 (52.5%)+19.0pp1.3 × 10⁻⁴2.27
"On BABILong qa1 single-fact retrieval, the substrate advantage over the same-weights vanilla baseline grows with context length: +12pp at 32k, +14.5pp at 64k, +19pp at 128k — all Fisher exact p < 0.01."

256k decay slope — graceful through Gemma-4's architectural ceiling

Modulum BABILong qa1 across all four lengths. Panel R5-verified. Decay is monotonic and gradual through the 262k ceiling.

Context lengthModulumNDrop vs prior length
32k90.0%200
64k77.0%200−13.0pp
128k71.5%200−5.5pp
256k (near ceiling)59.3%91 (deduped)−12.2pp
"Modulum shows graceful long-context decay through Gemma-4's 262k architectural ceiling: 90.0% at 32k, 77.0% at 64k, 71.5% at 128k, and 59.3% at 256k on BABILong qa1."
Honest boundary disclosure: at 256k paired (N=48), Gemini 3.1 Pro 83% beats Modulum 60% (McNemar p=9.8e-4). Modulum is positioned as efficient 31B long-context capability on H100-class serving, not as a frontier-model replacement at the ceiling.
Substrate-attribution gap at 256k: clean Vanilla H100 Q4 256k pair does not yet exist. RULER 240k_ceiling Vanilla pair currently running — early result: both Modulum and Vanilla cliff 0/30 on niah_multikey_3 at 240k (no substrate advantage at the hardest ceiling subtask). Vs base Gemma-4-31B-it FP16: Modulum 59.3% vs 10.0% (Fisher p=2.3e-13) — directional but confounded by FP16 vs Q4 + hardware setup, treat as deployment-stack evidence not substrate.

Decode / throughput on H100 (NVIDIA serving-density angle)

Modulum vs Vanilla on the same BABILong 128k workload, same H100-class hardware. Panel R5-verified. Must be framed as prefill-dominated (output = 1 token).

StackConcurrencyAvg wallPer-call tok/sCluster TPSCost/correct (est)
ModulumC=1240.8s527.3527.3$0.151
VanillaC=1278.3s456.2456.2$0.198
ModulumC=4303.3s418.61,612.2$0.227
VanillaC=4252.8s502.21,866.0$0.265
"On the same 128k BABILong workload at C=1, Modulum delivered 527 tok/s vs Vanilla 456 tok/s — a +15.6% effective throughput gain. Because BABILong outputs only one token, this primarily measures long-context prefill throughput, not sustained decode speed."
+15.6%
C=1 prefill throughput vs Vanilla
1,723
tok/s prefill at 256k context, single H100 endpoint
$0.151
/correct at C=1, vs Vanilla $0.198 (−24%)
Boundary condition (panel-mandated disclosure): at C=4, Vanilla leads raw cluster TPS (1,866 vs 1,612). The pitch story is NOT "Modulum always wins raw TPS" — it is long-context accuracy retained per deployed H100, especially at 128k-256k. Raw throughput is a wash at higher concurrency; the value is cost-per-correct.

Comparative — Modulum vs frontier at qa1 128k

Same benchmark, all stacks. Modulum 31B Q4 with substrate sits in the second tier; vanilla same-weights sits in the bottom tier.

StackNAccuracyvs Modulum
gpt-5.5 sync (frontier)5096.0%+24.5pp
claude-opus-4-6 (frontier)5096.0%+24.5pp
claude-opus-4-7 (frontier)5092.0%+20.5pp
gemini-3.1-pro-preview (frontier)5084.0%+12.5pp
modulum-gemma-4-31B-Q4 (ours)20071.5%
vanilla-gemma-4-31B-Q4 (same weights, no substrate)20052.5%−19.0pp
"Modulum 31B-Q4 + substrate reaches 71.5% on qa1 @ 128k, closing most of the gap to frontier models while delivering a +19pp lift over identical vanilla weights."
Appendix only: qa3 multi-hop at 128k shows Modulum 27.0% (N=500) vs Vanilla 23.0% (N=200) — Fisher p=0.29, not statistically significant. Multi-hop substrate effects await BABILong qa2/qa3/qa5 N=200 currently in flight. Grok-4.3 collapsed to 30% on qa1 128k (N=50, external batch) — held out of main comparison per R4 panel.

The original long-context + safety findings

R1+R2 panel-verified 2026-05-27. Long-context retrieval (BABILong), prompt-injection robustness, and behavioral adherence under hallucination probes.

BABILong qa1 128k · N=200

71.5%
Modulum
52.5%
Vanilla
+19.0pp delta
Counts143/200 vs 105/200
Fisher exactp = 1.3 × 10⁻⁴
Odds ratio2.27
"On 128k-context single-fact retrieval, Modulum delivers +19pp accuracy over vanilla on identical Gemma-4-31B-Q4 weights and matched H100-class clusters."

Injection v2 SOFT · N=60

45.0%
Modulum
21.7%
Vanilla
+23.3pp delta
Counts27/60 vs 13/60 TARGET_HELD
Fisher exactp = 0.011
Odds ratio2.96
"Under Pattern C continuation-hijack injection, Modulum holds the target 2.1× more often than vanilla on the same weights."
Pattern C is direct-override; HARD jailbreak class shows 4-way tie at 100% — substrate has no advantage there.

Coherent refusal · N=150

100.0%
Modulum
63.3%
Vanilla
+36.7pp delta
Counts150/150 vs 95/150
Fisher exactp = 1.2 × 10⁻¹⁹
Odds ratio
"Under hallucination-inducing prompts, Modulum achieved 100% coherent refusal versus 63.3% baseline on identical weights."
Anchor on coherent refusal — NOT hallucination cure. Zero-hallucination delta (148→150) is p=0.498 (not significant).
QUALIFIED — secondary RAG conditions

All six RAG-grounding conditions

Full 6-condition breakdown. Includes one null-result (distractor injection tie) as honest boundary disclosure.

ConditionModulumVanillaDeltaFisher pMcNemar pVerdict
present_normal26/30 (86.7%)0/30 (0.0%)+86.7pp7.8e-132.98e-8APPROVE
present_paraphrase27/30 (90.0%)0/30 (0.0%)+90.0pp9.2e-141.49e-8APPROVE
present_typo26/30 (86.7%)1/30 (3.3%)+83.3pp1.9e-115.96e-8APPROVE
present_contradicted18/30 (60.0%)5/30 (16.7%)+43.3pp1.2e-32.44e-4QUALIFIED
absent (refusal)10/30 (33.3%)0/30 (0.0%)+33.3pp8.0e-41.95e-3QUALIFIED
present_with_distractor_injection28/30 (93.3%)28/30 (93.3%)+0.0pp1.0001.000BOUNDARY (no substrate effect)
Hallucination floor: Modulum, Vanilla, Anthropic, OpenAI, Google all 0/30 false-positive (committed when absent). Only xAI/Grok 1/30. Substrate gain is in GROUNDED COMMITMENT, not hallucination suppression.
QUALIFIED — REQUIRES AMORTIZATION CAVEAT

Cost-per-correct economics

Apples-to-apples sync-tier comparison; Modulum at est. $0.30/M amortized H100, BABILong qa1-qa3 128k workload.

~24×
cheaper than Opus 4.6 sync
~4×
cheaper than GPT-5.5 sync
$0.093
per correct answer (est.)
"Under full-utilization amortized H100 assumptions, Modulum estimates ~24× lower cost-per-correct vs Opus 4.6 sync and ~4× vs GPT-5.5 sync on this benchmark."
Mandatory disclosures on slide:
DO NOT CITE

Findings rejected for the deck

Per all 3 panelists R1 unanimous: "would severely undermine analytical credibility" — Gemini. "Would read as cherry-picking" — Codex.

FindingModulumVanillaDeltap-valueStatus
Halluc zero-hallucination150/150148/150+1.3pp0.498NOT SIG
Throughput v3 C=1 accuracy21/5016/50+10.0pp0.408NOT SIG · N too small
Throughput v3 C=4 accuracy14/5012/50+4.0pp0.820NOT SIG
HotpotQA multidoc N=50 (paired)44/5039/50+10.0pp0.180HOLD — N=200 in flight
12 ACTIVE H100 JOBS — landing this hour

In-flight runs

RULER 240k_ceiling at Gemma-4's architectural maxModulum + VanillaN=30 each, 4 subtasks
InfiniteBench 6 subtasks (passkey, kv_retrieval, etc.)Modulum + VanillaN=30 each
LongBench v2 multi-subtask cross-domainModulum + VanillaN=30 each
BABILong qa5 multi-hop substrate-attributionModulum + VanillaN=200 each
BABILong qa2, qa3 Vanilla pairs (multi-hop)VanillaN=200 each
HotpotQA multidoc N=200 closes p=0.18 → expected p<0.05Modulum + VanillaN=200 each

Methodology

Panel verification trail — auditable, 3 rounds × 3 models:

R1 Codex: 5ca77f38-e08e-475f-bbed-f5f46905714d
R1 Grok 4.3: f72850b8-0426-4940-b3d8-606d7fb13869
R1 Gemini 3.1-pro-preview: Google API

R2 Codex: 7b65587d-1462-4e33-b8ba-668b7bfac5ab
R2 Grok 4.3: 2058bd66-ed22-4b9d-b77b-156686b90cac
R2 Gemini 3.1-pro-preview: VERIFIED unchanged

R3 Codex: dispatch logged
R3 Grok 4.3: 83650b4e-cec2-4892-a2e9-0e6425d5a802
R3 Gemini 3.1-pro-preview: Google API

All three rounds converged unanimously: 7 APPROVE_FOR_DECK + 3 QUALIFIED + 4 DO_NOT_USE.
CXDB persistent record: modulum-bench:d6c740ff56de