3-OF-3 PANEL VERIFIED
R1 + R2 + R3 ROUNDS
2026-05-27 12:45 UTC
Same weights.
Different substrate.
Seven panel-verified gains.
All comparisons: Modulum hosted vs Vanilla H100 baseline running the identical Gemma-4-31B-Q4 weights on matched H100-class clusters (5 endpoints each side). Statistical significance verified by Codex + Grok 4.3 + Gemini 3.1-pro-preview, independent then convergent across three panel rounds.
The headline RAG-grounding finding
Source: runs/rag-resistance-sf-20260524/ — 6-stack × 6-condition × N=30 paired probe. Surfaced by data-audit; R3 panel-verified 2026-05-27.
RAG present_normal · N=30 paired
+86.7pp delta
Counts26/30 vs 0/30 committed
Fisher exactp = 7.8 × 10⁻¹³
McNemar pairedb=26 c=0, p = 2.98 × 10⁻⁸
"Under matched weights and hardware, Modulum converted explicitly present context into committed answers in 86.7% of paired cases, while vanilla remained ambiguous in all 30 cases."
RAG present_paraphrase · N=30 paired
+90.0pp delta
Counts27/30 vs 0/30 committed
Fisher exactp = 9.2 × 10⁻¹⁴
McNemar pairedb=27 c=0, p = 1.49 × 10⁻⁸
"On paraphrased context (same fact, different wording), Modulum commits in 90% of paired cases; vanilla 0%."
RAG present_typo · N=30 paired
+83.3pp delta
Counts26/30 vs 1/30 committed
Fisher exactp = 1.9 × 10⁻¹¹
McNemar pairedb=25 c=0, p = 5.96 × 10⁻⁸
"On typo-perturbed context, Modulum holds 86.7% commit; vanilla collapses to 3.3% — substrate-attributable robustness to surface-level noise."
Decay slope — substrate advantage grows with context length
BABILong qa1 paired Modulum vs Vanilla on identical Gemma-4-31B-Q4 weights, N=200 per length. R4 panel verification queued.
| Context | Modulum | Vanilla | Delta | Fisher p | Odds ratio |
| 32k | 180/200 (90.0%) | 156/200 (78.0%) | +12.0pp | 1.6 × 10⁻³ | 2.54 |
| 64k | 154/200 (77.0%) | 125/200 (62.5%) | +14.5pp | 2.2 × 10⁻³ | 2.01 |
| 128k | 143/200 (71.5%) | 105/200 (52.5%) | +19.0pp | 1.3 × 10⁻⁴ | 2.27 |
"On BABILong qa1 single-fact retrieval, the substrate advantage over the same-weights vanilla baseline grows with context length: +12pp at 32k, +14.5pp at 64k, +19pp at 128k — all Fisher exact p < 0.01."
Comparative — Modulum vs frontier at qa1 128k
Same benchmark, all stacks. Modulum 31B Q4 with substrate sits in the second tier; vanilla same-weights sits in the bottom tier.
| Stack | N | Accuracy | vs Modulum |
| gpt-5.5 sync (frontier) | 50 | 96.0% | +24.5pp |
| claude-opus-4-6 (frontier) | 50 | 96.0% | +24.5pp |
| claude-opus-4-7 (frontier) | 50 | 92.0% | +20.5pp |
| gemini-3.1-pro-preview (frontier) | 50 | 84.0% | +12.5pp |
| modulum-gemma-4-31B-Q4 (ours) | 200 | 71.5% | — |
| vanilla-gemma-4-31B-Q4 (same weights, no substrate) | 200 | 52.5% | −19.0pp |
"Modulum 31B-Q4 + substrate reaches 71.5% on qa1 @ 128k, closing most of the gap to frontier models while delivering a +19pp lift over identical vanilla weights."
Appendix only: qa3 multi-hop at 128k shows Modulum 27.0% (N=500) vs Vanilla 23.0% (N=200) — Fisher p=0.29, not statistically significant. Multi-hop substrate effects await BABILong qa2/qa3/qa5 N=200 currently in flight. Grok-4.3 collapsed to 30% on qa1 128k (N=50, external batch) — held out of main comparison per R4 panel.
The original long-context + safety findings
R1+R2 panel-verified 2026-05-27. Long-context retrieval (BABILong), prompt-injection robustness, and behavioral adherence under hallucination probes.
BABILong qa1 128k · N=200
+19.0pp delta
Counts143/200 vs 105/200
Fisher exactp = 1.3 × 10⁻⁴
Odds ratio2.27
"On 128k-context single-fact retrieval, Modulum delivers +19pp accuracy over vanilla on identical Gemma-4-31B-Q4 weights and matched H100-class clusters."
Injection v2 SOFT · N=60
+23.3pp delta
Counts27/60 vs 13/60 TARGET_HELD
Fisher exactp = 0.011
Odds ratio2.96
"Under Pattern C continuation-hijack injection, Modulum holds the target 2.1× more often than vanilla on the same weights."
Pattern C is direct-override; HARD jailbreak class shows 4-way tie at 100% — substrate has no advantage there.
Coherent refusal · N=150
+36.7pp delta
Counts150/150 vs 95/150
Fisher exactp = 1.2 × 10⁻¹⁹
Odds ratio∞
"Under hallucination-inducing prompts, Modulum achieved 100% coherent refusal versus 63.3% baseline on identical weights."
Anchor on coherent refusal — NOT hallucination cure. Zero-hallucination delta (148→150) is p=0.498 (not significant).
QUALIFIED — secondary RAG conditions
All six RAG-grounding conditions
Full 6-condition breakdown. Includes one null-result (distractor injection tie) as honest boundary disclosure.
| Condition | Modulum | Vanilla | Delta | Fisher p | McNemar p | Verdict |
| present_normal | 26/30 (86.7%) | 0/30 (0.0%) | +86.7pp | 7.8e-13 | 2.98e-8 | APPROVE |
| present_paraphrase | 27/30 (90.0%) | 0/30 (0.0%) | +90.0pp | 9.2e-14 | 1.49e-8 | APPROVE |
| present_typo | 26/30 (86.7%) | 1/30 (3.3%) | +83.3pp | 1.9e-11 | 5.96e-8 | APPROVE |
| present_contradicted | 18/30 (60.0%) | 5/30 (16.7%) | +43.3pp | 1.2e-3 | 2.44e-4 | QUALIFIED |
| absent (refusal) | 10/30 (33.3%) | 0/30 (0.0%) | +33.3pp | 8.0e-4 | 1.95e-3 | QUALIFIED |
| present_with_distractor_injection | 28/30 (93.3%) | 28/30 (93.3%) | +0.0pp | 1.000 | 1.000 | BOUNDARY (no substrate effect) |
Hallucination floor: Modulum, Vanilla, Anthropic, OpenAI, Google all 0/30 false-positive (committed when absent). Only xAI/Grok 1/30. Substrate gain is in GROUNDED COMMITMENT, not hallucination suppression.
QUALIFIED — REQUIRES AMORTIZATION CAVEAT
Cost-per-correct economics
Apples-to-apples sync-tier comparison; Modulum at est. $0.30/M amortized H100, BABILong qa1-qa3 128k workload.
~24×
cheaper than Opus 4.6 sync
~4×
cheaper than GPT-5.5 sync
$0.093
per correct answer (est.)
"Under full-utilization amortized H100 assumptions, Modulum estimates ~24× lower cost-per-correct vs Opus 4.6 sync and ~4× vs GPT-5.5 sync on this benchmark."
Mandatory disclosures on slide:
- $0.30/M is amortization estimate at assumed 100% H100 utilization — not realized customer pricing
- Hypernym partner pricing TBD
- Compared against published 2026-05 sync-tier rates; batch pricing excluded
DO NOT CITE
Findings rejected for the deck
Per all 3 panelists R1 unanimous: "would severely undermine analytical credibility" — Gemini. "Would read as cherry-picking" — Codex.
| Finding | Modulum | Vanilla | Delta | p-value | Status |
| Halluc zero-hallucination | 150/150 | 148/150 | +1.3pp | 0.498 | NOT SIG |
| Throughput v3 C=1 accuracy | 21/50 | 16/50 | +10.0pp | 0.408 | NOT SIG · N too small |
| Throughput v3 C=4 accuracy | 14/50 | 12/50 | +4.0pp | 0.820 | NOT SIG |
| HotpotQA multidoc N=50 (paired) | 44/50 | 39/50 | +10.0pp | 0.180 | HOLD — N=200 in flight |
12 ACTIVE H100 JOBS — landing this hour
In-flight runs
| RULER 240k_ceiling at Gemma-4's architectural max | Modulum + Vanilla | N=30 each, 4 subtasks |
| InfiniteBench 6 subtasks (passkey, kv_retrieval, etc.) | Modulum + Vanilla | N=30 each |
| LongBench v2 multi-subtask cross-domain | Modulum + Vanilla | N=30 each |
| BABILong qa5 multi-hop substrate-attribution | Modulum + Vanilla | N=200 each |
| BABILong qa2, qa3 Vanilla pairs (multi-hop) | Vanilla | N=200 each |
| HotpotQA multidoc N=200 closes p=0.18 → expected p<0.05 | Modulum + Vanilla | N=200 each |
Methodology
- Modulum vs Vanilla on identical Gemma-4-31B-Q4 weights → paired comparison where indices align
- Paired McNemar exact for sample-aligned binary outcomes; Fisher exact for aggregate proportions
- Wilson 95% confidence intervals on all proportions
- Independent classification (committed / refused / ambiguous) for RAG-resistance set
- Substring match scoring for BABILong; temperature = 0 throughout
- Hardware: 5× H100-class endpoints per cluster (Modulum us-west1-b + us-east4-a; Vanilla us-central1-a + us-east4-a), slot=2, random round-robin
- Apparatus reproducible:
research/tracks/modulum-bench/run_*.py + runs/rag-resistance-sf-20260524/
Panel verification trail — auditable, 3 rounds × 3 models:
R1 Codex: 5ca77f38-e08e-475f-bbed-f5f46905714d
R1 Grok 4.3: f72850b8-0426-4940-b3d8-606d7fb13869
R1 Gemini 3.1-pro-preview: Google API
R2 Codex: 7b65587d-1462-4e33-b8ba-668b7bfac5ab
R2 Grok 4.3: 2058bd66-ed22-4b9d-b77b-156686b90cac
R2 Gemini 3.1-pro-preview: VERIFIED unchanged
R3 Codex: dispatch logged
R3 Grok 4.3: 83650b4e-cec2-4892-a2e9-0e6425d5a802
R3 Gemini 3.1-pro-preview: Google API
All three rounds converged unanimously: 7 APPROVE_FOR_DECK + 3 QUALIFIED + 4 DO_NOT_USE.
CXDB persistent record: modulum-bench:d6c740ff56de