3-OF-3 PANEL VERIFIED R1 + R2 + R3 ROUNDS 2026-05-27 12:45 UTC

Same weights.
Different substrate.
Seven panel-verified gains.

All comparisons: Modulum hosted vs Vanilla H100 baseline running the identical Gemma-4-31B-Q4 weights on matched H100-class clusters (5 endpoints each side). Statistical significance verified by Codex + Grok 4.3 + Gemini 3.1-pro-preview, independent then convergent across three panel rounds.

The headline RAG-grounding finding

Source: runs/rag-resistance-sf-20260524/ — 6-stack × 6-condition × N=30 paired probe. Surfaced by data-audit; R3 panel-verified 2026-05-27.

RAG present_normal · N=30 paired

86.7%

Modulum

0.0%

Vanilla

+86.7pp delta

Counts26/30 vs 0/30 committed

Fisher exactp = 7.8 × 10⁻¹³

McNemar pairedb=26 c=0, p = 2.98 × 10⁻⁸

"Under matched weights and hardware, Modulum converted explicitly present context into committed answers in 86.7% of paired cases, while vanilla remained ambiguous in all 30 cases."

RAG present_paraphrase · N=30 paired

90.0%

Modulum

0.0%

Vanilla

+90.0pp delta

Counts27/30 vs 0/30 committed

Fisher exactp = 9.2 × 10⁻¹⁴

McNemar pairedb=27 c=0, p = 1.49 × 10⁻⁸

"On paraphrased context (same fact, different wording), Modulum commits in 90% of paired cases; vanilla 0%."

RAG present_typo · N=30 paired

86.7%

Modulum

3.3%

Vanilla

+83.3pp delta

Counts26/30 vs 1/30 committed

Fisher exactp = 1.9 × 10⁻¹¹

McNemar pairedb=25 c=0, p = 5.96 × 10⁻⁸

"On typo-perturbed context, Modulum holds 86.7% commit; vanilla collapses to 3.3% — substrate-attributable robustness to surface-level noise."

Decay slope — substrate advantage grows with context length

BABILong qa1 paired Modulum vs Vanilla on identical Gemma-4-31B-Q4 weights, N=200 per length. R4 panel verification queued.

Context	Modulum	Vanilla	Delta	Fisher p	Odds ratio
32k	180/200 (90.0%)	156/200 (78.0%)	+12.0pp	1.6 × 10⁻³	2.54
64k	154/200 (77.0%)	125/200 (62.5%)	+14.5pp	2.2 × 10⁻³	2.01
128k	143/200 (71.5%)	105/200 (52.5%)	+19.0pp	1.3 × 10⁻⁴	2.27

"On BABILong qa1 single-fact retrieval, the substrate advantage over the same-weights vanilla baseline grows with context length: +12pp at 32k, +14.5pp at 64k, +19pp at 128k — all Fisher exact p < 0.01."

NEW — Long-context request reliability (panel R6 just-landed)

HotpotQA multi-doc N=200 paired (Modulum vs Vanilla, same Gemma weights, same H100 hardware). Panel R6 3-of-3 unanimous: this is a reliability/SLA finding, not accuracy. Critical for NVIDIA serving-density story.

Reliability — successful request yield · N=200

68.0%

Modulum

47.0%

Vanilla

+21.0pp reliability

Modulum succeeded136/200

Vanilla succeeded94/200

WorkloadHotpotQA multi-doc + contradiction

HardwareSame H100 cluster, same weights

"On identical H100 hardware, Modulum delivered 68% successful long-context responses versus Vanilla's 47% — a +21pp reliability gain on the same workload and same Gemma-4-31B-Q4 weights."

Root-cause caveat (panel-mandated): exact failure modes (OOM vs timeout vs context-length truncation) not yet isolated. Frame as "observed request-success rate at the client boundary" — do not claim "OOM resistance" or "serving-density improvement" as proven mechanism.

Accuracy parity control · successful-only N=65

83.1%

Modulum

80.0%

Vanilla

+3.1pp · ns

Counts54/65 vs 52/65

McNemar pairedb=2 c=0, p = 0.50

Fisher exactp = 0.82

"When both calls succeed, accuracy is statistically equivalent (~80–83%) — the substrate does NOT improve answer quality on long-context multi-doc, only the rate at which the call completes successfully."

Why this matters: this is the accuracy-degradation control. It proves the +21pp reliability win does not come at the cost of answer quality.

APPENDIX ONLY — end-to-end task success, not accuracy

Decomposition: raw paired result for reproducibility

The +20pp raw delta (errors counted as wrong) is publishable only as decomposed "end-to-end task success" — not an accuracy claim. Per panel R6 all 3 panelists.

Interpretation	Modulum	Vanilla	Delta	Test	p-value	Use
C — Reliability (call success)	136/200 (68.0%)	94/200 (47.0%)	+21.0pp	—	—	HEADLINE
B — Accuracy (succ-only)	54/65 (83.1%)	52/65 (80.0%)	+3.1pp	McNemar	0.50	Control (no degradation)
A — End-to-end task success (raw)	119/200 (59.5%)	79/200 (39.5%)	+20.0pp	McNemar	4.5e-5	Appendix only (= reliability × accuracy)

256k decay slope — graceful through Gemma-4's architectural ceiling

Modulum BABILong qa1 across all four lengths. Panel R5-verified. Decay is monotonic and gradual through the 262k ceiling.

Context length	Modulum	N	Drop vs prior length
32k	90.0%	200	—
64k	77.0%	200	−13.0pp
128k	71.5%	200	−5.5pp
256k (near ceiling)	59.3%	91 (deduped)	−12.2pp

"Modulum shows graceful long-context decay through Gemma-4's 262k architectural ceiling: 90.0% at 32k, 77.0% at 64k, 71.5% at 128k, and 59.3% at 256k on BABILong qa1."

Honest boundary disclosure: at 256k paired (N=48), Gemini 3.1 Pro 83% beats Modulum 60% (McNemar p=9.8e-4). Modulum is positioned as efficient 31B long-context capability on H100-class serving, not as a frontier-model replacement at the ceiling.

Substrate-attribution gap at 256k: clean Vanilla H100 Q4 256k pair does not yet exist. RULER 240k_ceiling Vanilla pair currently running — early result: both Modulum and Vanilla cliff 0/30 on niah_multikey_3 at 240k (no substrate advantage at the hardest ceiling subtask). Vs base Gemma-4-31B-it FP16: Modulum 59.3% vs 10.0% (Fisher p=2.3e-13) — directional but confounded by FP16 vs Q4 + hardware setup, treat as deployment-stack evidence not substrate.

Decode / throughput on H100 (NVIDIA serving-density angle)

Modulum vs Vanilla on the same BABILong 128k workload, same H100-class hardware. Panel R5-verified. Must be framed as prefill-dominated (output = 1 token).

Stack	Concurrency	Avg wall	Per-call tok/s	Cluster TPS	Cost/correct (est)
Modulum	C=1	240.8s	527.3	527.3	$0.151
Vanilla	C=1	278.3s	456.2	456.2	$0.198
Modulum	C=4	303.3s	418.6	1,612.2	$0.227
Vanilla	C=4	252.8s	502.2	1,866.0	$0.265

"On the same 128k BABILong workload at C=1, Modulum delivered 527 tok/s vs Vanilla 456 tok/s — a +15.6% effective throughput gain. Because BABILong outputs only one token, this primarily measures long-context prefill throughput, not sustained decode speed."

+15.6%

C=1 prefill throughput vs Vanilla

1,723

tok/s prefill at 256k context, single H100 endpoint

$0.151

/correct at C=1, vs Vanilla $0.198 (−24%)

Boundary condition (panel-mandated disclosure): at C=4, Vanilla leads raw cluster TPS (1,866 vs 1,612). The pitch story is NOT "Modulum always wins raw TPS" — it is long-context accuracy retained per deployed H100, especially at 128k-256k. Raw throughput is a wash at higher concurrency; the value is cost-per-correct.

Comparative — Modulum vs frontier at qa1 128k

Same benchmark, all stacks. Modulum 31B Q4 with substrate sits in the second tier; vanilla same-weights sits in the bottom tier.

Stack	N	Accuracy	vs Modulum
gpt-5.5 sync (frontier)	50	96.0%	+24.5pp
claude-opus-4-6 (frontier)	50	96.0%	+24.5pp
claude-opus-4-7 (frontier)	50	92.0%	+20.5pp
gemini-3.1-pro-preview (frontier)	50	84.0%	+12.5pp
modulum-gemma-4-31B-Q4 (ours)	200	71.5%	—
vanilla-gemma-4-31B-Q4 (same weights, no substrate)	200	52.5%	−19.0pp

"Modulum 31B-Q4 + substrate reaches 71.5% on qa1 @ 128k, closing most of the gap to frontier models while delivering a +19pp lift over identical vanilla weights."

Appendix only: qa3 multi-hop at 128k shows Modulum 27.0% (N=500) vs Vanilla 23.0% (N=200) — Fisher p=0.29, not statistically significant. Multi-hop substrate effects await BABILong qa2/qa3/qa5 N=200 currently in flight. Grok-4.3 collapsed to 30% on qa1 128k (N=50, external batch) — held out of main comparison per R4 panel.

The original long-context + safety findings

R1+R2 panel-verified 2026-05-27. Long-context retrieval (BABILong), prompt-injection robustness, and behavioral adherence under hallucination probes.

BABILong qa1 128k · N=200

71.5%

Modulum

52.5%

Vanilla

+19.0pp delta

Counts143/200 vs 105/200

Fisher exactp = 1.3 × 10⁻⁴

Odds ratio2.27

"On 128k-context single-fact retrieval, Modulum delivers +19pp accuracy over vanilla on identical Gemma-4-31B-Q4 weights and matched H100-class clusters."

Injection v2 SOFT · N=60

45.0%

Modulum

21.7%

Vanilla

+23.3pp delta

Counts27/60 vs 13/60 TARGET_HELD

Fisher exactp = 0.011

Odds ratio2.96

"Under Pattern C continuation-hijack injection, Modulum holds the target 2.1× more often than vanilla on the same weights."

Pattern C is direct-override; HARD jailbreak class shows 4-way tie at 100% — substrate has no advantage there.

Coherent refusal · N=150

100.0%

Modulum

63.3%

Vanilla

+36.7pp delta

Counts150/150 vs 95/150

Fisher exactp = 1.2 × 10⁻¹⁹

Odds ratio∞

"Under hallucination-inducing prompts, Modulum achieved 100% coherent refusal versus 63.3% baseline on identical weights."

Anchor on coherent refusal — NOT hallucination cure. Zero-hallucination delta (148→150) is p=0.498 (not significant).

QUALIFIED — secondary RAG conditions

All six RAG-grounding conditions

Full 6-condition breakdown. Includes one null-result (distractor injection tie) as honest boundary disclosure.

Condition	Modulum	Vanilla	Delta	Fisher p	McNemar p	Verdict
present_normal	26/30 (86.7%)	0/30 (0.0%)	+86.7pp	7.8e-13	2.98e-8	APPROVE
present_paraphrase	27/30 (90.0%)	0/30 (0.0%)	+90.0pp	9.2e-14	1.49e-8	APPROVE
present_typo	26/30 (86.7%)	1/30 (3.3%)	+83.3pp	1.9e-11	5.96e-8	APPROVE
present_contradicted	18/30 (60.0%)	5/30 (16.7%)	+43.3pp	1.2e-3	2.44e-4	QUALIFIED
absent (refusal)	10/30 (33.3%)	0/30 (0.0%)	+33.3pp	8.0e-4	1.95e-3	QUALIFIED
present_with_distractor_injection	28/30 (93.3%)	28/30 (93.3%)	+0.0pp	1.000	1.000	BOUNDARY (no substrate effect)

Hallucination floor: Modulum, Vanilla, Anthropic, OpenAI, Google all 0/30 false-positive (committed when absent). Only xAI/Grok 1/30. Substrate gain is in GROUNDED COMMITMENT, not hallucination suppression.

QUALIFIED — REQUIRES AMORTIZATION CAVEAT

Cost-per-correct economics

Apples-to-apples sync-tier comparison; Modulum at est. $0.30/M amortized H100, BABILong qa1-qa3 128k workload.

~24×

cheaper than Opus 4.6 sync

~4×

cheaper than GPT-5.5 sync

$0.093

per correct answer (est.)

"Under full-utilization amortized H100 assumptions, Modulum estimates ~24× lower cost-per-correct vs Opus 4.6 sync and ~4× vs GPT-5.5 sync on this benchmark."

Mandatory disclosures on slide:

$0.30/M is amortization estimate at assumed 100% H100 utilization — not realized customer pricing
Hypernym partner pricing TBD
Compared against published 2026-05 sync-tier rates; batch pricing excluded

DO NOT CITE

Findings rejected for the deck

Per all 3 panelists R1 unanimous: "would severely undermine analytical credibility" — Gemini. "Would read as cherry-picking" — Codex.

Finding	Modulum	Vanilla	Delta	p-value	Status
Halluc zero-hallucination	150/150	148/150	+1.3pp	0.498	NOT SIG
Throughput v3 C=1 accuracy	21/50	16/50	+10.0pp	0.408	NOT SIG · N too small
Throughput v3 C=4 accuracy	14/50	12/50	+4.0pp	0.820	NOT SIG
HotpotQA multidoc N=50 (paired)	44/50	39/50	+10.0pp	0.180	HOLD — N=200 in flight

12 ACTIVE H100 JOBS — landing this hour

In-flight runs

RULER 240k_ceiling at Gemma-4's architectural max	Modulum + Vanilla	N=30 each, 4 subtasks
InfiniteBench 6 subtasks (passkey, kv_retrieval, etc.)	Modulum + Vanilla	N=30 each
LongBench v2 multi-subtask cross-domain	Modulum + Vanilla	N=30 each
BABILong qa5 multi-hop substrate-attribution	Modulum + Vanilla	N=200 each
BABILong qa2, qa3 Vanilla pairs (multi-hop)	Vanilla	N=200 each
HotpotQA multidoc N=200 closes p=0.18 → expected p<0.05	Modulum + Vanilla	N=200 each

Methodology

Modulum vs Vanilla on identical Gemma-4-31B-Q4 weights → paired comparison where indices align
Paired McNemar exact for sample-aligned binary outcomes; Fisher exact for aggregate proportions
Wilson 95% confidence intervals on all proportions
Independent classification (committed / refused / ambiguous) for RAG-resistance set
Substring match scoring for BABILong; temperature = 0 throughout
Hardware: 5× H100-class endpoints per cluster (Modulum us-west1-b + us-east4-a; Vanilla us-central1-a + us-east4-a), slot=2, random round-robin
Apparatus reproducible: research/tracks/modulum-bench/run_*.py + runs/rag-resistance-sf-20260524/

Panel verification trail — auditable, 3 rounds × 3 models:

R1 Codex: 5ca77f38-e08e-475f-bbed-f5f46905714d
R1 Grok 4.3: f72850b8-0426-4940-b3d8-606d7fb13869
R1 Gemini 3.1-pro-preview: Google API

R2 Codex: 7b65587d-1462-4e33-b8ba-668b7bfac5ab
R2 Grok 4.3: 2058bd66-ed22-4b9d-b77b-156686b90cac
R2 Gemini 3.1-pro-preview: VERIFIED unchanged

R3 Codex: dispatch logged
R3 Grok 4.3: 83650b4e-cec2-4892-a2e9-0e6425d5a802
R3 Gemini 3.1-pro-preview: Google API

All three rounds converged unanimously: 7 APPROVE_FOR_DECK + 3 QUALIFIED + 4 DO_NOT_USE.
CXDB persistent record: modulum-bench:d6c740ff56de

Same weights.Different substrate.Seven panel-verified gains.

The headline RAG-grounding finding

RAG present_normal · N=30 paired

RAG present_paraphrase · N=30 paired

RAG present_typo · N=30 paired

Decay slope — substrate advantage grows with context length

NEW — Long-context request reliability (panel R6 just-landed)

Reliability — successful request yield · N=200

Accuracy parity control · successful-only N=65

Decomposition: raw paired result for reproducibility

256k decay slope — graceful through Gemma-4's architectural ceiling

Decode / throughput on H100 (NVIDIA serving-density angle)

Comparative — Modulum vs frontier at qa1 128k

The original long-context + safety findings

BABILong qa1 128k · N=200

Injection v2 SOFT · N=60

Coherent refusal · N=150

All six RAG-grounding conditions

Cost-per-correct economics

Findings rejected for the deck

In-flight runs

Methodology

Same weights.
Different substrate.
Seven panel-verified gains.