Gemma 4 (QAT-4bit) prompt-lab — full 9-case corpus

The two new Quantization-Aware-Training 4-bit Gemma 4 models (gemma4-31b-qat dense, gemma4-26b-a4b-qat MoE), self-hosted and free, over all 9 corpus cases. Three non-leaking prompting arms (open / plan-first / CWE-checklist) × 2 trials/arm at temperature 0.5. Multi-file cases (nomad/craft/ghost) audit every baseline file; a case is detected if ANY file localizes within 10 lines of the planted hunk (the benchmark's gate). Motivation: at 8-bit the 26B-A4B MoE tied the field-leading 4/9 detection, so we feel out the QAT-4bit edges as we did Qwen 3.6.

Headline

gemma4-26b-a4b-qat: best arm solves 3/5 solved-tier cases and cracks 1/4 hard-miss cases.
gemma4-31b-qat: best arm solves 4/5 solved-tier cases and cracks 2/4 hard-miss cases.
(union = detected by at least one trial of that arm; localization-only, no truth-judge yet — confirm the hard-case hits with the judge before claiming them.)

Detection matrix

Cell = case-detected trials / completed trials for that arm (a case counts if any of its files hit). green = every trial, amber = some, red = none, — = no data. Hard-miss cases (baseline ≤1) bold.

gemma4-26b-a4b-qat

case	baseline	open	plan	checklist
GHSA-w52v-v783-gw97 CWE-89 · JShit anchor	17/20	0/1	1/2	1/2
GHSA-j273-m5qq-6825 CWE-22 · Javaguard	12/20	1/2	2/2	1/1
GHSA-f26g-jm89-4g65 CWE-77 · Rustmedium	11/19	0/2	1/2	0/2
CVE-2026-7474 CWE-22 · Gomedium-hard	7/21	0/2	0/2	0/2
GHSA-9f49-8x56-jmjc CWE-416 · Chard	2/21	0/1	0/1	0/1
GHSA-mpxh-8fq3-x8mh CWE-787 · Chard	1/17	0/2	0/1	0/1
GHSA-x9h5-r9v2-vcww CWE-122 · Chard	1/20	0/1	—	—
GHSA-cc7p-2j3x-x7xf CWE-863 · PHPhardest	0/20	0/2	1/2	0/2
CVE-2026-5199 CWE-639 · Gohardest	0/20	2/2	0/2	0/2

gemma4-31b-qat

case	baseline	open	plan	checklist
GHSA-w52v-v783-gw97 CWE-89 · JShit anchor	17/20	2/2	2/2	2/2
GHSA-j273-m5qq-6825 CWE-22 · Javaguard	12/20	2/2	2/2	2/2
GHSA-f26g-jm89-4g65 CWE-77 · Rustmedium	11/19	2/2	0/2	0/2
CVE-2026-7474 CWE-22 · Gomedium-hard	7/21	1/2	0/2	1/2
GHSA-9f49-8x56-jmjc CWE-416 · Chard	2/21	0/2	0/2	0/2
GHSA-mpxh-8fq3-x8mh CWE-787 · Chard	1/17	0/2	0/2	0/2
GHSA-x9h5-r9v2-vcww CWE-122 · Chard	1/20	0/2	0/2	0/1
GHSA-cc7p-2j3x-x7xf CWE-863 · PHPhardest	0/20	2/2	0/2	0/2
CVE-2026-5199 CWE-639 · Gohardest	0/20	1/2	1/2	0/2

Arm summary (union over 2 trials)

Cases each arm detected in at least one trial, split by whether the baseline solved the case (regression guard) or missed it (the real question). Denominators count only cases with a completed run.

model	arm	solved-cases	hard-miss-cases
gemma4-26b-a4b-qat	open	1/5	1/4
gemma4-26b-a4b-qat	plan	3/5	1/3
gemma4-26b-a4b-qat	checklist	2/5	0/3
gemma4-31b-qat	open	4/5	2/4
gemma4-31b-qat	plan	2/5	1/4
gemma4-31b-qat	checklist	3/5	0/4

Off-target findings: real secondary bugs vs false positives

Detection above is only half the story. Every finding a model reports that isn't the planted CVE still lands in front of a human (or a downstream model) who burns time and tokens triaging it — so a model that floods the queue with hallucinations is expensive even when it finds the real bug. But these are real in-the-wild OSS files, not synthetic single-bug fixtures, so an off-target finding is not automatically a false positive: it may be a genuine, never-fixed secondary bug. We judged every off-target finding with the code-grounded Opus FP-judge (it reads the pre-patch source via git show, never the advisory) and classified each as a real bug, a false positive, or undetermined. Distinct sites are deduped (the verdict is a property of the code, not which model/arm/trial surfaced it), so this counts how many real secondary bugs actually live in each file.

case	distinct off-target sites	real secondary bug	false positive	undetermined	mix
GHSA-w52v-v783-gw97 CWE-89 · JS	1	—	—	1
GHSA-j273-m5qq-6825 CWE-22 · Java	1	1	—	—
GHSA-f26g-jm89-4g65 CWE-77 · Rust	0	—	—	—	—
CVE-2026-7474 CWE-22 · Go	7	2	5	—
GHSA-9f49-8x56-jmjc CWE-416 · C	6	4	2	—
GHSA-mpxh-8fq3-x8mh CWE-787 · C	6	2	4	—
GHSA-x9h5-r9v2-vcww CWE-122 · C	1	—	1	—
GHSA-cc7p-2j3x-x7xf CWE-863 · PHP	2	1	1	—
CVE-2026-5199 CWE-639 · Go	3	1	2	—
all cases	27	11	15	1

green = judge-confirmed real bug, red = false positive, grey = undetermined. Hard-miss cases bold. The on-target CVE hits (31 findings) are not shown here — those are the planted bug, already known real.

Does one model hallucinate more? Off-target rows by model

Per-model off-target finding volume split by verdict (row-level, so repeated reports of the same site count each time). A lower % real means more of that model's noise is hallucination — the cost a human pays per report.

model	off-target rows	real	false positive	undetermined	% real	mix
gemma4-26b-a4b-qat	18	2	13	3	11%
gemma4-31b-qat	27	16	11	0	59%

Method & caveats

Models: gemma4-31b-qat (dense, 10.20.30.1) and gemma4-26b-a4b-qat (MoE, 10.20.30.2), self-hosted llama-server, free, QAT-4bit weights.
15 target files over 9 cases (nomad 2, craft 5, ghost 2, rest 1); a case is a hit if any file localizes. 90 runs/model = 15 files × 3 arms × 2 trials.
Temperature 0.5 so repeats explore different reasoning paths; llama-server seed=-1 randomizes per request. A timeout/infra_error reads as no-data, never a miss.
Detection is localization-only (free). Hard-case hits still need the truth judge to confirm SAME-bug; this report is the cheap first pass.
Off-target findings are FP-judged once per distinct site by the code-grounded Opus judge (fpjudge_gemma_promptlab.py), which reads the pre-patch source and never the advisory; verdicts persist on judge_fp_verdict so the report is reproducible from the DB. On-target CVE hits are not judged.

Generated from nelson-gemma-promptlab.db · 2 models · 9 cases × 3 arms × 2 trials.