How far can the self-hosted Qwen 3.6 models be quantized before vulnerability detection — or the tokens spent reaching it — degrades? The same open-prompt probe across four precision tiers (full-precision BF16 / UD-Q8_K_XL / UD-Q6_K_XL / UD-Q4_K_XL): three non-leaking arms (open / plan-first / CWE-checklist) × two models, 2 trials/arm at temperature 0.5, over six single-file cases spanning an 18→0 baseline and five languages. Detection = a finding localized within 10 lines of the planted ground-truth hunk.
Cell = localized hits / completed trials, one column per quant. green = both trials hit, amber = split, red = neither, — = no data. A Δ marks an arm where the quants disagree on whether the bug was found. Hard-miss cases (baseline ≤1) bold.
| case | baseline | open | plan | checklist | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BF16 | Q8 | Q6 | Q4 | BF16 | Q8 | Q6 | Q4 | BF16 | Q8 | Q6 | Q4 | ||
| GHSA-w52v-v783-gw97 CWE-89 · JShit anchor | 18/20 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 |
| GHSA-j273-m5qq-6825 CWE-22 · Javaguard | 17/20 | 2/2 | 2/2 | 1/2 | 1/2 | 0/2 | 2/2 | 2/2 | 1/2 | 2/2 | 1/2 | 2/2 | 1/2 |
| GHSA-f26g-jm89-4g65 CWE-77 · Rustmedium | 11/20 | 1/2 | 1/2 | 2/2 | 2/2 | 0/2 | 1/2 | 0/2 | 1/2 | 1/2 | 1/2 | 0/2 | 1/2 |
| GHSA-x9h5-r9v2-vcww CWE-122 · Chard miss | 1/21 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 |
| GHSA-9f49-8x56-jmjc CWE-416 · Chard miss | 1/21 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 |
| CVE-2026-5199 CWE-639 · Gohardest miss | 0/21 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 |
| case | baseline | open | plan | checklist | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BF16 | Q8 | Q6 | Q4 | BF16 | Q8 | Q6 | Q4 | BF16 | Q8 | Q6 | Q4 | ||
| GHSA-w52v-v783-gw97 CWE-89 · JShit anchor | 18/20 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 |
| GHSA-j273-m5qq-6825 CWE-22 · Javaguard | 17/20 | 1/2 | 1/2 | 2/2 | 2/2 | 1/2 | 2/2 | 2/2 | 0/2 | 1/2 | 2/2 | 2/2 | 1/2 |
| GHSA-f26g-jm89-4g65 CWE-77 · Rustmedium | 11/20 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 1/2 | 0/2 | 0/2 | 0/2 |
| GHSA-x9h5-r9v2-vcww CWE-122 · Chard miss | 1/21 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 |
| GHSA-9f49-8x56-jmjc CWE-416 · Chard miss | 1/21 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 |
| CVE-2026-5199 CWE-639 · Gohardest miss | 0/21 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/1 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 | 0/2 |
Mean per completed run. Gen tokens = what the model actually generates (the real compute-to-answer); context tokens = input re-read across the ReAct loop (proxy for tool-call turns); wall-clock = latency (also moved by host load). Percentages are vs BF16; + means the smaller quant works harder. A smaller quant that needs materially more gen tokens for the same detection is a false economy where both quants are runnable.
| model | quant | completed runs | gen tokens/run | context tokens/run | wall-clock/run |
|---|---|---|---|---|---|
| Qwen 3.6 27B (dense) | BF16 | 36 | 4671 (+0.0%) | 369151 (+0.0%) | 696s (+0.0%) |
| Qwen 3.6 27B (dense) | Q8 | 36 | 5132 (+9.9%) | 393977 (+6.7%) | 373s (-46.3%) |
| Qwen 3.6 27B (dense) | Q6 | 36 | 5267 (+12.8%) | 360289 (-2.4%) | 364s (-47.6%) |
| Qwen 3.6 27B (dense) | Q4 | 36 | 6036 (+29.2%) | 344132 (-6.8%) | 336s (-51.7%) |
| Qwen 3.6 35B-A3B (MoE) | BF16 | 36 | 8544 (+0.0%) | 272504 (+0.0%) | 448s (+0.0%) |
| Qwen 3.6 35B-A3B (MoE) | Q8 | 35 | 7655 (-10.4%) | 250490 (-8.1%) | 246s (-45.0%) |
| Qwen 3.6 35B-A3B (MoE) | Q6 | 36 | 7333 (-14.2%) | 279185 (+2.5%) | 250s (-44.2%) |
| Qwen 3.6 35B-A3B (MoE) | Q4 | 36 | 8449 (-1.1%) | 264276 (-3.0%) | 253s (-43.5%) |
How many solved/medium cases and hard-miss cases each (model, quant, arm) detected in at least one trial. Denominators count only cases with a completed run.
| model | quant | arm | solved-cases | hard-miss-cases |
|---|---|---|---|---|
| Qwen 3.6 27B (dense) | BF16 | open | 3/3 | 0/3 |
| Qwen 3.6 27B (dense) | BF16 | plan | 1/3 | 0/3 |
| Qwen 3.6 27B (dense) | BF16 | checklist | 3/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q8 | open | 3/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q8 | plan | 3/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q8 | checklist | 3/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q6 | open | 3/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q6 | plan | 2/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q6 | checklist | 2/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q4 | open | 3/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q4 | plan | 3/3 | 0/3 |
| Qwen 3.6 27B (dense) | Q4 | checklist | 3/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | BF16 | open | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | BF16 | plan | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | BF16 | checklist | 3/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q8 | open | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q8 | plan | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q8 | checklist | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q6 | open | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q6 | plan | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q6 | checklist | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q4 | open | 2/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q4 | plan | 1/3 | 0/3 |
| Qwen 3.6 35B-A3B (MoE) | Q4 | checklist | 2/3 | 0/3 |
These are real in-the-wild OSS files, not synthetic single-bug fixtures, so a finding that isn't the planted CVE is not automatically a false positive — it may be a genuine, never-fixed secondary bug. We judged every off-target finding with the code-grounded Opus FP-judge (it reads the pre-patch source, never the advisory) and classified each as a real bug, a false positive, or undetermined. Distinct sites are deduped globally across all four tiers (the verdict is a property of the code, not the precision), so this counts how many real secondary bugs actually live in each file.
| case | distinct off-target sites | real secondary bug | false positive | undetermined | mix |
|---|---|---|---|---|---|
| GHSA-w52v-v783-gw97 CWE-89 · JS | 0 | — | — | — | — |
| GHSA-j273-m5qq-6825 CWE-22 · Java | 4 | 4 | — | — | |
| GHSA-f26g-jm89-4g65 CWE-77 · Rust | 0 | — | — | — | — |
| GHSA-x9h5-r9v2-vcww CWE-122 · C | 25 | 4 | 21 | — | |
| GHSA-9f49-8x56-jmjc CWE-416 · C | 37 | 28 | 9 | — | |
| CVE-2026-5199 CWE-639 · Go | 2 | — | 2 | — | |
| all cases | 68 | 36 | 32 | 0 |
green = judge-confirmed real bug, red = false positive, grey = undetermined. Hard-miss cases bold. The on-target CVE hits (95 findings) are not shown here — those are the planted bug, already known real.
Per-tier off-target finding volume split by verdict (row-level, so repeated reports of the same bug count each time). If the smaller quants don't post a materially lower % real, precision is precision-invariant too — the smaller build is no noisier, consistent with the detection result.
| tier | off-target rows | real | false positive | undetermined | % real | mix |
|---|---|---|---|---|---|---|
| BF16 | 35 | 24 | 11 | 0 | 69% | |
| Q8 | 34 | 27 | 7 | 0 | 79% | |
| Q6 | 36 | 29 | 7 | 0 | 81% | |
| Q4 | 22 | 15 | 7 | 0 | 68% |
qwen3.6-27b (dense, 10.20.30.1) and
qwen3.6-35b-A3b (MoE, 10.20.30.2), self-hosted llama-server, free.
Identical endpoints — only the loaded precision differs (BF16 / UD-Q8/Q6/Q4_K_XL).seed=-1
randomizes per request. Detection is localization-only (free); on the hard cases
zero findings landed near the planted bug, so there is nothing for a truth-judge
to confirm.nelson-promptlab-bf16.db, nelson-promptlab.db, nelson-promptlab-6bit.db, nelson-promptlab-4bit.db);
the baseline benchmark DB is untouched.Generated from nelson-promptlab-bf16.db, nelson-promptlab.db, nelson-promptlab-6bit.db, nelson-promptlab-4bit.db · 4 quants × 2 models × 6 cases × 3 arms × 2 trials.