Qwen 3.6 precision sweep: full-precision BF16 vs 8-bit vs 6-bit vs 4-bit

How far can the self-hosted Qwen 3.6 models be quantized before vulnerability detection — or the tokens spent reaching it — degrades? The same open-prompt probe across four precision tiers (full-precision BF16 / UD-Q8_K_XL / UD-Q6_K_XL / UD-Q4_K_XL): three non-leaking arms (open / plan-first / CWE-checklist) × two models, 2 trials/arm at temperature 0.5, over six single-file cases spanning an 18→0 baseline and five languages. Detection = a finding localized within 10 lines of the planted ground-truth hunk.

Hard-miss detection by quant: BF16 0/6 · Q8 0/6 · Q6 0/6 · Q4 0/6
The capability ceiling is the headline — if every tier posts the same hard-case count, the bottleneck is model capability, not precision. The token-usage section below answers the second question: does a smaller quant spend more tokens to reach the same answer (a false economy for anyone who can run either)? Deltas there are measured against BF16.

Detection matrix — BF16, Q8, Q6, Q4 side by side

Cell = localized hits / completed trials, one column per quant. green = both trials hit, amber = split, red = neither, = no data. A Δ marks an arm where the quants disagree on whether the bug was found. Hard-miss cases (baseline ≤1) bold.

Qwen 3.6 27B (dense)

casebaselineopenplanchecklist
BF16Q8Q6Q4BF16Q8Q6Q4BF16Q8Q6Q4
GHSA-w52v-v783-gw97
CWE-89 · JShit anchor
18/202/22/22/22/22/22/22/22/22/22/22/22/2
GHSA-j273-m5qq-6825
CWE-22 · Javaguard
17/202/22/21/21/20/22/22/21/22/21/22/21/2
GHSA-f26g-jm89-4g65
CWE-77 · Rustmedium
11/201/21/22/22/20/21/20/21/21/21/20/21/2
GHSA-x9h5-r9v2-vcww
CWE-122 · Chard miss
1/210/20/20/20/20/20/20/20/20/20/20/20/2
GHSA-9f49-8x56-jmjc
CWE-416 · Chard miss
1/210/20/20/20/20/20/20/20/20/20/20/20/2
CVE-2026-5199
CWE-639 · Gohardest miss
0/210/20/20/20/20/20/20/20/20/20/20/20/2

Qwen 3.6 35B-A3B (MoE)

casebaselineopenplanchecklist
BF16Q8Q6Q4BF16Q8Q6Q4BF16Q8Q6Q4
GHSA-w52v-v783-gw97
CWE-89 · JShit anchor
18/202/22/22/22/22/22/22/22/22/22/22/22/2
GHSA-j273-m5qq-6825
CWE-22 · Javaguard
17/201/21/22/22/21/22/22/20/21/22/22/21/2
GHSA-f26g-jm89-4g65
CWE-77 · Rustmedium
11/200/20/20/20/20/20/20/20/21/20/20/20/2
GHSA-x9h5-r9v2-vcww
CWE-122 · Chard miss
1/210/20/20/20/20/20/20/20/20/20/20/20/2
GHSA-9f49-8x56-jmjc
CWE-416 · Chard miss
1/210/20/20/20/20/20/20/20/20/20/20/20/2
CVE-2026-5199
CWE-639 · Gohardest miss
0/210/20/20/20/20/20/10/20/20/20/20/20/2

Token usage — cost to answer per quant

Mean per completed run. Gen tokens = what the model actually generates (the real compute-to-answer); context tokens = input re-read across the ReAct loop (proxy for tool-call turns); wall-clock = latency (also moved by host load). Percentages are vs BF16; + means the smaller quant works harder. A smaller quant that needs materially more gen tokens for the same detection is a false economy where both quants are runnable.

modelquantcompleted runsgen tokens/runcontext tokens/runwall-clock/run
Qwen 3.6 27B (dense)BF16364671 (+0.0%)369151 (+0.0%)696s (+0.0%)
Qwen 3.6 27B (dense)Q8365132 (+9.9%)393977 (+6.7%)373s (-46.3%)
Qwen 3.6 27B (dense)Q6365267 (+12.8%)360289 (-2.4%)364s (-47.6%)
Qwen 3.6 27B (dense)Q4366036 (+29.2%)344132 (-6.8%)336s (-51.7%)
Qwen 3.6 35B-A3B (MoE)BF16368544 (+0.0%)272504 (+0.0%)448s (+0.0%)
Qwen 3.6 35B-A3B (MoE)Q8357655 (-10.4%)250490 (-8.1%)246s (-45.0%)
Qwen 3.6 35B-A3B (MoE)Q6367333 (-14.2%)279185 (+2.5%)250s (-44.2%)
Qwen 3.6 35B-A3B (MoE)Q4368449 (-1.1%)264276 (-3.0%)253s (-43.5%)

Arm summary (union over 2 trials)

How many solved/medium cases and hard-miss cases each (model, quant, arm) detected in at least one trial. Denominators count only cases with a completed run.

modelquantarmsolved-caseshard-miss-cases
Qwen 3.6 27B (dense)BF16open3/30/3
Qwen 3.6 27B (dense)BF16plan1/30/3
Qwen 3.6 27B (dense)BF16checklist3/30/3
Qwen 3.6 27B (dense)Q8open3/30/3
Qwen 3.6 27B (dense)Q8plan3/30/3
Qwen 3.6 27B (dense)Q8checklist3/30/3
Qwen 3.6 27B (dense)Q6open3/30/3
Qwen 3.6 27B (dense)Q6plan2/30/3
Qwen 3.6 27B (dense)Q6checklist2/30/3
Qwen 3.6 27B (dense)Q4open3/30/3
Qwen 3.6 27B (dense)Q4plan3/30/3
Qwen 3.6 27B (dense)Q4checklist3/30/3
Qwen 3.6 35B-A3B (MoE)BF16open2/30/3
Qwen 3.6 35B-A3B (MoE)BF16plan2/30/3
Qwen 3.6 35B-A3B (MoE)BF16checklist3/30/3
Qwen 3.6 35B-A3B (MoE)Q8open2/30/3
Qwen 3.6 35B-A3B (MoE)Q8plan2/30/3
Qwen 3.6 35B-A3B (MoE)Q8checklist2/30/3
Qwen 3.6 35B-A3B (MoE)Q6open2/30/3
Qwen 3.6 35B-A3B (MoE)Q6plan2/30/3
Qwen 3.6 35B-A3B (MoE)Q6checklist2/30/3
Qwen 3.6 35B-A3B (MoE)Q4open2/30/3
Qwen 3.6 35B-A3B (MoE)Q4plan1/30/3
Qwen 3.6 35B-A3B (MoE)Q4checklist2/30/3

Off-target findings: real secondary bugs vs false positives

These are real in-the-wild OSS files, not synthetic single-bug fixtures, so a finding that isn't the planted CVE is not automatically a false positive — it may be a genuine, never-fixed secondary bug. We judged every off-target finding with the code-grounded Opus FP-judge (it reads the pre-patch source, never the advisory) and classified each as a real bug, a false positive, or undetermined. Distinct sites are deduped globally across all four tiers (the verdict is a property of the code, not the precision), so this counts how many real secondary bugs actually live in each file.

casedistinct off-target sitesreal secondary bugfalse positiveundeterminedmix
GHSA-w52v-v783-gw97
CWE-89 · JS
0
GHSA-j273-m5qq-6825
CWE-22 · Java
44
GHSA-f26g-jm89-4g65
CWE-77 · Rust
0
GHSA-x9h5-r9v2-vcww
CWE-122 · C
25421
GHSA-9f49-8x56-jmjc
CWE-416 · C
37289
CVE-2026-5199
CWE-639 · Go
22
all cases6836320

green = judge-confirmed real bug, red = false positive, grey = undetermined. Hard-miss cases bold. The on-target CVE hits (95 findings) are not shown here — those are the planted bug, already known real.

Does a smaller quant hallucinate more? Off-target rows by tier

Per-tier off-target finding volume split by verdict (row-level, so repeated reports of the same bug count each time). If the smaller quants don't post a materially lower % real, precision is precision-invariant too — the smaller build is no noisier, consistent with the detection result.

tieroff-target rowsrealfalse positiveundetermined% realmix
BF16352411069%
Q834277079%
Q636297081%
Q422157068%

Method & caveats

Generated from nelson-promptlab-bf16.db, nelson-promptlab.db, nelson-promptlab-6bit.db, nelson-promptlab-4bit.db · 4 quants × 2 models × 6 cases × 3 arms × 2 trials.