Qwen 3.6 precision sweep: full-precision BF16 vs 8-bit vs 6-bit vs 4-bit

How far can the self-hosted Qwen 3.6 models be quantized before vulnerability detection — or the tokens spent reaching it — degrades? The same open-prompt probe across four precision tiers (full-precision BF16 / UD-Q8_K_XL / UD-Q6_K_XL / UD-Q4_K_XL): three non-leaking arms (open / plan-first / CWE-checklist) × two models, 2 trials/arm at temperature 0.5, over six single-file cases spanning an 18→0 baseline and five languages. Detection = a finding localized within 10 lines of the planted ground-truth hunk.

Hard-miss detection by quant: BF16 0/6 · Q8 0/6 · Q6 0/6 · Q4 0/6

The capability ceiling is the headline — if every tier posts the same hard-case count, the bottleneck is model capability, not precision. The token-usage section below answers the second question: does a smaller quant spend more tokens to reach the same answer (a false economy for anyone who can run either)? Deltas there are measured against BF16.

Detection matrix — BF16, Q8, Q6, Q4 side by side

Cell = localized hits / completed trials, one column per quant. green = both trials hit, amber = split, red = neither, — = no data. A Δ marks an arm where the quants disagree on whether the bug was found. Hard-miss cases (baseline ≤1) bold.

Qwen 3.6 27B (dense)

case	baseline	open				plan				checklist
case	baseline	BF16	Q8	Q6	Q4	BF16	Q8	Q6	Q4	BF16	Q8	Q6	Q4
GHSA-w52v-v783-gw97 CWE-89 · JShit anchor	18/20	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2
GHSA-j273-m5qq-6825 CWE-22 · Javaguard	17/20	2/2	2/2	1/2	1/2	0/2	2/2	2/2	1/2	2/2	1/2	2/2	1/2
GHSA-f26g-jm89-4g65 CWE-77 · Rustmedium	11/20	1/2	1/2	2/2	2/2	0/2	1/2	0/2	1/2	1/2	1/2	0/2	1/2
GHSA-x9h5-r9v2-vcww CWE-122 · Chard miss	1/21	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2
GHSA-9f49-8x56-jmjc CWE-416 · Chard miss	1/21	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2
CVE-2026-5199 CWE-639 · Gohardest miss	0/21	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2

Qwen 3.6 35B-A3B (MoE)

case	baseline	open				plan				checklist
case	baseline	BF16	Q8	Q6	Q4	BF16	Q8	Q6	Q4	BF16	Q8	Q6	Q4
GHSA-w52v-v783-gw97 CWE-89 · JShit anchor	18/20	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2	2/2
GHSA-j273-m5qq-6825 CWE-22 · Javaguard	17/20	1/2	1/2	2/2	2/2	1/2	2/2	2/2	0/2	1/2	2/2	2/2	1/2
GHSA-f26g-jm89-4g65 CWE-77 · Rustmedium	11/20	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	1/2	0/2	0/2	0/2
GHSA-x9h5-r9v2-vcww CWE-122 · Chard miss	1/21	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2
GHSA-9f49-8x56-jmjc CWE-416 · Chard miss	1/21	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2	0/2
CVE-2026-5199 CWE-639 · Gohardest miss	0/21	0/2	0/2	0/2	0/2	0/2	0/1	0/2	0/2	0/2	0/2	0/2	0/2

Token usage — cost to answer per quant

Mean per completed run. Gen tokens = what the model actually generates (the real compute-to-answer); context tokens = input re-read across the ReAct loop (proxy for tool-call turns); wall-clock = latency (also moved by host load). Percentages are vs BF16; + means the smaller quant works harder. A smaller quant that needs materially more gen tokens for the same detection is a false economy where both quants are runnable.

model	quant	completed runs	gen tokens/run	context tokens/run	wall-clock/run
Qwen 3.6 27B (dense)	BF16	36	4671 (+0.0%)	369151 (+0.0%)	696s (+0.0%)
Qwen 3.6 27B (dense)	Q8	36	5132 (+9.9%)	393977 (+6.7%)	373s (-46.3%)
Qwen 3.6 27B (dense)	Q6	36	5267 (+12.8%)	360289 (-2.4%)	364s (-47.6%)
Qwen 3.6 27B (dense)	Q4	36	6036 (+29.2%)	344132 (-6.8%)	336s (-51.7%)
Qwen 3.6 35B-A3B (MoE)	BF16	36	8544 (+0.0%)	272504 (+0.0%)	448s (+0.0%)
Qwen 3.6 35B-A3B (MoE)	Q8	35	7655 (-10.4%)	250490 (-8.1%)	246s (-45.0%)
Qwen 3.6 35B-A3B (MoE)	Q6	36	7333 (-14.2%)	279185 (+2.5%)	250s (-44.2%)
Qwen 3.6 35B-A3B (MoE)	Q4	36	8449 (-1.1%)	264276 (-3.0%)	253s (-43.5%)

Arm summary (union over 2 trials)

How many solved/medium cases and hard-miss cases each (model, quant, arm) detected in at least one trial. Denominators count only cases with a completed run.

model	quant	arm	solved-cases	hard-miss-cases
Qwen 3.6 27B (dense)	BF16	open	3/3	0/3
Qwen 3.6 27B (dense)	BF16	plan	1/3	0/3
Qwen 3.6 27B (dense)	BF16	checklist	3/3	0/3
Qwen 3.6 27B (dense)	Q8	open	3/3	0/3
Qwen 3.6 27B (dense)	Q8	plan	3/3	0/3
Qwen 3.6 27B (dense)	Q8	checklist	3/3	0/3
Qwen 3.6 27B (dense)	Q6	open	3/3	0/3
Qwen 3.6 27B (dense)	Q6	plan	2/3	0/3
Qwen 3.6 27B (dense)	Q6	checklist	2/3	0/3
Qwen 3.6 27B (dense)	Q4	open	3/3	0/3
Qwen 3.6 27B (dense)	Q4	plan	3/3	0/3
Qwen 3.6 27B (dense)	Q4	checklist	3/3	0/3
Qwen 3.6 35B-A3B (MoE)	BF16	open	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	BF16	plan	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	BF16	checklist	3/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q8	open	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q8	plan	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q8	checklist	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q6	open	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q6	plan	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q6	checklist	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q4	open	2/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q4	plan	1/3	0/3
Qwen 3.6 35B-A3B (MoE)	Q4	checklist	2/3	0/3

Off-target findings: real secondary bugs vs false positives

These are real in-the-wild OSS files, not synthetic single-bug fixtures, so a finding that isn't the planted CVE is not automatically a false positive — it may be a genuine, never-fixed secondary bug. We judged every off-target finding with the code-grounded Opus FP-judge (it reads the pre-patch source, never the advisory) and classified each as a real bug, a false positive, or undetermined. Distinct sites are deduped globally across all four tiers (the verdict is a property of the code, not the precision), so this counts how many real secondary bugs actually live in each file.

case	distinct off-target sites	real secondary bug	false positive	undetermined	mix
GHSA-w52v-v783-gw97 CWE-89 · JS	0	—	—	—	—
GHSA-j273-m5qq-6825 CWE-22 · Java	4	4	—	—
GHSA-f26g-jm89-4g65 CWE-77 · Rust	0	—	—	—	—
GHSA-x9h5-r9v2-vcww CWE-122 · C	25	4	21	—
GHSA-9f49-8x56-jmjc CWE-416 · C	37	28	9	—
CVE-2026-5199 CWE-639 · Go	2	—	2	—
all cases	68	36	32	0

green = judge-confirmed real bug, red = false positive, grey = undetermined. Hard-miss cases bold. The on-target CVE hits (95 findings) are not shown here — those are the planted bug, already known real.

Does a smaller quant hallucinate more? Off-target rows by tier

Per-tier off-target finding volume split by verdict (row-level, so repeated reports of the same bug count each time). If the smaller quants don't post a materially lower % real, precision is precision-invariant too — the smaller build is no noisier, consistent with the detection result.

tier	off-target rows	real	false positive	% real
BF16	35	24	11	69%
Q8	34	27	7	79%
Q6	36	29	7	81%
Q4	22	15	7	68%

Method & caveats

Models: qwen3.6-27b (dense, 10.20.30.1) and qwen3.6-35b-A3b (MoE, 10.20.30.2), self-hosted llama-server, free. Identical endpoints — only the loaded precision differs (BF16 / UD-Q8/Q6/Q4_K_XL).
Per tier: 6 cases × 3 arms × 2 trials × 2 models = 72 cells. Completed runs found: BF16 36, Q8 36, Q6 36, Q4 36.
Temperature 0.5 so repeats explore; llama-server's default seed=-1 randomizes per request. Detection is localization-only (free); on the hard cases zero findings landed near the planted bug, so there is nothing for a truth-judge to confirm.
Token counts are llama-server's reported usage. Context tokens are large because the ReAct loop re-sends the transcript each turn (cache-blind accounting); the cross-quant comparison is still apples-to-apples since the harness is identical.
Separate DBs per quant (nelson-promptlab-bf16.db, nelson-promptlab.db, nelson-promptlab-6bit.db, nelson-promptlab-4bit.db); the baseline benchmark DB is untouched.

Generated from nelson-promptlab-bf16.db, nelson-promptlab.db, nelson-promptlab-6bit.db, nelson-promptlab-4bit.db · 4 quants × 2 models × 6 cases × 3 arms × 2 trials.