Nelson Benchmark Leaderboard

9 competitors — 1 case hits across 36 audited cases

Competitors

Cases audited

Case hits

$14.84

Competitor spend

$27.80

Judge spend

Leaderboard

#	Competitor	Size	Detect	det+½	Hits/Elig	Partial	Precision	FP/case	Other real	Cost/case	Latency	Tokens/case	Cases
1	deepseek-v4-pro ★	large	8% mean of 3 trials	25%	0%–25%	—	60%	1.00	5	$1.05	809s	2.4M	4
2	mimo-v2.5-pro	large	0% mean of 3 trials	0%	0%–0%	—	71%	0.50	5	$0.44	1571s	2.1M	4
3	gemma-4-31b ★	small	0% mean of 3 trials	12%	0%–0%	1	40%	1.50	4	$0.00	3296s	994k	4
4	mimo-reposcope	large	0% mean of 3 trials	0%	0%–0%	—	31%	2.75	5	$0.34	846s	1.7M	4
5	mimo-reposcope-2m	large	0% mean of 3 trials	0%	0%–0%	—	20%	5.00	5	$0.37	743s	1.8M	4
6	gemma-4-31b-reposcope ★	small	0% mean of 3 trials	0%	0%–0%	—	0%	2.75	—	$0.00	2591s	1.3M	4
7	gemma-4-31b-reposcope-2m ★	small	0% mean of 3 trials	0%	0%–0%	—	0%	2.00	—	$0.00	2804s	1.3M	4
8	deepseek-pro-reposcope-2m ★	large	0% mean of 3 trials	0%	0%–0%	—	0%	1.00	—	$0.78	340s	1.8M	4
9	deepseek-pro-reposcope	large	0% mean of 3 trials	0%	0%–0%	—	—	0.00	—	$0.73	398s	1.6M	4

Detect = case hits / eligible (hits + partials + genuine misses); undetermined, refused, and auth/infra-excluded cases are not in the denominator. Partial = cases localized to the right spot but judged a different bug — right place, wrong bug. It is an eligible non-hit (it sits in the denominator where it would otherwise be a miss), so it never moves Detect or the ranking; det+½ (= (hits + 0.5·partials) / eligible) shows its half-credit value informationally. Precision = true findings / (true + false positives). Other real = confirmed real bugs the model found that are not the planted target CVE (extra capability, but not counted as detection). Cost/latency are the competitor's own spend per audited case. ★ = on a Pareto frontier below.

This run used repeated trials (--repeat). Detect is the mean detection rate across trials — each trial is scored on its own and the rates averaged, so it reflects a typical single run, not best-of-N. The ranking is by that mean. Hits/Elig shows the per-trial detection range (min to max); hover it for the best-of-N pooled count and the spread. In the per-case matrix, a HIT marked n/N was found in only n of N trials (flaky). See bench noise for the full per-trial breakdown.

Pareto frontier

Quality vs cost / case

Quality vs latency / case

Quality = detection rate x precision (precision treated as 1.0 when a competitor reported no scorable findings). Green points are non-dominated — no other competitor is at least as good on quality while also cheaper/faster. Size is shown in the table; it is categorical, so it is not used as a numeric Pareto axis.

Tokens & time per case

Mean total tokens (prompt + completion, with the ReAct loop's resent context counted each turn) per audited case; the trailing number is mean latency/case. Bars are linear, so brute-force models dwarf frugal ones. Data-quality caveat: these are the tokens the provider's API reported — some OpenAI-compatible endpoints under-report usage (a near-zero bar with input ≈ output is the tell), so a suspiciously short bar may mean broken metering rather than a frugal model, and that competitor's cost/case is then an underestimate.

Per-case results

Competitor	CVE-2026-5199	GHSA-9f49-8x56-jmjc	GHSA-cc7p-2j3x-x7xf	GHSA-x9h5-r9v2-vcww
deepseek-v4-pro	miss	miss	HIT 1/3	miss
mimo-v2.5-pro	miss	miss	miss	miss
gemma-4-31b	miss	miss	part	miss
mimo-reposcope	miss	miss	miss	miss
mimo-reposcope-2m	miss	miss	miss	miss
gemma-4-31b-reposcope	miss	miss	miss	miss
gemma-4-31b-reposcope-2m	miss	miss	miss	miss
deepseek-pro-reposcope-2m	miss	miss	miss	miss
deepseek-pro-reposcope	miss	miss	miss	miss

HIT = detected; part = right spot, wrong bug (half credit, eligible non-hit); miss = looked, found nothing; jerr = judge undetermined (out of denominator); refu = model refused the task (out of denominator, never a miss); excl = auth/infra failure (never a miss); · = not run. A HIT marked n/N was found in only n of N trials (flaky); a bare HIT was found in every trial.