Nelson Benchmark Leaderboard

9 competitors — 1 case hits across 36 audited cases

9
Competitors
36
Cases audited
1
Case hits
$14.84
Competitor spend
$27.80
Judge spend

Leaderboard

#CompetitorSizeDetectdet+½Hits/EligPartialPrecisionFP/caseOther realCost/caseLatencyTokens/caseCases
1deepseek-v4-pro large8%
mean of 3 trials
25%0%–25%60%1.005$1.05809s2.4M4
2mimo-v2.5-prolarge0%
mean of 3 trials
0%0%–0%71%0.505$0.441571s2.1M4
3gemma-4-31b small0%
mean of 3 trials
12%0%–0%140%1.504$0.003296s994k4
4mimo-reposcopelarge0%
mean of 3 trials
0%0%–0%31%2.755$0.34846s1.7M4
5mimo-reposcope-2mlarge0%
mean of 3 trials
0%0%–0%20%5.005$0.37743s1.8M4
6gemma-4-31b-reposcope small0%
mean of 3 trials
0%0%–0%0%2.75$0.002591s1.3M4
7gemma-4-31b-reposcope-2m small0%
mean of 3 trials
0%0%–0%0%2.00$0.002804s1.3M4
8deepseek-pro-reposcope-2m large0%
mean of 3 trials
0%0%–0%0%1.00$0.78340s1.8M4
9deepseek-pro-reposcopelarge0%
mean of 3 trials
0%0%–0%0.00$0.73398s1.6M4

Detect = case hits / eligible (hits + partials + genuine misses); undetermined, refused, and auth/infra-excluded cases are not in the denominator. Partial = cases localized to the right spot but judged a different bug — right place, wrong bug. It is an eligible non-hit (it sits in the denominator where it would otherwise be a miss), so it never moves Detect or the ranking; det+½ (= (hits + 0.5·partials) / eligible) shows its half-credit value informationally. Precision = true findings / (true + false positives). Other real = confirmed real bugs the model found that are not the planted target CVE (extra capability, but not counted as detection). Cost/latency are the competitor's own spend per audited case. ★ = on a Pareto frontier below.

This run used repeated trials (--repeat). Detect is the mean detection rate across trials — each trial is scored on its own and the rates averaged, so it reflects a typical single run, not best-of-N. The ranking is by that mean. Hits/Elig shows the per-trial detection range (min to max); hover it for the best-of-N pooled count and the spread. In the per-case matrix, a HIT marked n/N was found in only n of N trials (flaky). See bench noise for the full per-trial breakdown.

Pareto frontier

Quality vs cost / case

0.00.51.0quality (det x prec)$0.00$1.05cost / case (lower is better →)deepseek-v4-promimo-v2.5-progemma-4-31bmimo-reposcopemimo-reposcope-2mgemma-4-31b-reposcopegemma-4-31b-reposcope…deepseek-pro-reposcop…deepseek-pro-reposcope

Quality vs latency / case

0.00.51.0quality (det x prec)340s3296slatency / case (lower is better →)deepseek-v4-promimo-v2.5-progemma-4-31bmimo-reposcopemimo-reposcope-2mgemma-4-31b-reposcopegemma-4-31b-reposcope…deepseek-pro-reposcop…deepseek-pro-reposcope

Quality = detection rate x precision (precision treated as 1.0 when a competitor reported no scorable findings). Green points are non-dominated — no other competitor is at least as good on quality while also cheaper/faster. Size is shown in the table; it is categorical, so it is not used as a numeric Pareto axis.

Tokens & time per case

deepseek-v4-pro2.4M · 809smimo-v2.5-pro2.1M · 1571smimo-reposcope-2m1.8M · 743sdeepseek-pro-reposcope-2m1.8M · 340smimo-reposcope1.7M · 846sdeepseek-pro-reposcope1.6M · 398sgemma-4-31b-reposcope-2m1.3M · 2804sgemma-4-31b-reposcope1.3M · 2591sgemma-4-31b994k · 3296s

Mean total tokens (prompt + completion, with the ReAct loop's resent context counted each turn) per audited case; the trailing number is mean latency/case. Bars are linear, so brute-force models dwarf frugal ones. Data-quality caveat: these are the tokens the provider's API reported — some OpenAI-compatible endpoints under-report usage (a near-zero bar with input ≈ output is the tell), so a suspiciously short bar may mean broken metering rather than a frugal model, and that competitor's cost/case is then an underestimate.

Per-case results

Competitor CVE-2026-5199 GHSA-9f49-8x56-jmjc GHSA-cc7p-2j3x-x7xf GHSA-x9h5-r9v2-vcww
deepseek-v4-pro miss miss HIT 1/3 miss
mimo-v2.5-pro miss miss miss miss
gemma-4-31b miss miss part miss
mimo-reposcope miss miss miss miss
mimo-reposcope-2m miss miss miss miss
gemma-4-31b-reposcope miss miss miss miss
gemma-4-31b-reposcope-2m miss miss miss miss
deepseek-pro-reposcope-2m miss miss miss miss
deepseek-pro-reposcope miss miss miss miss

HIT = detected; part = right spot, wrong bug (half credit, eligible non-hit); miss = looked, found nothing; jerr = judge undetermined (out of denominator); refu = model refused the task (out of denominator, never a miss); excl = auth/infra failure (never a miss); · = not run. A HIT marked n/N was found in only n of N trials (flaky); a bare HIT was found in every trial.