Nelson Benchmark Leaderboard

29 competitors — 63 case hits across 252 audited cases

29
Competitors
252
Cases audited
63
Case hits
$162.05
Competitor spend
$62.08
Judge spend

Leaderboard

#CompetitorSizeDetectdet+½Hits/EligPartialPrecisionFP/caseOther realCost/caseLatencyTokens/caseCases
1gpt-5.5-prolarge50%50%2/4100%0.002$22.82576s581k4
2mimo-v2.5-pro large44%50%4/91100%0.005$0.08475s397k9
3gpt-5.5 large44%44%4/9100%0.004$1.12191s766k9
4opus-4.8 large44%44%4/991%0.116$0.73137s501k9
5gemini-3.5-flashmedium44%44%4/978%0.223$0.68181s381k9
6deepseek-v4 (alias) large44%44%4/975%0.222$0.1091s623k9
7gemma4-26b-a4b small43%43%3/7100%0.00$0.00638s329k7
8qwen3.7-maxlarge38%44%3/81100%0.005$0.32447s332k8
9qwen3.6-27bsmall38%38%3/867%0.383$0.001278s733k8
10minimax-m3large33%33%3/986%0.112$0.23488s718k9
11glm-5.2large33%39%3/9175%0.223$0.49305s329k9
12gemini-3.1-pro-previewlarge33%44%3/9259%1.009$1.45242s334k9
13hy3-previewlarge25%31%2/8169%0.447$0.02386s167k9
14haiku-4.5small25%25%2/853%0.786$0.35201s1.6M9
15nemotron-3-nano-omnismall22%22%2/9100%0.001$0.00311s64k9
16nex-n2-prolarge22%22%2/9100%0.001$0.00207s914k9
17sonnet-4.6medium22%22%2/980%0.3310$0.45207s324k9
18north-mini-codesmall22%22%2/967%0.221$0.00593s579k9
19gemma4-31bsmall22%33%2/9260%0.444$0.001390s233k9
20glm-5.1large22%22%2/954%0.675$0.55733s654k9
21nemotron-3-super-120bsmall22%22%2/950%0.442$0.04491s371k9
22laguna-xs.2small11%11%1/9100%0.00$0.09380s896k9
23kimi-k2.6large11%11%1/980%0.113$0.35928s447k9
24owl-alphalarge11%22%1/9240%0.673$0.00406s611k9
25nemotron-3-ultramedium11%11%1/933%0.893$0.351846s666k9
26kimi-k2.7-codelarge11%17%1/9133%0.672$0.51438s605k9
27laguna-m.1medium0%11%0/929%1.111$0.00465s858k9
28mistral-medium medium0%0%0/90.00$0.0063s253k9
29vibethinker-3bsmall0%0%0/90.00$0.0065s3k9

Detect = case hits / eligible (hits + partials + genuine misses); undetermined, refused, and auth/infra-excluded cases are not in the denominator. Partial = cases localized to the right spot but judged a different bug — right place, wrong bug. It is an eligible non-hit (it sits in the denominator where it would otherwise be a miss), so it never moves Detect or the ranking; det+½ (= (hits + 0.5·partials) / eligible) shows its half-credit value informationally. Precision = true findings / (true + false positives). Other real = confirmed real bugs the model found that are not the planted target CVE (extra capability, but not counted as detection). Cost/latency are the competitor's own spend per audited case. ★ = on a Pareto frontier below.

† Partial coverage: this competitor completed fewer than the full 9 cases (see the Cases column). Its detection rate is therefore based on fewer audited cases and is not directly rank-comparable with full-corpus competitors — read it alongside the Cases count, not the rank.

Pareto frontier

Quality vs cost / case

0.00.51.0quality (det x prec)$0.00$1.45cost / case (lower is better →)mimo-v2.5-progpt-5.5opus-4.8gemini-3.5-flashdeepseek-v4 (alias)gemma4-26b-a4bqwen3.7-maxqwen3.6-27bminimax-m3glm-5.2gemini-3.1-pro-previewhy3-previewhaiku-4.5nemotron-3-nano-omninex-n2-prosonnet-4.6north-mini-codegemma4-31bglm-5.1nemotron-3-super-120blaguna-xs.2kimi-k2.6owl-alphanemotron-3-ultrakimi-k2.7-codelaguna-m.1mistral-mediumvibethinker-3b

Quality vs latency / case

0.00.51.0quality (det x prec)63s1846slatency / case (lower is better →)mimo-v2.5-progpt-5.5opus-4.8gemini-3.5-flashdeepseek-v4 (alias)gemma4-26b-a4bqwen3.7-maxqwen3.6-27bminimax-m3glm-5.2gemini-3.1-pro-previewhy3-previewhaiku-4.5nemotron-3-nano-omninex-n2-prosonnet-4.6north-mini-codegemma4-31bglm-5.1nemotron-3-super-120blaguna-xs.2kimi-k2.6owl-alphanemotron-3-ultrakimi-k2.7-codelaguna-m.1mistral-mediumvibethinker-3b

Quality = detection rate x precision (precision treated as 1.0 when a competitor reported no scorable findings). Green points are non-dominated — no other competitor is at least as good on quality while also cheaper/faster. Size is shown in the table; it is categorical, so it is not used as a numeric Pareto axis.

These charts omit gpt-5.5-pro (4/9): a competitor that audited fewer than 75% of the 9 cases measures its quality over a smaller, self-selected subset, so the point is not comparable with the full-corpus competitors — and a cost-capped probe sits so far out on the cost axis that every other competitor collapses into one indistinguishable cluster, making the trade-off unreadable. Its position would also imply a quality ranking the partial run does not establish. It remains in the leaderboard table above (marked †).

Tokens & time per case

haiku-4.51.6M · 201snex-n2-pro914k · 207slaguna-xs.2896k · 380slaguna-m.1858k · 465sgpt-5.5766k · 191sqwen3.6-27b733k · 1278sminimax-m3718k · 488snemotron-3-ultra666k · 1846sglm-5.1654k · 733sdeepseek-v4 (alias)623k · 91sowl-alpha611k · 406skimi-k2.7-code605k · 438sgpt-5.5-pro581k · 576snorth-mini-code579k · 593sopus-4.8501k · 137skimi-k2.6447k · 928smimo-v2.5-pro397k · 475sgemini-3.5-flash381k · 181snemotron-3-super-120b371k · 491sgemini-3.1-pro-preview334k · 242sqwen3.7-max332k · 447sgemma4-26b-a4b329k · 638sglm-5.2329k · 305ssonnet-4.6324k · 207smistral-medium253k · 63sgemma4-31b233k · 1390shy3-preview167k · 386snemotron-3-nano-omni64k · 311svibethinker-3b3k · 65s

Mean total tokens (prompt + completion, with the ReAct loop's resent context counted each turn) per audited case; the trailing number is mean latency/case. Bars are linear, so brute-force models dwarf frugal ones. Data-quality caveat: these are the tokens the provider's API reported — some OpenAI-compatible endpoints under-report usage (a near-zero bar with input ≈ output is the tell), so a suspiciously short bar may mean broken metering rather than a frugal model, and that competitor's cost/case is then an underestimate.

Per-case results

Competitor CVE-2026-5199 CVE-2026-7474 GHSA-9f49-8x56-jmjc GHSA-cc7p-2j3x-x7xf GHSA-f26g-jm89-4g65 GHSA-j273-m5qq-6825 GHSA-mpxh-8fq3-x8mh GHSA-w52v-v783-gw97 GHSA-x9h5-r9v2-vcww
gpt-5.5-pro miss HIT HIT excl excl excl excl excl miss
mimo-v2.5-pro miss HIT miss miss HIT part HIT HIT miss
gpt-5.5 miss HIT miss miss HIT HIT miss HIT miss
opus-4.8 miss miss miss miss HIT HIT miss HIT HIT
gemini-3.5-flash miss HIT miss miss HIT HIT miss HIT miss
deepseek-v4 (alias) miss HIT miss miss HIT HIT miss HIT miss
gemma4-26b-a4b excl miss HIT miss HIT miss miss HIT excl
qwen3.7-max miss HIT miss miss HIT part excl HIT miss
qwen3.6-27b miss miss miss miss HIT HIT excl HIT miss
minimax-m3 miss miss miss miss HIT HIT miss HIT miss
glm-5.2 miss HIT miss miss HIT part miss HIT miss
gemini-3.1-pro-preview miss part miss miss HIT HIT part HIT miss
hy3-preview miss miss miss miss HIT HIT refu part miss
haiku-4.5 miss miss miss miss jerr HIT miss HIT miss
nemotron-3-nano-omni miss miss miss miss HIT miss miss HIT miss
nex-n2-pro miss miss miss miss miss HIT miss HIT miss
sonnet-4.6 miss miss miss miss miss HIT miss HIT miss
north-mini-code miss miss miss miss miss HIT miss HIT miss
gemma4-31b miss HIT miss part miss part miss HIT miss
glm-5.1 miss miss miss miss miss HIT miss HIT miss
nemotron-3-super-120b miss miss miss miss miss HIT miss HIT miss
laguna-xs.2 miss miss miss miss miss miss miss HIT miss
kimi-k2.6 miss miss miss miss miss miss miss HIT miss
owl-alpha miss part miss miss miss part miss HIT miss
nemotron-3-ultra miss miss miss miss miss miss miss HIT miss
kimi-k2.7-code miss miss miss miss miss part miss HIT miss
laguna-m.1 miss miss miss part miss miss miss part miss
mistral-medium miss miss miss miss miss miss miss miss miss
vibethinker-3b miss miss miss miss miss miss miss miss miss

HIT = detected; part = right spot, wrong bug (half credit, eligible non-hit); miss = looked, found nothing; jerr = judge undetermined (out of denominator); refu = model refused the task (out of denominator, never a miss); excl = auth/infra failure (never a miss); · = not run. A HIT marked n/N was found in only n of N trials (flaky); a bare HIT was found in every trial.