Nelson Benchmark Leaderboard

18 competitors — 42 case hits across 155 audited cases

18
Competitors
155
Cases audited
42
Case hits
$146.99
Competitor spend
$41.61
Judge spend

Leaderboard

#CompetitorSizeDetectHits/EligPrecisionFP/caseOther realCost/caseLatencyTokens/caseCases
1gpt-5.5-prolarge50%2/4100%0.002$22.82576s581k4
2mimo-v2.5-pro large44%4/9100%0.005$0.08475s397k9
3gpt-5.5 large44%4/9100%0.004$1.12191s766k9
4opus-4.8 large44%4/991%0.116$0.73137s501k9
5gemini-3.5-flashmedium44%4/978%0.223$0.68181s381k9
6deepseek-v4-pro large44%4/975%0.222$0.1091s623k9
7qwen3.7-maxlarge38%3/8100%0.005$0.32447s332k8
8qwen3.6-27b small38%3/867%0.383$0.001278s733k8
9gemini-3.1-pro-previewlarge33%3/959%1.009$1.45242s334k9
10haiku-4.5small25%2/853%0.786$0.35201s1.6M9
11sonnet-4.6medium22%2/980%0.3310$0.45207s324k9
12glm-5.1large22%2/954%0.675$0.55733s654k9
13nemotron-3-super-120blarge22%2/950%0.442$0.04491s371k9
14hy3-previewlarge12%1/869%0.448$0.02386s167k9
15kimi-k2.6large11%1/980%0.113$0.35928s447k9
16owl-alphalarge11%1/940%0.673$0.00406s611k9
17laguna-m.1medium0%0/99%1.111$0.00465s858k9
18mistral-medium medium0%0/90.00$0.0063s253k9

Detect = case hits / eligible (hits + genuine misses); undetermined, refused, and auth/infra-excluded cases are not in the denominator. Precision = true findings / (true + false positives). Other real = confirmed real bugs the model found that are not the planted target CVE (extra capability, but not counted as detection). Cost/latency are the competitor's own spend per audited case. ★ = on a Pareto frontier below.

† Partial coverage: this competitor completed fewer than the full 9 cases (see the Cases column). Its detection rate is therefore based on fewer audited cases and is not directly rank-comparable with full-corpus competitors — read it alongside the Cases count, not the rank.

Pareto frontier

Quality vs cost / case

0.00.51.0quality (det x prec)$0.00$1.45cost / case (lower is better →)mimo-v2.5-progpt-5.5opus-4.8gemini-3.5-flashdeepseek-v4-proqwen3.7-maxqwen3.6-27bgemini-3.1-pro-previewhaiku-4.5sonnet-4.6glm-5.1nemotron-3-super-120bhy3-previewkimi-k2.6owl-alphalaguna-m.1mistral-medium

Quality vs latency / case

0.00.51.0quality (det x prec)63s1278slatency / case (lower is better →)mimo-v2.5-progpt-5.5opus-4.8gemini-3.5-flashdeepseek-v4-proqwen3.7-maxqwen3.6-27bgemini-3.1-pro-previewhaiku-4.5sonnet-4.6glm-5.1nemotron-3-super-120bhy3-previewkimi-k2.6owl-alphalaguna-m.1mistral-medium

Quality = detection rate x precision (precision treated as 1.0 when a competitor reported no scorable findings). Green points are non-dominated — no other competitor is at least as good on quality while also cheaper/faster. Size is shown in the table; it is categorical, so it is not used as a numeric Pareto axis.

These charts omit gpt-5.5-pro (4/9): a competitor that audited fewer than 75% of the 9 cases measures its quality over a smaller, self-selected subset, so the point is not comparable with the full-corpus competitors — and a cost-capped probe sits so far out on the cost axis that every other competitor collapses into one indistinguishable cluster, making the trade-off unreadable. Its position would also imply a quality ranking the partial run does not establish. It remains in the leaderboard table above (marked †).

Tokens & time per case

haiku-4.51.6M · 201slaguna-m.1858k · 465sgpt-5.5766k · 191sqwen3.6-27b733k · 1278sglm-5.1654k · 733sdeepseek-v4-pro623k · 91sowl-alpha611k · 406sgpt-5.5-pro581k · 576sopus-4.8501k · 137skimi-k2.6447k · 928smimo-v2.5-pro397k · 475sgemini-3.5-flash381k · 181snemotron-3-super-120b371k · 491sgemini-3.1-pro-preview334k · 242sqwen3.7-max332k · 447ssonnet-4.6324k · 207smistral-medium253k · 63shy3-preview167k · 386s

Mean total tokens (prompt + completion, with the ReAct loop's resent context counted each turn) per audited case; the trailing number is mean latency/case. Bars are linear, so brute-force models dwarf frugal ones. Data-quality caveat: these are the tokens the provider's API reported — some OpenAI-compatible endpoints under-report usage (a near-zero bar with input ≈ output is the tell), so a suspiciously short bar may mean broken metering rather than a frugal model, and that competitor's cost/case is then an underestimate.

Per-case results

Competitor CVE-2026-5199 CVE-2026-7474 GHSA-9f49-8x56-jmjc GHSA-cc7p-2j3x-x7xf GHSA-f26g-jm89-4g65 GHSA-j273-m5qq-6825 GHSA-mpxh-8fq3-x8mh GHSA-w52v-v783-gw97 GHSA-x9h5-r9v2-vcww
gpt-5.5-pro miss HIT HIT excl excl excl excl excl miss
mimo-v2.5-pro miss HIT miss miss HIT miss HIT HIT miss
gpt-5.5 miss HIT miss miss HIT HIT miss HIT miss
opus-4.8 miss miss miss miss HIT HIT miss HIT HIT
gemini-3.5-flash miss HIT miss miss HIT HIT miss HIT miss
deepseek-v4-pro miss HIT miss miss HIT HIT miss HIT miss
qwen3.7-max miss HIT miss miss HIT miss excl HIT miss
qwen3.6-27b miss miss miss miss HIT HIT excl HIT miss
gemini-3.1-pro-preview miss miss miss miss HIT HIT miss HIT miss
haiku-4.5 miss miss miss miss jerr HIT miss HIT miss
sonnet-4.6 miss miss miss miss miss HIT miss HIT miss
glm-5.1 miss miss miss miss miss HIT miss HIT miss
nemotron-3-super-120b miss miss miss miss miss HIT miss HIT miss
hy3-preview miss miss miss miss miss HIT refu miss miss
kimi-k2.6 miss miss miss miss miss miss miss HIT miss
owl-alpha miss miss miss miss miss miss miss HIT miss
laguna-m.1 miss miss miss miss miss miss miss miss miss
mistral-medium miss miss miss miss miss miss miss miss miss

HIT = detected; miss = looked, found nothing; jerr = judge undetermined (out of denominator); refu = model refused the task (out of denominator, never a miss); excl = auth/infra failure (never a miss); · = not run.