18 competitors — 42 case hits across 155 audited cases
| # | Competitor | Size | Detect | Hits/Elig | Precision | FP/case | Other real | Cost/case | Latency | Tokens/case | Cases |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | gpt-5.5-pro† | large | 50%† | 2/4 | 100% | 0.00 | 2 | $22.82 | 576s | 581k | 4 |
| 2 | mimo-v2.5-pro ★ | large | 44% | 4/9 | 100% | 0.00 | 5 | $0.08 | 475s | 397k | 9 |
| 3 | gpt-5.5 ★ | large | 44% | 4/9 | 100% | 0.00 | 4 | $1.12 | 191s | 766k | 9 |
| 4 | opus-4.8 ★ | large | 44% | 4/9 | 91% | 0.11 | 6 | $0.73 | 137s | 501k | 9 |
| 5 | gemini-3.5-flash | medium | 44% | 4/9 | 78% | 0.22 | 3 | $0.68 | 181s | 381k | 9 |
| 6 | deepseek-v4-pro ★ | large | 44% | 4/9 | 75% | 0.22 | 2 | $0.10 | 91s | 623k | 9 |
| 7 | qwen3.7-max† | large | 38%† | 3/8 | 100% | 0.00 | 5 | $0.32 | 447s | 332k | 8 |
| 8 | qwen3.6-27b† ★ | small | 38%† | 3/8 | 67% | 0.38 | 3 | $0.00 | 1278s | 733k | 8 |
| 9 | gemini-3.1-pro-preview | large | 33% | 3/9 | 59% | 1.00 | 9 | $1.45 | 242s | 334k | 9 |
| 10 | haiku-4.5 | small | 25% | 2/8 | 53% | 0.78 | 6 | $0.35 | 201s | 1.6M | 9 |
| 11 | sonnet-4.6 | medium | 22% | 2/9 | 80% | 0.33 | 10 | $0.45 | 207s | 324k | 9 |
| 12 | glm-5.1 | large | 22% | 2/9 | 54% | 0.67 | 5 | $0.55 | 733s | 654k | 9 |
| 13 | nemotron-3-super-120b | large | 22% | 2/9 | 50% | 0.44 | 2 | $0.04 | 491s | 371k | 9 |
| 14 | hy3-preview | large | 12% | 1/8 | 69% | 0.44 | 8 | $0.02 | 386s | 167k | 9 |
| 15 | kimi-k2.6 | large | 11% | 1/9 | 80% | 0.11 | 3 | $0.35 | 928s | 447k | 9 |
| 16 | owl-alpha | large | 11% | 1/9 | 40% | 0.67 | 3 | $0.00 | 406s | 611k | 9 |
| 17 | laguna-m.1 | medium | 0% | 0/9 | 9% | 1.11 | 1 | $0.00 | 465s | 858k | 9 |
| 18 | mistral-medium ★ | medium | 0% | 0/9 | — | 0.00 | — | $0.00 | 63s | 253k | 9 |
Detect = case hits / eligible (hits + genuine misses); undetermined, refused, and auth/infra-excluded cases are not in the denominator. Precision = true findings / (true + false positives). Other real = confirmed real bugs the model found that are not the planted target CVE (extra capability, but not counted as detection). Cost/latency are the competitor's own spend per audited case. ★ = on a Pareto frontier below.
† Partial coverage: this competitor completed fewer than the full 9 cases (see the Cases column). Its detection rate is therefore based on fewer audited cases and is not directly rank-comparable with full-corpus competitors — read it alongside the Cases count, not the rank.
Quality = detection rate x precision (precision treated as 1.0 when a competitor reported no scorable findings). Green points are non-dominated — no other competitor is at least as good on quality while also cheaper/faster. Size is shown in the table; it is categorical, so it is not used as a numeric Pareto axis.
These charts omit gpt-5.5-pro (4/9): a competitor that audited fewer than 75% of the 9 cases measures its quality over a smaller, self-selected subset, so the point is not comparable with the full-corpus competitors — and a cost-capped probe sits so far out on the cost axis that every other competitor collapses into one indistinguishable cluster, making the trade-off unreadable. Its position would also imply a quality ranking the partial run does not establish. It remains in the leaderboard table above (marked †).
Mean total tokens (prompt + completion, with the ReAct loop's resent context counted each turn) per audited case; the trailing number is mean latency/case. Bars are linear, so brute-force models dwarf frugal ones. Data-quality caveat: these are the tokens the provider's API reported — some OpenAI-compatible endpoints under-report usage (a near-zero bar with input ≈ output is the tell), so a suspiciously short bar may mean broken metering rather than a frugal model, and that competitor's cost/case is then an underestimate.
| Competitor | CVE-2026-5199 | CVE-2026-7474 | GHSA-9f49-8x56-jmjc | GHSA-cc7p-2j3x-x7xf | GHSA-f26g-jm89-4g65 | GHSA-j273-m5qq-6825 | GHSA-mpxh-8fq3-x8mh | GHSA-w52v-v783-gw97 | GHSA-x9h5-r9v2-vcww |
|---|---|---|---|---|---|---|---|---|---|
| gpt-5.5-pro | miss | HIT | HIT | excl | excl | excl | excl | excl | miss |
| mimo-v2.5-pro | miss | HIT | miss | miss | HIT | miss | HIT | HIT | miss |
| gpt-5.5 | miss | HIT | miss | miss | HIT | HIT | miss | HIT | miss |
| opus-4.8 | miss | miss | miss | miss | HIT | HIT | miss | HIT | HIT |
| gemini-3.5-flash | miss | HIT | miss | miss | HIT | HIT | miss | HIT | miss |
| deepseek-v4-pro | miss | HIT | miss | miss | HIT | HIT | miss | HIT | miss |
| qwen3.7-max | miss | HIT | miss | miss | HIT | miss | excl | HIT | miss |
| qwen3.6-27b | miss | miss | miss | miss | HIT | HIT | excl | HIT | miss |
| gemini-3.1-pro-preview | miss | miss | miss | miss | HIT | HIT | miss | HIT | miss |
| haiku-4.5 | miss | miss | miss | miss | jerr | HIT | miss | HIT | miss |
| sonnet-4.6 | miss | miss | miss | miss | miss | HIT | miss | HIT | miss |
| glm-5.1 | miss | miss | miss | miss | miss | HIT | miss | HIT | miss |
| nemotron-3-super-120b | miss | miss | miss | miss | miss | HIT | miss | HIT | miss |
| hy3-preview | miss | miss | miss | miss | miss | HIT | refu | miss | miss |
| kimi-k2.6 | miss | miss | miss | miss | miss | miss | miss | HIT | miss |
| owl-alpha | miss | miss | miss | miss | miss | miss | miss | HIT | miss |
| laguna-m.1 | miss | miss | miss | miss | miss | miss | miss | miss | miss |
| mistral-medium | miss | miss | miss | miss | miss | miss | miss | miss | miss |
HIT = detected; miss = looked, found nothing; jerr = judge undetermined (out of denominator); refu = model refused the task (out of denominator, never a miss); excl = auth/infra failure (never a miss); · = not run.