9 competitors — 1 case hits across 36 audited cases
| # | Competitor | Size | Detect | det+½ | Hits/Elig | Partial | Precision | FP/case | Other real | Cost/case | Latency | Tokens/case | Cases |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | deepseek-v4-pro ★ | large | 8% mean of 3 trials | 25% | 0%–25% | — | 60% | 1.00 | 5 | $1.05 | 809s | 2.4M | 4 |
| 2 | mimo-v2.5-pro | large | 0% mean of 3 trials | 0% | 0%–0% | — | 71% | 0.50 | 5 | $0.44 | 1571s | 2.1M | 4 |
| 3 | gemma-4-31b ★ | small | 0% mean of 3 trials | 12% | 0%–0% | 1 | 40% | 1.50 | 4 | $0.00 | 3296s | 994k | 4 |
| 4 | mimo-reposcope | large | 0% mean of 3 trials | 0% | 0%–0% | — | 31% | 2.75 | 5 | $0.34 | 846s | 1.7M | 4 |
| 5 | mimo-reposcope-2m | large | 0% mean of 3 trials | 0% | 0%–0% | — | 20% | 5.00 | 5 | $0.37 | 743s | 1.8M | 4 |
| 6 | gemma-4-31b-reposcope ★ | small | 0% mean of 3 trials | 0% | 0%–0% | — | 0% | 2.75 | — | $0.00 | 2591s | 1.3M | 4 |
| 7 | gemma-4-31b-reposcope-2m ★ | small | 0% mean of 3 trials | 0% | 0%–0% | — | 0% | 2.00 | — | $0.00 | 2804s | 1.3M | 4 |
| 8 | deepseek-pro-reposcope-2m ★ | large | 0% mean of 3 trials | 0% | 0%–0% | — | 0% | 1.00 | — | $0.78 | 340s | 1.8M | 4 |
| 9 | deepseek-pro-reposcope | large | 0% mean of 3 trials | 0% | 0%–0% | — | — | 0.00 | — | $0.73 | 398s | 1.6M | 4 |
Detect = case hits / eligible (hits + partials + genuine misses); undetermined, refused, and auth/infra-excluded cases are not in the denominator. Partial = cases localized to the right spot but judged a different bug — right place, wrong bug. It is an eligible non-hit (it sits in the denominator where it would otherwise be a miss), so it never moves Detect or the ranking; det+½ (= (hits + 0.5·partials) / eligible) shows its half-credit value informationally. Precision = true findings / (true + false positives). Other real = confirmed real bugs the model found that are not the planted target CVE (extra capability, but not counted as detection). Cost/latency are the competitor's own spend per audited case. ★ = on a Pareto frontier below.
This run used repeated trials (--repeat). Detect is the mean detection rate across trials — each trial is scored on its own and the rates averaged, so it reflects a typical single run, not best-of-N. The ranking is by that mean. Hits/Elig shows the per-trial detection range (min to max); hover it for the best-of-N pooled count and the spread. In the per-case matrix, a HIT marked n/N was found in only n of N trials (flaky). See bench noise for the full per-trial breakdown.
Quality = detection rate x precision (precision treated as 1.0 when a competitor reported no scorable findings). Green points are non-dominated — no other competitor is at least as good on quality while also cheaper/faster. Size is shown in the table; it is categorical, so it is not used as a numeric Pareto axis.
Mean total tokens (prompt + completion, with the ReAct loop's resent context counted each turn) per audited case; the trailing number is mean latency/case. Bars are linear, so brute-force models dwarf frugal ones. Data-quality caveat: these are the tokens the provider's API reported — some OpenAI-compatible endpoints under-report usage (a near-zero bar with input ≈ output is the tell), so a suspiciously short bar may mean broken metering rather than a frugal model, and that competitor's cost/case is then an underestimate.
| Competitor | CVE-2026-5199 | GHSA-9f49-8x56-jmjc | GHSA-cc7p-2j3x-x7xf | GHSA-x9h5-r9v2-vcww |
|---|---|---|---|---|
| deepseek-v4-pro | miss | miss | HIT 1/3 | miss |
| mimo-v2.5-pro | miss | miss | miss | miss |
| gemma-4-31b | miss | miss | part | miss |
| mimo-reposcope | miss | miss | miss | miss |
| mimo-reposcope-2m | miss | miss | miss | miss |
| gemma-4-31b-reposcope | miss | miss | miss | miss |
| gemma-4-31b-reposcope-2m | miss | miss | miss | miss |
| deepseek-pro-reposcope-2m | miss | miss | miss | miss |
| deepseek-pro-reposcope | miss | miss | miss | miss |
HIT = detected; part = right spot, wrong bug (half credit, eligible non-hit); miss = looked, found nothing; jerr = judge undetermined (out of denominator); refu = model refused the task (out of denominator, never a miss); excl = auth/infra failure (never a miss); · = not run. A HIT marked n/N was found in only n of N trials (flaky); a bare HIT was found in every trial.