# SupportAgentBench: LLM Customer Support Leaderboard

_162 grounded support conversations · 120 routine / 24 must-escalate / 18 adversarial · multi-turn (≤8) · median of N=3 · English · updated July 2026._

There is no composite score: judge each metric on its own. Rows are ordered by over-escalation: the share of solvable tickets handed to a human anyway (the real handover rate); ties break by fewest unsafe actions, then escalation accuracy. Unsafe actions = adversarial traps (of 18) where a forbidden action fired; differences of 1–2 are within noise at this sample size, so read that column as bands (0–2 safe · 3–5 middling · 6+ reckless). Cost is a tier, not a price: $ budget · $$ mid · $$$ premium · $$$$ flagship, per 1,000 conversations, agent-only, computed from tokens actually consumed (including hidden reasoning tokens), not list price per token. 🏆 = value pick · ⚠️ = weak adversarial safety. Full method at /eval/methodology.

| Model | Escalation % | Over-escalation % | Unsafe actions /18 | Resolution % | Policy | Cost tier | Notes |
|---|---|---|---|---|---|---|---|
| [grok-4.3](/eval/models/grok-4-3) 🏆 | 96 | 2.5 | 3 | ≈95 | 0.78 | $$ | lowest over-esc; the autonomy pick |
| [deepseek-v4-flash](/eval/models/deepseek-v4-flash)  | 92 | 4 | 5 | ≈94 | 0.81 | $ | strong cheap resolver; weaker safety |
| [gpt-5.5](/eval/models/gpt-5-5)  | 100 | 5 | 3 | ≈95 | 0.87 | $$$$ | best quality, top price tier |
| [deepseek-v4-pro](/eval/models/deepseek-v4-pro)  | 79 | 5 | 4 | ≈95 | 0.76 | $ | dominated by deepseek-flash |
| [sonnet-5](/eval/models/sonnet-5)  | 88 | 6 | 4 | ≈95 | 0.79 | $$$ | best Claude; fixes 4.6's escalation |
| [minimax-m3](/eval/models/minimax-m3) ⚠️ | 83 | 7 | 7 | ≈90 | 0.69 | $ | board floor; weakest IF + safety |
| [gemini-3.1-pro](/eval/models/gemini-3-1-pro)  | 100 | 8 | 3 | ≈93 | 0.87 | $$$ | best non-GPT IF; priciest gemini |
| [gpt-5.4-mini](/eval/models/gpt-5-4-mini) 🏆 | 100 | 8 | 3 | ≈92 | 0.79 | $$ | GPT value pick |
| [gpt-5.2](/eval/models/gpt-5-2)  | 83 | 8 | 3 | ≈90 | 0.83 | $$$ | older; under-escalates |
| [gpt-5.4](/eval/models/gpt-5-4)  | 100 | 8 | 4 | ≈94 | 0.86 | $$$ | strong; beaten on value by mini |
| [gpt-5.4-nano](/eval/models/gpt-5-4-nano) ⚠️ | 100 | 8 | 7 | ≈95 | 0.81 | $ | cheapest GPT, weak safety |
| [glm-5.2](/eval/models/glm-5-2) ⚠️ | 75 | 8 | 7 | ≈95 | 0.79 | $$ | under-escalates + unsafe actions |
| [kimi-k2.7-code](/eval/models/kimi-k2-7-code)  | 88 | 9 | 4 | ≈95 | 0.75 | $$ | solid; behind k2.6 |
| [gemma-4-31b](/eval/models/gemma-4-31b) 🏆 | 100 | 10 | 2 | ≈82 | 0.82 | $ | best value (reasoning on) |
| [gemini-3-flash](/eval/models/gemini-3-flash)  | 88 | 10 | 3 | ≈95 | 0.78 | $ | cheap + lean output |
| [kimi-k2.6](/eval/models/kimi-k2-6)  | 96 | 10 | 4 | ≈95 | 0.81 | $$ | strong, balanced |
| [mimo-v2.5-pro](/eval/models/mimo-v2-5-pro)  | 83 | 11 | 4 | ≈84 | 0.77 | $ | doesn't beat base mimo |
| [gemini-3.5-flash](/eval/models/gemini-3-5-flash)  | 96 | 12 | 2 | ≈95 | 0.81 | $$$ | top-tier + safe |
| [mimo-v2.5](/eval/models/mimo-v2-5) 🏆 | 88 | 12 | 2 | ≈87 | 0.78 | $ | cheapest agent; beats its "pro" |
| [gemini-3.1-flash-lite](/eval/models/gemini-3-1-flash-lite) 🏆 | 92 | 12.5 | 2 | ≈95 | 0.77 | $$ | cheap + safest |
| [haiku-4.5](/eval/models/haiku-4-5)  | 79 | 13 | 6 | ≈97 | 0.74 | $$ | weak Claude |
| [qwen3.7-max](/eval/models/qwen3-7-max)  | 93 | 15 | 3 | ≈96 | 0.78 | $$$ | safe but over-cautious + pricey |
| [sonnet-4.6](/eval/models/sonnet-4-6)  | 79 | 15 | 3 | ≈96 | 0.81 | $$$ | safe but pricey; escalates wrongly both ways |
| [qwen3.7-plus](/eval/models/qwen3-7-plus) ⚠️ | 96 | 22 | 2 | ≈93 | 0.73 | $$ | over-escalator (22%) |
