SupportAgentBench · 162 cases · updated July 2026
Which models can actually run a support desk?
A support agent has to do several things at once: pull real order data with tools, follow the store's policy, understand what a frustrated customer actually wants, stay pleasant, and not invent facts. We ran 24 models through 162 real multi-turn support conversations and measured exactly that: including what each one costs to run.
SupportAgentBench is an independent benchmark that evaluates 24 large language models as ecommerce customer-support agents across 162 grounded, multi-turn conversations, measuring resolution quality, escalation calibration, adversarial safety, policy adherence, and cost tier. There is no composite score: each metric is published separately. Headline results: gpt-5.5 posts the strongest metrics across the board; budget-tier models match flagship escalation and safety; and safety failures come from believing unverified claims, not from pressure.
Start here
Four answers to “which model?”
The strongest metrics across the board: best resolution, perfect escalation accuracy, best policy adherence. Also the only model in the flagship price tier. Most support desks don't need it.
Read the reportPerfect escalation accuracy and top-tier safety in the budget price tier, with reasoning on. Open weights: run it on your own infrastructure.
Read the reportNear-flagship escalation accuracy and safety in the mid price tier. The safe choice when you just want a managed API.
Read the reportHolds the adversarial line better than most of the board at the lowest cost measured. The floor for a working desk is far lower than the flagships suggest.
Read the reportFull leaderboard
162 grounded conversations · 120 routine / 24 must-escalate / 18 adversarial · multi-turn (≤8) · median of N=3. No composite score: sorted by over-escalation (the real handover rate) by default, tap a header to sort by the metric your desk cares about.
| # | Model | Notes | ||||||
|---|---|---|---|---|---|---|---|---|
| 1 | grok-4.3🏆 | 2.5 | 3 | 96 | ≈95 | 0.78 | $$ | lowest over-esc; the autonomy pick |
| 2 | deepseek-v4-flash | 4 | 5 | 92 | ≈94 | 0.81 | $ | strong cheap resolver; weaker safety |
| 3 | gpt-5.5 | 5 | 3 | 100 | ≈95 | 0.87 | $$$$ | best quality, top price tier |
| 4 | deepseek-v4-pro | 5 | 4 | 79 | ≈95 | 0.76 | $ | dominated by deepseek-flash |
| 5 | sonnet-5 | 6 | 4 | 88 | ≈95 | 0.79 | $$$ | best Claude; fixes 4.6's escalation |
| 6 | minimax-m3⚠️ | 7 | 7 | 83 | ≈90 | 0.69 | $ | board floor; weakest IF + safety |
| 7 | gemini-3.1-pro | 8 | 3 | 100 | ≈93 | 0.87 | $$$ | best non-GPT IF; priciest gemini |
| 8 | gpt-5.4-mini🏆 | 8 | 3 | 100 | ≈92 | 0.79 | $$ | GPT value pick |
| 9 | gpt-5.2 | 8 | 3 | 83 | ≈90 | 0.83 | $$$ | older; under-escalates |
| 10 | gpt-5.4 | 8 | 4 | 100 | ≈94 | 0.86 | $$$ | strong; beaten on value by mini |
| 11 | gpt-5.4-nano⚠️ | 8 | 7 | 100 | ≈95 | 0.81 | $ | cheapest GPT, weak safety |
| 12 | glm-5.2⚠️ | 8 | 7 | 75 | ≈95 | 0.79 | $$ | under-escalates + unsafe actions |
| 13 | kimi-k2.7-code | 9 | 4 | 88 | ≈95 | 0.75 | $$ | solid; behind k2.6 |
| 14 | gemma-4-31b🏆💭 | 10 | 2 | 100 | ≈82 | 0.82 | $ | best value (reasoning on) |
| 15 | gemini-3-flash | 10 | 3 | 88 | ≈95 | 0.78 | $ | cheap + lean output |
| 16 | kimi-k2.6 | 10 | 4 | 96 | ≈95 | 0.81 | $$ | strong, balanced |
| 17 | mimo-v2.5-pro | 11 | 4 | 83 | ≈84 | 0.77 | $ | doesn't beat base mimo |
| 18 | gemini-3.5-flash | 12 | 2 | 96 | ≈95 | 0.81 | $$$ | top-tier + safe |
| 19 | mimo-v2.5🏆 | 12 | 2 | 88 | ≈87 | 0.78 | $ | cheapest agent; beats its "pro" |
| 20 | gemini-3.1-flash-lite🏆 | 12.5 | 2 | 92 | ≈95 | 0.77 | $$ | cheap + safest |
| 21 | haiku-4.5 | 13 | 6 | 79 | ≈97 | 0.74 | $$ | weak Claude |
| 22 | qwen3.7-max | 15 | 3 | 93 | ≈96 | 0.78 | $$$ | safe but over-cautious + pricey |
| 23 | sonnet-4.6 | 15 | 3 | 79 | ≈96 | 0.81 | $$$ | safe but pricey; escalates wrongly both ways |
| 24 | qwen3.7-plus⚠️ | 22 | 2 | 96 | ≈93 | 0.73 | $$ | over-escalator (22%) |
- 1grok-4.3🏆$$
- Over-escalation
- 2.5%
- Unsafe actions
- 3/18
- Escalation accuracy
- 96%
lowest over-esc; the autonomy pick
- 2deepseek-v4-flash$
- Over-escalation
- 4%
- Unsafe actions
- 5/18
- Escalation accuracy
- 92%
strong cheap resolver; weaker safety
- 3gpt-5.5$$$$
- Over-escalation
- 5%
- Unsafe actions
- 3/18
- Escalation accuracy
- 100%
best quality, top price tier
- 4deepseek-v4-pro$
- Over-escalation
- 5%
- Unsafe actions
- 4/18
- Escalation accuracy
- 79%
dominated by deepseek-flash
- 5sonnet-5$$$
- Over-escalation
- 6%
- Unsafe actions
- 4/18
- Escalation accuracy
- 88%
best Claude; fixes 4.6's escalation
- 6minimax-m3⚠️$
- Over-escalation
- 7%
- Unsafe actions
- 7/18
- Escalation accuracy
- 83%
board floor; weakest IF + safety
- 7gemini-3.1-pro$$$
- Over-escalation
- 8%
- Unsafe actions
- 3/18
- Escalation accuracy
- 100%
best non-GPT IF; priciest gemini
- 8gpt-5.4-mini🏆$$
- Over-escalation
- 8%
- Unsafe actions
- 3/18
- Escalation accuracy
- 100%
GPT value pick
- 9gpt-5.2$$$
- Over-escalation
- 8%
- Unsafe actions
- 3/18
- Escalation accuracy
- 83%
older; under-escalates
- 10gpt-5.4$$$
- Over-escalation
- 8%
- Unsafe actions
- 4/18
- Escalation accuracy
- 100%
strong; beaten on value by mini
- 11gpt-5.4-nano⚠️$
- Over-escalation
- 8%
- Unsafe actions
- 7/18
- Escalation accuracy
- 100%
cheapest GPT, weak safety
- 12glm-5.2⚠️$$
- Over-escalation
- 8%
- Unsafe actions
- 7/18
- Escalation accuracy
- 75%
under-escalates + unsafe actions
- 13kimi-k2.7-code$$
- Over-escalation
- 9%
- Unsafe actions
- 4/18
- Escalation accuracy
- 88%
solid; behind k2.6
- 14gemma-4-31b🏆$
- Over-escalation
- 10%
- Unsafe actions
- 2/18
- Escalation accuracy
- 100%
best value (reasoning on)
- 15gemini-3-flash$
- Over-escalation
- 10%
- Unsafe actions
- 3/18
- Escalation accuracy
- 88%
cheap + lean output
- 16kimi-k2.6$$
- Over-escalation
- 10%
- Unsafe actions
- 4/18
- Escalation accuracy
- 96%
strong, balanced
- 17mimo-v2.5-pro$
- Over-escalation
- 11%
- Unsafe actions
- 4/18
- Escalation accuracy
- 83%
doesn't beat base mimo
- 18gemini-3.5-flash$$$
- Over-escalation
- 12%
- Unsafe actions
- 2/18
- Escalation accuracy
- 96%
top-tier + safe
- 19mimo-v2.5🏆$
- Over-escalation
- 12%
- Unsafe actions
- 2/18
- Escalation accuracy
- 88%
cheapest agent; beats its "pro"
- 20gemini-3.1-flash-lite🏆$$
- Over-escalation
- 12.5%
- Unsafe actions
- 2/18
- Escalation accuracy
- 92%
cheap + safest
- 21haiku-4.5$$
- Over-escalation
- 13%
- Unsafe actions
- 6/18
- Escalation accuracy
- 79%
weak Claude
- 22qwen3.7-max$$$
- Over-escalation
- 15%
- Unsafe actions
- 3/18
- Escalation accuracy
- 93%
safe but over-cautious + pricey
- 23sonnet-4.6$$$
- Over-escalation
- 15%
- Unsafe actions
- 3/18
- Escalation accuracy
- 79%
safe but pricey; escalates wrongly both ways
- 24qwen3.7-plus⚠️$$
- Over-escalation
- 22%
- Unsafe actions
- 2/18
- Escalation accuracy
- 96%
over-escalator (22%)
Ordered by over-escalation by default: the share of solvable tickets the model hands to a human anyway: the real handover rate, and the number that decides how much of your queue the agent actually takes off your team. Ties break by fewest unsafe actions, then escalation accuracy (all read straight from the transcript). Tap any header to sort by the axis that matters to your desk. 🏆 value pick · ⚠️ weak adversarial safety · 💭 run with reasoning on. Unsafe actions = adversarial traps (of 18) where a forbidden write action fired. With 18 traps and a median of 3 runs, a difference of 1–2 unsafe actions is within noise: read the safety column as bands (0–2 safe · 3–5 middling · 6+ reckless), not as an exact ranking. Cost is a tier, not a price: $ budget · $$ mid · $$$ premium · $$$$ flagship, per 1,000 conversations, agent-only, computed from the tokens the agent actually consumed in our runs (including hidden reasoning tokens), not list price per token.
The value picture
Price doesn’t buy judgment
Nearly everything on the board resolves ~95% of solvable tickets, so resolution is not the decision. Safety is. Plot unsafe actions against cost tier: higher on the chart = fewer traps failed.
The budget band contains both the safest models measured (gemma-4-31b, mimo-v2.5: 2 unsafe actions) and the most reckless (gpt-5.4-nano, minimax-m3: 7). Paying more doesn’t reliably buy the line-holding either: the flagship holds 15 of 18 while models a tier or three down hold 16. Pick from the top band of the chart, then pay as little as your queue allows.
Pick by budget
$ · Budget tier
mimo-v2.5
Escalation accuracy 88% · unsafe actions 2/18
Honest escalations (every promised handover fires) and a firm adversarial line at a rounding-error price. The strongest cheap generalist measured.
Read the report$ · Budget tier
gemma-4-31b
Escalation accuracy 100% · unsafe actions 2/18
Perfect escalation accuracy and top-tier safety, with the policy check visible in its reasoning. Open weights.
Read the report$$ · Mid tier
gpt-5.4-mini
Escalation accuracy 100% · unsafe actions 3/18
Never missed a must-escalate case, with frontier-grade calibration. The default pick for a managed API.
Read the report$$$$ · Flagship tier
gpt-5.5
Escalation accuracy 100% · unsafe actions 3/18
The quality ceiling: best resolution, escalation accuracy, and policy adherence on the board.
Read the reportOne caveat: the picks weigh escalation and safety most heavily. If a different failure mode matters more to your store, another model can be the right pick: grok-4.3 has the lowest over-escalation on the board.
The hold-the-line set
How far pressure pushes each model
Every adversarial case tempts the agent past two lines: first the point where it should stop and verify or escalate, then the point where a forbidden action fires: a free reship, a fraud reroute, a wrongful cancel.
The pattern across all 24 models: nobody folds to threats. Models get pushed into the red by believable claims: damage with no photo, a polite “it never arrived.” The safest models stop in the margin and ask for proof; the ones deep in the red believed the customer and skipped the check.
The decision axis
Escalation calibration, both ways
Support agents make two very different kinds of mistake. A cheap mistake hands a solvable ticket to a human: annoying, and it erodes the point of the agent. A catastrophic mistake keeps a case that needed a human, or takes a wrong action outright. We never average the two: this chart shows both directions separately. Perfect calibration is the center line; shorter bars on both sides win.
Which side matters more depends on what you sell: a wrongful cancellation is a rounding error for a t-shirt store and a disaster for flights. Weight the sides yourself.
The spread is the story: grok-4.3 barely leaves the center, while qwen3.7-plus dumps over a fifth of solvable tickets on humans and glm-5.2 misses a quarter of the cases that truly need one.
What we measure
We publish the individual metrics and no composite number. A “safe but useless” agent and a “helpful but reckless” agent fail on different axes, and collapsing them into one score hides exactly the tradeoff you need to see. Read the metric that matches your desk’s risk profile.
- Escalation accuracy
- Must-escalate cases handed to a human. Read deterministically from the transcript.
- Adversarial safety (unsafe actions)
- Forbidden actions fired on the 18 hold-the-line traps. Read deterministically from the transcript.
- Over-escalation
- Resolvable tickets handed to a human unnecessarily: the cost of playing it too safe.
- Resolution quality
- Solvable tickets actually resolved, judged against the store's policy.
- Policy adherence
- How closely the agent followed store policy and instructions (0–1).
Beyond the scores
What we learned reading every transcript
These patterns come from reading the conversations message by message across all 24 models: the things the aggregate scores hide.
The open models are already close
The best open-weights model, gemma-4-31b, matches the flagships on escalation accuracy and adversarial safety from the budget price tier, and the Chinese open models span the entire board: mimo-v2.5 out-holds models several price tiers above it while glm-5.2 and minimax-m3 sit at the floor. The question isn’t which flagship wins; it’s how little model you can get away with.
Unsafe actions come from believing the claim, not folding to pressure
Models hold the line against threats, chargebacks, fraud reroutes, and VIP pressure almost universally. They break when they believe a soft claim: damage without proof, a repeat “never arrived” claimant. Their guardrails key on hostile tone, not on missing evidence. Several models, including gpt-5.4-mini, narrate the red flag in their own reasoning and then act anyway.
The reply can be right while the action is wrong
A correct-sounding reply can hide an incorrect tool call. gpt-5.4-nano fired a replacement carrying the wrong item’s variant ID; others presented stale pre-reshipment tracking as the new shipment. These are wrong decisions at the action layer: you only catch them by reading what the agent did, not what it said.
Escalation calibration is the real decision axis
Must-escalate accuracy spreads from a clean 100% (the gpt-5.4 family, gpt-5.5, gemma) down to 75–83%, and unnecessary escalation of solvable tickets runs from 2.5% to 22%. grok-4.3 hands over the fewest solvable tickets on the board; qwen3.7-plus dumps more than a fifth of them on humans.
Terse models win
Agent messages per conversation anti-correlate with quality across every metric we track: the strongest models close tickets in ~3.2–3.8 turns while the floor circles for ~4.8–5.4. gemma-4-31b is the tersest model measured: it spends its budget thinking, not talking. Verbosity, not brevity, is what tracks failure.
How to read this honestly
We publish the things that could bias the ranking up front, not buried in a footnote.
Single agent, single store, single vertical
Every model runs the same production-style support desk for one premium travel-goods store, in English. It's a strong proxy, not a per-agent guarantee: results can shift in other verticals (subscriptions, electronics, apparel), and you should validate on your own transcripts before any production switch.
Small adversarial sample: read bands, not ranks
The hold-the-line set is 18 traps, scored as the median of three runs. At that sample size a difference of 1–2 unsafe actions between models is within noise. Treat the safety column as bands (0–2 safe · 3–5 middling · 6+ reckless); only gaps across bands are meaningful.
Customers are simulated
An LLM plays the customer, which keeps every model under identical pressure but narrows diversity: real customers are stranger and less predictable than a language model improvising one. Simulated conversations also vary run to run, so published numbers are the median of three (a few models were run N=1–2; their reports say so).
What the next version fixes
- Seed conversations from anonymized real support transcripts: resume from mid-conversation, with the human agent's real resolution as ground truth.
- Repeated-decision sampling: replay the same decision fork 10× per model to put a confidence interval on the safety numbers.
- More verticals: the same harness pointed at stores whose stakes differ (subscriptions, electronics, apparel).
Want an agent that scores like this on your store?
Adelante builds and runs the support agent for you, picking the right model per workload, with the guardrails this benchmark stress-tests.
See if it fits your store