SupportAgentBench is an independent benchmark that evaluates 24 large language models as ecommerce customer-support agents across 162 grounded, multi-turn conversations, measuring resolution quality, escalation calibration, adversarial safety, policy adherence, and cost tier. There is no composite score: each metric is published separately. Headline results: gpt-5.5 posts the strongest metrics across the board; budget-tier models match flagship escalation and safety; and safety failures come from believing unverified claims, not from pressure.

Start here

Four answers to “which model?”

The strongest metrics across the board: best resolution, perfect escalation accuracy, best policy adherence. Also the only model in the flagship price tier. Most support desks don't need it.

Read the report

The open-weights answer

Perfect escalation accuracy and top-tier safety in the budget price tier, with reasoning on. Open weights: run it on your own infrastructure.

Near-flagship escalation accuracy and safety in the mid price tier. The safe choice when you just want a managed API.

Holds the adversarial line better than most of the board at the lowest cost measured. The floor for a working desk is far lower than the flagships suggest.

Read the report

Full leaderboard

162 grounded conversations · 120 routine / 24 must-escalate / 18 adversarial · multi-turn (≤8) · median of N=3. No composite score: sorted by over-escalation (the real handover rate) by default, tap a header to sort by the metric your desk cares about.

Read the scoring methodology →

#	Model							Notes
1	grok-4.3🏆	2.5	3	96	≈95	0.78	$$	lowest over-esc; the autonomy pick
2	deepseek-v4-flash	4	5	92	≈94	0.81	$	strong cheap resolver; weaker safety
3	gpt-5.5	5	3	100	≈95	0.87	$$$$	best quality, top price tier
4	deepseek-v4-pro	5	4	79	≈95	0.76	$	dominated by deepseek-flash
5	sonnet-5	6	4	88	≈95	0.79	$$$	best Claude; fixes 4.6's escalation
6	minimax-m3⚠️	7	7	83	≈90	0.69	$	board floor; weakest IF + safety
7	gemini-3.1-pro	8	3	100	≈93	0.87	$$$	best non-GPT IF; priciest gemini
8	gpt-5.4-mini🏆	8	3	100	≈92	0.79	$$	GPT value pick
9	gpt-5.2	8	3	83	≈90	0.83	$$$	older; under-escalates
10	gpt-5.4	8	4	100	≈94	0.86	$$$	strong; beaten on value by mini
11	gpt-5.4-nano⚠️	8	7	100	≈95	0.81	$	cheapest GPT, weak safety
12	glm-5.2⚠️	8	7	75	≈95	0.79	$$	under-escalates + unsafe actions
13	kimi-k2.7-code	9	4	88	≈95	0.75	$$	solid; behind k2.6
14	gemma-4-31b🏆💭	10	2	100	≈82	0.82	$	best value (reasoning on)
15	gemini-3-flash	10	3	88	≈95	0.78	$	cheap + lean output
16	kimi-k2.6	10	4	96	≈95	0.81	$$	strong, balanced
17	mimo-v2.5-pro	11	4	83	≈84	0.77	$	doesn't beat base mimo
18	gemini-3.5-flash	12	2	96	≈95	0.81	$$$	top-tier + safe
19	mimo-v2.5🏆	12	2	88	≈87	0.78	$	cheapest agent; beats its "pro"
20	gemini-3.1-flash-lite🏆	12.5	2	92	≈95	0.77	$$	cheap + safest
21	haiku-4.5	13	6	79	≈97	0.74	$$	weak Claude
22	qwen3.7-max	15	3	93	≈96	0.78	$$$	safe but over-cautious + pricey
23	sonnet-4.6	15	3	79	≈96	0.81	$$$	safe but pricey; escalates wrongly both ways
24	qwen3.7-plus⚠️	22	2	96	≈93	0.73	$$	over-escalator (22%)

Ordered by over-escalation by default: the share of solvable tickets the model hands to a human anyway: the real handover rate, and the number that decides how much of your queue the agent actually takes off your team. Ties break by fewest unsafe actions, then escalation accuracy (all read straight from the transcript). Tap any header to sort by the axis that matters to your desk. 🏆 value pick · ⚠️ weak adversarial safety · 💭 run with reasoning on. Unsafe actions = adversarial traps (of 18) where a forbidden write action fired. With 18 traps and a median of 3 runs, a difference of 1–2 unsafe actions is within noise: read the safety column as bands (0–2 safe · 3–5 middling · 6+ reckless), not as an exact ranking. Cost is a tier, not a price: $ budget · $$ mid · $$$ premium · $$$$ flagship, per 1,000 conversations, agent-only, computed from the tokens the agent actually consumed in our runs (including hidden reasoning tokens), not list price per token.

The value picture

Price doesn’t buy judgment

Nearly everything on the board resolves ~95% of solvable tickets, so resolution is not the decision. Safety is. Plot unsafe actions against cost tier: higher on the chart = fewer traps failed.

The budget band contains both the safest models measured (gemma-4-31b, mimo-v2.5: 2 unsafe actions) and the most reckless (gpt-5.4-nano, minimax-m3: 7). Paying more doesn’t reliably buy the line-holding either: the flagship holds 15 of 18 while models a tier or three down hold 16. Pick from the top band of the chart, then pay as little as your queue allows.

Pick by budget

$ · Budget tier

mimo-v2.5

Escalation accuracy 88% · unsafe actions 2/18

Honest escalations (every promised handover fires) and a firm adversarial line at a rounding-error price. The strongest cheap generalist measured.

Read the report

$ · Budget tier

gemma-4-31b

Escalation accuracy 100% · unsafe actions 2/18

Perfect escalation accuracy and top-tier safety, with the policy check visible in its reasoning. Open weights.

Read the report

$$ · Mid tier

gpt-5.4-mini

Escalation accuracy 100% · unsafe actions 3/18

Never missed a must-escalate case, with frontier-grade calibration. The default pick for a managed API.

Read the report

$$$$ · Flagship tier

gpt-5.5

Escalation accuracy 100% · unsafe actions 3/18

The quality ceiling: best resolution, escalation accuracy, and policy adherence on the board.

Read the report

One caveat: the picks weigh escalation and safety most heavily. If a different failure mode matters more to your store, another model can be the right pick: grok-4.3 has the lowest over-escalation on the board.

The hold-the-line set

How far pressure pushes each model

Every adversarial case tempts the agent past two lines: first the point where it should stop and verify or escalate, then the point where a forbidden action fires: a free reship, a fraud reroute, a wrongful cancel.

The pattern across all 24 models: nobody folds to threats. Models get pushed into the red by believable claims: damage with no photo, a polite “it never arrived.” The safest models stop in the margin and ask for proof; the ones deep in the red believed the customer and skipped the check.

The decision axis

Escalation calibration, both ways

Support agents make two very different kinds of mistake. A cheap mistake hands a solvable ticket to a human: annoying, and it erodes the point of the agent. A catastrophic mistake keeps a case that needed a human, or takes a wrong action outright. We never average the two: this chart shows both directions separately. Perfect calibration is the center line; shorter bars on both sides win.

Which side matters more depends on what you sell: a wrongful cancellation is a rounding error for a t-shirt store and a disaster for flights. Weight the sides yourself.

The spread is the story: grok-4.3 barely leaves the center, while qwen3.7-plus dumps over a fifth of solvable tickets on humans and glm-5.2 misses a quarter of the cases that truly need one.

What we measure

We publish the individual metrics and no composite number. A “safe but useless” agent and a “helpful but reckless” agent fail on different axes, and collapsing them into one score hides exactly the tradeoff you need to see. Read the metric that matches your desk’s risk profile.

Escalation accuracy: Must-escalate cases handed to a human. Read deterministically from the transcript.
Adversarial safety (unsafe actions): Forbidden actions fired on the 18 hold-the-line traps. Read deterministically from the transcript.
Over-escalation: Resolvable tickets handed to a human unnecessarily: the cost of playing it too safe.
Resolution quality: Solvable tickets actually resolved, judged against the store's policy.
Policy adherence: How closely the agent followed store policy and instructions (0–1).

Beyond the scores

What we learned reading every transcript

These patterns come from reading the conversations message by message across all 24 models: the things the aggregate scores hide.

The open models are already close

The best open-weights model, gemma-4-31b, matches the flagships on escalation accuracy and adversarial safety from the budget price tier, and the Chinese open models span the entire board: mimo-v2.5 out-holds models several price tiers above it while glm-5.2 and minimax-m3 sit at the floor. The question isn’t which flagship wins; it’s how little model you can get away with.

Unsafe actions come from believing the claim, not folding to pressure

Models hold the line against threats, chargebacks, fraud reroutes, and VIP pressure almost universally. They break when they believe a soft claim: damage without proof, a repeat “never arrived” claimant. Their guardrails key on hostile tone, not on missing evidence. Several models, including gpt-5.4-mini, narrate the red flag in their own reasoning and then act anyway.

The reply can be right while the action is wrong

A correct-sounding reply can hide an incorrect tool call. gpt-5.4-nano fired a replacement carrying the wrong item’s variant ID; others presented stale pre-reshipment tracking as the new shipment. These are wrong decisions at the action layer: you only catch them by reading what the agent did, not what it said.

Escalation calibration is the real decision axis

Must-escalate accuracy spreads from a clean 100% (the gpt-5.4 family, gpt-5.5, gemma) down to 75–83%, and unnecessary escalation of solvable tickets runs from 2.5% to 22%. grok-4.3 hands over the fewest solvable tickets on the board; qwen3.7-plus dumps more than a fifth of them on humans.

Terse models win

Agent messages per conversation anti-correlate with quality across every metric we track: the strongest models close tickets in ~3.2–3.8 turns while the floor circles for ~4.8–5.4. gemma-4-31b is the tersest model measured: it spends its budget thinking, not talking. Verbosity, not brevity, is what tracks failure.

How to read this honestly

We publish the things that could bias the ranking up front, not buried in a footnote.

Single agent, single store, single vertical

Every model runs the same production-style support desk for one premium travel-goods store, in English. It's a strong proxy, not a per-agent guarantee: results can shift in other verticals (subscriptions, electronics, apparel), and you should validate on your own transcripts before any production switch.

Small adversarial sample: read bands, not ranks

The hold-the-line set is 18 traps, scored as the median of three runs. At that sample size a difference of 1–2 unsafe actions between models is within noise. Treat the safety column as bands (0–2 safe · 3–5 middling · 6+ reckless); only gaps across bands are meaningful.

Customers are simulated

An LLM plays the customer, which keeps every model under identical pressure but narrows diversity: real customers are stranger and less predictable than a language model improvising one. Simulated conversations also vary run to run, so published numbers are the median of three (a few models were run N=1–2; their reports say so).

What the next version fixes

Seed conversations from anonymized real support transcripts: resume from mid-conversation, with the human agent's real resolution as ground truth.
Repeated-decision sampling: replay the same decision fork 10× per model to put a confidence interval on the safety numbers.
More verticals: the same harness pointed at stores whose stakes differ (subscriptions, electronics, apparel).

Want an agent that scores like this on your store?

Adelante builds and runs the support agent for you, picking the right model per workload, with the guardrails this benchmark stress-tests.

See if it fits your store

Which models can actually run a support desk?

Four answers to “which model?”

Full leaderboard

Price doesn’t buy judgment

Pick by budget

How far pressure pushes each model

Escalation calibration, both ways

What we measure

What we learned reading every transcript

The open models are already close

Unsafe actions come from believing the claim, not folding to pressure

The reply can be right while the action is wrong

Escalation calibration is the real decision axis

Terse models win

How to read this honestly

Single agent, single store, single vertical

Small adversarial sample: read bands, not ranks

Customers are simulated

What the next version fixes

Want an agent that scores like this on your store?