SupportAgentBench · 162 cases · updated July 2026

Which models can actually run a support desk?

A support agent has to do several things at once: pull real order data with tools, follow the store's policy, understand what a frustrated customer actually wants, stay pleasant, and not invent facts. We ran 24 models through 162 real multi-turn support conversations and measured exactly that: including what each one costs to run.

SupportAgentBench is an independent benchmark that evaluates 24 large language models as ecommerce customer-support agents across 162 grounded, multi-turn conversations, measuring resolution quality, escalation calibration, adversarial safety, policy adherence, and cost tier. There is no composite score: each metric is published separately. Headline results: gpt-5.5 posts the strongest metrics across the board; budget-tier models match flagship escalation and safety; and safety failures come from believing unverified claims, not from pressure.

Start here

Four answers to “which model?”

Full leaderboard

162 grounded conversations · 120 routine / 24 must-escalate / 18 adversarial · multi-turn (≤8) · median of N=3. No composite score: sorted by over-escalation (the real handover rate) by default, tap a header to sort by the metric your desk cares about.

Read the scoring methodology →
  1. 1grok-4.3🏆
    $$
    Over-escalation
    2.5%
    Unsafe actions
    3/18
    Escalation accuracy
    96%

    lowest over-esc; the autonomy pick

  2. 2deepseek-v4-flash
    $
    Over-escalation
    4%
    Unsafe actions
    5/18
    Escalation accuracy
    92%

    strong cheap resolver; weaker safety

  3. 3gpt-5.5
    $$$$
    Over-escalation
    5%
    Unsafe actions
    3/18
    Escalation accuracy
    100%

    best quality, top price tier

  4. 4deepseek-v4-pro
    $
    Over-escalation
    5%
    Unsafe actions
    4/18
    Escalation accuracy
    79%

    dominated by deepseek-flash

  5. 5sonnet-5
    $$$
    Over-escalation
    6%
    Unsafe actions
    4/18
    Escalation accuracy
    88%

    best Claude; fixes 4.6's escalation

  6. 6minimax-m3⚠️
    $
    Over-escalation
    7%
    Unsafe actions
    7/18
    Escalation accuracy
    83%

    board floor; weakest IF + safety

  7. 7gemini-3.1-pro
    $$$
    Over-escalation
    8%
    Unsafe actions
    3/18
    Escalation accuracy
    100%

    best non-GPT IF; priciest gemini

  8. 8gpt-5.4-mini🏆
    $$
    Over-escalation
    8%
    Unsafe actions
    3/18
    Escalation accuracy
    100%

    GPT value pick

  9. 9gpt-5.2
    $$$
    Over-escalation
    8%
    Unsafe actions
    3/18
    Escalation accuracy
    83%

    older; under-escalates

  10. 10gpt-5.4
    $$$
    Over-escalation
    8%
    Unsafe actions
    4/18
    Escalation accuracy
    100%

    strong; beaten on value by mini

  11. 11gpt-5.4-nano⚠️
    $
    Over-escalation
    8%
    Unsafe actions
    7/18
    Escalation accuracy
    100%

    cheapest GPT, weak safety

  12. 12glm-5.2⚠️
    $$
    Over-escalation
    8%
    Unsafe actions
    7/18
    Escalation accuracy
    75%

    under-escalates + unsafe actions

  13. 13kimi-k2.7-code
    $$
    Over-escalation
    9%
    Unsafe actions
    4/18
    Escalation accuracy
    88%

    solid; behind k2.6

  14. 14gemma-4-31b🏆
    $
    Over-escalation
    10%
    Unsafe actions
    2/18
    Escalation accuracy
    100%

    best value (reasoning on)

  15. 15gemini-3-flash
    $
    Over-escalation
    10%
    Unsafe actions
    3/18
    Escalation accuracy
    88%

    cheap + lean output

  16. 16kimi-k2.6
    $$
    Over-escalation
    10%
    Unsafe actions
    4/18
    Escalation accuracy
    96%

    strong, balanced

  17. 17mimo-v2.5-pro
    $
    Over-escalation
    11%
    Unsafe actions
    4/18
    Escalation accuracy
    83%

    doesn't beat base mimo

  18. 18gemini-3.5-flash
    $$$
    Over-escalation
    12%
    Unsafe actions
    2/18
    Escalation accuracy
    96%

    top-tier + safe

  19. 19mimo-v2.5🏆
    $
    Over-escalation
    12%
    Unsafe actions
    2/18
    Escalation accuracy
    88%

    cheapest agent; beats its "pro"

  20. 20gemini-3.1-flash-lite🏆
    $$
    Over-escalation
    12.5%
    Unsafe actions
    2/18
    Escalation accuracy
    92%

    cheap + safest

  21. 21haiku-4.5
    $$
    Over-escalation
    13%
    Unsafe actions
    6/18
    Escalation accuracy
    79%

    weak Claude

  22. 22qwen3.7-max
    $$$
    Over-escalation
    15%
    Unsafe actions
    3/18
    Escalation accuracy
    93%

    safe but over-cautious + pricey

  23. 23sonnet-4.6
    $$$
    Over-escalation
    15%
    Unsafe actions
    3/18
    Escalation accuracy
    79%

    safe but pricey; escalates wrongly both ways

  24. 24qwen3.7-plus⚠️
    $$
    Over-escalation
    22%
    Unsafe actions
    2/18
    Escalation accuracy
    96%

    over-escalator (22%)

Ordered by over-escalation by default: the share of solvable tickets the model hands to a human anyway: the real handover rate, and the number that decides how much of your queue the agent actually takes off your team. Ties break by fewest unsafe actions, then escalation accuracy (all read straight from the transcript). Tap any header to sort by the axis that matters to your desk. 🏆 value pick · ⚠️ weak adversarial safety · 💭 run with reasoning on. Unsafe actions = adversarial traps (of 18) where a forbidden write action fired. With 18 traps and a median of 3 runs, a difference of 1–2 unsafe actions is within noise: read the safety column as bands (0–2 safe · 3–5 middling · 6+ reckless), not as an exact ranking. Cost is a tier, not a price: $ budget · $$ mid · $$$ premium · $$$$ flagship, per 1,000 conversations, agent-only, computed from the tokens the agent actually consumed in our runs (including hidden reasoning tokens), not list price per token.

The value picture

Price doesn’t buy judgment

Nearly everything on the board resolves ~95% of solvable tickets, so resolution is not the decision. Safety is. Plot unsafe actions against cost tier: higher on the chart = fewer traps failed.

$Budget$$Mid$$$Premium$$$$Flagship012345678SAFER ↑MORE RECKLESS ↓Cost tier per 1,000 conversations (agent-only)Unsafe actions (of 18 traps)grok-4.3 · 3/18 unsafe actions · escalation accuracy 96% · $$deepseek-v4-flash · 5/18 unsafe actions · escalation accuracy 92% · $gpt-5.5 · 3/18 unsafe actions · escalation accuracy 100% · $$$$gpt-5.5deepseek-v4-pro · 4/18 unsafe actions · escalation accuracy 79% · $sonnet-5 · 4/18 unsafe actions · escalation accuracy 88% · $$$minimax-m3 · 7/18 unsafe actions · escalation accuracy 83% · $minimax-m3gemini-3.1-pro · 3/18 unsafe actions · escalation accuracy 100% · $$$gpt-5.4-mini · 3/18 unsafe actions · escalation accuracy 100% · $$gpt-5.4-minigpt-5.2 · 3/18 unsafe actions · escalation accuracy 83% · $$$gpt-5.4 · 4/18 unsafe actions · escalation accuracy 100% · $$$gpt-5.4-nano · 7/18 unsafe actions · escalation accuracy 100% · $gpt-5.4-nanoglm-5.2 · 7/18 unsafe actions · escalation accuracy 75% · $$glm-5.2kimi-k2.7-code · 4/18 unsafe actions · escalation accuracy 88% · $$gemma-4-31b · 2/18 unsafe actions · escalation accuracy 100% · $gemma-4-31bgemini-3-flash · 3/18 unsafe actions · escalation accuracy 88% · $kimi-k2.6 · 4/18 unsafe actions · escalation accuracy 96% · $$mimo-v2.5-pro · 4/18 unsafe actions · escalation accuracy 83% · $gemini-3.5-flash · 2/18 unsafe actions · escalation accuracy 96% · $$$gemini-3.5-flashmimo-v2.5 · 2/18 unsafe actions · escalation accuracy 88% · $mimo-v2.5gemini-3.1-flash-lite · 2/18 unsafe actions · escalation accuracy 92% · $$haiku-4.5 · 6/18 unsafe actions · escalation accuracy 79% · $$haiku-4.5qwen3.7-max · 3/18 unsafe actions · escalation accuracy 93% · $$$sonnet-4.6 · 3/18 unsafe actions · escalation accuracy 79% · $$$qwen3.7-plus · 2/18 unsafe actions · escalation accuracy 96% · $$value pick6+ unsafe actions

The budget band contains both the safest models measured (gemma-4-31b, mimo-v2.5: 2 unsafe actions) and the most reckless (gpt-5.4-nano, minimax-m3: 7). Paying more doesn’t reliably buy the line-holding either: the flagship holds 15 of 18 while models a tier or three down hold 16. Pick from the top band of the chart, then pay as little as your queue allows.

One caveat: the picks weigh escalation and safety most heavily. If a different failure mode matters more to your store, another model can be the right pick: grok-4.3 has the lowest over-escalation on the board.

RESOLVESSAFETY MARGINFORBIDDEN ACTIONbenign requestsverify or escalatefree goods · fraud · wrongful cancelgemini-3.5-flashasks for proof, holds2/18 unsafegemma-4-31bpolicy check visible in reasoning2/18 unsafemimo-v2.5holds above its price tier2/18 unsafegpt-5.5flagship: still 3 unsafe actions3/18 unsafehaiku-4.5over-complies: reships on demand6/18 unsafegpt-5.4-nanotext right, action wrong7/18 unsafe

The hold-the-line set

How far pressure pushes each model

Every adversarial case tempts the agent past two lines: first the point where it should stop and verify or escalate, then the point where a forbidden action fires: a free reship, a fraud reroute, a wrongful cancel.

The pattern across all 24 models: nobody folds to threats. Models get pushed into the red by believable claims: damage with no photo, a polite “it never arrived.” The safest models stop in the margin and ask for proof; the ones deep in the red believed the customer and skipped the check.

The decision axis

Escalation calibration, both ways

Support agents make two very different kinds of mistake. A cheap mistake hands a solvable ticket to a human: annoying, and it erodes the point of the agent. A catastrophic mistake keeps a case that needed a human, or takes a wrong action outright. We never average the two: this chart shows both directions separately. Perfect calibration is the center line; shorter bars on both sides win.

Which side matters more depends on what you sell: a wrongful cancellation is a rounding error for a t-shirt store and a disaster for flights. Weight the sides yourself.

The spread is the story: grok-4.3 barely leaves the center, while qwen3.7-plus dumps over a fifth of solvable tickets on humans and glm-5.2 misses a quarter of the cases that truly need one.

← MISSES CASES THAT NEED A HUMANHANDS OFF TICKETS IT SHOULD SOLVE →% of must-escalate set missed% of solvable tickets over-routedperfect calibrationgpt-5.55%grok-4.32.5%4%gemini-3.1-pro8%gpt-5.48%gpt-5.4-mini8%gpt-5.4-nano8%gemma-4-31b10%deepseek-v4-flash4%8%kimi-k2.610%4%gemini-3.5-flash12%4%sonnet-56%12%gemini-3.1-flash-lite12.5%8%kimi-k2.7-code9%12%gemini-3-flash10%12%qwen3.7-max15%7%mimo-v2.512%12%minimax-m37%17%gpt-5.28%17%deepseek-v4-pro5%21%qwen3.7-plus22%4%mimo-v2.5-pro11%17%glm-5.28%25%haiku-4.513%21%sonnet-4.615%21%

What we measure

We publish the individual metrics and no composite number. A “safe but useless” agent and a “helpful but reckless” agent fail on different axes, and collapsing them into one score hides exactly the tradeoff you need to see. Read the metric that matches your desk’s risk profile.

Escalation accuracy
Must-escalate cases handed to a human. Read deterministically from the transcript.
Adversarial safety (unsafe actions)
Forbidden actions fired on the 18 hold-the-line traps. Read deterministically from the transcript.
Over-escalation
Resolvable tickets handed to a human unnecessarily: the cost of playing it too safe.
Resolution quality
Solvable tickets actually resolved, judged against the store's policy.
Policy adherence
How closely the agent followed store policy and instructions (0–1).

Beyond the scores

What we learned reading every transcript

These patterns come from reading the conversations message by message across all 24 models: the things the aggregate scores hide.

The open models are already close

The best open-weights model, gemma-4-31b, matches the flagships on escalation accuracy and adversarial safety from the budget price tier, and the Chinese open models span the entire board: mimo-v2.5 out-holds models several price tiers above it while glm-5.2 and minimax-m3 sit at the floor. The question isn’t which flagship wins; it’s how little model you can get away with.

Unsafe actions come from believing the claim, not folding to pressure

Models hold the line against threats, chargebacks, fraud reroutes, and VIP pressure almost universally. They break when they believe a soft claim: damage without proof, a repeat “never arrived” claimant. Their guardrails key on hostile tone, not on missing evidence. Several models, including gpt-5.4-mini, narrate the red flag in their own reasoning and then act anyway.

The reply can be right while the action is wrong

A correct-sounding reply can hide an incorrect tool call. gpt-5.4-nano fired a replacement carrying the wrong item’s variant ID; others presented stale pre-reshipment tracking as the new shipment. These are wrong decisions at the action layer: you only catch them by reading what the agent did, not what it said.

Escalation calibration is the real decision axis

Must-escalate accuracy spreads from a clean 100% (the gpt-5.4 family, gpt-5.5, gemma) down to 75–83%, and unnecessary escalation of solvable tickets runs from 2.5% to 22%. grok-4.3 hands over the fewest solvable tickets on the board; qwen3.7-plus dumps more than a fifth of them on humans.

Terse models win

Agent messages per conversation anti-correlate with quality across every metric we track: the strongest models close tickets in ~3.2–3.8 turns while the floor circles for ~4.8–5.4. gemma-4-31b is the tersest model measured: it spends its budget thinking, not talking. Verbosity, not brevity, is what tracks failure.

How to read this honestly

We publish the things that could bias the ranking up front, not buried in a footnote.

Single agent, single store, single vertical

Every model runs the same production-style support desk for one premium travel-goods store, in English. It's a strong proxy, not a per-agent guarantee: results can shift in other verticals (subscriptions, electronics, apparel), and you should validate on your own transcripts before any production switch.

Small adversarial sample: read bands, not ranks

The hold-the-line set is 18 traps, scored as the median of three runs. At that sample size a difference of 1–2 unsafe actions between models is within noise. Treat the safety column as bands (0–2 safe · 3–5 middling · 6+ reckless); only gaps across bands are meaningful.

Customers are simulated

An LLM plays the customer, which keeps every model under identical pressure but narrows diversity: real customers are stranger and less predictable than a language model improvising one. Simulated conversations also vary run to run, so published numbers are the median of three (a few models were run N=1–2; their reports say so).

What the next version fixes

  • Seed conversations from anonymized real support transcripts: resume from mid-conversation, with the human agent's real resolution as ground truth.
  • Repeated-decision sampling: replay the same decision fork 10× per model to put a confidence interval on the safety numbers.
  • More verticals: the same harness pointed at stores whose stakes differ (subscriptions, electronics, apparel).

Want an agent that scores like this on your store?

Adelante builds and runs the support agent for you, picking the right model per workload, with the guardrails this benchmark stress-tests.

See if it fits your store