Agent Shortlist

Article

AI Agent Model Routing: Cut Your API Bill by 60% Without Losing Quality

Brain-and-muscle model routing: use expensive models for planning, cheap models for execution. Real cost breakdowns and the routing logic that makes it work.

By Lucas Powell·April 29, 2026·5 min read·1,150 words

You built something. It works. You ran it for a week. Then you opened the API dashboard.

There it is. The number. Bigger than you expected. Possibly bigger than your lunch budget for the month. You click around to see if there's a mistake. There is not.

This is the moment most builders hit when they move from "testing with 20 examples" to "running in production." The instinct when building is to use the best model for everything — it's the one you tested with, the one that produces the output you like, the one that made the demo look good. That instinct is expensive.

"Best model for everything" is like hiring a McKinsey partner to sort your mail. Technically, they could do it. They'd probably sort it quite thoughtfully. But you're paying $1,200 an hour for someone to make piles, and the mail doesn't care.

The fix is model routing. And it can cut your bill by 40-60% without touching output quality on the things that matter.


The brain and muscle framework

Every agent pipeline has two categories of work. Knowing which is which is the whole game.

Brain work is planning, reasoning, judgment, synthesising messy information into something coherent, writing things that need to sound like a human. This is where quality compounds — a sharper model makes genuinely better decisions. Brain work is typically a small fraction of your total token volume, but it determines the quality of your output. Use the frontier models here: Claude Opus 4.7, GPT-5, Gemini 2.5 Pro.

Muscle work is classification, data extraction, format conversion, pattern matching, routing decisions, simple transformations. It's repetitive, high-volume, and clearly defined. A model that costs 35x less will get it right just as often — the task doesn't require reasoning, it requires reliability. Use the value models: Claude Haiku 4.5, GPT-5.4 mini, Gemini 2.5 Flash, DeepSeek V4 Flash.

The routing principle is simple: send every task to the cheapest model that can do it well. Escalate to a better model only when quality actually matters.


What the numbers look like

Here's what this costs in practice. Current pricing per million tokens:

| Model | Input | Output | |---|---|---| | Claude Opus 4.7 | $5 | $25 | | Claude Sonnet 4.6 | $3 | $15 | | Claude Haiku 4.5 | $1 | $5 | | DeepSeek V4 Flash | $0.14 | $0.28 |

Take a lead research workflow processing 5,000 prospects a month. The pipeline pulls company data, extracts key facts, and writes a personalised summary for each lead.

  • All-Opus: Everything goes to the flagship model. ~$340/month.
  • Routed (Haiku for parsing, Opus for synthesis): Extraction is muscle work; only the final summary is brain work. ~$28/month.
  • Aggressive routing (DeepSeek for parsing, Sonnet for synthesis): Push even harder on the extraction step. ~$8/month.

Same quality on the output that gets read by a human. The $332 difference is entirely in work that a cheaper model handles identically. That's a 40x cost reduction — and it's the kind of calculation the cost calculator makes easy to check before you build.

See the full model pricing breakdown if you want to run your own numbers.


Five routing patterns worth knowing

1. Classify then escalate

A cheap model handles the first pass: is this a refund request, a technical question, or an out-of-scope message? Simple, fast, cheap. Only the edge cases — "I'm not sure what to do with this" — get escalated to the expensive model. Works well for support triage, content moderation, and anything with defined categories.

2. Extract then synthesise

The cheap model pulls structured data from documents: dates, names, figures, key claims. The expensive model takes that clean structured input and writes the actual output. Extraction is pattern matching; synthesis is judgment. Route accordingly. This is the pattern behind the lead research example above.

3. Draft then refine

The cheap model writes a complete first draft. The expensive model (or a human) refines it. You get the cost of a Haiku draft with Opus-level polish on the output. Works for email generation, report writing, and anywhere you have a clear template but want a sharp final pass.

4. Route by confidence

Some models can return a confidence score or self-assessment. If the cheap model flags low confidence — or produces output that fails a simple validation check — automatically re-run with a better model. This keeps the good cases cheap while catching the ones that need more firepower. Takes slightly more logic to implement, but it's worth it for production pipelines.

5. Volume gate

Assign a complexity score to incoming tasks (based on length, ambiguity, or domain). Everything below the threshold hits the cheap model automatically. Everything above routes to the expensive model. Simple, auditable, and easy to tune by inspecting the gate threshold over time.

For implementation: n8n handles conditional branching cleanly if you're building visual workflows. If you're writing direct API calls, it's a single parameter change — model: "claude-haiku-4-5" instead of model: "claude-opus-4-7". Lindy has multi-model support built into its workflow logic if you want a no-code path.


Where routing goes wrong

This is the part most guides skip. Model routing fails when you route the wrong step to the cheap model and the quality drop is visible to the end user.

The synthesis step in research workflows. The customer-facing final output. Anything where a real human is about to read and judge the result. These are not places to cheap out. If the model is making a judgment call that determines what a user sees, hears, or acts on — that's brain work, not muscle work, even if it feels routine.

The tell: if you'd be embarrassed to show a specific output to a customer, that step needed a better model. Audit your pipeline output before you declare routing a success, not after a user complains.

We covered the broader ROI picture in where AI agents deliver the most value — the same principle applies. The savings are real, but they come from routing correctly, not just routing cheaply.


Start here

If you've got a pipeline running and a bill you'd rather not repeat, the quickest win is to identify the highest-volume step and ask: is this brain work or muscle work? If it's pattern matching, extraction, or classification, it's almost certainly muscle work, and you can switch it to Haiku or Flash today.

Profile your token usage by step. Most builders are surprised to find that 70-80% of their tokens are sitting on muscle work running at frontier prices. Move those to value models. Keep the brain work on the good stuff. Check the bill next month.

That's model routing. It's not complicated. It's just the thing most builders skip when they're building fast, then have to fix when the invoice arrives.

About the author

Lucas Powell

Lucas Powell

Founder, Growth 8020

Founder of Growth 8020. Started Agent Shortlist as the publication he wished existed when his team had to pick AI tools.