Article · foundations
The ARR framework: which tasks should you actually give to an AI agent?
A short mental model for deciding which tasks belong with an AI agent and which don't. Three letters. Autonomous, Recurring, Reviewable. Skip the rest.
Most teams that struggle with AI agents aren't struggling with the agent. They're struggling because they picked the wrong task to give it.
A simple three-letter mental model fixes most of this: a task is a good fit for an AI agent if it's Autonomous, Recurring, and Reviewable. That's it. If a task fails any of the three checks, hand it back to a human and build something else.
Here's why each letter matters and what the failure modes look like when one is missing.
A — Autonomous
The agent has to be able to finish the task without depending on real-time human judgment partway through. If the work needs someone to make a call at step three before step four can happen, the agent is mostly an expensive way to schedule meetings with yourself.
The clearest signal a task is autonomous: a competent contractor could complete it with a written brief and no follow-up questions. If the brief itself takes longer to write than just doing the work, the task isn't a fit.
Failure mode when A is missing: you build the agent, but it sits idle 80% of the time waiting for human decisions. The "autonomous" workflow ends up coordinated by a flood of Slack pings to you. Net productivity is negative.
Examples that pass A:
- Drafting customer-support replies for known ticket categories
- Summarizing yesterday's competitor activity from a predefined list of sources
- Routing inbound leads to the right SDR based on rules
Examples that fail A:
- Strategy decisions on which markets to enter
- Hiring shortlist creation that requires read of body language in interviews
- Editorial judgment on whether a story is "us" or not
R — Recurring
The agent needs to do the task often enough to amortize the setup cost. The setup cost is real and usually larger than people estimate: defining the task, writing the prompt, testing edge cases, building evaluation criteria, iterating until quality holds, monitoring for drift. Call it 10–40 hours for a non-trivial agent.
If the task only happens once or twice a year, you'd spend less time just doing it manually. The threshold isn't a hard rule, but as a starting point: if the task doesn't happen at least weekly, the math probably doesn't work.
Failure mode when R is missing: you spend two days building an agent for a quarterly task. Three months later when the task comes around again, the API has changed, the model is on a new version, your prompt no longer works, and you spend another day fixing it. Net cost is now higher than just doing it manually.
Examples that pass R:
- Daily standup summaries (daily ✓)
- Weekly support volume reports (weekly ✓)
- Per-ticket reply drafting (10–1,000+ times per day ✓)
Examples that fail R:
- Annual strategic planning docs
- Onboarding flow design for a new market entry
- One-off competitive analysis for a board meeting
R — Reviewable
You need to be able to tell, in less time than doing the task manually, whether the agent's output is correct. If checking the answer takes longer than producing it, the agent has added work, not removed it.
This is the subtlest of the three. It's also the one that gets violated most often by builders excited about new capabilities.
Failure mode when R is missing: you build an agent that "does the work," but verifying each output takes 80% of the original time. The remaining 20% of "saved" time gets eaten by context-switching costs (you have to look up what the original task even was to evaluate the output). Net time saved: zero or negative. Plus you've added a new failure mode where the agent's confident-but-wrong output goes uncaught.
Examples that pass the review check:
- Summary outputs you can spot-check by reading the source for 30 seconds
- Code changes you can see in a diff and test with one command
- Data extraction where the answer is verifiable against a known structure
Examples that fail the review check:
- Long-form articles where checking accuracy means reading the article
- Strategic recommendations where you'd need to do the analysis yourself to verify
- Novel synthesis where the agent could be confidently wrong in subtle ways
Why all three matter
Each letter independently rules out a real category of mistake. Together they leave you with a much smaller surface of "good agent tasks" than the demos suggest.
Walking through a few real-world examples:
Customer support reply drafting at scale. Autonomous (clear input and output), Recurring (hundreds of times a day), Reviewable (a human can scan a draft in seconds). Three out of three. Good agent task.
Writing the company strategy doc. Not autonomous (depends on judgment calls), not recurring (annual), not reviewable (the whole point is to produce something only the leader can validate). Zero out of three. Terrible agent task.
Coding a new feature. Autonomous if the spec is tight, Recurring if the codebase has lots of similar features to ship, Reviewable if tests exist that catch regressions. Often all three pass — which is why coding agents have taken off. Sometimes only one or two pass, which is why some teams find their coding agent useless.
Personalized cold outreach. Autonomous (the prospect data and the desired outreach are well-defined), Recurring (most teams send dozens per day), Reviewable if a human reads each draft before send — but only barely. The review check is the wobbly leg here. If you're scaling past 200 outreach emails a day with no human review, you've broken the R.
Research summaries. Usually autonomous and recurring, but reviewability depends on whether you have the source material on hand to spot-check. If the agent reads 100 sources and gives you a one-page summary, you can't verify it without reading the 100 sources yourself. Fails R unless the agent shows its work.
The corollary: how to fix tasks that fail ARR
A task that fails one letter isn't lost. You can often re-shape it until all three pass:
-
Fails Autonomous? Narrow the scope. Instead of "handle the customer," try "handle tier-1 categories with a known resolution path." Shrink the surface until the agent doesn't need a human decision mid-flight.
-
Fails Recurring? Don't build the agent. Or pair it with similar tasks that share infrastructure. A "weekly executive briefing agent" only justifies its setup if it can also do the daily briefing and the monthly briefing — same agent, same prompts, different cadence.
-
Fails Reviewable? Force the output into a reviewable shape. Make the agent show its sources. Make it provide structured outputs alongside the prose. Make it flag confidence levels. The goal isn't to make the work disappear — it's to make the checking of the work fast enough that the agent still saves time net.
What this doesn't tell you
ARR is necessary but not sufficient. A task can pass all three checks and still fail in production because:
- The model isn't capable enough at this specific work yet
- The integrations don't exist for the systems the agent needs to touch
- The economics don't work — model cost exceeds value delivered
- Your team can't operationalize the agent without dedicated engineering time
ARR filters out most bad agent ideas quickly. The remaining ideas need a different test: do you have the budget, the tooling, and the team to actually build and maintain this? That's a separate conversation, covered in where AI agents actually deliver ROI and how much does it cost to build an AI agent.
When to use this framework
Three moments where pulling out ARR pays off immediately:
- When someone brings you "an AI agent idea." Run it through the three letters out loud. If one fails, you've saved a week of building.
- When you're picking which agent to build first. Use ARR to rank candidate workflows. The one that passes all three most cleanly is the one to build first.
- When you're auditing existing agents that aren't paying off. If a deployed agent isn't delivering value, the failure is almost always one of the three letters. Identifying which one tells you whether to fix it or kill it.
The full picker recommends platforms for ARR-passing tasks; the calculator tells you what they'll cost. ARR is the upstream filter that decides whether to use either at all.
About the author

Lucas Powell
Founder, Growth 8020Founder of Growth 8020. Started Agent Shortlist as the publication he wished existed when his team had to pick AI tools.
More in this series
Director vs doer: the mindset shift that separates working AI agents from broken ones
Stop prompting. Start directing. The mindset change builders need to make once they move from chatbots to agents — and the practices that come with it.
The lethal trifecta: the AI agent security trap nobody warns you about
Three capabilities that are individually safe become catastrophic when combined: private data access, internet access, and untrusted input. Here's how the trap works and how to break it.
AI Agent Model Routing: Cut Your API Bill by 60% Without Losing Quality
Brain-and-muscle model routing: use expensive models for planning, cheap models for execution. Real cost breakdowns and the routing logic that makes it work.