Article · foundations

AI Agent Task Selection: The ARR Framework (Autonomous, Recurring, Reviewable)

AI agent task selection made simple. The ARR framework — Autonomous, Recurring, Reviewable — decides which tasks belong with an AI agent and which don't.

By Lucas Powell·May 17, 2026·6 min read·1,413 words

Most teams that struggle with AI agents aren't struggling with the agent. They're struggling because they picked the wrong task to give it.

A simple three-letter mental model fixes most of this: a task is a good fit for an AI agent if it's Autonomous, Recurring, and Reviewable. That's it. If a task fails any of the three checks, hand it back to a human and build something else.

Here's why each letter matters and what the failure modes look like when one is missing.

A — Autonomous

The agent has to be able to finish the task without depending on real-time human judgment partway through. If the work needs someone to make a call at step three before step four can happen, the agent is mostly an expensive way to schedule meetings with yourself.

The clearest signal a task is autonomous: a competent contractor could complete it with a written brief and no follow-up questions. If the brief itself takes longer to write than just doing the work, the task isn't a fit.

Failure mode when A is missing: you build the agent, but it sits idle 80% of the time waiting for human decisions. The "autonomous" workflow ends up coordinated by a flood of Slack pings to you. Net productivity is negative.

Examples that pass A:

Drafting customer-support replies for known ticket categories
Summarizing yesterday's competitor activity from a predefined list of sources
Routing inbound leads to the right SDR based on rules

Examples that fail A:

Strategy decisions on which markets to enter
Hiring shortlist creation that requires read of body language in interviews
Editorial judgment on whether a story is "us" or not

R — Recurring

The agent needs to do the task often enough to amortize the setup cost. The setup cost is real and usually larger than people estimate: defining the task, writing the prompt, testing edge cases, building evaluation criteria, iterating until quality holds, monitoring for drift. Call it 10–40 hours for a non-trivial agent.

If the task only happens once or twice a year, you'd spend less time just doing it manually. The threshold isn't a hard rule, but as a starting point: if the task doesn't happen at least weekly, the math probably doesn't work.

Failure mode when R is missing: you spend two days building an agent for a quarterly task. Three months later when the task comes around again, the API has changed, the model is on a new version, your prompt no longer works, and you spend another day fixing it. Net cost is now higher than just doing it manually.

Examples that pass R:

Daily standup summaries (daily ✓)
Weekly support volume reports (weekly ✓)
Per-ticket reply drafting (10–1,000+ times per day ✓)

Examples that fail R:

Annual strategic planning docs
Onboarding flow design for a new market entry
One-off competitive analysis for a board meeting

R — Reviewable

You need to be able to tell, in less time than doing the task manually, whether the agent's output is correct. If checking the answer takes longer than producing it, the agent has added work, not removed it.

This is the subtlest of the three. It's also the one that gets violated most often by builders excited about new capabilities.

Failure mode when R is missing: you build an agent that "does the work," but verifying each output takes 80% of the original time. The remaining 20% of "saved" time gets eaten by context-switching costs (you have to look up what the original task even was to evaluate the output). Net time saved: zero or negative. Plus you've added a new failure mode where the agent's confident-but-wrong output goes uncaught.

Examples that pass the review check:

Summary outputs you can spot-check by reading the source for 30 seconds
Code changes you can see in a diff and test with one command
Data extraction where the answer is verifiable against a known structure

Examples that fail the review check:

Long-form articles where checking accuracy means reading the article
Strategic recommendations where you'd need to do the analysis yourself to verify
Novel synthesis where the agent could be confidently wrong in subtle ways

Why all three matter

Each letter independently rules out a real category of mistake. Together they leave you with a much smaller surface of "good agent tasks" than the demos suggest.

Walking through a few real-world examples:

Customer support reply drafting at scale. Autonomous (clear input and output), Recurring (hundreds of times a day), Reviewable (a human can scan a draft in seconds). Three out of three. Good agent task.

Writing the company strategy doc. Not autonomous (depends on judgment calls), not recurring (annual), not reviewable (the whole point is to produce something only the leader can validate). Zero out of three. Terrible agent task.

Coding a new feature. Autonomous if the spec is tight, Recurring if the codebase has lots of similar features to ship, Reviewable if tests exist that catch regressions. Often all three pass, which is why coding agents have taken off. Sometimes only one or two pass, which is why some teams find their coding agent useless.

Personalized cold outreach. Autonomous (the prospect data and the desired outreach are well-defined), Recurring (most teams send dozens per day), Reviewable if a human reads each draft before send, but only barely. The review check is the wobbly leg here. If you're scaling past 200 outreach emails a day with no human review, you've broken the R.

Research summaries. Usually autonomous and recurring, but reviewability depends on whether you have the source material on hand to spot-check. If the agent reads 100 sources and gives you a one-page summary, you can't verify it without reading the 100 sources yourself. Fails R unless the agent shows its work.

The corollary: how to fix tasks that fail ARR

A task that fails one letter isn't lost. You can often re-shape it until all three pass:

Fails Autonomous? Narrow the scope. Instead of "handle the customer," try "handle tier-1 categories with a known resolution path." Shrink the surface until the agent doesn't need a human decision mid-flight.
Fails Recurring? Don't build the agent. Or pair it with similar tasks that share infrastructure. A "weekly executive briefing agent" only justifies its setup if it can also do the daily briefing and the monthly briefing — same agent, same prompts, different cadence.
Fails Reviewable? Force the output into a reviewable shape. Make the agent show its sources. Make it provide structured outputs alongside the prose. Make it flag confidence levels. The goal isn't to make the work disappear — it's to make the checking of the work fast enough that the agent still saves time net.

What this doesn't tell you

ARR is necessary but not sufficient. A task can pass all three checks and still fail in production because:

The model isn't capable enough at this specific work yet
The integrations don't exist for the systems the agent needs to touch
The economics don't work — model cost exceeds value delivered
Your team can't operationalize the agent without dedicated engineering time

ARR filters out most bad agent ideas quickly. The remaining ideas need a different test: do you have the budget, the tooling, and the team to actually build and maintain this? That's a separate conversation, covered in where AI agents actually deliver ROI and how much does it cost to build an AI agent.

When to use this framework

Three moments where pulling out ARR pays off immediately:

When someone brings you "an AI agent idea." Run it through the three letters out loud. If one fails, you've saved a week of building.
When you're picking which agent to build first. Use ARR to rank candidate workflows. The one that passes all three most cleanly is the one to build first.
When you're auditing existing agents that aren't paying off. If a deployed agent isn't delivering value, the failure is almost always one of the three letters. Identifying which one tells you whether to fix it or kill it.

The full picker recommends platforms for ARR-passing tasks; the calculator tells you what they'll cost. ARR is the upstream filter that decides whether to use either at all.

About the author

Lucas Powell

Founder, Growth 8020 · Editor, Agent Shortlist

Founder of Growth 8020, an AI-first B2B marketing studio. Editor of Agent Shortlist — the publication he wished existed when his team had to pick AI tools.

Full bio →Growth 8020 ↗GitHub ↗

Liked this one? Get the next.

One issue every two weeks. New reviews, tools I've built, and one interesting thing shipped by someone else. Unsubscribe in one click.

← All articles