Article · foundations
Evals as PRDs: How AI Teams Are Replacing Specs With Tests
Evals are becoming the new PRD for AI agent development — quantifiable tests that define 'done' for coding agents. The frameworks, the patterns, and the failure modes.
Top AI engineering teams are quietly changing how they spec software. Instead of writing PRDs that describe what the agent should do step-by-step, they're writing evaluations — quantifiable tests the agent's output must pass — and letting the coding agent figure out the steps.
The pattern's called "evals as PRDs." The shift it represents is real: for non-deterministic systems, behaviour-by-specification beats process-by-specification.
This guide covers what evals actually are, the frameworks worth knowing, how the "evals as PRDs" pattern works in practice, and the failure modes that catch teams that adopt it without thinking.
What evals actually are
An eval is one input plus the expected behaviour. The agent runs against the input; the eval framework checks whether the output meets the criterion; it reports pass or fail.
A trivial example for a support-classification agent:
input: "My credit card was charged twice for the same order"
eval: output must include category="billing" with confidence > 0.8
The eval doesn't dictate how the agent should reach that conclusion. It just checks whether the output is right.
A more sophisticated example using LLM-as-judge:
input: <500-word support ticket about a refund>
eval: response should
- acknowledge the customer's frustration in the opening
- reference the specific order number from the ticket
- state the refund policy clearly
- end with the next action the customer needs to take
The eval framework calls an LLM (often Claude or GPT) to evaluate whether each criterion is satisfied, and produces a pass/fail with reasoning.
Evals are testing infrastructure for non-deterministic systems. Where unit tests check that add(2, 2) === 4, evals check that a probabilistic system's outputs satisfy semantic properties.
The 'evals as PRDs' pattern
The traditional PRD shape: a document describing what the feature should do, how it should work, what the UI should look like, what algorithms it should use. The engineering team reads the spec, implements it, ships.
The evals-as-PRDs shape: a suite of evals defining what "done" means. The implementation details are unspecified. A coding agent (Claude Code, Cursor in agent mode, OpenAI Codex, Amp, Roo Code with Architect mode, or OpenHands for autonomous task completion) iterates — tries different approaches, runs the evals, refines, tries again — until every eval passes.
The shift is from process specification to outcome specification. You're not telling the agent how to solve the problem; you're defining what success looks like and letting it figure out the how.
When this works well: hard problems where the right algorithm isn't obvious upfront. Database indexing strategy, ranking algorithms, edge-case handling in classification, retrieval tuning. The agent tries hundreds of approaches in the background; you wake up to the one that passed.
When this doesn't work: tasks where the goal is the process, not the outcome (writing a tutorial in a specific voice, designing a specific UI flow, executing a regulated workflow). For these, the steps matter as much as the output, and a PRD that specifies them is still the right tool.
The honest take: evals as PRDs is a powerful pattern for a narrow band of problems — algorithmically-hard, outcome-measurable work. For the rest, evals supplement PRDs rather than replace them.
The eval frameworks worth knowing
Five frameworks cover ~90% of production eval workflows in 2026.
Anthropic Inspect AI — Anthropic's research-grade evaluation infrastructure. Open-source. Strongest framework for rigorous, multi-model evaluation including agent-trajectory evals (testing not just the final output but the steps the agent took to get there). The right pick if you want the most serious evaluation tooling available and you're comfortable in Python.
promptfoo — the most popular open-source eval framework. Fast setup (a YAML file gets you running), broad model support across every major provider, simple LLM-as-judge primitives. The right default for most teams starting with evals.
Braintrust — commercial hosted eval platform. UI for inspecting eval runs, comparing versions, tracking drift over time. The right pick when the team wants hosted infrastructure and shareable eval dashboards, not just CLI tooling.
LangSmith — LangChain's evaluation product. Tightly integrated with LangGraph workflows, strong for teams already on the LangChain stack. The right pick if your agent code is already LangChain-shaped.
OpenAI Evals — the original eval framework. Open-source. The right pick if you're committed to OpenAI's models and want the framework with the longest production history.
The choice between them is less about features (they all do similar things) and more about your stack. Self-hosted Python team: Inspect AI or promptfoo. Hosted UI: Braintrust. LangChain shop: LangSmith. OpenAI shop: OpenAI Evals.
How to write evals that actually work
Three rules from teams that ship eval-driven agents successfully.
Rule 1: Test the behaviour, not the prompt.
If your eval would break the moment you rewrote the prompt, it's testing the wrong thing. Evals should pass even when you swap the model, change the prompt, or refactor the agent — as long as the output still satisfies the criterion. Bad eval: "the response includes the word 'unfortunately.'" Good eval: "the response acknowledges the customer's frustration in a tonally appropriate way."
Rule 2: Cover the long tail, not just the happy path.
The happy path is what your demo tests. Production failures come from cases nobody anticipated — ambiguous inputs, malformed data, edge cases, hostile inputs. Eval coverage on the long tail is where the value compounds.
A pattern that works: every production failure becomes a new eval. The customer who got the broken response is now an input in your test suite. You can never regress on that case again.
Rule 3: Make the failure mode obvious.
When an eval fails, the message should tell you what's wrong in plain language. Not "expected match returned false" — "the response did not include the order number from the input, which the customer needs to track their refund." Good eval failure messages turn debug sessions from hours to minutes.
A worked example: ranking algorithm via evals as PRDs
A concrete example of evals replacing a PRD. The job: build a ranking algorithm that orders search results for a B2B product catalog.
Traditional PRD: specify the ranking factors (recency, relevance, popularity), the weights, the merge logic, the edge cases. Engineering implements. Iterates against feedback. Ships in weeks.
Evals as PRDs: write 200 evals that define what "good ranking" means.
For query "industrial widgets":
- Result 1 should be in the top 3
- Results 4–7 should all be widget products (not adjacent categories)
- Results 8–10 should include at least 2 from the past 30 days
For query "compatible with Acme Model X":
- Top 3 results must all be confirmed compatible (data flag)
- No result without explicit compatibility data in top 5
For query "Acme":
- Results 1–3 should be Acme-branded
- Results 4–10 should include competing brands (don't single-source)
... 197 more evals covering different query shapes
The coding agent (Claude Code in this case) implements an initial ranking algorithm, runs the evals, sees that 73 pass, iterates. Tries a different weighting. Runs again. 89 pass. Tries adding a recency boost. 112 pass. Continues until 195/200 pass and the remaining 5 are validated as actually-edge-case failures rather than bugs.
The time-to-shipped: similar to traditional PRD development. The advantage: the eval suite is now permanent infrastructure. Every future ranking change is validated against the same 200 evals. Regressions are caught the moment they happen.
This is the pattern in action: the evals become the spec, the agent does the implementation work, and the team's job shifts to designing eval coverage rather than designing algorithms.
The honest failure mode: Goodhart's Law
The big risk with evals-as-PRDs is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The agent passes every eval. The agent ships. Production users hit edge cases the evals didn't cover. The agent fails in non-obvious ways.
This isn't a hypothetical — it's the failure mode every team that adopts this pattern encounters.
The mitigation isn't "write more evals" (you can't anticipate everything). It's a three-part discipline:
- Pair evals with structured user testing. Run the agent against real users in a staged rollout. Their failures become new evals.
- Gradual production rollout with monitoring. Ship to 1% of traffic, monitor outcomes against business metrics (not just evals), expand if outcomes hold.
- Production failures become new evals. Every failure mode discovered in production gets codified as an eval. The suite grows over time, catching the regressions next time.
Teams that treat evals as a complete coverage tool fail with this pattern. Teams that treat evals as a high-leverage portion of a broader testing discipline succeed.
The cost math
What running evals actually costs.
For a 50-eval suite running against Claude Sonnet 4.6 on a typical agent workload (~3,000 tokens per eval input + ~500 tokens per output + ~1,000 tokens per LLM-as-judge evaluation):
| Run frequency | Cost per run | Monthly cost |
|---|---|---|
| On every deploy (10x/month) | ~$2.50 | ~$25 |
| On every deploy + nightly (40x/month) | ~$2.50 | ~$100 |
| On every deploy + nightly + per-PR (200x/month) | ~$2.50 | ~$500 |
For a 500-eval suite, multiply by 10.
These are cheap relative to what a single shipped agent bug costs. The eval cost is investment in the discipline, not running expense.
When evals-as-PRDs is right vs wrong
The pattern is the right pick when:
- The task has measurable success criteria
- The right algorithm/approach isn't obvious upfront
- The agent has compute available to iterate (often background time, not real-time)
- The cost of shipping a wrong implementation is high
The pattern is the wrong pick when:
- The work's value is in the steps, not the outcome (regulatory, audit-trail, process-anchored work)
- The success criteria can't be made testable
- The team doesn't have time to write rigorous evals before starting
- The implementation is straightforward and a PRD ships faster
For most product teams in 2026: use evals to supplement PRDs. Let the agent iterate against evals on the algorithm-hard parts; write PRDs for the process-anchored parts. The hybrid is more productive than the pure pattern in either direction.
What to read next
The AI agent observability guide covers what to instrument so eval results and production behaviour can be correlated. Loop engineering covers the trigger layer that decides when evals run. AI agent workflow design covers the workflow patterns evals are testing against. The shortlist has the platform-by-platform breakdown if you're picking the coding agent that will iterate against your evals.
If you're stuck choosing between eval frameworks for a specific stack, promptfoo is the default starting point and Anthropic's Inspect AI is the upgrade path for serious teams.
About the author

Lucas Powell
Founder, Growth 8020 · Editor, Agent ShortlistFounder of Growth 8020, an AI-first B2B marketing studio. Editor of Agent Shortlist — the publication he wished existed when his team had to pick AI tools.
More in this series
Every article in the foundations cluster — for builders who want the full picture.
How to Create an AI Agent: A Tested Builder's Guide (2026)
Loop Engineering: How to Design Self-Prompting AI Agents
Multi-Agent AI: When to Use It, When to Skip It, What Actually Works
The ARR framework: which tasks should you actually give to an AI agent?
Director vs doer: the mindset shift that separates working AI agents from broken ones
The lethal trifecta: the AI agent security trap nobody warns you about
AI Agent Model Routing: Cut Your API Bill by 60% Without Losing Quality
AI Agent Observability: What to Monitor and How
AI Agent Guardrails: How to Not Delete Your Database in 9 Seconds
AI Agent Orchestration: Frameworks, Platforms, and What Actually Works
AI Agent Workflow Design: Patterns That Ship in Production
The best AI agent frameworks in 2026: LangGraph, CrewAI, AutoGen, and what to pick
AI Agent Skills and Memory: How to Make Agents Get Better Over Time