What are AI agent evals?

Evals are quantifiable tests that measure whether an AI agent's output meets a defined success criterion. Each eval is one input plus the expected behaviour, the agent runs against it, and the eval framework reports pass or fail. Evals are how production AI teams know whether a prompt change, model swap, or workflow tweak actually improved things. They're testing infrastructure for non-deterministic systems.

What does 'evals as PRDs' mean?

Evals as PRDs is the pattern of specifying agent features by writing the tests they must pass rather than the steps they should take. Instead of 'the agent should use Algorithm X, then approach Y, then format Z,' the spec is 'the agent's output must pass these 50 evals.' The coding agent then iterates — trying different algorithms, approaches, and formats, until the evals pass. The eval set becomes the source of truth for 'done.'

What are the best AI agent eval frameworks in 2026?

Five worth considering. Anthropic Inspect AI for the most rigorous research-grade evaluation infrastructure. promptfoo for the most popular open-source pick — fast setup, broad model support. Braintrust for commercial teams wanting hosted infrastructure and UI tooling. LangSmith for teams already on LangChain or LangGraph. OpenAI Evals for the original framework, strongest if you're committed to OpenAI's stack. The right pick depends on whether you want self-hosted vs hosted, open-source vs commercial, and your existing stack.

How do you write good AI agent evals?

Three rules. Test the behaviour, not the prompt, the eval should pass even if you rewrite the prompt entirely. Cover the long tail, not just the happy path. Most production failures come from edge cases the prompt didn't anticipate. Make the eval failure mode obvious, when an eval fails, the message should tell you what's wrong in plain language, not require detective work. The eval set's quality determines the agent's quality.

What's the difference between evals and traditional unit tests?

Traditional unit tests check deterministic outputs — given input X, expect exact output Y. Evals check non-deterministic outputs — given input X, the response should satisfy property P (uses the right tone, answers the question, includes the required fields, doesn't hallucinate). The eval framework typically uses an LLM-as-judge or rubric-based scoring to evaluate properties rather than exact-match comparison. Same infrastructure shape, different evaluation primitive.

How much does running evals cost?

Token costs scale with eval count and frequency. A 50-eval suite run against Claude Sonnet 4.6 on a typical agent workload costs roughly $1-5 per run. Running on every deploy plus nightly costs roughly $30-150/month. The math is cheap relative to the cost of shipping a broken agent. LLM-as-judge evals add a second model call per eval, so the cost roughly doubles when used.

What's the failure mode of evals as PRDs?

Goodhart's Law. The agent over-fits to the eval set — passes every test in the suite, fails in production on cases the evals didn't cover. The fix isn't writing more evals (you can't anticipate everything); it's pairing evals with structured user testing, gradual production rollout with monitoring, and a feedback loop where production failures become new evals. Evals are a coverage tool, not a complete coverage.

Should I use evals for non-coding agents (sales, support, content)?

Yes, but the eval primitives are different. For support agents, evals test response quality, brand voice, escalation triggers. For content agents, evals test tone, format, factual accuracy. The same frameworks (promptfoo, Braintrust, LangSmith) support both code and content evals. The discipline of writing evals before shipping is valuable across every agent category, the cost is low, the catch rate on shipping bugs is high.

Article · foundations

Evals as PRDs: How AI Teams Are Replacing Specs With Tests

Evals are becoming the new PRD for AI agents — quantifiable tests that define 'done' for coding agents. Frameworks, patterns, failure modes.

By Lucas Powell·June 17, 2026·7 min read·1,640 words

Top AI engineering teams are quietly changing how they spec software. Instead of writing PRDs that describe what the agent should do step-by-step, they're writing evaluations — quantifiable tests the agent's output must pass, and letting the coding agent figure out the steps.

The pattern's called "evals as PRDs." The shift it represents is real: for non-deterministic systems, behaviour-by-specification beats process-by-specification.

This guide covers what evals actually are, the frameworks worth knowing, how the "evals as PRDs" pattern works in practice, and the failure modes that catch teams that adopt it without thinking.

What evals actually are

An eval is one input plus the expected behaviour. The agent runs against the input; the eval framework checks whether the output meets the criterion; it reports pass or fail.

A trivial example for a support-classification agent:

input: "My credit card was charged twice for the same order"
eval: output must include category="billing" with confidence > 0.8

The eval doesn't dictate how the agent should reach that conclusion. It just checks whether the output is right.

A more sophisticated example using LLM-as-judge:

input: <500-word support ticket about a refund>
eval: response should
  - acknowledge the customer's frustration in the opening
  - reference the specific order number from the ticket
  - state the refund policy clearly
  - end with the next action the customer needs to take

The eval framework calls an LLM (often Claude or GPT) to evaluate whether each criterion is satisfied, and produces a pass/fail with reasoning.

Evals are testing infrastructure for non-deterministic systems. Where unit tests check that add(2, 2) === 4, evals check that a probabilistic system's outputs satisfy semantic properties.

The 'evals as PRDs' pattern

The traditional PRD shape: a document describing what the feature should do, how it should work, what the UI should look like, what algorithms it should use. The engineering team reads the spec, implements it, ships.

The evals-as-PRDs shape: a suite of evals defining what "done" means. The implementation details are unspecified. A coding agent (Claude Code, Cursor in agent mode, OpenAI Codex, Amp, Roo Code with Architect mode, or OpenHands for autonomous task completion) iterates — tries different approaches, runs the evals, refines, tries again, until every eval passes.

The shift is from process specification to outcome specification. You're not telling the agent how to solve the problem; you're defining what success looks like and letting it figure out the how.

When this works well: hard problems where the right algorithm isn't obvious upfront. Database indexing strategy, ranking algorithms, edge-case handling in classification, retrieval tuning. The agent tries hundreds of approaches in the background; you wake up to the one that passed.

When this doesn't work: tasks where the goal is the process, not the outcome (writing a tutorial in a specific voice, designing a specific UI flow, executing a regulated workflow). For these, the steps matter as much as the output, and a PRD that specifies them is still the right tool.

What this means in practice: evals as PRDs is a powerful pattern for a narrow band of problems: algorithmically-hard, outcome-measurable work. For the rest, evals supplement PRDs rather than replace them.

The eval frameworks worth knowing

Five frameworks cover ~90% of production eval workflows in 2026.

Anthropic Inspect AI — Anthropic's research-grade evaluation infrastructure. Open-source. Strongest framework for rigorous, multi-model evaluation including agent-trajectory evals (testing not just the final output but the steps the agent took to get there). The right pick if you want the most serious evaluation tooling available and you're comfortable in Python.

promptfoo, the most popular open-source eval framework. Fast setup (a YAML file gets you running), broad model support across every major provider, simple LLM-as-judge primitives. The right default for most teams starting with evals.

Braintrust — commercial hosted eval platform. UI for inspecting eval runs, comparing versions, tracking drift over time. The right pick when the team wants hosted infrastructure and shareable eval dashboards, not just CLI tooling.

LangSmith — LangChain's evaluation product. Tightly integrated with LangGraph workflows, strong for teams already on the LangChain stack. The right pick if your agent code is already LangChain-shaped.

OpenAI Evals, the original eval framework. Open-source. The right pick if you're committed to OpenAI's models and want the framework with the longest production history.

The choice between them is less about features (they all do similar things) and more about your stack. Self-hosted Python team: Inspect AI or promptfoo. Hosted UI: Braintrust. LangChain shop: LangSmith. OpenAI shop: OpenAI Evals.

How to write evals that actually work

Three rules from teams that ship eval-driven agents successfully.

Rule 1: Test the behaviour, not the prompt.

If your eval would break the moment you rewrote the prompt, it's testing the wrong thing. Evals should pass even when you swap the model, change the prompt, or refactor the agent — as long as the output still satisfies the criterion. Bad eval: "the response includes the word 'unfortunately.'" Good eval: "the response acknowledges the customer's frustration in a tonally appropriate way."

Rule 2: Cover the long tail, not just the happy path.

The happy path is what your demo tests. Production failures come from cases nobody anticipated — ambiguous inputs, malformed data, edge cases, hostile inputs. Eval coverage on the long tail is where the value compounds.

A pattern that works: every production failure becomes a new eval. The customer who got the broken response is now an input in your test suite. You can never regress on that case again.

Rule 3: Make the failure mode obvious.

When an eval fails, the message should tell you what's wrong in plain language. Not "expected match returned false" — "the response did not include the order number from the input, which the customer needs to track their refund." Good eval failure messages turn debug sessions from hours to minutes.

A worked example: ranking algorithm via evals as PRDs

A concrete example of evals replacing a PRD. The job: build a ranking algorithm that orders search results for a B2B product catalog.

Traditional PRD: specify the ranking factors (recency, relevance, popularity), the weights, the merge logic, the edge cases. Engineering implements. Iterates against feedback. Ships in weeks.

Evals as PRDs: write 200 evals that define what "good ranking" means.

For query "industrial widgets":
  - Result 1 should be in the top 3
  - Results 4–7 should all be widget products (not adjacent categories)
  - Results 8–10 should include at least 2 from the past 30 days

For query "compatible with Acme Model X":
  - Top 3 results must all be confirmed compatible (data flag)
  - No result without explicit compatibility data in top 5

For query "Acme":
  - Results 1–3 should be Acme-branded
  - Results 4–10 should include competing brands (don't single-source)

... 197 more evals covering different query shapes

The coding agent (Claude Code in this case) implements an initial ranking algorithm, runs the evals, sees that 73 pass, iterates. Tries a different weighting. Runs again. 89 pass. Tries adding a recency boost. 112 pass. Continues until 195/200 pass and the remaining 5 are validated as actually-edge-case failures rather than bugs.

The time-to-shipped: similar to traditional PRD development. The advantage: the eval suite is now permanent infrastructure. Every future ranking change is validated against the same 200 evals. Regressions are caught the moment they happen.

This is the pattern in action: the evals become the spec, the agent does the implementation work, and the team's job shifts to designing eval coverage rather than designing algorithms.

The honest failure mode: Goodhart's Law

The big risk with evals-as-PRDs is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The agent passes every eval. The agent ships. Production users hit edge cases the evals didn't cover. The agent fails in non-obvious ways.

This isn't a hypothetical — it's the failure mode every team that adopts this pattern encounters.

The mitigation isn't "write more evals" (you can't anticipate everything). It's a three-part discipline:

Pair evals with structured user testing. Run the agent against real users in a staged rollout. Their failures become new evals.
Gradual production rollout with monitoring. Ship to 1% of traffic, monitor outcomes against business metrics (not just evals), expand if outcomes hold.
Production failures become new evals. Every failure mode discovered in production gets codified as an eval. The suite grows over time, catching the regressions next time.

Teams that treat evals as a complete coverage tool fail with this pattern. Teams that treat evals as a high-leverage portion of a broader testing discipline succeed.

The cost math

What running evals actually costs.

For a 50-eval suite running against Claude Sonnet 4.6 on a typical agent workload (~3,000 tokens per eval input + ~500 tokens per output + ~1,000 tokens per LLM-as-judge evaluation):

Run frequency	Cost per run	Monthly cost
On every deploy (10x/month)	~$2.50	~$25
On every deploy + nightly (40x/month)	~$2.50	~$100
On every deploy + nightly + per-PR (200x/month)	~$2.50	~$500

For a 500-eval suite, multiply by 10.

These are cheap relative to what a single shipped agent bug costs. The eval cost is investment in the discipline, not running expense.

When evals-as-PRDs is right vs wrong

The pattern is the right pick when:

The task has measurable success criteria
The right algorithm/approach isn't obvious upfront
The agent has compute available to iterate (often background time, not real-time)
The cost of shipping a wrong implementation is high

The pattern is the wrong pick when:

The work's value is in the steps, not the outcome (regulatory, audit-trail, process-anchored work)
The success criteria can't be made testable
The team doesn't have time to write rigorous evals before starting
The implementation is straightforward and a PRD ships faster

For most product teams in 2026: use evals to supplement PRDs. Let the agent iterate against evals on the algorithm-hard parts; write PRDs for the process-anchored parts. The hybrid is more productive than the pure pattern in either direction.

What to read next

The AI agent observability guide covers what to instrument so eval results and production behaviour can be correlated. Loop engineering covers the trigger layer that decides when evals run. AI agent workflow design covers the workflow patterns evals are testing against. The shortlist has the platform-by-platform breakdown if you're picking the coding agent that will iterate against your evals.

If you're stuck choosing between eval frameworks for a specific stack, promptfoo is the default starting point and Anthropic's Inspect AI is the upgrade path for serious teams.

About the author

Lucas Powell

Founder, Growth 8020 · Editor, Agent Shortlist

Founder of Growth 8020, an AI-first B2B marketing studio. Editor of Agent Shortlist — the publication he wished existed when his team had to pick AI tools.

Full bio →Growth 8020 ↗GitHub ↗

Liked this one? Get the next.

One issue every two weeks. New reviews, tools I've built, and one interesting thing shipped by someone else. Unsubscribe in one click.

← All articles