Agent Shortlist

Article · foundations

AI Agent Observability: What to Monitor and How

The best AI agent observability and monitoring tools, what to instrument, and how to find hidden costs fast. LangSmith, Paperclip, OpenTelemetry, Datadog — compared, with the four metrics every production agent needs.

By Lucas Powell·March 20, 2026·13 min read·2,772 words

A growth team running a lead research workflow discovered — three months in — that 40% of their entire monthly API spend was coming from a single classification step. One prompt. Running thousands of times a day. On Claude Sonnet when Claude Haiku would have done the job just as well.

They found it because someone finally looked at the observability dashboard. Before that, they had none. The workflow had been running for three months. The money had been leaving for three months.

The fix took twenty minutes. The three months of overspend didn't come back.

That's what AI agent observability is for. Not dashboards for their own sake — finding the things you'd never notice otherwise until you're staring at a bigger API bill than expected.

Why agent observability is different from app monitoring

Traditional application monitoring answers straightforward questions. Did the server return a 500? How long did the database query take? Did the job complete?

Agents are non-deterministic. The same input can produce different outputs. They make decisions — which tool to call, what to include in a prompt, whether to spawn a subagent — and those decisions aren't logged by default. A crash log tells you something broke. It doesn't tell you why the agent took a wrong turn three steps earlier and spent $12 in API calls getting there.

The other difference: agents take real-world actions. They send emails. They write to databases. They make API calls on your behalf. A slow HTTP endpoint is annoying. An agent that misunderstood its task and sent 400 customer emails is a crisis.

You can't debug what you can't see. That's the whole argument for observability. With agents, the stakes for not seeing are higher.

The four things you must instrument

Most agent monitoring setups collect too much or too little. Here are the four signals that actually matter.

1. Token usage per step

Agents are expensive in proportion to how much context they carry. A single run that works in testing might cost 10x more in production once real data inflates the prompt.

Instrument tokens in and tokens out at every step — not just the final output. This is where cost visibility lives. Without it, you'll get a monthly API bill you can't explain or attribute to specific workflows.

Check the token cost calculator before committing to a model stack. The difference between GPT-4o and Claude Haiku on a high-volume agent pipeline can be hundreds of dollars a month for identical outputs.

2. Tool call success and failure rates

Most agents fail not because the model hallucinated, but because a tool call failed. The browser scrape timed out. The API returned a 429. The database query returned no rows and the agent didn't handle the empty state.

Track: which tools are called, how often, and what percentage succeed. A tool that fails 15% of the time isn't a fluke — it's a reliability problem that will bite you at scale.

This is also where you catch agents that have learned to avoid a broken tool by hallucinating an answer instead. The tool call rate drops. The hallucination rate rises. The success metric stays green. You miss it entirely without this signal.

3. Latency per step

End-to-end latency tells you the pipeline is slow. Latency per step tells you which part is the bottleneck.

In most agent pipelines, latency clusters in one of three places: the first LLM call (large system prompt), a specific tool (external API with rate limits), or a loop that runs more iterations than expected. You can't fix what you can't locate.

Instrument start time and end time at every node. A step that usually takes 2 seconds taking 12 seconds is a flag — even if the output looks correct.

4. Output quality signals

This is the hardest one and the most important. Token counts and latency are mechanical. Quality is semantic.

The signals that work at scale:

  • Task completion rate — did the agent finish the task it was asked to do, or did it bail out with an error or partial output?
  • Human correction rate — for agents with human review downstream, how often does a reviewer change the output?
  • Confidence flags — some agents can self-report uncertainty. Wire that into your logs.
  • Hallucination triggers — fact-check a sample of outputs against ground truth. Track the rate.

You won't instrument all of these on day one. Start with task completion rate. It's the one that catches the most failure modes.


What a good observability setup looks like

Three components. You need all three.

Structured logs with trace IDs. Every agent action should emit a JSON log with: trace ID (unique per user request), step name, model used, tokens in/out, tool calls made, duration, and a pass/fail flag. The trace ID is what lets you follow a request through every agent hop in a multi-agent pipeline. Without it, you can't correlate a downstream failure to the upstream decision that caused it.

Human-readable decision summaries. Raw logs are for machines. Engineers debugging an issue need to read what the agent decided and why in plain language. Some frameworks support reasoning traces out of the box. If yours doesn't, add a summary field to your logs — have the agent narrate its decision in a sentence or two before executing.

Cost dashboards. Token costs per workflow, per agent, per model. Updated daily. With alerts for anomalous spend. If a workflow's cost doubles overnight, you want to know before the billing cycle closes.


Tools that exist today

You don't have to build this from scratch. Several tools handle the heavy lifting.

LangSmith

LangChain's observability layer. Works with any LLM application, not just LangChain-based ones. Captures every LLM call, tool call, and chain step automatically. Gives you trace views, cost tracking, and a dataset for evaluating outputs over time.

LangSmith's strength: it's comprehensive and the free tier is generous. Its weakness: it's cloud-hosted by LangChain. If your agents process sensitive data, check whether that fits your compliance requirements.

Paperclip's audit trail

Paperclip has an immutable decision log built into the platform. Every agent decision, action, and tool call is recorded and linked to the agent that made it. You can reconstruct exactly what happened in a multi-agent workflow — which agent did what, in what order, and what the inputs and outputs were at each step.

This is particularly useful for multi-agent orchestration scenarios where a chain of agents produces a final output and you need to trace a problem back to its source. The audit trail makes that reconstruction straightforward.

Paperclip is open-source and self-hosted, which means the audit data stays on your infrastructure.

OpenTelemetry

If you're building a custom agent framework or your agents are embedded in a larger application, OpenTelemetry is the standard for custom instrumentation. You instrument your agent code with spans and export traces to your existing observability stack — Honeycomb, Jaeger, Grafana Tempo.

The work is upfront. Once instrumented, you get trace data that integrates cleanly with the rest of your application monitoring, which is useful when you need to correlate agent behaviour with downstream system health.

Datadog and Honeycomb

If your team already uses Datadog or Honeycomb for APM, you can push agent logs there. Both tools can ingest structured JSON, build dashboards on token costs and latency, and alert on anomalies.

This isn't a native LLM observability solution — you'll build the queries yourself — but for teams with existing tooling it's faster than adopting a new platform. The cost dashboard you need can be built in an afternoon with structured logs and a basic SQL-style query.

Compare model costs across providers before choosing a stack — your observability approach should account for switching models if you find a cheaper option for a given step.


The minimum viable observability setup

For teams that want to ship something today, not build a monitoring platform.

Emit this JSON object at every agent step:

{
  "trace_id": "uuid-per-user-request",
  "step": "classify_intent",
  "model": "claude-haiku-3-5",
  "tokens_in": 1240,
  "tokens_out": 83,
  "tools_called": ["search_crm"],
  "tool_success": true,
  "duration_ms": 1820,
  "pass": true
}

In practice, a single request looks like this:

{
  "trace_id": "req_a4f2b",
  "agent": "lead-classifier",
  "model": "claude-haiku-4-5",
  "input_tokens": 847,
  "output_tokens": 42,
  "tool_calls": [],
  "duration_ms": 1240,
  "pass": true
}

One line per request. Queryable in any log platform. Shows you instantly when a cheap task is accidentally running on an expensive model.

That's it. Eight fields. Ship them to any log platform — CloudWatch, Datadog, Logtail, or even a flat file. Build a single dashboard showing cost by step and success rate by tool.

This takes one afternoon to implement. It catches 80% of production issues. It's $0 to run if you're already paying for a log platform.

Start here. Add LangSmith or OpenTelemetry when you need deeper trace visualisation or output evaluation.


The ROI is immediate

One team running a document processing pipeline found that 40% of their monthly agent API budget was being spent on a single classification step — a call to GPT-4o that was determining document type. The actual classification task required no reasoning. A swap to a smaller, cheaper model cost them two hours of work and cut their monthly bill by a third.

They only found this because they started logging tokens per step. Without it, the budget line was "AI API costs." With it, it was "GPT-4o on classification: $340/month."

Observability doesn't just tell you when things break. It tells you where you're overpaying.

Most teams skip it because it feels like infrastructure work rather than product work. It is infrastructure work. But it pays for itself the first time you find a misattributed cost, a broken tool call you didn't know was failing, or an agent spending 14 seconds on a step that should take two.


AI agent observability tools compared

A head-to-head across the tools we'd actually recommend in 2026. The right pick depends less on feature parity and more on where the rest of your stack already lives.

ToolBest forOpen sourcePricingTrace visualisationCost attributionOutput evaluation
LangSmithLangChain / LangGraph teamsNoFree tier; $39+/user/mo paidExcellentBuilt-inBuilt-in (evals, datasets)
Paperclip audit trailMulti-agent platform usersYes (MIT)Free (self-hosted)Good (chronological)Per-agent budget trackingManual review queue
OpenTelemetry + GrafanaSelf-hosted, multi-cloudYes (Apache 2.0)Free (your infra)ConfigurableDIYDIY
Datadog LLM ObservabilityTeams already on DatadogNoUsage-based, ~$15/host/mo+ExcellentBuilt-inAdd-on
HoneycombHigh-cardinality tracingNoFree tier; $130/mo+ paidExcellent (BubbleUp)DIY (custom fields)DIY
HeliconeAPI proxy approachYes (Apache 2.0)Free tier; $20+/mo paidGoodBuilt-inBuilt-in

Two patterns to call out:

  • For LangChain or LangGraph teams, LangSmith is the default. The integration is tight enough that the cost of using anything else is real engineering work.
  • For teams running Paperclip, the built-in audit trail handles 80% of the observability problem without adding another tool. Per-agent budgets, immutable action logs, and chronological trace views ship by default.

AI agent observability best practices

Five practices that consistently separate teams who catch the $340/month classification leak in week one from teams who catch it at the quarterly bill review:

1. Instrument tokens per step from day one, not day 90

The most expensive observability mistake is adding it later. Every team we've seen find a major cost leak found it because they had token logging from the start. Teams that skip it discover the leak months in, after the spend is fully baked into "expected cost."

2. Log the prompt and the response, not just the metadata

The compressed log line "agent called LLM, got result" is useless when you're debugging. Log the actual prompt and the actual response, with PII redaction if your domain requires it. Storage is cheap; reproducible debugging is not.

3. Alert on tool call failure rates, not just absolute counts

A tool call going from 99% success to 92% success is a leading indicator of a broken integration. An absolute count alert ("more than 100 failures") won't catch it because volume is also growing. Set alerts on the rate, not the count.

4. Track latency per step, not just end-to-end

A workflow that runs in 4 seconds total can hide a step that takes 3.5 seconds because of a slow tool. End-to-end latency tells you something is slow; per-step latency tells you which thing is slow. Always instrument both.

5. Sample the output for quality, don't just count it

Output quality drifts. Models update, prompts decay, edge cases accumulate. The only way to catch this is to sample N outputs per day and either evaluate them with a rubric (LangSmith Evals, Helicone Evaluations) or have a human spot-check them. Automated counting gives you "the agent ran"; sampling tells you whether the result was any good.


Frequently asked questions

What is AI agent observability?

AI agent observability is the practice of instrumenting AI agents in production so you can answer four questions in real time: what is each agent doing, how much is it costing, where is it slow, and is the output any good. It's the difference between "the agent ran" and "the agent ran, processed 47 documents at an average cost of $0.03 each, succeeded on 45 of them, and the two failures came from a malformed PDF in the input stream."

What are the best AI agent observability tools?

For most teams: LangSmith if you're on LangChain or LangGraph; Paperclip's built-in audit trail if you're using Paperclip for orchestration; OpenTelemetry + Grafana if you want fully self-hosted; Datadog LLM Observability if the rest of your stack already runs on Datadog. Helicone is the strongest open-source pick if you want an API-proxy approach.

What should I monitor in an AI agent?

Four things, ranked by impact: token usage per step (catches cost leaks fastest), tool call success and failure rates (catches broken integrations), latency per step (catches the slow tool that's killing UX), output quality signals (catches quality drift before customers do). Skip output evaluation only if you have a human in every loop already.

How do I monitor an AI agent continuously?

Three layers: (1) emit structured logs from every agent action — prompt, response, tokens used, latency, success/failure; (2) ship those logs to an observability platform (LangSmith, Datadog, Honeycomb, or a Grafana setup over OpenTelemetry); (3) set alerts on tool failure rates, cost-per-step anomalies, and latency tails. The minimum viable setup is just (1) into a file — even no destination is better than no instrumentation.

How is AI agent monitoring different from regular app monitoring?

Regular app monitoring tracks deterministic systems — a function either runs or it doesn't, returns the right value or the wrong one. AI agents are probabilistic — every call has a non-deterministic cost (tokens), a non-deterministic latency (sometimes 200ms, sometimes 14 seconds), and a non-deterministic quality (sometimes correct, sometimes confidently wrong). Standard APM tools (DataDog, New Relic) handle the deterministic shape; LLM observability tools (LangSmith, Helicone, Datadog LLM Observability) add the probabilistic dimension.

Do I really need AI agent observability for a small project?

If you're running a single agent doing a single task and your monthly bill is under $20, no. Skip it. Add it the moment the workflow goes to production for real users or the bill goes above $50/month — whichever comes first. The ROI is almost always faster than teams expect: the first cost-leak you catch typically pays for the instrumentation work itself.

Is observability worth it before I have any agents in production?

Yes — set it up before the first deploy, not after. Adding observability after the fact means months of operating blind while you catch up. The instrumentation is cheap; the cost of not having it the day a customer complains is high. The lethal trifecta describes the security version of this problem — observability is the operational version.


What to read next

The AI agent orchestration guide covers what sits above observability — budget controls, approval gates, and audit trails. The lethal trifecta covers the architectural security version of the same operational discipline. The Paperclip review breaks down the only platform that ships an audit trail by default. And the cost calculator helps you size what your agent should cost before you instrument it — useful baseline for spotting anomalies.

You built the agent. Now you need to know what it's doing.

About the author

Lucas Powell

Lucas Powell

Founder, Growth 8020 · Editor, Agent Shortlist

Founder of Growth 8020, an AI-first B2B marketing studio. Editor of Agent Shortlist — the publication he wished existed when his team had to pick AI tools.