Agent Shortlist

Article

AI Agent Observability: What to Monitor and How

Most teams deploy AI agents and skip monitoring entirely. Here's what to instrument, which tools work, and how to find hidden costs fast.

By Lucas Powell·April 29, 2026·7 min read·1,550 words

Most teams deploy an AI agent, test it a few times, and push it to production. Monitoring is an afterthought — or skipped entirely.

That's fine until the agent does something unexpected. Then you're staring at a blank dashboard trying to reconstruct what happened from memory.

AI agent observability is the discipline of knowing what your agents are doing, why they're doing it, and what it costs. This guide covers what to instrument, which tools exist today, and the minimum viable setup for teams that don't want to build a monitoring platform from scratch.

Why agent observability is different from app monitoring

Traditional application monitoring answers straightforward questions. Did the server return a 500? How long did the database query take? Did the job complete?

Agents are non-deterministic. The same input can produce different outputs. They make decisions — which tool to call, what to include in a prompt, whether to spawn a subagent — and those decisions aren't logged by default. A crash log tells you something broke. It doesn't tell you why the agent took a wrong turn three steps earlier and spent $12 in API calls getting there.

The other difference: agents take real-world actions. They send emails. They write to databases. They make API calls on your behalf. A slow HTTP endpoint is annoying. An agent that misunderstood its task and sent 400 customer emails is a crisis.

You can't debug what you can't see. That's the whole argument for observability. With agents, the stakes for not seeing are higher.

The four things you must instrument

Most agent monitoring setups collect too much or too little. Here are the four signals that actually matter.

1. Token usage per step

Agents are expensive in proportion to how much context they carry. A single run that works in testing might cost 10x more in production once real data inflates the prompt.

Instrument tokens in and tokens out at every step — not just the final output. This is where cost visibility lives. Without it, you'll get a monthly API bill you can't explain or attribute to specific workflows.

Check the token cost calculator before committing to a model stack. The difference between GPT-4o and Claude Haiku on a high-volume agent pipeline can be hundreds of dollars a month for identical outputs.

2. Tool call success and failure rates

Most agents fail not because the model hallucinated, but because a tool call failed. The browser scrape timed out. The API returned a 429. The database query returned no rows and the agent didn't handle the empty state.

Track: which tools are called, how often, and what percentage succeed. A tool that fails 15% of the time isn't a fluke — it's a reliability problem that will bite you at scale.

This is also where you catch agents that have learned to avoid a broken tool by hallucinating an answer instead. The tool call rate drops. The hallucination rate rises. The success metric stays green. You miss it entirely without this signal.

3. Latency per step

End-to-end latency tells you the pipeline is slow. Latency per step tells you which part is the bottleneck.

In most agent pipelines, latency clusters in one of three places: the first LLM call (large system prompt), a specific tool (external API with rate limits), or a loop that runs more iterations than expected. You can't fix what you can't locate.

Instrument start time and end time at every node. A step that usually takes 2 seconds taking 12 seconds is a flag — even if the output looks correct.

4. Output quality signals

This is the hardest one and the most important. Token counts and latency are mechanical. Quality is semantic.

The signals that work at scale:

  • Task completion rate — did the agent finish the task it was asked to do, or did it bail out with an error or partial output?
  • Human correction rate — for agents with human review downstream, how often does a reviewer change the output?
  • Confidence flags — some agents can self-report uncertainty. Wire that into your logs.
  • Hallucination triggers — fact-check a sample of outputs against ground truth. Track the rate.

You won't instrument all of these on day one. Start with task completion rate. It's the one that catches the most failure modes.


What a good observability setup looks like

Three components. You need all three.

Structured logs with trace IDs. Every agent action should emit a JSON log with: trace ID (unique per user request), step name, model used, tokens in/out, tool calls made, duration, and a pass/fail flag. The trace ID is what lets you follow a request through every agent hop in a multi-agent pipeline. Without it, you can't correlate a downstream failure to the upstream decision that caused it.

Human-readable decision summaries. Raw logs are for machines. Engineers debugging an issue need to read what the agent decided and why in plain language. Some frameworks support reasoning traces out of the box. If yours doesn't, add a summary field to your logs — have the agent narrate its decision in a sentence or two before executing.

Cost dashboards. Token costs per workflow, per agent, per model. Updated daily. With alerts for anomalous spend. If a workflow's cost doubles overnight, you want to know before the billing cycle closes.


Tools that exist today

You don't have to build this from scratch. Several tools handle the heavy lifting.

LangSmith

LangChain's observability layer. Works with any LLM application, not just LangChain-based ones. Captures every LLM call, tool call, and chain step automatically. Gives you trace views, cost tracking, and a dataset for evaluating outputs over time.

LangSmith's strength: it's comprehensive and the free tier is generous. Its weakness: it's cloud-hosted by LangChain. If your agents process sensitive data, check whether that fits your compliance requirements.

Paperclip's audit trail

Paperclip has an immutable decision log built into the platform. Every agent decision, action, and tool call is recorded and linked to the agent that made it. You can reconstruct exactly what happened in a multi-agent workflow — which agent did what, in what order, and what the inputs and outputs were at each step.

This is particularly useful for multi-agent orchestration scenarios where a chain of agents produces a final output and you need to trace a problem back to its source. The audit trail makes that reconstruction straightforward.

Paperclip is open-source and self-hosted, which means the audit data stays on your infrastructure.

OpenTelemetry

If you're building a custom agent framework or your agents are embedded in a larger application, OpenTelemetry is the standard for custom instrumentation. You instrument your agent code with spans and export traces to your existing observability stack — Honeycomb, Jaeger, Grafana Tempo.

The work is upfront. Once instrumented, you get trace data that integrates cleanly with the rest of your application monitoring, which is useful when you need to correlate agent behaviour with downstream system health.

Datadog and Honeycomb

If your team already uses Datadog or Honeycomb for APM, you can push agent logs there. Both tools can ingest structured JSON, build dashboards on token costs and latency, and alert on anomalies.

This isn't a native LLM observability solution — you'll build the queries yourself — but for teams with existing tooling it's faster than adopting a new platform. The cost dashboard you need can be built in an afternoon with structured logs and a basic SQL-style query.

Compare model costs across providers before choosing a stack — your observability approach should account for switching models if you find a cheaper option for a given step.


The minimum viable observability setup

For teams that want to ship something today, not build a monitoring platform.

Emit this JSON object at every agent step:

{
  "trace_id": "uuid-per-user-request",
  "step": "classify_intent",
  "model": "claude-haiku-3-5",
  "tokens_in": 1240,
  "tokens_out": 83,
  "tools_called": ["search_crm"],
  "tool_success": true,
  "duration_ms": 1820,
  "pass": true
}

That's it. Eight fields. Ship them to any log platform — CloudWatch, Datadog, Logtail, or even a flat file. Build a single dashboard showing cost by step and success rate by tool.

This takes one afternoon to implement. It catches 80% of production issues. It's $0 to run if you're already paying for a log platform.

Start here. Add LangSmith or OpenTelemetry when you need deeper trace visualisation or output evaluation.


The ROI is immediate

One team running a document processing pipeline found that 40% of their monthly agent API budget was being spent on a single classification step — a call to GPT-4o that was determining document type. The actual classification task required no reasoning. A swap to a smaller, cheaper model cost them two hours of work and cut their monthly bill by a third.

They only found this because they started logging tokens per step. Without it, the budget line was "AI API costs." With it, it was "GPT-4o on classification: $340/month."

Observability doesn't just tell you when things break. It tells you where you're overpaying.

Most teams skip it because it feels like infrastructure work rather than product work. It is infrastructure work. But it pays for itself the first time you find a misattributed cost, a broken tool call you didn't know was failing, or an agent spending 14 seconds on a step that should take two.

You built the agent. Now you need to know what it's doing.

About the author

Lucas Powell

Lucas Powell

Founder, Growth 8020

Founder of Growth 8020. Started Agent Shortlist as the publication he wished existed when his team had to pick AI tools.