📡AI Observability

How you know if your AI feature is working in production. The single most-underbuilt layer in AI products in 2026.

aioperations

Why it matters

Without observability, you ship AI features and have no idea if they're working. Drift, regressions, edge cases, cost surprises — all invisible. The teams that build observability early ship reliable AI; those that don't ship demos that break in production.

The core idea

AI observability = logging every model call, tracing multi-step agent runs, sampling outputs for quality, monitoring cost and latency, alerting on anomalies. Tools like LangSmith, Helicone, Datadog AI, Arize are emerging. The PM specifies what to observe; engineering builds the infrastructure.

What to observe

Per call:

Input tokens, output tokens, latency, cost
Model used, prompt version, temperature
User ID, session ID, feature
Output (sampled for review)

Per session / agent run:

Number of model calls
Total cost, total latency
Tool calls (which, in what order)
Success / failure status

Aggregate:

p50, p95, p99 latency by feature
Cost per user per day
Error rate, hallucination rate (from eval sampling)
Quality drift (eval scores over time)

The tools (2026)

LangSmith. Best for LangChain-based apps; tracing, evaluation, dataset management.
Helicone. Lightweight observability for LLM apps. Easy to set up.
Arize Phoenix. Open-source observability + evals.
Datadog AI Observability. For teams already on Datadog.
Custom. Many teams roll their own — logs to S3, queries in Athena/BigQuery, dashboards in Grafana.

The PM workflows

Daily check. Glance at cost, error rate, latency. Catches anomalies fast.

Weekly review. Quality samples, eval score trends, cost trends. Decide on changes.

Monthly deep-dive. Production failures, eval drift, new use cases emerging.

Quarterly. Cost optimization (can we route more traffic to smaller models? prompt caching wins?), eval suite refresh.

The failure modes observability catches

Cost blowup. A bug or malicious user spikes token usage 100x. Without monitoring, $50K bill at end of month.
Quality drift. Model provider quietly updates the model; outputs degrade. Eval scores drop; you catch it in week 1 vs month 3.
Latency regression. Long-tail latencies creeping up; user experience degrading. p99 monitoring catches.
Hallucination spike. New use case where the model is failing. Sampling + alerts catch.
Tool failure in agents. Agent gets stuck because one tool is broken. Trace shows immediately.

What good looks like

The mature AI PM team has:

Every model call logged
Every agent run traced
Production sampling continuously evaluated
Dashboards visible to PMs and engineers
Alerts on anomalies
Weekly review of trends

Teams without this: ship and hope. Often discover problems weeks late, when users complain.

Real-world examples

Anthropic / OpenAI customers

Observability as table stakes

By mid-2026, AI observability has moved from 'nice to have' to 'table stakes' for production AI products. Teams ship with logging, tracing, and sampling on day 1. The discipline correlates with which AI products feel reliable in production vs. which ones quietly break.

Go deeper — recommended reading

AI PM's Ultimate Guide: Observability

Aakash Gupta · Product Growth

Interview questions (1)

What observability would you want for an AI agent in production?

ai-pmsenior

▼

Three layers.

Per-call layer. Every model call logged: tokens, latency, cost, model used, prompt version, output (sampled). Lets you debug specific failures.

Per-agent-run layer. Trace of the whole agent loop — tool calls in order, intermediate reasoning, total tokens, final outcome. Critical for agents because the failure mode is usually 'tool X returned weird data and agent went off the rails.'

Aggregate layer. Dashboards showing: p50/p95/p99 latency by feature, cost per user per day, error rate, eval quality trend. Alerts on anomalies (cost spike, latency regression, error rate above threshold).

Plus: production sampling. 1-5% of agent runs get re-evaluated by the eval suite continuously. Catches quality drift between releases.

The tools: LangSmith if we're on LangChain, Helicone or custom for lightweight setups, Datadog AI for teams already on Datadog.

PM workflow: daily glance at the dashboard, weekly review of trends, monthly deep-dive on failures and cost optimization. The discipline catches problems before users do, which is the difference between a reliable AI product and one that quietly breaks.

Related concepts

🤖AI Agents for PMs

Agents are the dominant AI UX of 2025-26. PMs who can design and ship agentic products have a defensible career skill.

📊Evals — The FAQ Every AI PM Needs

Evals are how you know if your AI product actually works. The single most-skipped discipline by junior AI teams.