๐กAI Observability
How you know if your AI feature is working in production. The single most-underbuilt layer in AI products in 2026.
Without observability, you ship AI features and have no idea if they're working. Drift, regressions, edge cases, cost surprises โ all invisible. The teams that build observability early ship reliable AI; those that don't ship demos that break in production.
AI observability = logging every model call, tracing multi-step agent runs, sampling outputs for quality, monitoring cost and latency, alerting on anomalies. Tools like LangSmith, Helicone, Datadog AI, Arize are emerging. The PM specifies what to observe; engineering builds the infrastructure.
What to observe
Per call:
- Input tokens, output tokens, latency, cost
- Model used, prompt version, temperature
- User ID, session ID, feature
- Output (sampled for review)
Per session / agent run:
- Number of model calls
- Total cost, total latency
- Tool calls (which, in what order)
- Success / failure status
Aggregate:
- p50, p95, p99 latency by feature
- Cost per user per day
- Error rate, hallucination rate (from eval sampling)
- Quality drift (eval scores over time)
The tools (2026)
- LangSmith. Best for LangChain-based apps; tracing, evaluation, dataset management.
- Helicone. Lightweight observability for LLM apps. Easy to set up.
- Arize Phoenix. Open-source observability + evals.
- Datadog AI Observability. For teams already on Datadog.
- Custom. Many teams roll their own โ logs to S3, queries in Athena/BigQuery, dashboards in Grafana.
The PM workflows
Daily check. Glance at cost, error rate, latency. Catches anomalies fast.
Weekly review. Quality samples, eval score trends, cost trends. Decide on changes.
Monthly deep-dive. Production failures, eval drift, new use cases emerging.
Quarterly. Cost optimization (can we route more traffic to smaller models? prompt caching wins?), eval suite refresh.
The failure modes observability catches
- Cost blowup. A bug or malicious user spikes token usage 100x. Without monitoring, $50K bill at end of month.
- Quality drift. Model provider quietly updates the model; outputs degrade. Eval scores drop; you catch it in week 1 vs month 3.
- Latency regression. Long-tail latencies creeping up; user experience degrading. p99 monitoring catches.
- Hallucination spike. New use case where the model is failing. Sampling + alerts catch.
- Tool failure in agents. Agent gets stuck because one tool is broken. Trace shows immediately.
What good looks like
The mature AI PM team has:
- Every model call logged
- Every agent run traced
- Production sampling continuously evaluated
- Dashboards visible to PMs and engineers
- Alerts on anomalies
- Weekly review of trends
Teams without this: ship and hope. Often discover problems weeks late, when users complain.
Real-world examples
By mid-2026, AI observability has moved from 'nice to have' to 'table stakes' for production AI products. Teams ship with logging, tracing, and sampling on day 1. The discipline correlates with which AI products feel reliable in production vs. which ones quietly break.
Go deeper โ recommended reading
Interview questions (1)
Q1What observability would you want for an AI agent in production?ai-pmseniorโผ
Three layers.
Per-call layer. Every model call logged: tokens, latency, cost, model used, prompt version, output (sampled). Lets you debug specific failures.
Per-agent-run layer. Trace of the whole agent loop โ tool calls in order, intermediate reasoning, total tokens, final outcome. Critical for agents because the failure mode is usually 'tool X returned weird data and agent went off the rails.'
Aggregate layer. Dashboards showing: p50/p95/p99 latency by feature, cost per user per day, error rate, eval quality trend. Alerts on anomalies (cost spike, latency regression, error rate above threshold).
Plus: production sampling. 1-5% of agent runs get re-evaluated by the eval suite continuously. Catches quality drift between releases.
The tools: LangSmith if we're on LangChain, Helicone or custom for lightweight setups, Datadog AI for teams already on Datadog.
PM workflow: daily glance at the dashboard, weekly review of trends, monthly deep-dive on failures and cost optimization. The discipline catches problems before users do, which is the difference between a reliable AI product and one that quietly breaks.