LLM observability, why production AI dies without it

Why LLM systems fail silently

Traditional software fails loudly. An exception throws, a 500 returns, a log shows a stack trace. You know something's wrong quickly.

LLM systems fail quietly. The model returns plausible-looking but wrong output. Retrieval misses relevant context. A prompt injection slips through. The user gets a wrong answer and either doesn't notice or complains days later when enough wrong answers have accumulated. By then the data is dirty, trust is eroded, and debugging requires archeology.

This is why LLM observability matters more than traditional observability, the failure modes are subtler and the cost of delayed detection is higher.

What to log for every LLM interaction

Every LLM call in production should log:

Input: the user query or triggering event, before prompt templating
Prompt: the full prompt sent to the model, including system prompt + retrieved context + user query
Retrieval context: what was retrieved, from where, with what relevance scores
Model metadata: provider, model name, model version, temperature, max tokens
Tool calls: if the model invoked functions/tools, what were they and what did they return
Output: the full model response
Token usage: input and output tokens (for cost and latency analysis)
Latency: end-to-end and per stage (retrieval, LLM, tool calls)
Cost: computed from token usage and pricing
Session/user context: trace ID, user ID, session ID for correlation

This data answers the questions that come up during debugging:

Why did the agent say X?
What context was it using?
Did we change anything recently that affected quality?
Is latency or cost drifting?

What to monitor continuously

Beyond logging, specific metrics need continuous monitoring:

Accuracy (golden set regression)

Maintain a golden set of inputs with expected outputs. Run it on every deployment, then daily in production. Alert on regression greater than a threshold (e.g., 5% drop in accuracy).

Retrieval recall

For retrieval-grounded agents, monitor whether expected documents are being retrieved for queries where ground truth exists. Drops in retrieval recall usually precede accuracy drops.

Hallucination rate

Measured by asking the agent questions where you know the answer and checking if the response is grounded in source material or fabricated. Can be automated with a judge LLM for scale.

Latency tails

P50, P95, P99 latency. LLM latency is often bimodal, most requests fast, some slow. Track the tails.

Cost per interaction

Total cost / total interactions over rolling windows. Alert on significant increases (often indicates prompt inflation or retrieval bloat).

Escalation rate (if human-in-the-loop)

Rate at which agent decisions are overridden by humans. Increasing escalation often signals quality degradation.

Error rate

LLM API errors, retrieval errors, tool-call errors. Each tracked separately because the remediation differs.

Tools that work

Several mature tools handle LLM observability:

Langfuse, open-source, self-hostable, strong on tracing. Good default for teams that want ownership.
Helicone, proxy-based, very easy setup, good analytics UI.
LangSmith, tightly coupled with LangChain. Good if you're already on LangChain.
OpenTelemetry + custom exporters, if you have existing OTEL infrastructure, extending it for LLM observability makes sense.
Arize / WhyLabs, more enterprise, focused on ML observability including LLMs.

We default to Langfuse for most client engagements. It's open-source, self-hostable (important for data residency), and covers 90% of what teams need without the enterprise overhead.

Observability drives improvement

With good observability, improvement cycles become routine:

Monitor catches a quality drop on Monday
Engineering reviews traces, finds that retrieval is returning fewer relevant chunks for certain queries
Reranking parameters are tuned in staging
A/B test against production shows improvement
Rollout, with observability catching any regression

Without observability, this cycle doesn't happen. Quality degrades; nobody notices for weeks; by then debugging is archaeology rather than engineering.

Common mistakes

1. Logging only inputs and outputs. Missing retrieval, tool calls, and metadata makes debugging impossible.

2. Not sampling at scale. At high volume, log 100% of inputs/outputs but sample detailed traces (e.g., 10%). Cost and storage become issues without sampling.

3. No PII handling. LLM inputs often contain PII. Logging must respect privacy, redact, tokenize, or exclude PII before it hits observability storage.

4. No production/staging separation. Staging and production observability should be separate to avoid polluting metrics with test data.

5. Alert fatigue. Too many alerts → ignored alerts. Tune thresholds and route alerts to the right people.

Conclusion

LLM observability is the difference between production AI that improves over time and production AI that silently degrades. It's boring infrastructure work that pays back enormously. If you're shipping anything AI-powered to production, invest in observability on day one.

If you're running production AI without observability and want to fix that, talk to us.