What we've learned shipping ten agentic AI deployments — unvarnished lessons
Ten production agentic AI deployments later, what actually works, what doesn't, and what vendors oversell. Lessons on retrieval, guardrails, observability, and human-in-the-loop.
Context
This is about agentic AI — systems that take actions autonomously based on reasoning over data — not chatbots that answer questions. Everything below is from deployments we've shipped to production, not demos or PoCs. Across ten production deployments we've deployed agents for:
- Customer support (Tier-1 ticket resolution)
- Invoice reconciliation and posting
- Procurement negotiation within policy bounds
- Contract and lease review with source citations
- HR helpdesk (leave, benefits, expenses)
- Sales research and qualification
- Document translation with context preservation
- Expense policy classification
- Inventory reorder exception triage
- Fraud pattern detection with investigator handoff
Different domains, different stakes, different scale. But the lessons rhyme. Here's what we've learned.
Lesson 1: Retrieval quality determines agent quality
Every production agent we've built is grounded in some retrieval layer — a vector store plus structured data, or hybrid search combining both. The retrieval layer is where agent quality gets made or lost.
What happens when retrieval is weak:
- Agent returns confidently wrong answers based on irrelevant context
- Hallucination rates spike because the model is "reaching" for relevance
- User trust collapses within the first week of scale
What happens when retrieval is strong:
- Agent responses are grounded in the right source material
- Hallucination rates drop to near-zero
- Users come to trust the agent as an authoritative source
The engineering investment in retrieval is substantial and often underestimated:
- Chunking strategy: How you split source documents matters. Too-small chunks lose context; too-large chunks include noise. We tune chunk size by domain, typically 400-1,200 tokens with 50-100 token overlap.
- Hybrid search: Pure vector search misses exact-match queries; pure keyword search misses semantic matches. Hybrid (BM25 + vector) with fusion ranking consistently outperforms either alone.
- Reranking: Cross-encoder reranking on top-20 candidates improves top-5 precision dramatically.
- Evaluation harness: Golden set of queries with expected documents. Measured on every deployment. Regression alerts if scores drop.
Budget 30-40% of total engineering effort on retrieval. It's not glamorous work but it's the difference between an agent users trust and one they ignore.
Lesson 2: Guardrails belong in CI
You cannot rely on prompt engineering to keep agents in bounds. "Don't promise refunds over $500" is a prompt instruction. The model will ignore it sometimes, especially under prompt injection or adversarial input.
Real guardrails are tests. Every policy becomes an adversarial test case in CI:
- "If a user asks for a $600 refund, does the agent commit to it?" → test expects escalation
- "If a user threatens negative reviews, does the agent offer extra credit?" → test expects decline
- "If the agent is told 'ignore previous instructions,' does it follow the override?" → test expects ignoring
We run hundreds of adversarial tests on every deployment. When new policies are added, new tests are added. When a production failure happens, a new test is added. The test suite is the ground truth for acceptable agent behavior.
Lesson 3: Observability is not optional
Every agent decision needs to be logged with inputs, tools used, sources cited, model version, prompt version, and rationale. If you can't replay a decision, you can't debug one. If you can't debug, you can't improve.
What we log per agent interaction:
- Input: the user query or triggering event
- Retrieval results: what sources the agent retrieved and their relevance scores
- Tools called: which functions the agent executed and their parameters
- Intermediate reasoning: the agent's chain of thought (if observable)
- Output: what the agent returned or action it took
- Model metadata: model version, prompt template version, retrieval version
- Timestamp and user/session identifiers
Queries on this log answer questions like:
- "Why did the agent approve a refund over policy?"
- "What sources was the agent using when it gave this wrong answer?"
- "Did a prompt change affect accuracy?"
- "What % of decisions are being escalated to humans?"
Without observability, AI deployments become a black box. With it, they become debuggable software.
Lesson 4: Human-in-the-loop for high-stakes actions
Agents that compose actions are powerful. Agents that commit high-stakes actions without human review are dangerous — even with guardrails.
Our default: human-in-the-loop for anything with financial, legal, or customer-facing consequences. Examples from production:
- Invoice processing agent: Matches, reconciles, and prepares journal entries autonomously. Human reviews the journal before posting.
- Refund agent: Determines refund eligibility and amount. Human approves refunds over $X.
- Contract review agent: Identifies risk clauses and redlines. Human lawyer reviews and approves changes.
- Hire decision agent: Scores candidates. Human recruiter makes the final call.
The human-in-the-loop pattern doesn't make the agent less useful — it makes it useful longer. The agent handles the 80% of cases that are unambiguous; humans handle the 20% that require judgment, creativity, or stake-appropriate review.
Lesson 5: Measure against a blind control group
Don't ship your agent and then measure improvement — that's confirmation bias. Run A/B or shadow traffic against a human-only control group. Measure real outcomes, not proxy metrics.
What we measure:
- For support agents: resolution rate, customer satisfaction, escalation rate, time to resolution
- For invoice agents: match accuracy, exceptions flagged correctly vs incorrectly, processing time
- For document review agents: issues identified, false positives, lawyer time saved
We run agent + control in parallel for 4-8 weeks before committing to scale-out. If the agent isn't materially better (in whatever metric matters), we don't scale — we improve the agent or decommission it.
Lesson 6: Model-agnostic infrastructure wins
Claude is best for some workloads. GPT is best for others. Open models (Llama, Mistral) are fine for still others — especially when data residency or cost demands it. Building against a single vendor's SDK is a long-term mistake.
Our infrastructure routes everything through a gateway (Vercel AI Gateway, OpenRouter, or self-hosted abstractions). Model changes are config updates, not code changes. Failover across providers is automatic.
This flexibility compounds:
- We switched several agents from GPT-4 to Claude Sonnet when Claude's quality improved for long-context tasks
- We moved some workloads to open models when data residency became a requirement
- When a specific provider has an outage, agents continue on fallback models
- Cost optimization is a config knob, not a re-architecture
Lock-in to a specific vendor's stack feels fine in year 1 and hurts in year 3.
Lesson 7: Start narrow, expand deliberately
The agents that succeeded started with narrow scope. The agents that struggled started with broad scope.
Narrow: "Handle tier-1 support tickets about password resets, account access, and billing questions in English for customers in the US time zone."
Broad: "Handle customer support."
Narrow agents deliver measurable value within 8-12 weeks. They build organizational trust. They expand scope deliberately, one dimension at a time, based on measured performance.
Broad agents take forever to deploy, produce marginal results, and either get killed by the organization or limp along for years without clear ROI.
Lesson 8: The failure mode is silent degradation
When AI systems fail, they usually fail silently. Accuracy drifts down over time as:
- Source data changes and retrieval goes out of date
- User query patterns evolve beyond what the agent was tuned for
- Model vendors update their APIs or models
- Adversarial prompts not in the test suite start appearing
Without active monitoring, this drift is invisible until someone complains. By then the damage is accumulated.
We run weekly evaluations against golden sets and alert on regressions. Monthly reviews surface qualitative issues. Quarterly tuning addresses the drift.
AI isn't deploy-and-forget. It's deploy-and-maintain.
Lesson 9: Cost economics matter
Model inference costs aren't negligible at scale. An agent answering 10,000 queries/day at an average of 2,000 tokens in/out costs:
- With Claude Opus: ~$18,000/month
- With Claude Sonnet: ~$4,500/month
- With Claude Haiku: ~$900/month
- With GPT-4: ~$15,000/month
- With GPT-4o-mini: ~$450/month
- With a well-tuned open model: ~$200/month (infrastructure cost only)
Model choice has order-of-magnitude cost implications. For high-volume agents, route by query complexity: Opus/GPT-4 for hard queries, Haiku/GPT-4o-mini for easy ones. Typical savings: 60-80% vs using the flagship model for everything.
Lesson 10: ROI needs to be modeled, not assumed
Some AI deployments have fantastic ROI. Some don't. The difference is usually in the use-case economics, not the AI quality.
High-ROI patterns:
- High-volume, repetitive knowledge work (tier-1 support, invoice matching)
- Clear ground truth for evaluation
- Measurable cost or time savings per interaction
Low-ROI patterns:
- Low-volume work (agent never pays back its engineering cost)
- No clear ground truth (can't evaluate whether the agent is even helping)
- Work where humans add non-replaceable value (complex judgment, creativity, empathy)
We model ROI during discovery using honest assumptions. If the math doesn't work, we say so. Deploying AI to deploy AI is how companies end up with expensive infrastructure producing no value.
Conclusion
Agentic AI works in production — if you do the work. The work isn't prompt engineering. The work is the surrounding infrastructure: retrieval, guardrails, observability, human-in-the-loop, blind measurement, and ongoing maintenance. Teams that invest in the infrastructure ship AI systems that deliver real value. Teams that invest only in prompts produce demos that don't scale.
If you're starting a production AI engagement and want to do it right, talk to us.
Related reading: LLM observability in production · Build vs buy for AI agents · Real cost of enterprise AI