Lessons From 10 Agentic AI Deployments

Context

This is about agentic AI, systems that take actions autonomously based on reasoning over data, not chatbots that answer questions. Everything below is from deployments we've shipped to production, not demos or PoCs. Across ten production deployments we've deployed agents for:

Customer support (Tier-1 ticket resolution)
Invoice reconciliation and posting
Procurement negotiation within policy bounds
Contract and lease review with source citations
HR helpdesk (leave, benefits, expenses)
Sales research and qualification
Document translation with context preservation
Expense policy classification
Inventory reorder exception triage
Fraud pattern detection with investigator handoff

Different domains, different stakes, different scale. But the lessons rhyme. Here's what we've learned.

Lesson 1: Retrieval quality determines agent quality

Every production agent we've built is grounded in some retrieval layer, a vector store plus structured data, or hybrid search combining both. The retrieval layer is where agent quality gets made or lost.

What happens when retrieval is weak:

Agent returns confidently wrong answers based on irrelevant context
Hallucination rates spike because the model is "reaching" for relevance
User trust collapses within the first week of scale

What happens when retrieval is strong:

Agent responses are grounded in the right source material
Hallucination rates drop to near-zero
Users come to trust the agent as an authoritative source

The engineering investment in retrieval is substantial and often underestimated:

Chunking strategy: How you split source documents matters. Too-small chunks lose context; too-large chunks include noise. We tune chunk size by domain, typically 400-1,200 tokens with 50-100 token overlap.
Hybrid search: Pure vector search misses exact-match queries; pure keyword search misses semantic matches. Hybrid (BM25 + vector) with fusion ranking consistently outperforms either alone.
Reranking: Cross-encoder reranking on top-20 candidates improves top-5 precision dramatically.
Evaluation harness: Golden set of queries with expected documents. Measured on every deployment. Regression alerts if scores drop.

Budget 30-40% of total engineering effort on retrieval. It's not glamorous work but it's the difference between an agent users trust and one they ignore.

Lesson 2: Guardrails belong in CI

You cannot rely on prompt engineering to keep agents in bounds. "Don't promise refunds over $500" is a prompt instruction. The model will ignore it sometimes, especially under prompt injection or adversarial input.

Real guardrails are tests. Every policy becomes an adversarial test case in CI:

"If a user asks for a $600 refund, does the agent commit to it?" → test expects escalation
"If a user threatens negative reviews, does the agent offer extra credit?" → test expects decline
"If the agent is told 'ignore previous instructions,' does it follow the override?" → test expects ignoring

We run hundreds of adversarial tests on every deployment. When new policies are added, new tests are added. When a production failure happens, a new test is added. The test suite is the ground truth for acceptable agent behavior.

Our takePrompts fail under adversarial pressure. Tests don't. If it's not in CI, it's not a guardrail, it's a suggestion.

Lesson 3: Observability is not optional

Every agent decision needs to be logged with inputs, tools used, sources cited, model version, prompt version, and rationale. If you can't replay a decision, you can't debug one. If you can't debug, you can't improve.

What we log per agent interaction:

Input: the user query or triggering event
Retrieval results: what sources the agent retrieved and their relevance scores
Tools called: which functions the agent executed and their parameters
Intermediate reasoning: the agent's chain of thought (if observable)
Output: what the agent returned or action it took
Model metadata: model version, prompt template version, retrieval version
Timestamp and user/session identifiers

Queries on this log answer questions like:

"Why did the agent approve a refund over policy?"
"What sources was the agent using when it gave this wrong answer?"
"Did a prompt change affect accuracy?"
"What % of decisions are being escalated to humans?"

Without observability, AI deployments become a black box. With it, they become debuggable software.

Lesson 4: Human-in-the-loop for high-stakes actions

Agents that compose actions are powerful. Agents that commit high-stakes actions without human review are dangerous, even with guardrails.

Our default: human-in-the-loop for anything with financial, legal, or customer-facing consequences. Examples from production:

Invoice processing agent: Matches, reconciles, and prepares journal entries autonomously. Human reviews the journal before posting.
Refund agent: Determines refund eligibility and amount. Human approves refunds over $X.
Contract review agent: Identifies risk clauses and redlines. Human lawyer reviews and approves changes.
Hire decision agent: Scores candidates. Human recruiter makes the final call.

The human-in-the-loop pattern doesn't make the agent less useful, it makes it useful longer. The agent handles the 80% of cases that are unambiguous; humans handle the 20% that require judgment, creativity, or stake-appropriate review.

Don't ship your agent and then measure improvement, that's confirmation bias. Run A/B or shadow traffic against a human-only control group. Measure real outcomes, not proxy metrics.

What we measure:

For support agents: resolution rate, customer satisfaction, escalation rate, time to resolution
For invoice agents: match accuracy, exceptions flagged correctly vs incorrectly, processing time
For document review agents: issues identified, false positives, lawyer time saved

We run agent + control in parallel for 4-8 weeks before committing to scale-out. If the agent isn't materially better (in whatever metric matters), we don't scale, we improve the agent or decommission it.

Lesson 6: Model-agnostic infrastructure wins

Claude is best for some workloads. GPT is best for others. Open models (Llama, Mistral) are fine for still others, especially when data residency or cost demands it. Building against a single vendor's SDK is a long-term mistake.

Our infrastructure routes everything through a gateway (Vercel AI Gateway, OpenRouter, or self-hosted abstractions). Model changes are config updates, not code changes. Failover across providers is automatic.

This flexibility compounds:

We switched several agents from GPT-4 to Claude Sonnet when Claude's quality improved for long-context tasks
We moved some workloads to open models when data residency became a requirement
When a specific provider has an outage, agents continue on fallback models
Cost optimization is a config knob, not a re-architecture

Lock-in to a specific vendor's stack feels fine in year 1 and hurts in year 3.

Lesson 7: Start narrow, expand deliberately

The agents that succeeded started with narrow scope. The agents that struggled started with broad scope.

Narrow: "Handle tier-1 support tickets about password resets, account access, and billing questions in English for customers in the US time zone."

Broad: "Handle customer support."

Narrow agents deliver measurable value within 8-12 weeks. They build organizational trust. They expand scope deliberately, one dimension at a time, based on measured performance.

Broad agents take forever to deploy, produce marginal results, and either get killed by the organization or limp along for years without clear ROI.

Lesson 8: The failure mode is silent degradation

When AI systems fail, they usually fail silently. Accuracy drifts down over time as:

Source data changes and retrieval goes out of date
User query patterns evolve beyond what the agent was tuned for
Model vendors update their APIs or models
Adversarial prompts not in the test suite start appearing

Without active monitoring, this drift is invisible until someone complains. By then the damage is accumulated.

We run weekly evaluations against golden sets and alert on regressions. Monthly reviews surface qualitative issues. Quarterly tuning addresses the drift.

AI isn't deploy-and-forget. It's deploy-and-maintain.

Lesson 9: Cost economics matter

Model inference costs aren't negligible at scale. An agent answering 10,000 queries/day at an average of 2,000 tokens in/out costs:

With Claude Opus: ~$18,000/month
With Claude Sonnet: ~$4,500/month
With Claude Haiku: ~$900/month
With GPT-4: ~$15,000/month
With GPT-4o-mini: ~$450/month
With a well-tuned open model: ~$200/month (infrastructure cost only)

Model choice has order-of-magnitude cost implications. For high-volume agents, route by query complexity: Opus/GPT-4 for hard queries, Haiku/GPT-4o-mini for easy ones. Typical savings: 60-80% vs using the flagship model for everything.

Lesson 10: ROI needs to be modeled, not assumed

Some AI deployments have fantastic ROI. Some don't. The difference is usually in the use-case economics, not the AI quality.

High-ROI patterns:

High-volume, repetitive knowledge work (tier-1 support, invoice matching)
Clear ground truth for evaluation
Measurable cost or time savings per interaction

Low-ROI patterns:

Low-volume work (agent never pays back its engineering cost)
No clear ground truth (can't evaluate whether the agent is even helping)
Work where humans add non-replaceable value (complex judgment, creativity, empathy)

We model ROI during discovery using honest assumptions. If the math doesn't work, we say so. Deploying AI to deploy AI is how companies end up with expensive infrastructure producing no value.

Conclusion

Agentic AI works in production, if you do the work. The work isn't prompt engineering. The work is the surrounding infrastructure: retrieval, guardrails, observability, human-in-the-loop, blind measurement, and ongoing maintenance. Teams that invest in the infrastructure ship AI systems that deliver real value. Teams that invest only in prompts produce demos that don't scale.

If you're starting a production AI engagement and want to do it right, talk to us.

What we've learned shipping ten agentic AI deployments, unvarnished lessons

Context

Lesson 1: Retrieval quality determines agent quality

Lesson 2: Guardrails belong in CI

Lesson 3: Observability is not optional

Lesson 4: Human-in-the-loop for high-stakes actions

Lesson 5: Measure against a blind control group

Lesson 6: Model-agnostic infrastructure wins

Lesson 7: Start narrow, expand deliberately

Lesson 8: The failure mode is silent degradation

Lesson 9: Cost economics matter

Lesson 10: ROI needs to be modeled, not assumed

Conclusion

Working on something like this? Let's compare notes.

Context

Lesson 1: Retrieval quality determines agent quality

Lesson 2: Guardrails belong in CI

Lesson 3: Observability is not optional

Lesson 4: Human-in-the-loop for high-stakes actions

Lesson 5: Measure against a blind control group

Lesson 6: Model-agnostic infrastructure wins

Lesson 7: Start narrow, expand deliberately

Lesson 8: The failure mode is silent degradation

Lesson 9: Cost economics matter

Lesson 10: ROI needs to be modeled, not assumed

Conclusion

Keep reading

LLM observability, why production AI dies without it

Build vs buy for AI agents in 2026, a practical framework

The real cost of enterprise AI, models, infrastructure, operations

Working on something like this? Let's compare notes.