Agentic AI Development: What Production Looks Like in 2026
by Green Dolphin Software, AI / Agentic practice
The gap between an agentic AI demo and an agentic AI production system is enormous. Most demos work because the demo environment is a happy path: clean data, no edge cases, no PII, no audit trail. Production is the opposite. After delivering several Fortune 500 agentic engagements across Salesforce Agentforce, Agent Fusion, Anthropic Claude, OpenAI, AWS Bedrock, Azure OpenAI, Google Vertex AI, Snowflake Cortex, and Databricks Mosaic AI, we've converged on a small set of patterns that hold up.
This is the playbook.
What a production agentic AI architecture looks like
What you're seeing, left to right:
- Triggers — Salesforce UI / Page (where Agentforce lives natively), Slack/Teams, email/webhook, scheduled cron, document/PDF intake.
- Agent orchestration — Salesforce Agentforce or Agent Fusion sitting on the Atlas Reasoning Engine, routing user intent to topics + actions. Under the hood a multi-LLM router picks the right model per task (Claude for reasoning, OpenAI for tool calling, Bedrock for FedRAMP-eligible workloads, Vertex for Google-shop integration, etc.). Tool / function registry holds structured-output schemas. Prompt cache + evaluation harness measure both cost and quality.
- RAG + tools — vector store (Pinecone, Weaviate, Snowflake Cortex Search, Databricks Mosaic AI Vector Search, or a custom embeddings pipeline), the knowledge base it indexes (Confluence, SharePoint, SF Data Cloud, Snowflake), and tool endpoints that wrap enterprise APIs as agent-callable functions.
- Systems — Salesforce, NetSuite/ERP, Snowflake/Databricks, Slack/email. Agents read, update, and take action.
- Observability — full prompt + response logging, token cost + latency, hallucination detector, shadow evaluation on 5-10% of traffic, output to Datadog or Splunk.
The confidence gate → human review loop is what makes this audit-ready. Outputs below a configured threshold get queued for SME review rather than auto-actioned. That single discipline catches the silent failures (hallucinations that look plausible) that demos never expose.
Five patterns that make production agentic systems survive
1. Structured output, always
Free-form text from an LLM into a regulated workflow is a quality and compliance disaster waiting to happen. Every agent action is defined as a tool with a strict JSON schema. Anthropic Claude does this via tool definitions, OpenAI via JSON mode and function calling, Bedrock via response format directives. The orchestration layer validates every output against the tool's schema before acting on it.
2. Prompt caching as the cost lever
For agentic systems with a stable system prompt (10K+ tokens describing the business domain, schemas, rules, tool definitions, examples), prompt caching cuts API costs by 80-90%. On a recent engagement processing ~50K agent interactions/day, prompt caching dropped the monthly LLM bill from $4,200 to $620 with no other changes. We build this in from day one.
3. Confidence-score gating
Every agent call returns a confidence field (either model-native or computed via a calibration step). Below a threshold, the agent does NOT auto-action — it queues the suggestion for human review. The threshold is tuned per use case but typically starts conservative (0.85+) and loosens as the team builds trust through observed outcomes.
4. Shadow evaluation
Route 5-10% of agent decisions through a parallel "expected behavior" check — either a deterministic rules engine, or a second LLM with a different prompt, or a human reviewer. Disagreements get flagged. This catches model drift early, before it affects business outcomes. Without this, you find out about quality degradation when a customer complains.
5. Full prompt and response logging
Every agent interaction's full prompt, response, model version, latency, and token count goes to a structured store (Splunk, Datadog, or a dedicated analytics store). When something looks wrong, you can replay the exact call. You can A/B test prompt changes against the same input distribution. You can prove to auditors what the model "knew" at the time of each decision.
AI / Agentic engagement tiers
$25,000 Starter (~3 weeks)
One Salesforce Agentforce agent (Sales or Service) with up to 4 actions plus 2 data sources. Or one LLM integration into an existing workflow (document extraction, summarization, classification). Or one custom RAG pipeline over a single data source. Sized for proving the value of agentic AI in one workflow before expanding.
$50,000 Standard (~6 weeks)
Multi-agent orchestration: Agentforce + Agent Fusion with shared Atlas Reasoning Engine, or multi-LLM routing where different models handle different intents. Up to 3 data sources for RAG. Includes the evaluation harness baseline. Where most "we want real agentic AI, not a chatbot" engagements land.
$75,000 Enterprise (~8 weeks)
Enterprise agent platform with RAG over multiple data sources, full evaluation framework, governance model (which agents do what, who owns the prompts, how new agents get reviewed), and integration with the broader integration estate. AI System APIs exposed through MuleSoft or your existing iPaaS for reuse across business units.
$100,000+ Custom (10-12+ weeks)
Regulated agentic AI (HIPAA, FedRAMP, PCI-DSS) with full audit trail, immutable evidence store, human-in-the-loop required by policy, content filters, data classification, and compliance documentation pack. This is where AI starts touching protected data and the procurement bar gets serious.
Platforms we deliver on
- Salesforce Agentforce (Sales, Service, Marketing, custom agents) — Topic design, Action implementation, Atlas Reasoning Engine, Data Cloud integration, Einstein Trust Layer configuration.
- Salesforce Agent Fusion — multi-agent orchestration with shared reasoning state.
- Anthropic Claude (Sonnet, Opus, Haiku) — including prompt caching and tool use.
- OpenAI (GPT-5, GPT-4 family, o-series) — Chat Completions, Assistants, function calling, file/vector store.
- AWS Bedrock — Claude on Bedrock, Bedrock Agents, Knowledge Bases, Guardrails, Flows. FedRAMP-eligible deployments.
- Azure OpenAI — Deployments, Function calling, On Your Data, content filters.
- Google Vertex AI — Model Endpoint, Search App, RAG Engine, reasoning agents.
- Snowflake Cortex — Cortex Search Service for RAG over warehouse data, Cortex Analyst, Cortex Fine-Tuned models.
- Databricks Mosaic AI — Vector Search Index, Mosaic AI Agents, Model Serving Endpoints, Feature Store.
- Vector DBs: Pinecone, Weaviate, plus native vector stores in Snowflake/Databricks/Postgres pgvector.
- Custom RAG: embeddings pipeline + vector store + retrieval endpoint + re-ranker, built to fit when off-the-shelf doesn't.
What's included on every AI / Agentic engagement (Starter through Custom)
Same standard as Integration: source code in Green Dolphin's GitHub transferred to client at acceptance, comprehensive README per repo, full design package (topology + landscape + sequence diagrams + per-API/agent design), ≥80% unit test coverage, Postman collection for every endpoint, all-environment deploy with post-deploy Postman tests, error handling + structured logging.
Plus AI-specific additions:
- Prompt cache configuration with measured cost reduction
- Evaluation harness baseline (golden test set, automated regression on prompt changes)
- Confidence gate configuration with documented thresholds
- Full prompt + response logging to your observability stack
- Trust Layer / guardrails / content filter configuration where the platform supports it
- Tool/function registry documented for future agent additions
What's NOT included (typical out-of-scope)
- Model fine-tuning unless explicitly scoped (most use cases are better served by prompt engineering + RAG against frontier models)
- Custom embedding model training (we use OpenAI/Cohere/Voyage embeddings off the shelf unless your data justifies otherwise)
- Generation of training data labels (client-owned, with SME effort)
- Long-term agent monitoring and prompt updates (Managed Services)
- AI governance committee / policy authoring (advisory only; not policy-writing)
Ready to scope an AI / Agentic engagement?
If you've watched a few impressive AI demos and you're wondering what it actually takes to get one into production at your enterprise, submit the 6-step intake form. Fixed-bid SOW in 3 business days. We've done this on Salesforce Agentforce, multi-LLM routers, and regulated environments.
For the deeper technical playbook on AI on MuleSoft specifically, see our companion piece: AI-Native MuleSoft: Five Integration Patterns That Actually Ship.

