Agentic AI Development: What Production Looks Like in 2026

by Green Dolphin Software, AI / Agentic practice

The gap between an agentic AI demo and an agentic AI production system is enormous. Most demos work because the demo environment is a happy path: clean data, no edge cases, no PII, no audit trail. Production is the opposite. After delivering several Fortune 500 agentic engagements across Salesforce Agentforce, Agent Fusion, Anthropic Claude, OpenAI, AWS Bedrock, Azure OpenAI, Google Vertex AI, Snowflake Cortex, and Databricks Mosaic AI, we've converged on a small set of patterns that hold up.

This is the playbook.

What a production agentic AI architecture looks like

Production agentic AI architectureBusiness triggers on the left, agent orchestration layer in the middle (Agentforce, Atlas Reasoning, multi-LLM router), RAG + tool layer, enterprise system APIs, and observability / evaluation on the right.TRIGGERSAGENT ORCHESTRATIONRAG + TOOLSSYSTEMSSalesforce UI / PageLightning, Agentforce SDRSlack / Teams@mention agent, slash cmdEmail / WebhookInbound triage agentScheduled / EventCron, Pub/Sub, KafkaDocument / PDFIDP intakeSalesforce Agentforce / Agent FusionAtlas Reasoning EngineTopic routing · Action planning · Trust LayerMulti-LLM RouterClaude · OpenAI · Bedrock · Vertex · Azure OpenAITool / Function RegistryStructured output schemas · OpenAPI toolsPrompt Cache + Eval Harness80-90% cost cut · regression tests · confidence gateConfidence Gate → Human ReviewSub-threshold outputs queued for SMEVector Store / RAGPinecone · Weaviate · Cortex SearchMosaic AI Vector · Custom RAGRe-ranker · hybrid searchKnowledge BaseConfluence · SharePointSF Data Cloud · SnowflakeEmbedded chunking pipelineTool EndpointsSearch · Lookup · UpdateWrapped enterprise APIsSalesforceRead · Update · ActionNetSuite / ERPOrder · Invoice · CustomerSnowflake / DatabricksAnalytics writesEmail / SlackNotify · ConfirmObservabilityFull prompt + response logToken cost + latencyHallucination detectorShadow eval (5-10%)Datadog / SplunkTech stack: Salesforce Agentforce + Agent Fusion · Atlas Reasoning Engine · Claude/OpenAI/Bedrock/Vertex · Pinecone/Weaviate/Cortex/Mosaic AISynthetic example · not real client data

What you're seeing, left to right:

  • Triggers — Salesforce UI / Page (where Agentforce lives natively), Slack/Teams, email/webhook, scheduled cron, document/PDF intake.
  • Agent orchestration — Salesforce Agentforce or Agent Fusion sitting on the Atlas Reasoning Engine, routing user intent to topics + actions. Under the hood a multi-LLM router picks the right model per task (Claude for reasoning, OpenAI for tool calling, Bedrock for FedRAMP-eligible workloads, Vertex for Google-shop integration, etc.). Tool / function registry holds structured-output schemas. Prompt cache + evaluation harness measure both cost and quality.
  • RAG + tools — vector store (Pinecone, Weaviate, Snowflake Cortex Search, Databricks Mosaic AI Vector Search, or a custom embeddings pipeline), the knowledge base it indexes (Confluence, SharePoint, SF Data Cloud, Snowflake), and tool endpoints that wrap enterprise APIs as agent-callable functions.
  • Systems — Salesforce, NetSuite/ERP, Snowflake/Databricks, Slack/email. Agents read, update, and take action.
  • Observability — full prompt + response logging, token cost + latency, hallucination detector, shadow evaluation on 5-10% of traffic, output to Datadog or Splunk.

The confidence gate → human review loop is what makes this audit-ready. Outputs below a configured threshold get queued for SME review rather than auto-actioned. That single discipline catches the silent failures (hallucinations that look plausible) that demos never expose.

Five patterns that make production agentic systems survive

1. Structured output, always

Free-form text from an LLM into a regulated workflow is a quality and compliance disaster waiting to happen. Every agent action is defined as a tool with a strict JSON schema. Anthropic Claude does this via tool definitions, OpenAI via JSON mode and function calling, Bedrock via response format directives. The orchestration layer validates every output against the tool's schema before acting on it.

2. Prompt caching as the cost lever

For agentic systems with a stable system prompt (10K+ tokens describing the business domain, schemas, rules, tool definitions, examples), prompt caching cuts API costs by 80-90%. On a recent engagement processing ~50K agent interactions/day, prompt caching dropped the monthly LLM bill from $4,200 to $620 with no other changes. We build this in from day one.

3. Confidence-score gating

Every agent call returns a confidence field (either model-native or computed via a calibration step). Below a threshold, the agent does NOT auto-action — it queues the suggestion for human review. The threshold is tuned per use case but typically starts conservative (0.85+) and loosens as the team builds trust through observed outcomes.

4. Shadow evaluation

Route 5-10% of agent decisions through a parallel "expected behavior" check — either a deterministic rules engine, or a second LLM with a different prompt, or a human reviewer. Disagreements get flagged. This catches model drift early, before it affects business outcomes. Without this, you find out about quality degradation when a customer complains.

5. Full prompt and response logging

Every agent interaction's full prompt, response, model version, latency, and token count goes to a structured store (Splunk, Datadog, or a dedicated analytics store). When something looks wrong, you can replay the exact call. You can A/B test prompt changes against the same input distribution. You can prove to auditors what the model "knew" at the time of each decision.

AI / Agentic engagement tiers

$25,000 Starter (~3 weeks)

One Salesforce Agentforce agent (Sales or Service) with up to 4 actions plus 2 data sources. Or one LLM integration into an existing workflow (document extraction, summarization, classification). Or one custom RAG pipeline over a single data source. Sized for proving the value of agentic AI in one workflow before expanding.

$50,000 Standard (~6 weeks)

Multi-agent orchestration: Agentforce + Agent Fusion with shared Atlas Reasoning Engine, or multi-LLM routing where different models handle different intents. Up to 3 data sources for RAG. Includes the evaluation harness baseline. Where most "we want real agentic AI, not a chatbot" engagements land.

$75,000 Enterprise (~8 weeks)

Enterprise agent platform with RAG over multiple data sources, full evaluation framework, governance model (which agents do what, who owns the prompts, how new agents get reviewed), and integration with the broader integration estate. AI System APIs exposed through MuleSoft or your existing iPaaS for reuse across business units.

$100,000+ Custom (10-12+ weeks)

Regulated agentic AI (HIPAA, FedRAMP, PCI-DSS) with full audit trail, immutable evidence store, human-in-the-loop required by policy, content filters, data classification, and compliance documentation pack. This is where AI starts touching protected data and the procurement bar gets serious.

Platforms we deliver on

  • Salesforce Agentforce (Sales, Service, Marketing, custom agents) — Topic design, Action implementation, Atlas Reasoning Engine, Data Cloud integration, Einstein Trust Layer configuration.
  • Salesforce Agent Fusion — multi-agent orchestration with shared reasoning state.
  • Anthropic Claude (Sonnet, Opus, Haiku) — including prompt caching and tool use.
  • OpenAI (GPT-5, GPT-4 family, o-series) — Chat Completions, Assistants, function calling, file/vector store.
  • AWS Bedrock — Claude on Bedrock, Bedrock Agents, Knowledge Bases, Guardrails, Flows. FedRAMP-eligible deployments.
  • Azure OpenAI — Deployments, Function calling, On Your Data, content filters.
  • Google Vertex AI — Model Endpoint, Search App, RAG Engine, reasoning agents.
  • Snowflake Cortex — Cortex Search Service for RAG over warehouse data, Cortex Analyst, Cortex Fine-Tuned models.
  • Databricks Mosaic AI — Vector Search Index, Mosaic AI Agents, Model Serving Endpoints, Feature Store.
  • Vector DBs: Pinecone, Weaviate, plus native vector stores in Snowflake/Databricks/Postgres pgvector.
  • Custom RAG: embeddings pipeline + vector store + retrieval endpoint + re-ranker, built to fit when off-the-shelf doesn't.

What's included on every AI / Agentic engagement (Starter through Custom)

Same standard as Integration: source code in Green Dolphin's GitHub transferred to client at acceptance, comprehensive README per repo, full design package (topology + landscape + sequence diagrams + per-API/agent design), ≥80% unit test coverage, Postman collection for every endpoint, all-environment deploy with post-deploy Postman tests, error handling + structured logging.

Plus AI-specific additions:

  • Prompt cache configuration with measured cost reduction
  • Evaluation harness baseline (golden test set, automated regression on prompt changes)
  • Confidence gate configuration with documented thresholds
  • Full prompt + response logging to your observability stack
  • Trust Layer / guardrails / content filter configuration where the platform supports it
  • Tool/function registry documented for future agent additions

What's NOT included (typical out-of-scope)

  • Model fine-tuning unless explicitly scoped (most use cases are better served by prompt engineering + RAG against frontier models)
  • Custom embedding model training (we use OpenAI/Cohere/Voyage embeddings off the shelf unless your data justifies otherwise)
  • Generation of training data labels (client-owned, with SME effort)
  • Long-term agent monitoring and prompt updates (Managed Services)
  • AI governance committee / policy authoring (advisory only; not policy-writing)

Ready to scope an AI / Agentic engagement?

If you've watched a few impressive AI demos and you're wondering what it actually takes to get one into production at your enterprise, submit the 6-step intake form. Fixed-bid SOW in 3 business days. We've done this on Salesforce Agentforce, multi-LLM routers, and regulated environments.

For the deeper technical playbook on AI on MuleSoft specifically, see our companion piece: AI-Native MuleSoft: Five Integration Patterns That Actually Ship.

More articles

AI Cost Optimization for Enterprise Workloads: Prompt Caching, Evaluation Frameworks, and the 80% Reduction Levers

Enterprise AI bills compound silently. The same workload that costs $4,200/month in November will hit $25,000/month by July without intervention. A vendor-neutral playbook for the five levers that produce 80% cost reduction without compromising quality: prompt caching, model tiering, response truncation, batch routing, and evaluation-driven optimization. Plus the audit framework that catches drift before invoices do.

Read more

MuleSoft Center for Enablement (C4E) Playbook: Crossing the 30-API Wall

Most enterprise MuleSoft estates hit a governance wall around their 30th API — not from technical limits, but from reuse, naming, and review drift. The Center for Enablement (C4E) framework is the answer. A vendor-neutral playbook covering the five C4E pillars (API standards, reusable assets, reuse model, security guardrails, Architecture Review Board cadence), what real C4E governance looks like vs vendor-mandated theater, and the $75K Enterprise tier we ship it with.

Read more

Ready to scope an integration?

Six-step intake. Fixed-bid SOW returned in 3 business days. $25K floor, $25K increments.

Office