Data Architecture in 2026: Picking the Right Platform Stack

by Green Dolphin Software, Data architecture practice

The data platform decision is a 5-year commitment with seven-figure implications. The buyer is asked to pick between Snowflake, Databricks, BigQuery, Synapse, Redshift, and now Microsoft Fabric — usually based on a 90-minute vendor pitch and a free PoC. The ingestion layer decision (Informatica, Talend, Fivetran, dbt, Airbyte, native CDC, custom Spark) gets even less rigorous treatment.

This post is the vendor-neutral framework we use for clients on a $25K+ Data Architecture engagement: pick the right warehouse, pick the right ingestion stack, design the data model and governance framework, and lay the groundwork for build — with no build, no vendor kickback, no agenda beyond a stack that survives the 5-year cost curve.

Target-state data architecture

Data architecture: medallion lakehouse with AI layerSources on the left (apps + legacy + streaming), ingestion via Informatica / Talend / Fivetran / native CDC into Bronze → Silver → Gold layers in Snowflake / Databricks, consumed by BI tools and AI/RAG agents, with Unity / Atlan governance overlay.SOURCESINGESTIONLAKEHOUSE (MEDALLION)CONSUMERSSalesforceLead · Opp · AccountNetSuite / SAPOrder · InvoiceMongoDB / PostgresTransactional DBKafka / KinesisStreaming eventsLegacy / mainframeCDC · file extractsSaaS apps (50+)HubSpot, Workday, etc.Ingestion / ELTInformatica IICSEnterprise CDC, MDM-friendlyTalend / HevoOpen-source heritage, broadFivetran / AirbyteManaged SaaS connectorsdbt (transforms)Silver→Gold SQL/PythonKafka Connect / DebeziumStreaming CDC, real-timeBRONZE — rawSource-shaped landing. No transforms.Snowflake / Databricks Delta LakeSchema-on-read · full history retainedCDC + batch + streaming all land hereSILVER — cleansed + conformedDedup · type-cast · joins · slowly changingdbt models / Spark notebooks / SnowparkPII redaction · access control · lineageCanonical entities (Customer, Order, etc.)GOLD — business-readyAggregated · denormalized · BI / ML featuresDimensional / star / data-vault / OBTEmbeddings for AI · feature storeVersioned, tested, SLA-backedBI / DashboardsTableau · Power BILooker · SigmaAI / RAG layerCortex · Mosaic AIVertex · BedrockReverse ETLHightouch · CensusSF / HubSpot writesML / NotebooksDatabricks MLSnowpark · VertexGovernance (cross-cutting)Catalog: Unity / Atlan / Collibra · Lineage · PII classification · RBAC + masking · Cost monitor · Quality (Great Expectations / dbt tests)Tech: Snowflake / Databricks · Informatica / Talend / Fivetran / dbt / Kafka · Unity / Atlan governance · Cortex / Mosaic AI · Hightouch reverse-ETLSynthetic example · not real client data

The diagram shows the medallion lakehouse pattern that's become the default for new builds in 2025-2026. Bronze is raw landing — source-shaped, full history, schema-on-read. Silver is the cleansed and conformed layer — dedup, type-cast, joins, slowly-changing dimensions, PII redaction, canonical entities. Gold is business-ready — aggregated, denormalized, BI features, embeddings for AI, dimensional/star/data-vault models, all versioned and SLA-backed.

This isn't novel architecture in 2026 — but the combination you build it on still matters enormously.

Warehouse / lakehouse decision (the 5-year choice)

Snowflake

The default for SQL-first analytics. Strongest ecosystem of partner tools, easy admin, predictable cost model (compute + storage decoupled). Cortex adds in-warehouse AI (Search Service for RAG, Cortex Analyst for natural-language SQL, fine-tuned models on warehouse data). Weak on streaming-first workloads (Snowpipe Streaming is improving but not Kafka-native). Strong fit: enterprise SQL shops, BI-heavy workloads, regulated industries that value predictability over raw throughput.

Databricks

The default for ML-heavy and large-scale data-engineering workloads. Spark-native, notebook-first, Unity Catalog has matured into a genuinely strong governance story. Mosaic AI (formerly MosaicML) provides vector search, agent runtime, model serving, and feature store — tightly integrated. Cost can be unpredictable if Spark jobs aren't tuned. Strong fit: ML/AI-heavy teams, Spark-fluent shops, lakehouse architectures with mixed structured + unstructured data.

BigQuery

Strong if you're a Google-shop OR if your workload pattern is unpredictable bursty queries (BigQuery serverless pricing rewards this). BigLake brings lakehouse-style external table access. Weak if you need fine-grained workload isolation. Strong fit: GCP-native shops, ad-hoc analytics, data-mesh patterns.

Azure Synapse

Microsoft's mature warehouse. Dedicated SQL pools for predictable workloads, serverless for ad-hoc. Strong integration with Power BI and Azure Data Factory. Has been positioned for replacement by Microsoft Fabric (see below). Strong fit: Azure-only shops where Fabric isn't yet mature enough for the workload.

AWS Redshift

The veteran. Mature ecosystem, predictable cost. Redshift Spectrum reads S3 data lake. Strong fit: AWS-native shops with workloads already on Redshift; less compelling for greenfield in 2026.

Microsoft Fabric

Microsoft's new unified analytics platform — OneLake (shared storage), Power BI, Data Factory, Synapse Data Warehouse, Synapse Data Engineering (Spark) — bundled. Compelling vision but still maturing in 2026. Strong fit: organizations heavily committed to Microsoft 365 + Power BI willing to bet on the integrated story.

How we pick

Three questions answer most of the warehouse decision:

  1. What's your shop's cloud center of gravity? AWS-heavy → Redshift or Snowflake. Azure-heavy → Synapse or Fabric or Snowflake. GCP-heavy → BigQuery or Snowflake. Multi-cloud → Snowflake or Databricks (both cloud-portable).
  2. What's the workload mix? Mostly SQL/BI → Snowflake or BigQuery. ML/Spark-heavy → Databricks. Bursty ad-hoc → BigQuery.
  3. What's the 3-year AI plan? AI in the warehouse (Cortex / Cortex Search) → Snowflake. ML platform integration (Mosaic AI, agents) → Databricks. Both work but they're tuned differently.

Ingestion / ELT decision (the cost-control layer)

The ingestion layer choice is where TCO is won or lost. Per-row pricing on managed connectors looks cheap at proof-of-concept volume and can become punishing at scale.

Informatica IICS (Intelligent Cloud Services)

Enterprise heritage. Strong on complex CDC, mainframe sources, MDM integration, governance baked in. Has been retooling for cloud-native (and added IDMC umbrella). Expensive but predictable. Strong fit: enterprise data shops already on Informatica, regulated industries, complex ETL with embedded transformations.

Talend

Open-source heritage (now Qlik-owned). Broad connector library. Good for hybrid on-prem + cloud. Slower release cadence than the SaaS-native vendors. Strong fit: mid-market and enterprise with existing Talend skills, hybrid estates.

Fivetran

The "set it and forget it" managed connector vendor. Excellent SaaS connectors with maintained schemas. Pricing scales with rows synced — fine at small scale, expensive at large. Strong fit: RevOps-driven analytics teams, modern stacks, willing to trade cost for engineering velocity.

Airbyte

Open-source Fivetran alternative. Cloud and self-hosted options. Smaller connector library than Fivetran but growing. Strong fit: budget-sensitive teams, teams that want connector source code, teams comfortable with some operational maintenance.

dbt

Not ingestion — transformation. dbt sits between Bronze/Silver and Silver/Gold as the SQL-and-Python transformation orchestrator. Version-controlled, testable, lineage-aware. Default for most modern data teams. Strong fit: virtually any team building Silver and Gold layers.

Hevo, Stitch, Matillion

Other managed-connector options. Hevo competes with Fivetran on cost. Stitch is the budget end. Matillion is stronger on complex transformations and traditional ETL patterns. Strong fit: each has a niche; we recommend whichever matches your team's ETL pattern preference.

Apache NiFi, Streaming (Kafka Connect, Debezium, Striim)

For real-time CDC and streaming-first architectures. Apache NiFi for flow-based data movement (popular in regulated industries). Kafka Connect / Debezium for source CDC. Striim and Qlik Replicate for enterprise CDC with broader source coverage. Strong fit: real-time analytics, event-driven architectures, very large transactional databases.

Native cloud (AWS Glue, Azure Data Factory, GCP Dataflow)

Cloud-native data integration. Cheapest at scale within their respective clouds. Less polished UX than the SaaS vendors but lower per-row cost. Strong fit: cloud-native teams comfortable building their own connectors when needed.

How we pick

  • High row volume (100M+/day) at predictable scale → native cloud (Glue, ADF, Dataflow) or self-hosted Airbyte/NiFi
  • Mid volume, fast SaaS connector roll-out → Fivetran (if budget allows) or Hevo (budget-conscious)
  • Complex enterprise CDC with mainframe/legacy → Informatica or Striim
  • Open-source-friendly with team engineering capacity → Airbyte + dbt
  • Streaming-first → Kafka Connect + Debezium + dbt on the warehouse

dbt is almost always part of the answer for the Silver→Gold transformation layer.

Governance: the often-skipped step

Catalog (Unity Catalog if on Databricks, Atlan or Collibra for multi-platform, Alation for traditional enterprise, Informatica EDC if you're already in the Informatica ecosystem, Microsoft Purview if Microsoft-shop), lineage (often catalog-provided), PII classification, RBAC + masking, cost monitoring (built into Snowflake and Databricks; needs tooling on other platforms), and quality (Great Expectations or dbt tests).

Skipping governance until "later" is the most expensive non-decision a data team makes.

Data Architecture engagement — what you get for $25K+

A Data Architecture engagement is $25,000+ at the floor, design-only, ~2-3 weeks. Deliverables:

  • 2-3 discovery sessions with client data + business SMEs (recorded)
  • Vendor-neutral platform comparison: Snowflake vs Databricks vs BigQuery vs Synapse vs Redshift vs Fabric, scored against your workload, your cloud center of gravity, your AI roadmap, your governance requirements
  • Target-state data architecture diagram (warehouse + lakehouse + streaming + AI layer)
  • Data model design — dimensional, data vault, or one-big-table, picked to fit the use case
  • Medallion architecture (Bronze raw / Silver cleansed / Gold business-ready)
  • Ingestion / ELT design — Informatica vs Talend vs Fivetran vs Airbyte vs native CDC, per workload
  • Governance recommendations — catalog choice, lineage, PII classification, access control
  • Cost / TCO model — monthly and annual cost projection for each recommended platform across 3-year horizon
  • AI layer design (where applicable) — Snowflake Cortex / Databricks Mosaic AI / Vertex AI integration with the warehouse
  • Real-time streaming integration design (Kafka, Kinesis, Pub/Sub) when needed
  • 90-day data modernization roadmap
  • 2 design review sessions with client data / architecture team

Not included (separate engagements): data migration execution, pipeline/dbt-model/Spark-notebook build, license procurement, data cleansing / MDM tooling implementation.

When to do a Data Architecture engagement vs jumping to build

Three signals you should do the design engagement first:

  1. You're choosing between two or more platforms and the decision has 7-figure 3-year cost implications.
  2. You're starting an AI/RAG initiative and the data layer it'll sit on top of isn't designed yet.
  3. You need a fundable design package for budget approval before procurement signs off on platform licenses.

If you already know what platform you're on and you just need the pipelines built, skip Data Architecture and submit a regular Integration intake instead.

Ready to scope?

If you're evaluating data platforms and want vendor-neutral architecture leadership before committing to a 5-year license, submit the 6-step intake form. $25K+, ~2-3 weeks, full design package returned.

For a synthetic sample of the design depth you'd receive (deliverable Word docs from real engagements), see /samples.

More articles

Enterprise RAG Standardization: One Governed Retrieval Layer for Every Dev AI Tool

Every dev AI tool (Claude, Cursor, GPT) ships its own retrieval. Without an enterprise standard, you get uncontrolled data sprawl, inconsistent answer quality, no audit trail, and re-implementation overhead. The right pattern: an iPaaS-backed retrieval substrate every tool routes through. A vendor-neutral phased playbook (Live Gateway → Indexed Vector → Hybrid Router) drawn from real engagements.

Read more

Architecture & Design: When to Buy Design Without the Build

Some teams have build capacity but need senior architecture leadership. Some need a fundable design package before procurement signs off. The $25K+ Architecture & Design engagement gives you the full design deliverable — topology, integration landscape, sequence diagrams, per-API design, canonical model, security recommendations, vendor-neutral target-state stack, 90-day roadmap — with no build, no code, no vendor agenda.

Read more

Ready to scope an integration?

Six-step intake. Fixed-bid SOW returned in 3 business days. $25K floor, $25K increments.

Office