Streaming vs Batch Data Architecture: Kafka, Debezium, Iceberg, Materialize in 2026

by Green Dolphin Software, Data architecture practice

Streaming vs batch data architecture in 2026

"Should we go streaming-first?" is the most common architecture question we get from data leaders in 2026. The honest answer is: usually not the way you mean. Kafka-by-default is the most expensive architecture mistake on the modern data roadmap — second only to skipping the governance layer entirely.

This post is the framework we use on $25K+ Data Architecture engagements when streaming vs batch is on the table. Vendor-neutral, no kickback agreements.

The three real workload categories

The streaming-vs-batch decision is not actually binary. Three categories, three different right answers:

1. Analytical batch (the 80% case)

"We need yesterday's numbers in the dashboard by 9am." Reports. KPIs. Finance. Operations summaries. ML training data. Most of what data teams actually build.

Right answer: batch ELT. Fivetran / Airbyte / native CDC into Bronze; dbt transforms into Silver / Gold; orchestrated daily or hourly with Airflow / Dagster / Prefect / dbt Cloud. Materialize the warehouse, serve to BI tools.

2. Operational analytics (the 15% case)

"We need to see fraud signals within 30 seconds." "We need a recommender that updates within 1 minute of user action." "We need an inventory dashboard the warehouse staff can trust to-the-minute."

Right answer: CDC streaming + stream-native processing. Debezium / Fivetran HVR / native CDC pushes change events into Kafka / Pulsar / Kinesis. Stream-native processing engine (Flink / Spark Structured Streaming / Materialize / RisingWave / ClickHouse) maintains incrementally-materialized views. The warehouse may not even be in the loop for the critical path.

3. Event-driven application (the 5% case)

"Our app fires order events; downstream needs to react." This is application architecture, not data architecture. Same tooling, different ownership.

Right answer: Kafka / Pulsar as the application backbone, with stream processing for materialized read models. The data warehouse is a downstream consumer (Iceberg / Delta / Hudi sinks), not the source of truth.

Why "Kafka by default" is the expensive mistake

Three failure modes we see repeatedly:

  1. Standing up Kafka for an 80%-batch workload. Kafka adds: a stateful cluster to operate, schema registry, connector orchestration, dead-letter handling, exactly-once semantics, consumer-lag monitoring, replay strategies. None of that pays off if your real requirement is "daily refresh."

  2. Treating Kafka as a database. Topic retention forever, no compaction strategy, fan-out to a dozen consumers each doing slightly different processing. Eventually the bill (in storage + operational complexity) exceeds the warehouse it was meant to feed.

  3. Skipping the warehouse. Streaming-only architectures look elegant on a whiteboard. In practice, the warehouse is what BI tools, analysts, and auditors need. Skipping it means rebuilding it inside the streaming layer — at much higher cost.

The right pattern in 2026 is almost always batch as the default, streaming where it earns its cost.

The ingestion landscape

ToolBest forWorst at
FivetranSaaS connectors, lean team, fast rolloutHigh-volume custom sources (cost scales fast)
AirbyteOpen source flexibility, custom connectorsOperating overhead vs Fivetran
Fivetran HVRHigh-volume CDC from RDBMS / mainframeLicense cost; depth of feature set is overkill for small estates
Informatica IDMCRegulated enterprise, multi-platform CDCModern developer ergonomics
TalendOpen-source heritage, broad connector libraryUX and platform momentum
StriimReal-time CDC with built-in pipelinesNewer in regulated industries
DebeziumOpen-source CDC, self-hostedDIY operations, schema-evolution discipline
Native CDCSnowflake Streams, Databricks DLT, BigQuery CDCCross-platform portability

For most clients we recommend Fivetran for SaaS-to-warehouse + native CDC for warehouse-to-warehouse + Debezium for high-volume operational CDC. The "one ingestion tool to rule them all" instinct usually loses to a polyglot stack.

The stream-processing landscape

EngineBest forWorst at
Apache FlinkStateful complex event processing, lowest-latencyOperational complexity, steep learning curve
Spark Structured StreamingSpark-fluent teams, batch + streaming unificationTrue millisecond latency
Kafka StreamsAlready-on-Kafka stacks, JVM ecosystemPolyglot teams; ops still real
MaterializeSQL-native incremental views, low-latency analyticsNiche awareness; vendor maturity
RisingWavePostgres-compatible streaming SQL, open sourceSmaller ecosystem than Flink
ClickHouseReal-time OLAP, ingest + query at scaleNot a true stream processor (window over append)
Snowflake Dynamic TablesIn-warehouse incremental, no separate clusterWarehouse-credit costs at high refresh frequency
Databricks DLTLakehouse incremental, Unity Catalog lineageSpark-DBU economics

For most clients in 2026 we recommend Snowflake Dynamic Tables or Databricks DLT for the analytical-incremental layer — staying inside the warehouse — and reserve Flink / Materialize / RisingWave for the truly latency-critical operational analytics workloads.

Lakehouse table format: Iceberg vs Hudi vs Delta

Streaming sinks need a table format that supports concurrent writes + ACID + schema evolution + time travel. The three options:

  • Apache Iceberg — emerging open standard, multi-engine (Snowflake, Databricks, BigQuery, ClickHouse all read it), strongest 2026 momentum
  • Delta Lake — Databricks-native, strongest tooling integration inside Databricks, Iceberg compatibility via UniForm
  • Apache Hudi — record-level upserts, mature CDC story, niche outside Uber-scale workloads

For greenfield in 2026: Iceberg. The multi-engine readability is the right insurance against vendor lock-in. If already deep in Databricks: Delta with UniForm enabled.

Pattern recommendations by workload

WorkloadIngestionStorageProcessingServing
Daily BI dashboardsFivetran / AirbyteWarehouse-nativedbtSnowflake / BigQuery / Databricks
Hourly operational dashboardsNative CDCWarehouse-nativeDynamic Tables / DLTSame warehouse
Sub-minute fraud / recommenderDebezium → KafkaIcebergFlink / MaterializeClickHouse / Pinot
Event-driven appApp writes KafkaIceberg sinkFlink statefulKafka topics + materialized read models
ML training dataFivetran / native CDCIceberg or DeltaSpark batchFeature store (Feast / Tecton)

How we pick

The decision tree on a $25K+ Data Architecture engagement:

  1. What is the actual latency requirement, in writing, with an SLA owner? Most "real-time" is daily-or-hourly when pressed.
  2. What is the cost of being wrong? Streaming infrastructure costs 3-10× batch for the same throughput. The premium has to buy something specific.
  3. Who owns the streaming infrastructure on-call? If the answer is "we figure that out later," streaming is the wrong choice.
  4. What is the source data shape? SaaS APIs → batch is almost always right. RDBMS with CDC enabled → streaming is cheaper than you think. Application events → streaming is the right answer.
  5. What is the warehouse / lakehouse already in place? If you already have one, use it for analytical workloads; reserve streaming for operational analytics.

The honest summary

Default to batch. Add streaming where the latency requirement, the source-data shape, or the application architecture genuinely demands it. The cost of building a maintainable batch pipeline first and adding streaming later is much lower than the cost of standing up Kafka for a workload that did not need it.

Concrete next step

If the streaming-vs-batch decision is on the architecture roadmap, a $25K Data Architecture engagement returns a fixed-bid recommendation with:

  • Target-state diagram (ingestion + storage + processing + serving for each workload class)
  • Workload-by-workload latency SLA proposal with owners
  • 3-year TCO comparison for streaming-first vs batch-first vs polyglot
  • Migration path from existing state to target state

Start the intake. Fixed-bid SOW returned in 3 business days. See also the broader platform-selection framework, the in-warehouse AI comparison, and the vector database framework.

More articles

AI Cost Optimization for Enterprise Workloads: Prompt Caching, Evaluation Frameworks, and the 80% Reduction Levers

Enterprise AI bills compound silently. The same workload that costs $4,200/month in November will hit $25,000/month by July without intervention. A vendor-neutral playbook for the five levers that produce 80% cost reduction without compromising quality: prompt caching, model tiering, response truncation, batch routing, and evaluation-driven optimization. Plus the audit framework that catches drift before invoices do.

Read more

MuleSoft Center for Enablement (C4E) Playbook: Crossing the 30-API Wall

Most enterprise MuleSoft estates hit a governance wall around their 30th API — not from technical limits, but from reuse, naming, and review drift. The Center for Enablement (C4E) framework is the answer. A vendor-neutral playbook covering the five C4E pillars (API standards, reusable assets, reuse model, security guardrails, Architecture Review Board cadence), what real C4E governance looks like vs vendor-mandated theater, and the $75K Enterprise tier we ship it with.

Read more

Ready to scope an integration?

Six-step intake. Fixed-bid SOW returned in 3 business days. $25K floor, $25K increments.

Office