Streaming vs Batch Data Architecture: Kafka, Debezium, Iceberg, Materialize in 2026
by Green Dolphin Software, Data architecture practice

"Should we go streaming-first?" is the most common architecture question we get from data leaders in 2026. The honest answer is: usually not the way you mean. Kafka-by-default is the most expensive architecture mistake on the modern data roadmap — second only to skipping the governance layer entirely.
This post is the framework we use on $25K+ Data Architecture engagements when streaming vs batch is on the table. Vendor-neutral, no kickback agreements.
The three real workload categories
The streaming-vs-batch decision is not actually binary. Three categories, three different right answers:
1. Analytical batch (the 80% case)
"We need yesterday's numbers in the dashboard by 9am." Reports. KPIs. Finance. Operations summaries. ML training data. Most of what data teams actually build.
Right answer: batch ELT. Fivetran / Airbyte / native CDC into Bronze; dbt transforms into Silver / Gold; orchestrated daily or hourly with Airflow / Dagster / Prefect / dbt Cloud. Materialize the warehouse, serve to BI tools.
2. Operational analytics (the 15% case)
"We need to see fraud signals within 30 seconds." "We need a recommender that updates within 1 minute of user action." "We need an inventory dashboard the warehouse staff can trust to-the-minute."
Right answer: CDC streaming + stream-native processing. Debezium / Fivetran HVR / native CDC pushes change events into Kafka / Pulsar / Kinesis. Stream-native processing engine (Flink / Spark Structured Streaming / Materialize / RisingWave / ClickHouse) maintains incrementally-materialized views. The warehouse may not even be in the loop for the critical path.
3. Event-driven application (the 5% case)
"Our app fires order events; downstream needs to react." This is application architecture, not data architecture. Same tooling, different ownership.
Right answer: Kafka / Pulsar as the application backbone, with stream processing for materialized read models. The data warehouse is a downstream consumer (Iceberg / Delta / Hudi sinks), not the source of truth.
Why "Kafka by default" is the expensive mistake
Three failure modes we see repeatedly:
-
Standing up Kafka for an 80%-batch workload. Kafka adds: a stateful cluster to operate, schema registry, connector orchestration, dead-letter handling, exactly-once semantics, consumer-lag monitoring, replay strategies. None of that pays off if your real requirement is "daily refresh."
-
Treating Kafka as a database. Topic retention forever, no compaction strategy, fan-out to a dozen consumers each doing slightly different processing. Eventually the bill (in storage + operational complexity) exceeds the warehouse it was meant to feed.
-
Skipping the warehouse. Streaming-only architectures look elegant on a whiteboard. In practice, the warehouse is what BI tools, analysts, and auditors need. Skipping it means rebuilding it inside the streaming layer — at much higher cost.
The right pattern in 2026 is almost always batch as the default, streaming where it earns its cost.
The ingestion landscape
| Tool | Best for | Worst at |
|---|---|---|
| Fivetran | SaaS connectors, lean team, fast rollout | High-volume custom sources (cost scales fast) |
| Airbyte | Open source flexibility, custom connectors | Operating overhead vs Fivetran |
| Fivetran HVR | High-volume CDC from RDBMS / mainframe | License cost; depth of feature set is overkill for small estates |
| Informatica IDMC | Regulated enterprise, multi-platform CDC | Modern developer ergonomics |
| Talend | Open-source heritage, broad connector library | UX and platform momentum |
| Striim | Real-time CDC with built-in pipelines | Newer in regulated industries |
| Debezium | Open-source CDC, self-hosted | DIY operations, schema-evolution discipline |
| Native CDC | Snowflake Streams, Databricks DLT, BigQuery CDC | Cross-platform portability |
For most clients we recommend Fivetran for SaaS-to-warehouse + native CDC for warehouse-to-warehouse + Debezium for high-volume operational CDC. The "one ingestion tool to rule them all" instinct usually loses to a polyglot stack.
The stream-processing landscape
| Engine | Best for | Worst at |
|---|---|---|
| Apache Flink | Stateful complex event processing, lowest-latency | Operational complexity, steep learning curve |
| Spark Structured Streaming | Spark-fluent teams, batch + streaming unification | True millisecond latency |
| Kafka Streams | Already-on-Kafka stacks, JVM ecosystem | Polyglot teams; ops still real |
| Materialize | SQL-native incremental views, low-latency analytics | Niche awareness; vendor maturity |
| RisingWave | Postgres-compatible streaming SQL, open source | Smaller ecosystem than Flink |
| ClickHouse | Real-time OLAP, ingest + query at scale | Not a true stream processor (window over append) |
| Snowflake Dynamic Tables | In-warehouse incremental, no separate cluster | Warehouse-credit costs at high refresh frequency |
| Databricks DLT | Lakehouse incremental, Unity Catalog lineage | Spark-DBU economics |
For most clients in 2026 we recommend Snowflake Dynamic Tables or Databricks DLT for the analytical-incremental layer — staying inside the warehouse — and reserve Flink / Materialize / RisingWave for the truly latency-critical operational analytics workloads.
Lakehouse table format: Iceberg vs Hudi vs Delta
Streaming sinks need a table format that supports concurrent writes + ACID + schema evolution + time travel. The three options:
- Apache Iceberg — emerging open standard, multi-engine (Snowflake, Databricks, BigQuery, ClickHouse all read it), strongest 2026 momentum
- Delta Lake — Databricks-native, strongest tooling integration inside Databricks, Iceberg compatibility via UniForm
- Apache Hudi — record-level upserts, mature CDC story, niche outside Uber-scale workloads
For greenfield in 2026: Iceberg. The multi-engine readability is the right insurance against vendor lock-in. If already deep in Databricks: Delta with UniForm enabled.
Pattern recommendations by workload
| Workload | Ingestion | Storage | Processing | Serving |
|---|---|---|---|---|
| Daily BI dashboards | Fivetran / Airbyte | Warehouse-native | dbt | Snowflake / BigQuery / Databricks |
| Hourly operational dashboards | Native CDC | Warehouse-native | Dynamic Tables / DLT | Same warehouse |
| Sub-minute fraud / recommender | Debezium → Kafka | Iceberg | Flink / Materialize | ClickHouse / Pinot |
| Event-driven app | App writes Kafka | Iceberg sink | Flink stateful | Kafka topics + materialized read models |
| ML training data | Fivetran / native CDC | Iceberg or Delta | Spark batch | Feature store (Feast / Tecton) |
How we pick
The decision tree on a $25K+ Data Architecture engagement:
- What is the actual latency requirement, in writing, with an SLA owner? Most "real-time" is daily-or-hourly when pressed.
- What is the cost of being wrong? Streaming infrastructure costs 3-10× batch for the same throughput. The premium has to buy something specific.
- Who owns the streaming infrastructure on-call? If the answer is "we figure that out later," streaming is the wrong choice.
- What is the source data shape? SaaS APIs → batch is almost always right. RDBMS with CDC enabled → streaming is cheaper than you think. Application events → streaming is the right answer.
- What is the warehouse / lakehouse already in place? If you already have one, use it for analytical workloads; reserve streaming for operational analytics.
The honest summary
Default to batch. Add streaming where the latency requirement, the source-data shape, or the application architecture genuinely demands it. The cost of building a maintainable batch pipeline first and adding streaming later is much lower than the cost of standing up Kafka for a workload that did not need it.
Concrete next step
If the streaming-vs-batch decision is on the architecture roadmap, a $25K Data Architecture engagement returns a fixed-bid recommendation with:
- Target-state diagram (ingestion + storage + processing + serving for each workload class)
- Workload-by-workload latency SLA proposal with owners
- 3-year TCO comparison for streaming-first vs batch-first vs polyglot
- Migration path from existing state to target state
Start the intake. Fixed-bid SOW returned in 3 business days. See also the broader platform-selection framework, the in-warehouse AI comparison, and the vector database framework.

