How long does an integration engagement take?

Starter engagements ship in ~3 weeks. Standard ~6. Enterprise ~8. Custom platform initiatives 10-12+ weeks. AI-augmented delivery cuts the typical 8-12 week industry timeline to 3-8 weeks without cutting quality.

What is the engagement floor price?

A basic integration starts at $10,000 (one source → one target, standard fields, low volume). Multi-integration engagements start at $25,000 (3–5 integrations sharing a pattern) and scale in $25K increments — $50K, $75K, $100K+. Every engagement is delivered as a fixed-bid SOW with target-state architecture diagram, returned within 3 business days of intake.

What integration platforms does Green Dolphin support?

MuleSoft Anypoint, Dell Boomi, Workato, Oracle Integration Cloud (OIC), TIBCO, Talend, SnapLogic, Informatica, Azure Integration Services, SAP CPI, Apigee, Kong. Plus custom Java (Spring Boot), .NET, and Node.js. Plus AWS (Lambda + EventBridge), Azure (Functions + Logic Apps), and GCP (Cloud Functions + Pub/Sub).

Do you offer time-and-materials engagements?

No. All Green Dolphin engagements are fixed-bid SOWs. T&M is not a billing model offered. If scope changes mid-engagement, a written change order with a new fixed price is issued for client approval.

What managed services options are available after delivery?

10 hours/week of senior architect time, optional add-on to any fixed-bid SOW. Available in 3-month ($25K), 6-month ($48K, 4% off), 12-month ($90K, 10% off), and 24-month ($168K, 16% off) terms.

What industries does Green Dolphin work in?

Financial Services, Healthcare, Retail, Telecommunications, Aerospace & Defense, Public Sector, Logistics & Supply Chain, and Manufacturing. Including regulated environments under HIPAA, SOX, FedRAMP, GDPR, and PCI-DSS.

Streaming vs Batch Data Architecture: Kafka, Debezium, Iceberg, Materialize in 2026

May 16, 2026

by Green Dolphin Software, Data architecture practice

Streaming vs batch data architecture in 2026

"Should we go streaming-first?" is the most common architecture question we get from data leaders in 2026. The honest answer is: usually not the way you mean. Kafka-by-default is the most expensive architecture mistake on the modern data roadmap — second only to skipping the governance layer entirely.

This post is the framework we use on $25K+ Data Architecture engagements when streaming vs batch is on the table. Vendor-neutral, no kickback agreements.

The three real workload categories

The streaming-vs-batch decision is not actually binary. Three categories, three different right answers:

1. Analytical batch (the 80% case)

"We need yesterday's numbers in the dashboard by 9am." Reports. KPIs. Finance. Operations summaries. ML training data. Most of what data teams actually build.

Right answer: batch ELT. Fivetran / Airbyte / native CDC into Bronze; dbt transforms into Silver / Gold; orchestrated daily or hourly with Airflow / Dagster / Prefect / dbt Cloud. Materialize the warehouse, serve to BI tools.

2. Operational analytics (the 15% case)

"We need to see fraud signals within 30 seconds." "We need a recommender that updates within 1 minute of user action." "We need an inventory dashboard the warehouse staff can trust to-the-minute."

Right answer: CDC streaming + stream-native processing. Debezium / Fivetran HVR / native CDC pushes change events into Kafka / Pulsar / Kinesis. Stream-native processing engine (Flink / Spark Structured Streaming / Materialize / RisingWave / ClickHouse) maintains incrementally-materialized views. The warehouse may not even be in the loop for the critical path.

3. Event-driven application (the 5% case)

"Our app fires order events; downstream needs to react." This is application architecture, not data architecture. Same tooling, different ownership.

Right answer: Kafka / Pulsar as the application backbone, with stream processing for materialized read models. The data warehouse is a downstream consumer (Iceberg / Delta / Hudi sinks), not the source of truth.

Why "Kafka by default" is the expensive mistake

Three failure modes we see repeatedly:

Standing up Kafka for an 80%-batch workload. Kafka adds: a stateful cluster to operate, schema registry, connector orchestration, dead-letter handling, exactly-once semantics, consumer-lag monitoring, replay strategies. None of that pays off if your real requirement is "daily refresh."
Treating Kafka as a database. Topic retention forever, no compaction strategy, fan-out to a dozen consumers each doing slightly different processing. Eventually the bill (in storage + operational complexity) exceeds the warehouse it was meant to feed.
Skipping the warehouse. Streaming-only architectures look elegant on a whiteboard. In practice, the warehouse is what BI tools, analysts, and auditors need. Skipping it means rebuilding it inside the streaming layer — at much higher cost.

The right pattern in 2026 is almost always batch as the default, streaming where it earns its cost.

The ingestion landscape

Tool	Best for	Worst at
Fivetran	SaaS connectors, lean team, fast rollout	High-volume custom sources (cost scales fast)
Airbyte	Open source flexibility, custom connectors	Operating overhead vs Fivetran
Fivetran HVR	High-volume CDC from RDBMS / mainframe	License cost; depth of feature set is overkill for small estates
Informatica IDMC	Regulated enterprise, multi-platform CDC	Modern developer ergonomics
Talend	Open-source heritage, broad connector library	UX and platform momentum
Striim	Real-time CDC with built-in pipelines	Newer in regulated industries
Debezium	Open-source CDC, self-hosted	DIY operations, schema-evolution discipline
Native CDC	Snowflake Streams, Databricks DLT, BigQuery CDC	Cross-platform portability

For most clients we recommend Fivetran for SaaS-to-warehouse + native CDC for warehouse-to-warehouse + Debezium for high-volume operational CDC. The "one ingestion tool to rule them all" instinct usually loses to a polyglot stack.

The stream-processing landscape

Engine	Best for	Worst at
Apache Flink	Stateful complex event processing, lowest-latency	Operational complexity, steep learning curve
Spark Structured Streaming	Spark-fluent teams, batch + streaming unification	True millisecond latency
Kafka Streams	Already-on-Kafka stacks, JVM ecosystem	Polyglot teams; ops still real
Materialize	SQL-native incremental views, low-latency analytics	Niche awareness; vendor maturity
RisingWave	Postgres-compatible streaming SQL, open source	Smaller ecosystem than Flink
ClickHouse	Real-time OLAP, ingest + query at scale	Not a true stream processor (window over append)
Snowflake Dynamic Tables	In-warehouse incremental, no separate cluster	Warehouse-credit costs at high refresh frequency
Databricks DLT	Lakehouse incremental, Unity Catalog lineage	Spark-DBU economics

For most clients in 2026 we recommend Snowflake Dynamic Tables or Databricks DLT for the analytical-incremental layer — staying inside the warehouse — and reserve Flink / Materialize / RisingWave for the truly latency-critical operational analytics workloads.

Lakehouse table format: Iceberg vs Hudi vs Delta

Streaming sinks need a table format that supports concurrent writes + ACID + schema evolution + time travel. The three options:

Apache Iceberg — emerging open standard, multi-engine (Snowflake, Databricks, BigQuery, ClickHouse all read it), strongest 2026 momentum
Delta Lake — Databricks-native, strongest tooling integration inside Databricks, Iceberg compatibility via UniForm
Apache Hudi — record-level upserts, mature CDC story, niche outside Uber-scale workloads

For greenfield in 2026: Iceberg. The multi-engine readability is the right insurance against vendor lock-in. If already deep in Databricks: Delta with UniForm enabled.

Pattern recommendations by workload

Workload	Ingestion	Storage	Processing	Serving
Daily BI dashboards	Fivetran / Airbyte	Warehouse-native	dbt	Snowflake / BigQuery / Databricks
Hourly operational dashboards	Native CDC	Warehouse-native	Dynamic Tables / DLT	Same warehouse
Sub-minute fraud / recommender	Debezium → Kafka	Iceberg	Flink / Materialize	ClickHouse / Pinot
Event-driven app	App writes Kafka	Iceberg sink	Flink stateful	Kafka topics + materialized read models
ML training data	Fivetran / native CDC	Iceberg or Delta	Spark batch	Feature store (Feast / Tecton)

How we pick

The decision tree on a $25K+ Data Architecture engagement:

What is the actual latency requirement, in writing, with an SLA owner? Most "real-time" is daily-or-hourly when pressed.
What is the cost of being wrong? Streaming infrastructure costs 3-10× batch for the same throughput. The premium has to buy something specific.
Who owns the streaming infrastructure on-call? If the answer is "we figure that out later," streaming is the wrong choice.
What is the source data shape? SaaS APIs → batch is almost always right. RDBMS with CDC enabled → streaming is cheaper than you think. Application events → streaming is the right answer.
What is the warehouse / lakehouse already in place? If you already have one, use it for analytical workloads; reserve streaming for operational analytics.

The honest summary

Default to batch. Add streaming where the latency requirement, the source-data shape, or the application architecture genuinely demands it. The cost of building a maintainable batch pipeline first and adding streaming later is much lower than the cost of standing up Kafka for a workload that did not need it.

Concrete next step

If the streaming-vs-batch decision is on the architecture roadmap, a $25K Data Architecture engagement returns a fixed-bid recommendation with:

Target-state diagram (ingestion + storage + processing + serving for each workload class)
Workload-by-workload latency SLA proposal with owners
3-year TCO comparison for streaming-first vs batch-first vs polyglot
Migration path from existing state to target state

Start the intake. Fixed-bid SOW returned in 3 business days. See also the broader platform-selection framework, the in-warehouse AI comparison, and the vector database framework.

Our offices

Follow us