Data Pipeline Patterns: A Practical Reference

Every data pipeline makes a choice about how data moves, how it gets stored, and how often it gets refreshed. Those choices compound. A bad call on load strategy at the start of a project can mean years of slow pipelines, stale data, or a table that nobody trusts because the numbers change without warning.

This isn’t meant to be a hand-holding, in depth, explanation of each item. Or fully exhaustive, for that matter. Selfishly, I’ve always wanted a quick-intro/reference of the most common patterns that I could point to when talking to customers or peers.


Loading Strategies

The mechanics of how data gets written to a destination and how changes are tracked.

Strategy What it does Use when Watch out for
Truncate and Reload Wipe destination, reload everything from source Source is small; no reliable change indicator; destination must mirror source exactly, deletes included Does not scale; every run puts full load on source; if the pipeline fails mid-run, the table is empty
Incremental Load Load only records changed since the last run, filtered by timestamp or ID Large tables with a reliable updated_at or created_at column Misses updates that bypass the timestamp; does not capture deletes; requires tracking last-run state
Upsert Insert new records, update existing ones Dimension tables and reference data that must stay in sync with the source, updates included Requires a clean key to match on; slower than a plain insert; deletes still need separate handling
Append Only Write new records, never modify existing ones Events, logs, transactions, audit trails. Basically any time you want to keep a full history of everything. Table grows indefinitely; querying current state requires a separate derived table or aggregation
Snapshot Periodically copy the full source table with a date column appended No CDC or reliable change tracking available; table is small enough for full periodic copies Storage grows with table size times snapshot frequency; no intra-period changes captured
Change Data Capture (CDC) Read every insert, update, and delete from the source database’s transaction log as events Near-real-time movement; large tables where polling is too slow; need to capture deletes reliably; feeding event-driven pipelines Requires transaction log access, which managed cloud databases often restrict; schema changes on the source can break consumers; log retention causes issues if the pipeline falls behind
Slowly Changing Dimensions (SCD) Track historical changes in dimension attributes; Type 1 overwrites, Type 2 adds a new row with date range preserving full history Dimension tables where you need point-in-time accuracy (Type 2) or only current state matters (Type 1) Type 2 adds query complexity: every join needs a date range filter; table grows with each tracked change; dbt snapshots automate Type 2

The pattern should match the data type, not the other way around. Reference tables get full reloads. Event tables are append-only. Dimension tables use upserts or SCD Type 2. CDC is the right choice when you need sub-minute latency or reliable delete tracking. Popular CDC tools: Debezium (open source standard), Airbyte, AWS DMS, Google Datastream.


Architecture Patterns

Loading strategies are the building blocks. Architecture patterns are how you assemble them into a system. These three are not mutually exclusive; in practice, most pipelines combine pieces of all three.

Lambda Architecture

Lambda runs two parallel tracks. A batch layer reprocesses everything periodically to produce accurate, complete results. A speed layer processes the stream in real time for low latency but potentially incomplete results. A serving layer merges both.

When to use it: You need both historical accuracy and low latency, and you are willing to maintain two systems to get it.

Pros: Correct results from the batch layer. Low latency from the speed layer.

Cons: Two separate pipelines computing the same thing in different ways. Double the code, double the bugs, two codebases to keep in sync. The merge logic in the serving layer is usually the messiest part. Most teams that built Lambda architectures eventually regretted it.


Kappa Architecture

Kappa collapses Lambda’s two layers into one. Everything is a stream. Historical reprocessing is done by replaying the event log from the beginning. The same code handles both real-time and historical data.

When to use it: You have a durable, replayable event store (Kafka with long retention is the standard). Your transformation logic is the same for historical and real-time data.

Pros: One codebase. No batch/stream divergence to maintain. Simpler operationally than Lambda.

Cons: Requires a replayable event store, which not every source provides. Full reprocessing can be slow at high volumes. Not every problem is naturally modeled as a stream.


Medallion Architecture

Medallion organizes data into layers: bronze (raw ingested data), silver (cleaned and conformed), gold (aggregated, business-ready). Analysts query gold. dbt models the silver and gold transformations. The bronze layer is the safety net: raw data you can always reprocess if a downstream transformation turns out to be wrong.

When to use it: Almost always useful as an organizational principle for a data warehouse or lakehouse. Works especially well with dbt, and pairs naturally with either Lambda or Kappa for the ingestion layer.

Pros: Clear separation between raw and transformed data. Downstream layers can be reprocessed without re-ingesting from the source. Easy to inspect data at each stage when something breaks.

Cons: Multiple copies of data means more storage. Requires discipline to maintain the layer boundaries. Gold logic creeping into silver is a common failure mode.


Reliability Fundamentals

Four patterns that belong in every production pipeline. Not glamorous, but skipping them is how you end up with duplicated records and 2 AM incidents.

Idempotency

A pipeline is idempotent when running it twice for the same input produces the same result as running it once. No duplicate records, no corrupted state.

Get this wrong and everything else gets harder. Safe retries, backfills, and recovery from partial failures all depend on it. Achieve idempotency with upserts instead of blind inserts, output partitions keyed by logical date so re-runs overwrite rather than append, and a control table that tracks completed intervals so re-runs skip already-processed ones.

Atomic Swap

Never truncate a production table and reload it in place. If the pipeline fails midway, the table is empty and downstream queries return nothing.

Load into a staging table first, then swap it into production atomically with a rename or partition swap. Readers see either the old complete dataset or the new one. They never see a partial state. This is the correct implementation for any full reload.

Dead Letter Queue

In pipelines that process records one at a time, some records will fail. A dead letter queue (DLQ) routes those failures to a separate table or queue rather than silently dropping them or halting the batch. Failed records are preserved and can be replayed after the underlying issue is fixed.

A DLQ that nobody monitors is just silent data loss. Build the alerting alongside it.

Retry with Exponential Backoff

Transient failures are common: network blips, rate limits, brief database unavailability. Retry logic handles them without manual intervention. Exponential backoff (doubling the wait between attempts with some random jitter) prevents retries from piling up on a struggling system. Set a maximum retry count or you will end up masking real failures with infinite retries.


Orchestration Patterns

The trigger mechanism is a separate decision from what the pipeline actually does.

Pattern Triggered by Use when Trade-off
Time-based Cron schedule Data arrives on a predictable cadence; downstream only needs refreshes at set intervals Runs whether or not there is new data; if upstream is late, the pipeline proceeds anyway
Event-driven File arrival, queue message, webhook Data arrives unpredictably; you want minimal latency between arrival and processing Requires reliable event delivery; harder to test; event floods produce run floods
Data-aware Upstream dataset updated (Airflow Datasets, etc.) Chained pipelines where downstream should run as soon as upstream is ready Requires upstream to emit dataset update signals; tooling is still maturing
Sensor / Polling Task polls until a condition is true You cannot get a push notification and must check for readiness Holds worker slots while waiting; polling interval trades latency against resource use

Production systems typically mix several of these. Nightly batch jobs run on cron. File-drop workflows use sensors or event triggers. Pipeline chains use data-aware scheduling where the orchestrator supports it.


How These Fit Together

A typical setup: CDC reads changes from a production database into a staging table or message queue. An append-only or incremental load writes those events into the data warehouse (bronze layer). dbt transformations clean and conform the data (silver layer) and aggregate it into reporting tables (gold layer). The whole thing runs on a mix of cron schedules and data-aware triggers, with idempotency and atomic swaps built into every load step.

These patterns are not a menu where you pick one. Most real pipelines mix several. The load strategy, the transformation approach, and the architecture pattern are independent decisions. Pick each one based on what the data actually requires.


Further Reading

Get in touch