How Not to Take Down Your SCADA Source

SCADA is one of those data sources where the obvious ETL pattern is the wrong one. Teams reach for a full refresh because the data backfills, edits happen behind your back, and quality flags shift after the fact. Six months later the source vendor calls to ask why the pipeline is putting so much load on their database.

This is a pattern we see a lot. The pipeline starts simple. Pull the last few months every run, just to be safe. The motivation is right. The implementation is treating an operational system like a data warehouse, and the operational system is the one paying for it.


Why full refresh feels safe

SCADA is one of the worst sources for a naive incremental load, and the reasons are specific.

Readings arrive late. A pumper enters a manual reading three days after the fact. An RTU comes back online and dumps a buffer of telemetry. A sensor that was offline gets backfilled by an integrator. The timestamp on the reading is its event time, not the time it landed in the historian. An incremental query keyed on event time will miss everything that arrived after the last run.

Historical values get edited. Calibration corrections, manual overrides, allocation adjustments. Any of these can rewrite values you already pulled. An incremental load that only looks at new data will leave the warehouse out of sync with the source.

Quality flags change over time. A value flagged “good” on Monday gets flagged “bad” on Friday when someone notices a calibration issue. The row didn’t move. Only its metadata changed.

So the engineer reaches for the simplest correctness guarantee: pull the last 90 days every run. Or the last six months. Or, when it really goes off the rails, the full history of the recent tag set every time, because what’s the harm.

The harm is that the source is somebody else’s production system.


What full refresh actually costs

The cost shows up in places that are easy to miss until somebody starts paying attention.

Vendor systems are usually multi-tenant. When you pull from a cloud-hosted SCADA platform, the database you’re querying is shared with their other customers. The load you put on it affects them too. The vendor sees the impact long before you do, because their dashboards show queue depth and slow queries across the whole tenancy.

Volume scales with tag count, not with what changed. A pipeline pulling 30 days of one-minute aggregates across 5,000 tags is moving roughly 220 million rows per run. Doing that every six hours is roughly 880 million rows per day off a source database sized for live monitoring, not bulk extraction. The vast majority of those rows are byte-identical to what’s already in the warehouse.

You’re paying for the same data over and over. Egress charges if the source is in the cloud, transform cycles in your warehouse, storage for intermediate landing tables. None of that work changes the contents of the target.

Idempotency starts mattering more than it should. Once the volume is large enough that runs overlap, the upserts get expensive, deadlocks appear, and “just rerun it” stops being cheap.

And the vendor’s job is to keep their operational system running for their other customers. They will eventually ask you to stop.


A better pattern

Backfills and late arrivals are real. A naive watermark on event time will miss them. The approach that actually works splits the work across a few passes, each tuned to a different latency assumption.

A short overlap window for recent data. Pull the last seven days of readings every run, keyed on event time. The overlap catches most late arrivals. The volume is bounded. Merge into the target by primary key so the overlap doesn’t create duplicates.

A “separate edits pull” on a slower cadence. If the source exposes an updated_at or modified_at column, use it. A nightly pass keyed on updated_at > last_run catches historical edits without re-reading data that didn’t change. That column is often there but not advertised. Worth asking the vendor about.

A reconciliation job, not a refresh job. Once a week, sample N tags at random and compare row counts and hash totals between source and warehouse over the trailing 90 days. If they match, do nothing. If they don’t, repair the specific window that drifted. This replaces “full refresh just in case” with a job that only does work when there’s actually work to do.

A deep historical pull, only when needed. Backfill more than a few weeks back almost never happens in practice. When it does, schedule a one-time pull off-hours, coordinated with the vendor, against a bounded date range. Make it an explicit event rather than baking it into every run.

Done this way, the load on the source drops by something like an order of magnitude, and the warehouse still has every value it had before.


Practical levers in an Airflow pipeline

A few specifics for the Airflow and Python case, since that’s where most of these pipelines live. The patterns we lean on across the rest of our stack, covered in Building a Production Data Pipeline on PPDM with Airflow and DuckDB, apply here too.

Tune the polling cadence to the use case, not to the source. If the downstream report runs at 6 AM, the ingestion needs to finish before 6 AM. It does not need to run every fifteen minutes. We see this all the time: jobs running on a schedule inherited from a different pipeline that nobody revisited.

Chunk time-window queries. A 30-day pull of all tags is one giant query. Split it into daily chunks with explicit WHERE clauses and the source can plan each one. The query optimizer will thank you, the network won’t time out, and a failure costs you one day instead of a month.

Use Airflow pools and max_active_runs deliberately. Size the pool to the source’s tolerance, not your warehouse’s. If the source can handle two concurrent extractions, the pool has two slots. The rest queue. This is the simplest backpressure mechanism Airflow gives you and it’s almost always under-used.

Land raw, transform later. Pull raw values into a landing zone (Parquet on object storage, or a staging table) and do the time-series shaping inside your warehouse where you control the resources. Aggregations against the source database make every pull more expensive than it needs to be.

Make the watermarks observable. A small table that records, per source and per pull type, the last successful watermark and the row count returned. When something goes wrong it’s the first thing you look at. When the vendor calls, you can answer in specifics instead of guessing.


Talk to the vendor before they talk to you

Most SaaS SCADA vendors are willing to help if you ask first. Many offer bulk export endpoints, read replicas, or off-peak windows that they don’t advertise on the marketing page. Some have a data team contact separate from the support queue who can tell you what their database can actually handle.

Asking early costs a fifteen-minute call. Not asking costs the angry one.


Default to less work

Nobody designs a SCADA pipeline to be wasteful. It happens by accident, one safe default at a time. Backfills, edits, and routine ingestion are three different jobs. Trying to solve all three with one big query is what gets you the call from the vendor.

If you’re pulling from a vendor-hosted SCADA system today, the question worth asking is what the last six months of your pipeline runs actually did to their database. The answer is usually more than you’d guess.


Further Reading

Get in touch