June 9, 2026

Bringing SCADA Data Into Your Data Warehouse Without Breaking Either

By John Wassilak

There’s a screen in a control room somewhere in the Anadarko Basin showing live tank levels, casing pressures, and flow rates on a few hundred wells. The data goes back five years. The pumper checks it every morning. The reservoir engineer pulls a snapshot when something looks off. Aside from that, almost none of it touches the analytical side of the business.

This is not a story about one bad operator. It’s how most of the industry runs.

Operators have years of high-resolution operational data that never quite made it into the data warehouse, the BI reports, the type curves, or the monthly close. The integration work is unglamorous, the data volumes are intimidating, and the team that owns the operational systems answers to a different set of priorities than the team that would use the analytical output. The result is a gap that nobody owns and almost nobody is closing.

The work to close it isn’t exotic. It’s just not getting done.

Some terminology

A few definitions, because the names get conflated.

SCADA (Supervisory Control and Data Acquisition) is the operational system that monitors and controls field equipment. RTUs at the well or facility level talk to a SCADA server, which displays data and sometimes sends control commands back out.
A historian is the time-series database that sits behind the live SCADA display, storing values over time. OSI PI, AVEVA Historian, Ignition’s tag history, and Canary are common ones in upstream. Some smaller operators use the database that ships with their SCADA package without ever calling it a historian.
Tags are the individual data points. A single well might have dozens: tubing pressure, casing pressure, tank levels by tank, flow rates, run status, alarms. A facility can have hundreds. A whole field can run into the tens of thousands.
OPC UA is the protocol most modern systems use to expose tag data to other applications. Older systems may use OPC DA, Modbus exposed through a gateway, or vendor-specific APIs.

Most of the interesting data lives in the historian. Years of high-resolution telemetry across every well you own. The challenge is getting it into the analytical side of the business without either breaking the operational system or drowning your warehouse in noise.

Why this hasn’t been solved already

You can find more than one SCADA-to-warehouse product on the market. Plenty of operators have bought one. Adoption is uneven, and the reasons are specific.

The operational and analytical teams report to different people. SCADA falls under operations. The data warehouse falls under IT or, increasingly, a data team. The handoff between them is rarely defined. The data has to physically cross a boundary that nobody owns.

The volumes are genuinely large. A one-second resolution on a few thousand tags produces tens of millions of records per day. Naively dumping that into the same database that holds your accounting and master data is a way to make both slow. Storing it raw and forever in your warehouse is also wasteful for an analytical workload that almost never needs sub-minute resolution.

The data quality is uneven. Sensors fail. Calibrations drift. A pressure transducer reads zero for a week before anyone notices. Run statuses say “running” while the pumper is standing next to a downed well. The operations team works around this with their judgment. An analytical pipeline that takes the data at face value will draw wrong conclusions.

The historian is somebody else’s product. Pulling data out of OSI PI is straightforward if you have the right license tier and connector. Pulling it out of a vendor-specific package may require working through that vendor. The contractual and architectural details vary by historian, and they matter.

The downstream use cases aren’t always obvious. It’s easy to argue for accounting data in the warehouse. The dollar value is visible. The argument for SCADA in the warehouse depends on someone wanting to do downtime analysis, predictive maintenance, well surveillance at scale, or production allocation against meters. If nobody is explicitly asking for those, the project doesn’t get prioritized.

What a real SCADA pipeline looks like

The architecture is well understood. The challenge is making the decisions in the middle deliberately rather than by default.

Extraction at the source

The pipeline starts at the historian. The most common pattern is an OPC UA client or a vendor connector pulling tag values on a schedule, into a staging area that lives outside the operational network. The operational side stays in charge of the live system. The analytical side gets its own copy.

The connector has to be read-only, with no possibility of writing back into the historian, and the operations team has to be comfortable with that. The polling rate matters too. Pulling raw one-second data on every tag from a busy historian can degrade the operational system. Pulling one-minute aggregates is plenty for most analytical use cases and much kinder to the source.

A time-series-shaped destination

A row-per-tag-per-timestamp table in your transactional database is going to be sad inside of six months. The right destination is a time-series-shaped store: TimescaleDB on top of PostgreSQL, InfluxDB, ClickHouse, or partitioned Parquet on object storage queried through DuckDB. Each has tradeoffs. For most mid-size operators, TimescaleDB is the lowest-friction choice because it’s still PostgreSQL underneath, which means it slots into the rest of the stack we’ve covered in Building a Production Data Pipeline on PPDM with Airflow and DuckDB.

The schema is mostly the same regardless of engine. A tag table, a tag values table keyed on tag and timestamp, and a well-to-tag mapping that links the operational identifiers to your master well table.

Downsampling and retention

This is where most projects make their important decisions by accident.

You almost never want to store raw resolution forever in your analytical store. A practical pattern: keep raw resolution for a short window, maybe 90 days, in case someone needs to do an event reconstruction. Roll up to one-minute aggregates for one to two years. Roll up to hourly or daily for the long-term archive. The detail is still in the historian if anyone ever truly needs to go further back.

The aggregates aren’t just averages. For pressures, you usually want min, max, and average per interval. For flow rates and volumes, you want the time-weighted average and the total. For run status, you want percent runtime over the interval. The roll-up logic encodes domain knowledge, and getting it right matters more than picking the right engine.

Entity resolution to the well master

SCADA tags are named however the integrator named them, which is often inconsistent across vintages of installations. A well drilled in 2008 may have tags named after the lease. A well drilled in 2022 may have tags named after the API number. The same well’s tags may have been renamed when SCADA was upgraded.

You have to map every tag to a well in your master, the same way you map every record from any other source. This is the same problem we covered in OCC Data Ingestion: Automating What Most Companies Still Do by Hand. The well master is the well master. The matching rules are the same rules.

Data quality flags

A SCADA value with no quality information is a value you can’t trust. Most historians expose a quality flag alongside the value, distinguishing good readings from sensor errors, stale values, and manually overridden ones. That flag has to come across with the data, and your analytical queries need to filter on it. A “good” filter strips out the bad readings before they contaminate aggregates. An “any” filter is for forensics. Defaulting to “good” is almost always what you want.

The reconciliation problem (still)

SCADA volumes are not the same as production volumes that go on the report.

SCADA shows what the meters and sensors saw close to real time. Reported production usually goes through an allocation step, gets adjusted for tank-to-meter differences, and may be the result of monthly reconciliation by an analyst. A SCADA-based volume number for a given well in a given month rarely matches the allocated production number in the production database. Both are correct for their own purpose.

This is the same general shape as the reconciliation problems covered in Reconciling Land and Production Data and called out in Field Tickets and the Digitization Gap. The pipeline isn’t trying to make the two numbers identical. It’s making the difference visible, categorized, and auditable.

The variance itself becomes a signal. A well whose SCADA volume and allocated volume disagree by a consistent percentage may have a metering issue, an allocation methodology issue, or both. Catching that earlier than monthly close is most of the value.

Where to start

The trap is trying to ingest every tag from every well on the first project. The right move is to pick a use case, work backward to the minimum set of tags it requires, and ingest those.

Three use cases that almost always justify a focused first pass:

Downtime tracking. Run status, alarm states, and pump-off counts across the field. Lets operations measure runtime and identify chronic underperformers without pulling reports out of the SCADA system one well at a time.
Production surveillance. Daily and hourly volumes against expected ranges, with deviation alerts. Catches problems earlier than the monthly production report does.
Reconciliation support. Volumes for the reporting period, ready to be compared against the production database during monthly close. Closes the loop on the reconciliation problem above.

Each of those needs a manageable subset of tags. None of them require historical depth past a year or two. Each one produces visible value within a quarter if it’s set up correctly.

Build one. Run it in parallel with whatever the operations team is doing today. Once it’s working and the comparison shows it’s accurate, broaden the scope.

What this buys you

A working SCADA pipeline turns operational data from something you look at in a control room into something the rest of the business can act on.

Reservoir engineering gets a real history of pressures and rates across the field, queryable in the same place as everything else. Operations can measure cross-field runtime and downtime without building it by hand each month. Accounting has a real second source for production volumes during monthly close. And the autonomous and AI-driven tooling that everyone keeps trying to deploy in field operations finally has the clean, structured time-series data it actually needs to work, which is the point Jeff made about the data foundation that makes autonomy work better.

None of that is a moonshot. It’s the unglamorous infrastructure work that makes the more interesting work possible. Most operators don’t need a better algorithm or a fancier platform. They need their SCADA data to stop living only in the control room.

That’s a fixable problem.

We Were Just at PPDM 2026

We spent April 27 through 29 at the PPDM Energy Data Convention in Houston. SCADA integration and time-series data came up in more than one conversation at the booth. If that’s a problem you’re working on, we’d love to hear how you’re thinking about it.

Get in touch