Why dbt Belongs in Your OT Data Stack

Most analytics engineers we talk to have a clean dbt project for their business data. Finance, sales, contracts, vendor master, all modeled, tested, documented, lineage front to back. Then we ask what happens to the SCADA data once it lands in the warehouse and the answer is some version of “the engineers query it directly.”

The instinct here is understandable. OT (Operational Technology) data feels like a different kind of beast. The schemas come from somewhere outside the data team’s control. The volumes are large. The cadence is faster. The consumers (field engineers, surveillance teams, allocation analysts) move quickly and don’t usually wait for an analytics engineer to model anything.

But OT data has the same data hygiene problems dbt was built to solve. Usually worse. And the cost of getting it wrong is higher, because the consumers are making operational decisions on the output.


What raw SCADA data in the warehouse actually looks like

After the ingestion is wired up, the warehouse has tag values landing somewhere. Some flavor of (tag_name, timestamp, value, quality_flag, source_system). Often partitioned by ingestion date. Often pulled from a historian or vendor API at one-minute or fifteen-minute aggregates, with raw resolution available on request.

This is fine as a starting layer. It is the equivalent of a raw.source_table in your business data stack. Nobody would query that directly to build a dashboard. They would build a model on top of it.

The problem is that for SCADA data, this almost never happens. Every consumer ends up writing their own interpretation directly against the raw layer. Five engineers come away with five definitions of “average tubing pressure last week.” Two of them filter on quality flags. Two of them don’t. One of them filters out a different set of quality flags than the documentation suggests. All of them have their own opinion on whether null readings are zeros or gaps.

None of them are wrong, exactly. They are answering slightly different questions because the underlying tag has no canonical interpretation.

This is the same shape of problem that drove every business intelligence team to dbt a decade ago. The solution is the same.


The medallion pattern applies directly

The standard bronze-silver-gold structure is a good fit for SCADA data, and the boundaries are arguably cleaner than in business data.

Bronze. Raw tag data as ingested. Never modified. Partitioned by ingestion date and source. This is the source of truth for backfill and replay, and it’s the only layer that talks to the ingestion pipeline directly. Nothing else touches it.

Silver. Cleaned, quality-filtered, tag-normalized, gap-handled. This is where the hard work happens. Quality flag codes get translated into a common schema. Tag names get mapped to a canonical name through a tag dictionary. Units get normalized. Asset hierarchy gets resolved (this tag belongs to this well, this well belongs to this pad, this pad belongs to this field). Gap-handling logic gets applied once, deliberately, by people who know what the right answer is.

Gold. Asset-centric, consumer-ready models. Well performance summary at fifteen-minute intervals. Facility status at the hour. Production by basin by day. These are the models the engineer-built surveillance app should query, not the raw tag table.

The shape of this is identical to a business data stack. The discipline is the same. The artifacts (sources, tests, models, docs) are the same. dbt was built for exactly this.

The hardest layer is silver. Tag normalization and asset contextualization are the work that makes everything above them useful, and it is the work that gets skipped when teams jump straight from bronze to dashboard. The same point we made about vendor sprawl applies here: the ingestion is the easy part. The normalization is the project.


Tests on SCADA data catch real problems

dbt tests on business data tend to be defensive. Null checks on foreign keys. Uniqueness on surrogate keys. Accepted values on enum columns. They catch occasional weirdness.

Tests on SCADA data catch operational problems. That distinction matters.

Freshness tests on max timestamp per tag. If a sensor stopped reporting an hour ago, the data team finds out before the surveillance engineer does. This is the single highest-value test you can write against SCADA sources. The mechanic is the same as any dbt source freshness test. The interpretation is different. A stale tag is a real-world failure mode, not a data engineering one.

Null rate tests. If twenty percent of values for a tag came in as null this morning when the baseline is closer to two percent, the vendor API probably has a problem. Or the historian is dropping packets. Or a transducer failed.

Accepted range tests. Tubing pressure between zero and the rated max for the well. Tank levels between zero and the tank capacity. Flow rates within physically plausible bounds for the equipment. A value outside the range is either a sensor that needs calibration or a unit-conversion bug, and you want to know which one. Both are real problems worth catching.

Relationship tests against the tag dictionary. Every tag in the bronze layer should have a row in the tag dictionary, or the silver layer should know it doesn’t and route accordingly. A tag that shows up unannounced means the vendor added something and you don’t know about it yet. This catches the vendor schema drift problem before it becomes a downstream surprise.

Each of these tests fires when something is wrong with the physical world, not just when something is wrong with the data. That is a different class of value than business-data tests typically deliver.


Lineage matters more in OT

When the CFO asks why a revenue number moved, the finance team can walk back through the transformations because the lineage is documented. The model that produced the number, the upstream models that fed it, the source tables that fed those, all visible.

The same question gets asked about operational data, and it gets asked more often. A reservoir engineer sees a well’s daily production curve drop ten percent. They want to know if that’s a real signal or a data quality artifact. Without lineage, the answer is “probably real, let me check.” With lineage, the answer is “the silver-layer normalization swapped this tag’s mapping last Thursday, so the apparent drop is the renormalization.” One answer takes five minutes. The other takes a day.

dbt’s lineage graph is genuinely the right artifact for this. The fact that it ships for free with the rest of dbt is part of what makes the framework the right fit for OT data, not a separate sell.


The technical details that are worth getting right

A few specifics for the dbt-on-SCADA case, because the framework was not originally designed for time-series volumes.

Incremental models are the correct pattern for tag history. Tag history is append-only, partitioned by day or by hour depending on polling frequency. dbt’s incremental materialization handles this cleanly. The merge key is (tag_id, event_timestamp). Late-arriving data is handled by the same overlap-window pattern we covered in How Not to Take Down Your SCADA Source, applied here at the dbt layer.

Clustering and partitioning matter at this scale. Snowflake’s micro-partition pruning, BigQuery’s clustering, Databricks’ partitioning, all of them work the same way: queries that filter on the clustered columns get fast, queries that don’t get expensive. For tag data, the right clustering keys are usually (event_timestamp, tag_id) in that order. Get this wrong and your gold-layer aggregates get expensive in a hurry.

Snapshots are the right answer for the tag dictionary. Tag definitions change over time. New tags get added. Existing tags get reassigned to different wells when a workover happens. dbt snapshots capture this as a slowly-changing-dimension-style history, which means a query against historical data gets the tag definitions that were correct at the time, not the current ones. This is a subtle but important point that often gets missed.

Elementary or Monte Carlo on top of the dbt project. Both integrate cleanly with dbt and both add a class of anomaly detection that is hard to get from tests alone. Worth setting up once the bronze and silver layers are stable.


The case that should not need to be made

If your SCADA data is in Snowflake (or BigQuery, or Databricks) and your dbt project stops at the business data layer, you have lineage and tests for the data your CFO consumes and none of it for the data your field engineers are making operational decisions with.

That is backwards. The consequences of getting an operational decision wrong are usually more immediate than the consequences of getting a financial report wrong. The discipline should at least be equal.

The good news is that the dbt project you already have is the right starting point. The patterns transfer. The tests are even easier to write because the real-world rules (a tank level can’t be negative; a pressure transducer can’t read above its rated max) are concrete in a way business data rarely is.

Operational data is not a special case. It deserves the same engineering discipline as everything else. The team that figures this out first stops fighting fires every month and starts answering the questions that matter.


Further Reading

Get in touch