Your SCADA Data Is Already in Snowflake. That Doesn't Mean It's Reliable.

In most upstream data programs there is a moment when SCADA quietly stops being a SCADA problem and becomes a warehouse problem. The handoff usually happens without anyone announcing it. Somebody capable, often a data engineer or a reservoir engineer who codes, stands up a pipeline. It works. A handful of other engineers notice and start building on top of it. Six months later there are dashboards, surveillance apps, and an allocation comparison all pulling from tag history nobody officially owns.

That setup is further along than people give it credit for. It is also closer to the edge than they realize.


A pipeline that runs is not the same as a pipeline you can trust

We talk to a lot of teams who have already solved the hard parts of getting SCADA data into a cloud warehouse. They have figured out the connector pattern. They have a destination. The data shows up. From a distance, that looks like a finished job.

It is not finished. It is a working proof of concept that is now load-bearing for decisions further downstream than anyone planned.

The first time the pipeline breaks, you find out the way every team eventually finds out. A production engineer notices a dashboard is stale. An allocation report comes in low. Somebody pings the Slack channel and asks if anything changed. The answer is always the same. Something changed, and nobody was watching for it.

The gap between a pipeline that runs and a pipeline you can trust is almost never the ingestion tool. It is observability, error handling, documented ownership, and a recovery procedure that exists somewhere other than one person’s head.


How the typical setup ends up here

The pattern is consistent across operators. Each step makes sense on its own. The composite is what gets you in trouble.

Step one. Somebody stands up a pipeline because they need the data for one thing. Maybe it’s a Cirrus Link bridge feeding Snowpipe Streaming. Maybe it’s Estuary or Fivetran picking up a vendor API. Maybe it’s MQTT through Sparkplug B into Kafka, then into Snowflake. The mechanism is usually fine. Modern tooling for this is good.

Step two. Other engineers see the data is queryable and start using it. A surveillance dashboard. A workover candidate report. A monthly comparison against allocated production. Each new consumer assumes the pipeline is more permanent than its author intended.

Step three. The original author moves on, gets promoted, or just stops being the active maintainer. The pipeline keeps running. Nobody officially takes it over because, from the outside, there is nothing wrong with it.

Step four. Something upstream changes. The historian vendor adds a new tag schema. The broker hits a memory threshold during an integration window. A clock drifts. An API rotates a token. The pipeline fails partially, silently, or in a way that is only visible to the people consuming the data.

By the time anyone realizes something is wrong, the decisions made off that data have already been made.


Where the actual fragility lives

A few places we see this break, in roughly the order we see it.

MQTT brokers with no production owner. Sparkplug B on top of Ignition is genuinely a good pattern. It is fast to set up, the message format is well documented, and the publish-on-change semantics are kind to the source. The problem is that the broker becomes a critical piece of infrastructure that often sits with whoever happened to deploy it. Confluent Cloud and HiveMQ both have solid managed options. The question is who in your org is responsible for the broker today, and whether they know they are.

Snowpipe and Snowpipe Streaming failures that don’t page anyone. Snowflake’s ingestion side is genuinely reliable. It is also genuinely silent. A Snowpipe Streaming channel that closes unexpectedly, a stage with a credentials problem, a copy command rejecting half a file because of a schema mismatch: all of these can sit for hours before anyone notices, because the only signal you get is that newer data stops landing. We have seen multiple operators discover that the pipeline failed eight hours ago, when their morning report ran.

Vendor APIs that change without warning. A SaaS SCADA vendor decides to rename a field. They put a notice in their release notes. Nobody on your team is on the distribution list. Your extractor either drops the renamed field silently or errors out in a way that gets retried until it fills the alerting queue. Both happen.

Clock drift between source and warehouse. Sounds minor. Is not. When the historian’s wall clock and the warehouse’s wall clock disagree by even a few minutes, late-arriving readings start landing in the wrong partition, freshness checks lie, and reconciliation jobs that compare against allocated production produce confusing results. The fix is straightforward. The cost of not noticing for a quarter is not.

No runbook for the obvious failure modes. The broker drops. The credentials expire. The vendor changes their export format. Every one of these is predictable. None of them has a documented response in most pipelines we audit. The recovery time on a 2 AM page is measured in how quickly somebody can reverse-engineer what they built six months ago.


Snowflake isn’t the problem

Worth saying directly, because we get asked. The choice of Snowflake as the destination is fine. Snowflake handles time-series tag data well when partitioned correctly. Snowpipe Streaming is a reasonable answer for low-latency ingestion. The cost model is predictable enough to plan around.

The reliability problem is not the destination. It is everything between the historian and the destination, and the operational discipline around what happens when that path breaks.

This matters because the instinct when a SCADA pipeline becomes flaky is to rebuild the whole thing. Pick a new ingestion tool. Pick a new destination. Start over. Most of the time the architecture is fine and the gap is operational. Rebuilding doesn’t fix the operational gap. It just delays the next time you hit it.


What hardening actually looks like

The work is unglamorous. It also pays for itself the first time the pipeline breaks and someone notices in fifteen minutes instead of eight hours.

dbt tests on the raw tag data. Freshness tests on max timestamp per tag catch sensors that stopped reporting. Null-rate tests catch vendor API problems. Accepted-range tests catch sensors that need calibration. These are cheap to write, easy to schedule, and they fire before the dashboard goes stale. We covered the dbt side of this in Bringing SCADA Data Into Your Data Warehouse Without Breaking Either, and the same patterns apply once the data is in Snowflake.

Watermark tables that record what the pipeline actually did. Per source, per tag set, the last successful watermark and the row count returned. When something goes wrong this is the first thing anyone needs. When something is right, it gives you a baseline to detect anomalies against.

Owned alerting that fires on the right thing. Not “the pipeline ran.” Pipelines that ran and produced empty results are the failure mode you care about. Freshness alerting at the tag-set level, not at the job level.

A runbook that lives next to the code. Broker is unreachable: here is who to call and how to verify. Credentials are expired: here is the rotation procedure. Vendor schema changed: here is the test that catches it and how to add the new field. Five or six pages covers ninety percent of incidents. The same handoff principle applies to ingestion pipelines as it does to Airflow configuration, where the config that only lives in someone’s head is the config that takes down the pipeline.

A designated owner. One name. Not a team. Not a Slack channel. Not “whoever built it.” A person who gets paged when freshness drops and is responsible for either fixing it or escalating. Most pipelines don’t have one of these because nobody volunteered. That is precisely the problem.


The test that matters

If your engineers are using SCADA data in Snowflake today, the question worth asking is straightforward.

Does the team have a runbook for what happens when the pipeline breaks at 2 AM? Can a person who didn’t build it follow the procedure? Will the right people get paged before the next morning’s dashboard goes stale?

If the answer is no, the pipeline you have is a proof of concept that is running in production. That is not a failure. Most companies are in exactly the same state, and most are further along architecturally than they think. The work to close the gap is not a rewrite. It is a few weeks of tests, alerts, ownership decisions, and writing down what the team already knows.

The pipelines that survive contact with reality are not the ones with the cleverest architecture. They are the ones somebody is responsible for.


Further Reading

Get in touch