Crawl, Walk, Run: A Realistic Sequence for SCADA Ingestion

Most SCADA ingestion programs we audit are not failing because the technology is wrong. They are failing because they tried to boil the ocean. Eighteen months in, the team has three half-built pipelines, a long backlog of “remaining vendors,” a dashboard that pulls from a fourth source nobody mentioned in the kickoff, and a sponsor asking when this is going to start producing value.

A phased approach gets you to actual value faster. Not because phasing is intrinsically virtuous. Because the first pipeline is the one that proves the pattern, and you can’t expand a pattern that doesn’t exist yet.

This post is a sequencing guide for someone who has just been told they own SCADA ingestion. The technology choices matter less than the order you make them in.


Crawl: prove the pattern on one asset

The temptation in the first phase is to optimize for coverage. Pick the vendor with the most assets, plan for the whole portfolio, scope ten pipelines at once. We watch this happen constantly and it is almost always the wrong move.

The right move is to optimize for proving a pattern that can be handed to someone else.

Pick the cleanest asset, not the most important one. The asset that has the best documentation, the most accessible vendor API, and the fewest unknowns. For most modern operators this is an Ignition installation, because Ignition speaks OPC UA and MQTT natively and has a well-documented historian. If your portfolio includes one, start there. If it doesn’t, the next cleanest is usually OSI PI (AVEVA PI now), because PI Web API is mature and well-documented.

Avoid the temptation to “start with the hardest one to prove we can do it.” This is bad advice. The crawl phase is not about proving you can do hard things. It is about proving you have a repeatable pattern. Hard assets break repeatability and slow the proof.

One pipeline, end to end, fully documented. Raw tags landing in the warehouse. Basic quality checks running on a schedule. One dbt model surfacing the data in a usable format. One dashboard or report consuming the model, even if it’s just for internal validation. Every step written down with enough specificity that the next person can repeat it.

The deliverable from this phase is not coverage. It is a working pipeline plus a document that lets the next person rebuild it without the original author in the room.

Validate the boring stuff before you move on. Timestamp integrity, tag completeness, freshness end to end. The data in the warehouse matches what’s in the historian, on the day it arrives, for the right wells, in the right units. We have seen teams skip this validation and then spend the next two phases discovering that their reference pipeline had a clock-drift bug. Fix it once, here, before it becomes a pattern.

Documentation is the deliverable. Not a slide deck. Real artifacts: a README that walks through the architecture, a runbook for the most common failure modes, an inventory of credentials and where they live, and notes on the non-obvious decisions (why one-minute aggregates instead of raw, why this specific quality flag is filtered, why the polling cadence is what it is). The same point we made about building for the next engineer applies in spades to the crawl phase.

The goal of crawl is to get to a state where you can say, with a straight face, “here is one fully working, fully documented pipeline.” If you can’t, you do not have a foundation. You have a pilot.


Walk: expand on the proven pattern

Once the pattern is proven, the second phase is about scale. This is where most programs make the transition from “novel project” to “predictable engineering work.” The shape of the work changes, and the discipline has to shift with it.

Onboard remaining assets on the same vendor platform. Every Ignition site in the portfolio. Every PI server. Whatever the crawl-phase vendor was, finish that vendor first. The marginal cost of the second pipeline on the same platform should be a small fraction of the first one, because the pattern is already there. If it isn’t, the crawl-phase pattern wasn’t really a pattern.

Add the next tier of vendors. Usually this is the SaaS telemetry platforms with accessible APIs. The major SaaS SCADA tools all expose data in some form. The API quality varies. The patterns to handle each one are similar in shape but different in details. Plan for two or three weeks per new vendor, less if the vendor has a published export and more if you’re reverse-engineering one.

Tag normalization becomes real. This is the phase where the silver layer in dbt has to start reconciling tag names across vendors. We covered the tag dictionary work in detail, and this is the phase where the dictionary stops being a stub and starts being a maintained artifact. Plan for it. Budget for it. Assign someone to it. If nobody owns the tag dictionary, the silver layer doesn’t get built, and then the gold layer can’t exist.

Introduce orchestration. Once you have more than three or four pipelines running on different schedules, the cron jobs scattered across machines stop being manageable. This is the right time to introduce Airflow, Dagster, or Orchestra, depending on what is already in your stack. Pick one. Move the pipelines under it. The cost of doing this later, once you have twenty pipelines, is significantly higher than doing it now.

Address credentials and access gaps. Inevitably, some assets you assumed were accessible turn out to require access you don’t have. A vendor relationship that needs upgrading. A historian that requires a license tier you don’t have. A site whose IT contact left and whose VPN credentials are someone’s else’s problem. List them, escalate them, and don’t pretend they’re going to resolve themselves. The walk phase is the right time to do this, because by now you have proof that the pipeline pattern works and you have the credibility to ask for access.

The exit criterion for walk is straightforward. The majority of your portfolio is feeding one warehouse, on a managed orchestration layer, with a working tag dictionary and a documented runbook. The remaining assets are scoped, known, and queued for the next phase.


Run: unification and the harder questions

The third phase is where most programs stop measuring themselves against coverage and start measuring themselves against capability. The data is in. The integration is mostly done. The harder questions become the right ones to ask.

Onboard the legacy long tail. The older RTU platforms. The historian extracts that came in a custom file format. The asset whose SCADA admin retired and whose system is on a server in a closet. The work here is unglamorous and often involves working through previous vendors or contractors. Plan for it to take longer than you expect. The marginal value is real but small per asset.

Build a semantic layer. Once the underlying data is reliable and normalized, the right next step is a layer that lets business users (and, increasingly, AI-powered interfaces) ask questions in domain terms rather than tag terms. “How is this pad performing today?” should not require knowing which tags map to which pad. The semantic layer answers that. dbt’s semantic layer, Cube, or your warehouse’s native semantic features all work; pick the one that fits the stack.

Predictive maintenance and anomaly detection become viable. Not before now. The reason most predictive maintenance pilots fail is that the underlying data is too noisy, too inconsistent, or too fragmented to support the models. Once the data foundation is reliable and normalized, the same point Jeff made about clean data being the unlock for autonomy starts to pay off. The algorithms are not the constraint. The data is.

Resolve the streaming versus batch question. This is the right phase to resolve it, because by now you have actual tag volumes, actual query patterns, and actual cost data. Not estimates. Most operators discover that the streaming-first pitch from the early days was overkill for ninety percent of their tags, and that one-minute or fifteen-minute batch pulls do the job at a fraction of the cost. A small set of tags (alarm states, critical equipment status) might justify true streaming. Make that decision based on what the workload actually looks like, not on what you guessed at the start.

Build the governance you can’t avoid any longer. Tag dictionary maintenance procedures. Lineage documentation. Notification process for when vendors add or rename tags. Owner assignments for each pipeline. None of this is glamorous, all of it has been deferrable until now, and from this point forward it is what keeps the program from sliding backward.


The mistakes we see most often

A few patterns that come up repeatedly. Each one looks reasonable in the moment.

Starting with the hardest asset. “If we can ingest from this one, we can ingest from anything.” Maybe. But you spent four months getting through one pipeline and you still don’t have a repeatable pattern, because the asset you picked had too many unknowns to generalize from. Pick easy first. Generalize from working. Hard comes later.

Building ingestion without a transformation layer. Raw tag dumps in the warehouse are not usable at scale. Five consumers will write five interpretations of the same tag. Build the silver layer in parallel with the ingestion, not after.

Skipping documentation in the crawl phase. The entire value of the crawl is a pattern that can be handed to the next person. If you can’t, you don’t have a foundation, you have one engineer’s tribal knowledge.

Confusing active scanning with passive extraction. Active scanning of live OT networks can destabilize PLCs and is a different conversation than pulling data from a historian’s API. Always confirm the access method before touching a live system. The cost of getting this wrong is not “the pipeline broke.” The cost is “we tripped a well.” This is a conversation to have with the operations team before the project starts, not after.

Conflating polling frequency with streaming. Most SCADA use cases don’t need true streaming. One- to fifteen-minute batch pulls are operationally equivalent for surveillance, allocation, and reporting use cases, and they are significantly cheaper at scale. Start with batch. Move to streaming only for the specific use cases that need it. The same pattern we covered in How Not to Take Down Your SCADA Source applies: the cheapest pipeline is the one that does the least work.

Sequencing the vendor onboarding in the wrong order. Easy assets first means the harder ones get the benefit of a working pattern. Going vendor-by-vendor in the order of corporate convenience means each new vendor is its own project. Sequence based on engineering complexity, not on which executive’s portfolio gets attention first.


The bar for moving from crawl to walk

The single most common failure mode for SCADA ingestion programs is moving from crawl to walk before crawl is actually done. The pressure to show coverage is real. The fix is to be explicit about the criteria.

You are done with crawl when you can point at one fully working, fully documented pipeline for one asset, and a person who didn’t build it can extend the pattern to a new asset on the same vendor without asking the original author for help.

If you can’t do that, you don’t have a foundation. You have a pilot.

Walk is faster than crawl. Run is faster than walk. The compounding works in your favor if and only if the crawl phase produced a real pattern. Skip that step and every subsequent phase costs more than it should.

The teams that finish this work in two years are the teams that took the first three months to do crawl correctly. The teams that are still mid-walk after three years almost always skipped it.


Further Reading

Get in touch