Data Quality in Upstream Oil and Gas: What Goes Wrong and Where to Start

Data Quality in Upstream Oil and Gas: What Goes Wrong and Where to Start

Every operator we’ve ever worked with has a data quality story. It usually starts with something like “we tried to consolidate all our well data last year” and ends with “and then we found out the numbers didn’t match so we stopped.” The specifics vary. The shape of the story is always the same.

Upstream oil and gas has data quality problems that most industries do not. It’s not that the people are careless. It’s that the data itself is genuinely hard. Wells get drilled, recompleted, sidetracked, plugged, unplugged, and reassigned to different operators over decades. Land interests get carved up and reassembled. Production allocation depends on judgment calls that were made years ago and never documented. By the time anyone tries to make all of this agree in a single system, the problems are layered deep enough that pulling on one thread rarely helps.

If you’re sitting on a pile of well, land, and production data and you already know the numbers don’t match, you’re not alone. The question isn’t whether your data has quality problems. The question is which ones are worth fixing and in what order.


The failure modes you’ll actually see

Upstream data quality issues don’t look like the generic examples from a data governance textbook. They have specific shapes that anyone who has worked in the industry will recognize.

Well identifier chaos. The same well appears under different API numbers in different systems. Someone used a 10-digit API where a 14-digit API was required, or vice versa. The leading zeros got stripped when the number went through Excel. A completion update created a new record instead of updating the existing one. Multiply this across twenty years of records and you end up with a well list where every well exists three times, and nobody is sure which version is current.

Operator and ownership drift. Operator names get spelled inconsistently. “XYZ Energy LLC” appears as “XYZ Energy”, “XYZ Energy, L.L.C.”, “XYZ Energy Company”, and half a dozen variants. Working interest percentages get updated in one system and not in another. The royalty records in accounting show one breakdown; the land department has another; neither matches the current reality.

Production allocation disagreements. Field-reported volumes don’t match metered volumes. Metered volumes don’t match what accounting booked. Accounting doesn’t match what got reported to the OCC. Each of those differences has an explanation, and each explanation made sense at the time it was introduced, but the net result is that the company has four different “production numbers” for the same well in the same month.

Land and production that don’t speak to each other. Production data lives in one system. Land data lives in another. The connection between a well and the land interests that generate revenue from it is maintained by whoever owns the revenue run, which may or may not match what either the production system or the land system thinks is true.

Historical data that’s effectively unqueryable. Decades of production history exists, but it lives in formats that are hard to query (legacy systems, PDFs, archived Excel files, paper filings that were scanned but never indexed). It’s technically there. For practical purposes it isn’t.


Why standard frameworks don’t quite fit

Data quality as a discipline has a well-developed literature. Dimensions like completeness, accuracy, consistency, and timeliness are well understood. Frameworks for measuring and improving them are everywhere.

They don’t map cleanly onto upstream data for a few reasons.

Historical data was never collected with a quality standard in mind. You can set a standard for new data going forward. You cannot retroactively apply it to thirty years of filings. A lot of the quality problems in an operator’s current data are not bugs that can be fixed. They’re artifacts of how the data was originally recorded, which wasn’t with analytics in mind.

“Correct” is often a judgment call. For a well that was drilled in 1978, recompleted in 1991, and had an operator change in 2004, what’s the “correct” operator of record? It depends on what you’re asking and when you’re asking it. A generic data quality rule can’t capture that nuance.

The business logic is industry-specific. Working interest calculations, royalty breakdowns, production allocation methods. None of these look like anything in a standard enterprise data model. A data quality tool that doesn’t understand what a net revenue interest is cannot validate that the data is consistent.

This doesn’t mean standard frameworks are useless. It means they need to be adapted to the domain, and the people doing the adaptation need to actually understand the industry.


Where to start

The instinct when facing a large data quality problem is to try to fix everything at once. That’s how most data quality initiatives die.

The approach that works looks like this.

Pick one business outcome that matters. Don’t start with “our data quality is bad.” Start with “our monthly revenue reconciliation takes eight days and is wrong about 15 percent of the time.” Or “our reserves report requires two weeks of cleanup every quarter.” Or “when we get a diligence request, we can’t answer it quickly.” Those are the problems worth solving. The data quality work is in service of solving them.

Trace the data that feeds that outcome. Work backwards from the output. If the monthly revenue reconciliation is the problem, figure out which tables, systems, and sources contribute. Don’t try to inventory all of your data. Just the data that matters for this one outcome.

Define what “correct” means for the pieces that matter. This is the hardest part and the part that gets skipped most often. What’s the authoritative source for well operator? What’s the rule for resolving production volume disagreements? What’s the allocation methodology? These are business decisions, not technical ones. They need to be made, documented, and agreed on before any automation will help.

Instrument the current state. Put measurement in place before you try to fix anything. How many wells in your current data have conflicting operator records? How large are the production reconciliation variances? How often does the land department’s working interest disagree with accounting’s? You can’t improve what you don’t measure. And without baseline numbers, you won’t be able to show anyone that the investment paid off.

Fix the inputs first, then backfill. New data coming in should be clean. Build the ingestion pipelines to enforce the rules you just defined. Once the new data is trustworthy, you can start working backwards through the historical data. If you try to clean history first, new bad data keeps arriving and the cleanup never ends.


The prerequisite problem

A lot of data quality conversations end up in the same place: there’s no foundation to improve on. The data is scattered across systems with no consistent structure, so defining quality rules is impossible, and enforcing them is more impossible.

For most operators, the prerequisite is getting the data into a single, structured, properly modeled place. That’s the migration we wrote about in From Spreadsheets to a Real Data Stack: A Realistic Migration Path for Mid-Size Operators, and it’s the same foundation that makes PPDM implementation useful, as covered in Why Oklahoma Energy Companies Can’t Afford to Ignore Data Engineering.

Data quality initiatives tend to fail when they’re attempted before the foundation exists. You can’t enforce consistency across systems if the systems can’t be joined. You can’t measure completeness if there’s no agreed-upon scope of what “complete” means. Start with the structural work. The quality work gets much easier on a proper foundation.


The organizational side

Tools and pipelines only go so far. Data quality is ultimately a question of who is responsible for what.

The usual problem is that responsibility for data accuracy is diffuse. The land team owns some of it. Accounting owns some of it. Engineering owns some of it. IT owns the systems. Nobody owns the fact that the numbers don’t agree across those groups.

Fixing that doesn’t require a large new organization. It requires naming the person who owns the master record for each domain. Who owns the well master? Who owns the working interest register? Who owns the production allocation methodology? Once those owners exist, disputes have somewhere to go, and the data quality rules have someone who can approve them.

This isn’t glamorous work. It’s also the difference between a company that has trustworthy data and a company that doesn’t.


Progress, not perfection

Nobody’s upstream data is perfect. The goal isn’t perfect data. The goal is data that’s good enough for the decisions you need to make, with visibility into where the known gaps are.

The operators we see making real progress share a few traits. They pick specific, measurable problems. They invest in the structural work before the cleanup work. They name owners for their data domains. They instrument their current state before they start changing it. And they accept that this is a multi-year program, not a one-quarter project.

The payoff is that the decisions the business cares about (where to drill, what to buy, how to report, what to forecast) start being made from data the organization actually trusts. Everything else follows from that.


See Us at PPDM 2026

We’ll be at the PPDM Energy Data Convention in Houston, April 27 through 29. Stop by Booth #2 if you want to talk about data quality, PPDM, or any of the specific headaches in your current stack. We’d love to hear what you’re working on.

Further Reading

Get in touch