42 Gallons, Part 1: You've Known What's in the Barrel for 150 Years. It's Time to Know What's in Your Data.

By John Wassilak

There are 42 gallons in a barrel of oil. Has been since the 1860s. You can drop a barrel onto any rig, any pipeline, any refinery on earth and everyone in the supply chain agrees on what’s inside it, where it came from, and how it got there.

That didn’t happen by accident. The industry standardized because it had to. When you’re shipping crude across borders, blending streams from different fields, settling royalty payments to half a dozen interest holders, and answering to regulators in three different jurisdictions, you cannot afford to guess what’s in the barrel.

Now ask the same question about your data.

Where did this number come from. Who touched it. What rules were applied. What’s the chain of custody between the meter on the wellhead and the cell in this monthly report. Most operators we talk to can answer that question for the barrel. Almost nobody can answer it for the data.

That’s the gap this series is about.

The barrel is a chain of custody

When a barrel of oil moves through the supply chain, it carries more than the crude. It carries a record. The lease it came from. The volume reported at the meter. The run ticket from the truck or the pipeline. The lab analysis that classified the grade. The settlement that allocated revenue to the interest owners.

Every one of those steps is documented, signed off by someone with authority, and reconciled against the next step in the chain. If a barrel shows up at the refinery and the volume doesn’t match the run ticket, somebody is going to find out why. There is no scenario in which everyone shrugs and says “the numbers are close enough.”

The reason this works is that the industry built provenance and lineage into the physical workflow from day one. Nobody decided to add it later. The seal on the barrel, the run ticket, the gauge report, the division order, all of those are governance. They just don’t get called that.

Your data has the same chain of custody. It just isn’t documented anywhere.

What lineage and provenance actually mean for data

Strip the jargon and there are two questions.

Provenance. Where did this data come from, originally. Not where you got it from yesterday. The original source. The meter, the filing, the contract, the field ticket, the SCADA system. The thing that produced the value before any system in your stack ever saw it.

Lineage. What happened to it between the original source and where it is now. What systems did it pass through. What transformations were applied. What rules were used to allocate, aggregate, or reconcile. Who has touched it, when, and why.

If you can answer those two questions for any number that runs your business, you have lineage and provenance. If you can’t, you have a black box with a number coming out of it, and at some point that number is going to disagree with another number in another black box, and the people responsible for sorting it out are going to spend a week trying to reconstruct what should have been written down in the first place.

We see this constantly. The monthly production number is wrong, and the only way to figure out why is to ask the analyst who built the report, who has to ask the engineer who set up the allocation, who has to ask the vendor who configured the SCADA tags, who has to check whether anything changed in the field. By the time you get an answer, the next monthly close is already underway.

Built in, not bolted on

There is a version of this problem where the answer is “buy a data lineage tool.” That isn’t quite right.

You can buy a tool that scans your warehouse and infers some of the lineage from the SQL it sees. That’s useful at the warehouse layer. It does not give you provenance back to the meter. It does not capture the human decisions that shaped the data along the way. It cannot tell you that the production allocation methodology changed in March because the field engineer noticed a calibration issue and adjusted the splits. That information has to be captured deliberately, by the people who made the decision, at the point the decision was made.

This is what we mean when we say governance has to be built in, not bolted on. Bolting it on means standing up a tool, scanning the existing systems, and trying to reverse-engineer the chain of custody after the fact. You always end up with a partial picture, and the parts that are missing tend to be the parts that matter most.

Building it in means the chain of custody is part of the workflow that produces the data. Every ingestion pipeline records what it pulled, when, from where, and what it did to the data on the way in. Every transformation logs the rule it applied and who owned the rule. Every manual override gets attributed to a person and a reason. The lineage is a side effect of doing the work, not a separate project to document the work after it’s done.

The barrel doesn’t need a separate provenance tool. The provenance is the run ticket. It’s part of how the barrel moves. The data version of that is what we build for the operators we work with.

Why this matters more in upstream than most industries

Plenty of industries get away with weak data lineage. Upstream is not one of them.

The data feeds revenue runs that pay interest owners who will absolutely notice if the numbers shift. It feeds reserves reports that get filed with the SEC. It feeds regulatory filings to state agencies that have audit authority. It feeds diligence packages for transactions where a single number being off by ten percent can move the price by millions.

We covered the data quality side of this in Data Quality in Upstream Oil and Gas: What Goes Wrong and Where to Start, and the reconciliation side in Why Your PPDM Implementation Failed (and How to Try Again). The lineage problem sits underneath both of them. You cannot reliably fix a quality issue in a system whose history you can’t trace. You cannot rescue a stalled implementation if you don’t know what data is flowing where.

The companies that have invested in this don’t talk about it as a data project. They talk about it as a way of running the business. They can answer the provenance and lineage questions in minutes instead of weeks. When something disagrees, the conversation is about which rule to apply, not about reconstructing what happened. When a regulator asks, the answer is already on file. When a buyer’s diligence team shows up, the data is ready.

Where this series is going

This is the first of three posts about treating data the way the industry treats the barrel.

In Part 2, we’ll talk about measurement. Every drop traced and accounted for from ingestion through reporting, with quality and governance baked into the pipeline rather than reviewed at the end. Including what it means to be divestiture-ready from day one, not in a panic two weeks before close.

In Part 3, we’ll talk about standardization. The industry standardized the barrel in the 1860s and saved itself a century of expensive disputes. Your data has the same opportunity, and the cost of not taking it shows up most painfully in the diligence room.

The thread through all three posts is the same. The discipline that the oil industry has applied to the physical product for 150 years is the discipline your data deserves, and we’re well past the point where treating it as optional is acceptable.

What we do

This is the work we do for upstream operators. We build ingestion pipelines that capture provenance and lineage as a side effect of running. We model data in a way that the chain of custody is queryable, not buried in tribal knowledge. We help operators get to the point where the answer to “where did this number come from” takes five minutes, not five days.

We were just at the PPDM Energy Data Convention in Houston, April 27 through 29, having a version of this conversation with operators all week. If you were there and we didn’t get to talk, or if any of this sounds like a problem you’ve been carrying, start a conversation. We’d like to hear what you’re working on.