42 Gallons

42 Gallons, Part 3: Standardized Since 1866. Your Data Costs Millions More Every Time It Isn't.

In the early 1860s, oil was being shipped in whatever container was lying around. Whiskey barrels. Gunpowder kegs. Custom-made wooden barrels in whatever size the cooper felt like making that week. Buyers and sellers argued constantly about volumes. Every shipment was a negotiation. Every settlement was a fight.

By 1866, the producers in Pennsylvania had agreed on a standard. 42 gallons in a barrel. Period. The argument over volume went away, and the industry got on with the work of moving oil at scale.

A century and a half later, the upstream industry runs on that standard so completely that nobody thinks about it. Nobody has to. The volume is the volume. The unit is the unit. The argument was settled when most of our great-grandparents were children.

Now look at your data.

Every operator we work with has the data equivalent of pre-1866 chaos somewhere in their stack. Well names that don’t match across systems. API numbers in three different formats. Operator names that exist in fifteen variations. Units of measure that get assumed rather than recorded. Date formats that depend on whoever wrote the export script. Working interest percentages that round differently depending on which department is reporting them.

Each of those inconsistencies costs money every time it shows up, and the bill comes due all at once when somebody puts the asset on the market.

This is Part 3 of our 42 Gallons series. Part 1 was about lineage and provenance. Part 2 was about measurement and governance through the pipeline. This post is about why the lack of standardization is the single most expensive data problem in upstream, and why the cost shows up most clearly in the diligence room.


What standardization actually means in upstream data

Standardization is not a single thing. It’s a discipline applied at every layer of the data.

Identifiers. Every well has one canonical identifier in your systems. The 10-digit API, the 14-digit API, the lease and well name, the operator’s internal well code, all of those exist in the wild. One of them is the canonical identifier in your stack, and the rest get cross-referenced to it deterministically. Not by hand. By a maintained, owned mapping.

Names. Operator names, vendor names, county names, formation names. All of these have variants. “XYZ Energy LLC” and “XYZ Energy” and “XYZ Energy, L.L.C.” are the same operator. If they exist as three rows in your system, your data is not standardized. Period.

Units. Production volumes are reported in barrels, in MCF, in BOE, depending on the stream and the system. The unit has to live with the value. Implicit unit conventions are fine until somebody changes one and forgets to update the assumption.

Time. Production months, accrual periods, effective dates, posting dates. These are not the same thing. Standardized data treats each one as a distinct concept. Pre-standardized data treats them as interchangeable, until they aren’t.

Categorical values. Reservoir types, well statuses, completion methods, classification codes. Every one of these has a defined set of acceptable values somewhere in PPDM or in industry references. Pre-standardized data invents new values whenever an analyst types something into a free-text field.

This list is not exhaustive. It’s representative. The pattern is the same across all of them. Standardization means there is one right answer, the right answer is documented, the systems enforce it, and exceptions get handled deliberately rather than absorbed silently.


Where the cost shows up

If standardization is so valuable, why does almost every operator we talk to have a stack full of inconsistencies. Because the cost of weak standardization is hidden. It shows up as friction, not as a line item.

The monthly close takes longer because half the reconciliation time is dealing with name and identifier mismatches. The reserves report requires manual cleanup every quarter because the categorical codes don’t agree across systems. New analysts spend their first six months learning the tribal rules for which spelling of an operator name is “the right one.” Vendors charge integration premiums because their tools have to handle whatever variant your data happens to be in.

None of those costs ever appear as “weak data standardization.” They appear as ten percent more headcount, slower close cycles, and lower analyst productivity. They are absorbed into the cost of doing business.

The bill comes due in the diligence room.


Why diligence is where it hurts most

When an operator decides to sell an asset, the buyer’s diligence team is going to ask for the data. They’re going to ask for it in the format they expect. They’re going to compare it against public records, regulatory filings, and the seller’s own representations. They’re going to find every inconsistency that the operator has been quietly absorbing for years.

The result is predictable.

The volumes don’t reconcile cleanly to OCC filings. The land records don’t tie to the production data. The well counts in the data room don’t match the well counts in the well master. Operator names show up in two different forms in the same package. Categorical codes don’t match the industry references the buyer’s team is using.

Each one of those findings is a piece of leverage at the negotiation table. Each one is a reason for the buyer to discount the price, demand additional reps and warranties, or ask for an indemnity that will hold real money in escrow. Each one is also a reason for the buyer’s confidence in the rest of the data to drop.

We’ve seen deals close at materially lower prices because of data inconsistencies that the seller didn’t know they had. We’ve seen deals stretch out for months because diligence kept turning up issues that required rework. We’ve seen sellers walk away from transactions because the cost of fixing the data was higher than the upside of the deal at the offered price.

The dollar amounts are not small. On a mid-size package, a couple of percentage points of price discount because the buyer doesn’t trust the data is millions of dollars. The same couple of percentage points compounded across reps and warranties or indemnification holdbacks pushes the number higher. And those costs are always larger than the cost of doing the standardization work in the first place.


Fix it at ingestion, not at the data room

The instinct, when an asset goes on the market, is to assemble a team to clean up the data for the diligence package. Hire consultants. Pull engineers off other work. Reconcile by hand. Build a one-off data room that holds together long enough to close the deal.

We’ve watched this play out many times. It works, in the sense that the deal gets done. It does not work in the sense that any of that effort produces lasting value. The cleaned-up data goes into the buyer’s systems. The seller goes back to its pre-deal state, with the same inconsistencies, ready to do the same scramble for the next transaction.

The version that produces lasting value is to fix the standardization at ingestion. Identifier normalization happens at the boundary, where the data enters. Name resolution runs as part of every load. Unit conventions are explicit. Categorical values are validated against the canonical lists. The data lands in a model that knows the difference between accrual and posting dates, and the systems enforce it.

We covered the technical pattern in OCC Data Ingestion: Automating What Most Companies Still Do by Hand and the broader migration story in From Spreadsheets to a Real Data Stack: A Realistic Migration Path for Mid-Size Operators. Standardization is the layer that makes both of those investments pay off.

When you fix it at ingestion, the diligence question becomes a query. The asset package is generated, not assembled. The buyer’s team finds clean data, ties cleanly to public records, and has very little leverage to discount on data quality grounds.


Why PPDM matters here, specifically

Adopting PPDM is, in part, a standardization decision. The model is the industry’s accumulated answer to the question of how upstream data should be structured. The naming conventions, entity relationships, and acceptable value lists are not arbitrary. They reflect decades of work on exactly the standardization problems this post is about.

We wrote about what PPDM actually gives you in What the PPDM Model Actually Gives You (and What It Doesn’t). One of the things it gives you is a defensible answer when a buyer’s team asks how your data is organized. “We’re PPDM-aligned” is a different conversation than “we have a homegrown schema and we’ll need to walk you through it.”

Standardization through PPDM is also one of the most common reasons we see operators start the modeling work in the first place. A divestiture is on the calendar. A capital raise is in the future. An acquirer’s diligence team is going to be in the room. The standardization gap that has been absorbed for years is suddenly visible, and the cost of closing it is going to show up in the deal.


The thread across the series

The 42 Gallons series has been about treating data with the same discipline the industry has applied to the physical product for over 150 years. Lineage and provenance in Part 1. Measurement and governance in Part 2. Standardization in this post.

The thread is the same in each of them. The discipline is not optional in the physical world. The industry would not function without it. The same discipline, applied to data, produces the same result. Less friction, lower cost, higher confidence. And critically, a state of readiness that lets the business move when it needs to.

The cost of not doing this work compounds. Every quarter that passes adds another layer of inconsistency, another tribal rule that lives in someone’s head, another integration point that depends on a fragile assumption. The bill, when it comes due, is paid in deal value.

The good news is that the work itself is well understood. The patterns exist. The model exists. The tooling exists. What’s required is the deliberate decision to push the standardization upstream, and the willingness to invest in it before the diligence team is on the calendar.


What we do

We help upstream operators get to the state this series has been describing. Ingestion pipelines that capture provenance and standardize at the edge. Data models that make the chain of custody queryable. Governance that lives in the systems, not in a SharePoint site. The work that turns “give us six weeks to assemble the data room” into “what cutoff date do you want.”

We were just at the PPDM Energy Data Convention in Houston, April 27 through 29, having this conversation with operators of every size. If you have a transaction on the horizon, or if you have already been through one and decided you don’t want the next one to feel the same way, start a conversation. We’d like to hear what you’re working on.


Further Reading

Get in touch