April 21, 2026

OCC Data Ingestion: Automating What Most Companies Still Do by Hand

By John Wassilak

Walk into any Oklahoma upstream operator’s office and you’ll find someone whose job, at least in part, involves pulling data from the Oklahoma Corporation Commission. They download PDF filings. They open CSVs that were exported from a system designed in the 1990s. They copy numbers into spreadsheets. They reconcile wellbores across three different identifier conventions. They email the results to whoever asked.

Some of these people are very good at what they do. That’s the problem. The company leans on that skill so hard that nobody notices the process itself is broken.

OCC data is foundational. Production volumes, completion reports, permit records, operator changes. If you’re an Oklahoma operator and you don’t have current OCC data flowing into your own systems, you’re making decisions from stale numbers and hoping they hold up.

There’s no technical reason to still be doing this by hand. The reason it persists is that the automation isn’t glamorous, and nobody wants to own the migration from “Tammy pulls it every month” to “the pipeline pulls it every night.”

What OCC data actually looks like

For anyone outside Oklahoma, it’s worth explaining what we’re dealing with.

The Oklahoma Corporation Commission regulates oil and gas activity in the state. Operators file production reports, completion reports, permits, and various other documents. Some of that data is accessible through the OCC’s web interfaces. Some of it is posted as bulk downloads. Some of it arrives in formats that have not meaningfully changed in twenty years.

A partial list of what an operator typically needs:

Monthly production data by well, by lease, by operator
Completion reports (Form 1002A and related filings)
Drilling permits and spud records
Operator change records
Pooling orders and other commission orders
Plugging and workover reports

Each of these lives in a slightly different place, has a slightly different format, and updates on a slightly different cadence. There is no single “OCC API” that gives you everything in a clean JSON response.

Why the manual process persists

It’s easy to say “just automate it.” The reason most shops haven’t is more specific than general inertia.

The data isn’t clean. Well identifiers aren’t consistent across filings. The same well might appear under three different API numbers depending on which form you’re looking at. Operator names get misspelled. Lease descriptions use abbreviations that are specific to who filed the report. A human with industry experience can look at a record and say “yeah, that’s the same well,” but that intuition is hard to encode without doing the work.

The formats are awkward. Some data comes as CSVs with inconsistent column ordering. Some comes as fixed-width text files. Some comes as PDFs that need to be parsed. The person doing it manually has a mental model that handles all the weird cases. A naive automation doesn’t.

The stakes of getting it wrong are real. If your automated pipeline miscategorizes a well, that error flows into revenue calculations, reserves reports, and regulatory submissions. Manual processes have an obvious checkpoint (a human looking at the data) that automation has to replace with something deliberate.

Nobody has the budget to do it right. The person who currently pulls the data costs a few hours a month. Rebuilding that as a proper pipeline costs real engineering time. The ROI is there, but it’s not always visible until the manual process breaks.

What a proper ingestion pipeline looks like

None of this is exotic. The building blocks are the same ones used in any data engineering shop. The difference is that the pipeline has to account for the particular shape of OCC data.

Extraction

The extraction layer handles getting data out of OCC sources on a schedule. For bulk downloads, that’s an HTTP fetch against the published file location. For web-sourced data, it’s a targeted scrape that respects the site’s structure and rate limits. For PDFs, it’s OCR plus structured extraction, which works well enough for standardized filings that don’t change layout.

The extractor should be idempotent. If it runs twice on the same day, it should produce the same result. It should track what it’s already pulled so it doesn’t waste time re-fetching unchanged data. And it should fail loudly when a source changes its format, which happens more often than you’d think.

Parsing and normalization

Raw extracted data is not usable data. The parsing step handles the messy realities: column mappings for CSVs that changed format three years ago and need to be handled both ways, API number standardization (padding, stripping prefixes, matching the well across filings), date format normalization, and unit conversions for volumes.

This is where domain expertise matters. Somebody on the project needs to understand the difference between a 14-digit API number and a 10-digit one, and when the extra digits matter. Somebody needs to know which production fields are metered versus allocated. A generic data engineer without that context will build a pipeline that technically runs and quietly produces wrong numbers.

Entity resolution

Wells, leases, and operators need to be matched across filings. The OCC doesn’t hand you a clean identifier to join on in every case. You end up doing fuzzy matching against your own internal master data, often with a human review step for low-confidence matches.

The practical approach is to build a master well table in your own system, keyed on your internal identifier, with the OCC identifiers mapped in as alternate keys. New filings get matched to the master. Unmatched records go into a review queue. Over time, the review queue shrinks as the matching rules improve.

Loading into a structured model

Once the data is clean and resolved, it lands in a proper data model. For most upstream operators, that means PPDM-aligned tables in PostgreSQL or SQL Server. Production volumes go to a production table keyed on well and reporting period. Completion records go to a completion table. Permits go to a permits table. Each with consistent join keys back to the master well record.

If you want background on why PPDM is the right target model for this, we covered that in Why Oklahoma Energy Companies Can’t Afford to Ignore Data Engineering. For the pipeline patterns that make this kind of ingestion reliable, Data Pipeline Patterns: A Practical Reference walks through the specifics.

Orchestration

Something has to run all of this on a schedule, handle failures, retry when appropriate, and alert a human when something genuinely breaks. Airflow is the most common choice, though for simpler shops a lighter scheduler works fine. The important part isn’t which tool you use. It’s that the whole pipeline runs without anyone touching it until something goes wrong, and that when something goes wrong you find out quickly.

The reconciliation problem

OCC data doesn’t live in isolation. It has to reconcile with your internal production records, your SCADA data, your accounting system, and (eventually) your revenue reporting.

This is where a lot of automation projects stall. Getting the data in is the easy part. Making it agree with your other systems is where the hard problems live.

A few of the common reconciliation issues:

Field-reported versus OCC-reported volumes. These rarely match exactly. Differences come from timing, metering, allocation methodology, and simple data entry errors. A good pipeline tracks the variance, flags significant differences, and makes the mismatches visible rather than hiding them.

Well identifier drift. Your internal well list and the OCC’s well list will diverge over time. New wells get added, wells get recompleted, operators change. The reconciliation process needs to handle all of that without manual cleanup every month.

Allocation differences. When multiple wells share a facility, allocation methodology matters. The OCC may have one view; your accounting team may have another. Automating the ingestion doesn’t resolve this by itself. It just makes the disagreement visible and auditable, which is more than most operators have today.

If the broader context of building this kind of data foundation feels relevant, we wrote about the full migration path in From Spreadsheets to a Real Data Stack: A Realistic Migration Path for Mid-Size Operators.

What this buys you

Done properly, an automated OCC ingestion pipeline changes a few things materially.

Monthly reporting stops being a multi-day exercise. Production dashboards can show current data because the data is current. Due diligence requests get answered from a database query instead of a scramble through spreadsheets. Acquisition evaluations can include a proper OCC history pull in hours instead of weeks.

Maybe more importantly, the people who used to do the manual work can do something else. That’s the actual return on investment. The pipeline doesn’t eliminate a role; it eliminates the lowest-value part of that role and frees the person to work on analysis, forecasting, or operations support.

Start small, run it for real

The mistake to avoid is trying to build the perfect comprehensive pipeline on day one. Start with the single most painful OCC data feed. Build the ingestion for that one source end to end. Run it in production for a month. Fix what breaks. Then add the next source.

Within six months, a focused team can replace the bulk of the manual OCC work. The goal isn’t to eliminate human judgment from the loop. It’s to put human judgment where it actually adds value (reviewing exceptions, resolving ambiguities, approving unusual cases) instead of where it’s being wasted (copying numbers between systems).

See Us at PPDM 2026

We’ll be at the PPDM Energy Data Convention in Houston, April 27 through 29. Stop by Booth #2 if you want to talk about OCC data, PPDM implementations, or anything else on your plate. We’d love to hear what you’re working on.

Get in touch