AI Beyond the Hype

The Dirty Data Problem AI Was Made For

Every organization has a data graveyard.

It’s the filing cabinets scanned to PDFs in 2007 that nobody can search. The Excel files with 47 columns and no documentation. The maintenance logs from equipment installed before anyone currently on staff was hired. The handwritten forms digitized to images. The legacy database with field names like CUST_CD_01 and MISC_FLG_X.

According to IBM and IDC research, up to 80% of enterprise data is unstructured—residing in PDFs, text documents, images, and spreadsheets. It’s growing three times faster than structured data, yet less than 1% of it is being used today.

This isn’t a technology gap. It’s a semantic gap. The data exists. It’s just trapped in formats that machines couldn’t understand—until now.


Why Traditional Automation Failed

For decades, organizations have tried to automate the processing of unstructured data. The tools were there: regex, rules engines, pattern matching, OCR. They just didn’t work well enough.

Regex can’t understand context. A pattern like \d{3}-\d{2}-\d{4} will match a Social Security Number. It will also match dates, phone number fragments, and random number sequences that happen to fit the pattern. Regex matches syntax, not meaning.

Rules engines require every case to be anticipated. If your documents use “N/A”, “n/a”, “NA”, “Not Applicable”, “None”, “-”, and blank fields to mean the same thing, you need a rule for each. When a new variation appears—and it will—the system breaks or produces garbage.

Pattern matching breaks on variations. Invoices from one vendor look nothing like invoices from another. Maintenance logs from the 1990s use different abbreviations than logs from the 2010s. The same information appears in different positions, with different labels, in different formats.

OCR reads characters, not meaning. Optical character recognition can turn a scanned document into text. It can’t tell you what the text means, which parts are important, or how the pieces relate to each other.

The result is what I call the “80% automation” problem: traditional tools can automate 80% of cases, but the remaining 20% requires manual intervention—and that 20% often costs more than the 80% saved.


What LLMs Actually Unlock

Large language models approach the problem differently. Instead of matching patterns, they understand meaning.

Semantic understanding. An LLM can read “Customer terminated agreement effective 12/31” and understand that this is a cancellation with a specific date. It doesn’t need a rule for every possible phrasing. It understands what “terminated” means in context.

Format tolerance. The same information presented as a table, a paragraph, a bulleted list, or a form field can all be processed. The LLM understands the content regardless of how it’s formatted.

Context handling. Industry jargon, abbreviations, implicit references—LLMs handle these because they’ve learned from vast amounts of similar text. A maintenance log that says “Replaced PRV per SOP, unit back online 1430” makes sense to a model that’s seen thousands of similar logs.

Graceful degradation. When an LLM isn’t certain, it can say so. It can provide a confidence score, flag ambiguous cases for human review, or make a best guess with appropriate caveats. Traditional automation either works or fails—there’s no middle ground.

This is the shift: from “does this match my pattern?” to “what does this actually mean?”


The Use Cases That Matter

Let’s make this concrete with use cases drawn from the industries where dirty data is most painful.

Document → Structured Data

The problem: Contracts, regulations, specifications, and reports arrive as PDFs. The information inside needs to get into databases, workflows, and analytics systems.

Traditional approach: Hire people to read documents and manually enter data. For high-volume cases, build custom extraction rules that break whenever document formats change.

AI approach: Feed the document to an LLM with instructions on what to extract. Get structured JSON output. Handle edge cases with confidence scoring and human review for low-confidence extractions.

Real example: A shipping company processing bills of lading from thousands of partners—each with slightly different formats—can normalize them into a unified schema automatically. An energy company can take decades of equipment maintenance logs and make them queryable for predictive maintenance analysis.

Legacy System Translation

The problem: Organizations modernizing legacy systems need to understand what the old systems do. Documentation is missing or outdated. The people who built them have retired.

Traditional approach: Hire consultants to reverse-engineer the code and document it manually. Hope the documentation stays accurate during the migration.

AI approach: Use LLMs to read legacy code and generate documentation, explain business logic, and identify dependencies. Deloitte reports organizations using this approach for COBOL modernization, extracting business rules from decades-old code.

Data Standardization

The problem: Mergers, acquisitions, and organic growth leave organizations with the same data in different formats across different systems. “Customer” means one thing in sales, another in support, another in finance.

Traditional approach: Massive data governance initiatives. Months of meetings to agree on standards. Years of migration projects. Many of which fail.

AI approach: Use LLMs to normalize data at the point of integration. Map fields semantically rather than syntactically. A field called CUST_NM in one system and customer_full_name in another can be recognized as equivalent based on content and context.

Content Classification

The problem: Documents need to be routed, tagged, or categorized. Regulatory filings need to go to compliance. Customer complaints need to go to support. Technical specifications need to go to engineering.

Traditional approach: Keyword matching. If it contains “complaint,” route to support. If it contains “invoice,” route to finance. Except when it contains both, or neither, or uses different terminology.

AI approach: Classify based on meaning, not keywords. An LLM can read a document and determine its purpose even when it uses unexpected language or covers multiple topics.


AI as the Preprocessing Layer

Here’s the architectural insight that makes this practical: AI isn’t replacing your data infrastructure. It’s sitting in front of it.

Think of LLMs as a preprocessing layer before your ETL. Unstructured inputs go in. Structured outputs come out. Those outputs flow into your existing databases, data warehouses, and APIs.

flowchart LR
    subgraph Sources["Unstructured Sources"]
        A1[PDFs, Images, Docs]
        A2[Legacy Systems]
        A3[Mixed Format Files]
        A4[Free-Text Fields]
    end

    subgraph AI["AI Processing"]
        B1[LLM Extraction]
        B2[LLM Translation]
        B3[LLM Normalization]
        B4[LLM Classification]
    end

    subgraph Destinations["Structured Destinations"]
        C1[Database Tables]
        C2[Data Warehouse]
        C3[API Endpoints]
        C4[Analytics Systems]
    end

    A1 --> B1 --> C1
    A2 --> B2 --> C2
    A3 --> B3 --> C3
    A4 --> B4 --> C4

This is what IBM calls Unstructured Data Integration (UDI)—reimagining the traditional ETL process for unstructured data. The workflow connects to raw, unstructured sources, enhances data quality by structuring and cleansing, and delivers refined output to systems ready for use.

The key insight: you don’t have to replace your existing infrastructure. You just add a layer that makes your existing infrastructure usable with data it couldn’t process before.


Traditional Tools Aren’t Dead—They’re Complementary

Here’s what the “traditional automation failed” framing gets wrong: regex, rules engines, and pattern matching didn’t fail. They hit a ceiling. They’re still excellent at what they do—validating structure, enforcing formats, catching obvious errors. They just can’t understand meaning.

The most effective pipelines use traditional tools on both sides of AI processing:

Before AI: OCR converts scanned documents to text. Pattern matching identifies document types for routing. Preprocessing scripts clean up encoding issues and normalize whitespace.

AI processing: The LLM handles what traditional tools can’t—understanding what the text actually means, extracting structured data from messy formats, normalizing inconsistent terminology.

After AI: Regex validates that extracted dates and numbers are properly formatted. Rules engines verify values fall within expected ranges. Schema validation confirms the output structure is correct.

Consider a scanned maintenance log. OCR reads “Replaced pump bearing, unit restored 14:30” from the image. The LLM understands the semantic content—this is a repair completion with a timestamp—and extracts structured fields. Then regex validates that “14:30” is a properly formatted time, and rules engines verify the repair type matches the equipment category.

This layered approach tightens the data quality gap. Traditional preprocessing gets data into a form AI can work with. AI handles the semantic understanding. Traditional validation catches formatting errors and edge cases AI might miss. Each layer does what it does best.


The “Good Enough” Threshold

Here’s where pragmatism matters: AI extraction isn’t perfect. LLMs make mistakes. They hallucinate occasionally. They miss edge cases.

But here’s the question that matters: Is 95% accuracy with AI better than 0% automation with humans who can’t keep up with volume?

For most batch processing use cases, the answer is yes.

When you’re processing thousands of documents, a 5% error rate means flagging 50 documents per thousand for human review. That’s manageable. Processing those same thousand documents entirely by hand isn’t.

The organizations succeeding with AI data processing have accepted this tradeoff:

  • High confidence results → proceed automatically
  • Low confidence results → flag for human review
  • Systematic errors → refine the prompts and retry

This is fundamentally different from traditional automation, which either works or doesn’t. AI gives you a dial you can tune based on your tolerance for error versus your capacity for manual review.


Why This Problem Is Universal

The dirty data problem isn’t unique to any industry. It’s universal because every organization:

  • Has been operating long enough to accumulate legacy formats
  • Has merged with or acquired other organizations with different systems
  • Uses partners and vendors who send data in their own formats
  • Has humans who create documents in inconsistent ways

Energy and industrial: Decades of equipment records, maintenance logs, and compliance documentation in formats that predate current systems.

Shipping and logistics: Bills of lading, customs forms, and tracking data from thousands of partners, each with their own formats and conventions.

Finance: Contracts, regulatory filings, and transaction records that need to flow into modern compliance and analytics systems.

Government and public sector: Permit applications, case files, and records that need to be digitized and made searchable for modern service delivery.

The specific documents differ. The fundamental problem—unstructured data that needs to become structured—is the same everywhere.


What Comes Next

The dirty data problem is solvable now in ways it wasn’t five years ago. LLMs provide the semantic understanding that traditional automation lacked.

But understanding the capability is only part of the picture. The next question is practical: When does it make economic sense? How do you calculate whether AI processing costs less than human processing? What are the batch API strategies that make high-volume processing affordable?

That’s what we’ll cover in the next post: The Economics of AI Batch Processing.


This is the second post in a series on AI for batch data processing. Read the first post: Is AI a Bubble? Maybe. Here’s What Won’t Burst.


InFocus Data builds custom AI pipelines that extract structure from your messiest data sources. If you’re sitting on legacy documents, inconsistent formats, or data that’s trapped in systems that don’t talk to each other, we can help.