Stop Centralizing Everything: Your Data Lake Is Not a Strategy

If you’ve been in or around technology long enough, you’ve watched the vocabulary rotate every few years without much changing underneath. Data warehouse gave way to data lake. Data lake gave way to data lakehouse. Then came data fabric, data vault, data hub, data mart, operational data store, semantic layer, metrics layer, and, somehow, data ocean. Every few years, a new term arrives with the same underlying pitch: centralize your data and the insights will follow.

They rarely do.


The default answer has always been “put it all in one place”

When an organization decides to get serious about data, the standard playbook looks something like this. Pick a platform (often whatever the cloud vendor recommends), stand up a centralized repository, start loading everything into it, and wait for analytics to emerge.

The logic sounds reasonable. If all your data is in one place, anyone who needs it can find it. You can join tables that were never designed to talk to each other. You can answer questions that used to require a phone call and a week of someone’s time.

The problem is that the plan stops there. The “centralize everything” step gets treated as the destination, not the beginning.


Years go by. The data engineers stay busy. The business sees nothing.

Here is what actually happens.

The pipeline work never ends. There’s always one more source to connect, one more schema change to handle, one more API that started returning data differently. The data engineering team is perpetually in motion and perpetually behind. From the outside, they look productive. From the inside, they’re running to stand still.

Meanwhile, the data that landed in the centralized repository three years ago sits untouched. Nobody knows what it means. The person who built the original pipeline left. The business unit that requested it changed priorities. The data is technically there, but it isn’t useful to anyone.

Leadership starts asking why the data team isn’t delivering value. The data team points to the volume of pipelines they maintain. Nobody is wrong, and nothing improves.

This pattern repeats across organizations of every size. It’s not a talent problem. It’s an architecture problem combined with a governance problem that was never addressed.


What actually went wrong

Centralizing data is not the same as organizing it. Moving everything into a data lake gives you a large, undifferentiated collection of files with no clear ownership, no enforced quality standards, and no agreed-upon definitions.

A few things happen as a result:

Nobody owns anything. When data lives in a shared pool, responsibility for its accuracy and meaning becomes diffuse. If the sales numbers in the warehouse don’t match what the CRM shows, who fixes it? If no one owns it, no one fixes it.

Quality degrades without visibility. Data quality problems compound over time. A field populated inconsistently when it was first loaded becomes the foundation for a dashboard two years later. By the time someone notices the error, the lineage is opaque and the fix is expensive.

The business can’t self-serve. The original promise of centralization was that business users could access data without constantly asking the data team. In practice, a massive undifferentiated repository requires expert navigation. Without clear structure and documentation, business users need just as much help as before.

Governance gets deferred indefinitely. When the goal is “get everything in one place first, figure out governance later,” governance never comes. There is always more ingestion work to do.


A different starting point

The better question isn’t “where do we put all our data?” It’s “what decisions do we need to make, and what data do we need to make them?”

That reframe changes everything. Instead of building a centralized repository and hoping value emerges, you start by identifying the specific business outcomes that matter, trace those back to the data they require, and build infrastructure to serve those decisions. You bring data together only where there’s a clear, documented reason to do so.

This often means leaving data distributed, closer to the domain that owns and understands it. The finance team’s revenue data stays in the finance team’s care. The operations data stays with operations. The connections between those domains are made explicit through governance: agreed-upon definitions, shared identifiers, documented lineage.

Governance isn’t the bureaucratic afterthought. It’s the infrastructure that makes distributed data actually useful.


Picking an architecture that fits how you actually work

The label doesn’t matter. What matters is asking a few honest questions before you build anything:

Who actually owns each data source? If the answer is “nobody” or “the data team,” you have a governance problem before you have an architecture problem.

Which domains need to share data, and for what purpose? Start with the connections that have clear business value. Don’t connect everything on the assumption that someone might find it useful someday.

How does your organization actually make decisions? A highly centralized decision-making structure might genuinely benefit from a more centralized data model. A decentralized organization with strong domain teams will fight centralized data ownership at every step.

What can your team realistically maintain? A sophisticated federated architecture that nobody understands is worse than a simpler one that works.


Governance first, then connection

Whatever architecture fits your organization, governance has to be a first-class concern, not something you’ll get to after the ingestion work is done. That means:

  • Clear ownership for every data domain, with a named person accountable for its quality and meaning
  • Agreed-upon definitions for shared concepts (what is a “customer”? what counts as “revenue”?)
  • Documented lineage for the data that feeds your most important decisions
  • A process for resolving conflicts when definitions don’t align across domains

None of this requires a massive platform purchase. A lot of it is conversation and documentation. But it has to happen before the architecture, not after.

If you want a deeper look at how governance fits into a broader organizational data strategy, the Making Data Strategy Work post covers the people and process side of this in more detail.


What this looks like in practice

A manufacturer with plants in three states doesn’t need all their sensor data, HR records, procurement history, and customer data pooled into a single lake. They need the sensor data from each plant accessible to the people making operations decisions, connected to procurement data when there’s a specific reason to analyze spend against production efficiency, with clear ownership at each plant for the data that plant generates.

A healthcare organization doesn’t need every clinical and administrative system feeding a central repository. They need clinical data governed and accessible to clinical teams, billing data governed and accessible to finance, and a well-defined set of connections between those domains for the analysis that actually requires them.

The architecture in both cases might look distributed, federated, or centralized depending on the specific domain. The common thread is intentionality. Every connection exists because there’s a documented reason for it, and every dataset has an owner who’s accountable for it.


The honest reality

There is no architecture that fixes bad data ownership or substitutes for clear business requirements. A distributed setup with no governance is just a mess that’s harder to debug. A centralized lake with genuine ownership and accountability can work well. The pattern matters less than the discipline.

What doesn’t work, reliably and consistently, is centralizing everything first and expecting value to follow. The data engineering team stays busy. The business doesn’t see results. Years pass. Someone rebrands the initiative with the next buzzword and the cycle starts again.

Here’s the irony: “governance” is itself one of those buzzwords. Everyone nods when it comes up in a strategy meeting, and almost nobody leaves the room with a shared understanding of what it actually means or who’s supposed to do it. Nobody operationalizes it, except the tool vendors that sold you that one thing you never got to work. It gets added to slide decks next to “data mesh” and “data fabric” and treated with the same vague reverence.

But strip away the language and governance is just people agreeing on what data means, who’s responsible for it, and what to do when something breaks. It’s a conversation before it’s a process, and a process before it’s a platform. No tool delivers it.

That’s the shift worth making. Data engineering teams spend most of their time on pipelines and infrastructure, on tools and platforms, on moving data from one place to another faster and more reliably. That work matters. But it isn’t where business value comes from. Business value comes from data that people trust, understand, and can act on. Getting there is less a technical problem than a people problem.

The next buzzword will come along eventually. The organizations that actually figure this out won’t be the ones that adopted it earliest. They’ll be the ones that stayed focused on the fundamentals: the right data, the right people, and a clear line to a decision that matters.


Further Reading

Get in touch