How to Build a Lakehouse Layer Above ERP Systems

Artigence
11 min read
How to Build a Lakehouse Layer Above ERP Systems

The lakehouse only works above ERP if you stop treating ERP like a source you can freely reshape

Most lakehouse failures above ERP systems start with a simple mistake: teams try to copy everything, every night, and assume the warehouse will sort it out later. It usually does not. What you get instead is a brittle pipeline, mismatched totals, and finance asking why the dashboard is off by A$12,480 when the period-close report in the ERP is still the only number anyone trusts.

That is the real problem. Not the technology. The problem is that ERP data behaves differently from web or app data. It changes after posting. It gets corrected. It reopens. It carries business rules that live half in configuration, half in people’s heads.

A lakehouse architecture can sit above an ERP system and become a proper data foundation, but only if it is designed around those behaviours. If you build it like a generic analytics pipeline, you end up with a second ERP, only messier.

Start with the rule most teams ignore: the ERP remains the system of record

A governed source of truth does not mean the lakehouse replaces the ERP. It means the lakehouse makes the ERP usable for reporting, planning, and cross-system analysis without rewriting accounting logic on the fly. That distinction matters.

Finance usually wants closed-period integrity. Operations wants near-real-time movement on stock, orders, and margin. Sales wants pipeline and fulfilment visibility. Those needs conflict unless you separate what is operationally useful from what is financially authoritative.

A practical setup usually has three layers:

  1. Raw landing, where ERP extracts arrive with minimal transformation.
  2. Conformed or curated models, where core entities are standardised.
  3. Reporting marts, where business-ready measures are published for warehouse reporting and dashboards.

The mistake is pushing every rule into the first layer, or pushing every rule all the way down into the dashboard. Both create drift.

What breaks first when you mirror ERP data too aggressively

The first thing that usually breaks is not storage. It is the ingestion pattern.

If you try to mirror ERP data into a lakehouse with full-table pulls every night, you will hit one of three walls fast:

  • API limits or throttling
  • Long extract windows that overlap with business use
  • Incremental logic that misses updates, reversals, or deleted records

NetSuite data integration is a good example. In dev, a small sandbox with a few thousand transactions looks clean. In production, SuiteTalk or REST queries start timing out, custom fields appear only on some record types, and your delta logic works until a backdated change lands in a closed period. Then totals shift and no one can explain why.

The redesign is usually boring, which is why people skip it. You split ingestion by object type and volatility.

A better ingestion pattern

  • Reference tables like subsidiaries, locations, departments, item master, customer master, and chart of accounts can be pulled less often.
  • Transaction tables such as invoices, bills, fulfilments, inventory adjustments, and journal entries need incremental capture with change tracking.
  • High-churn objects like open orders, open AP, and open AR often need more frequent refreshes, but only for the fields that actually drive reporting.

If the ERP supports audit fields, system notes, or modified timestamps, use them. If it does not, you may need a watermark plus periodic reconciliation. That sounds slower, but it is cheaper than nightly firefighting.

Late journals and reopened periods are where reporting usually drifts

Warehouse reporting drifts from the ERP when teams assume posting date is enough. It is not. Finance cares about posting date, accounting period, reversal chains, backdated corrections, and whether a period was reopened after close.

That means your lakehouse layer needs to preserve both the original event and the accounting view. A journal entered on 3 July but posted into June because the period was reopened should not disappear into a generic “created_at” bucket.

The cleanest pattern is to store:

  • transaction date
  • posting date
  • accounting period
  • load timestamp
  • source system status
  • reversal or correction linkages

Then build two measures, not one. One for operational freshness, one for financial truth. Business users can still see near-real-time ERP analytics, but finance can trust period-close numbers because the reporting layer is explicitly period-aware.

If you skip this, every month-end becomes a debate about which dashboard is right. That is not an analytics problem. It is a design problem.

Key takeaway: If finance can reopen a period, your lakehouse has to model that possibility from day one, or your “single source of truth” will fracture at month-end.

When raw sync becomes too noisy to be useful

There is a point where the raw ERP-to-lakehouse sync stops adding value and starts adding noise. You usually hit it when:

  • the same business event appears in multiple tables
  • custom fields are sparsely populated and inconsistent
  • the source has too many statuses to report cleanly
  • downstream users keep asking for “just one more rule” to make the data understandable

That is the point to pre-model or filter before landing into the analytics layer.

For example, a sales order may exist as header, line, fulfilment, invoice, credit note, and cancellation records. If your analysts only need booked revenue, open backlog, and shipped but unbilled value, there is no prize for landing every source row unshaped and asking Power BI or dbt to sort it out later.

The same applies to inventory. Raw movements are useful for audit and investigation. They are not automatically useful for stock availability, landed cost, or margin reporting. Those are different models.

A good rule is this: if a source object needs the same transformation in every report, pre-model it once. If it needs different treatment by function, keep it raw longer.

How messy ERP data tells on itself

Teams usually realise their ERP data model is too messy for a straightforward lakehouse layer when the first three dashboards disagree on basic counts.

One report says there are 1,248 open sales orders. Another says 1,271. Finance says 1,233 after excluding cancelled lines. Then someone discovers the ERP stores header status one way, line status another, and the BI tool is counting both.

That is the sign the source model needs interpretation, not just replication.

The first workaround should not be a pile of spreadsheet mappings. It should be a thin semantic layer or conformed model that standardises the business definitions once. Usually that means:

  • defining the grain of each fact table
  • agreeing on canonical dimensions like customer, item, location, and period
  • documenting which ERP status values are included or excluded
  • separating operational states from accounting states

This creates some technical debt, but it is controlled debt. The long-term debt comes from letting every analyst invent their own version of “active customer” or “net sales”.

The hidden cost nobody budgets for

Keeping ERP dimension tables, chart of accounts changes, and master data mappings aligned across the lakehouse and downstream reports is a maintenance job, not a one-off project.

That is the part people underestimate. A chart of accounts restructure, a new department hierarchy, or a customer master merge can invalidate half a dozen reports overnight. If the ERP changes, your lakehouse does not automatically know that “Sales ANZ” was split into “Sales East” and “Sales West” last quarter.

The hidden cost shows up in three places:

| Area | What goes wrong | What it costs | |---|---|---| | Dimensions | Old codes remain in history, new codes appear without mapping | Broken trend lines, duplicate counts | | Chart of accounts | Account roll-ups change mid-year | Margin and P&L inconsistencies | | Master data | Customer, item, or supplier merges are not tracked | Double counting, misallocated revenue, stock confusion |

The fix is not more ad hoc mapping tables. It is a governed master-data process with change control. Somebody has to own the mapping lifecycle, not just the extract.

Where to keep logic, and where not to

Not every ERP object belongs in the lakehouse in the same way. Some should stay close to warehouse reporting, where they are shaped for consumption. Others should be pushed deeper into the lakehouse because they are foundational and reused everywhere.

A useful split looks like this:

| Keep closer to reporting layer | Push into lakehouse foundation | |---|---| | Revenue recognition views | Customer master | | Margin by channel | Item master | | Open order backlog | Location and warehouse hierarchy | | Period-close P&L | Chart of accounts mappings | | Sales rep performance | Supplier master |

If the same metric can be modelled multiple ways, decide whether the business wants a financial view, an operational view, or both. Gross margin by invoice line is not the same as margin by shipment, and neither matches margin by GL posting. That is not a modelling flaw. It is reality.

The lakehouse should hold the reusable building blocks. The reporting layer should hold the business definition people actually use to run the company.

When the lakehouse starts copying ERP logic, you have gone too far

The first sign is simple. A dashboard needs a new exception rule, and instead of changing one model, someone edits three dbt models, two SQL views, and a Power BI measure.

That is duplication. And once it starts, every dashboard becomes its own version of truth.

You stop it by drawing a line between source interpretation and business presentation.

  • Source interpretation belongs in one curated layer.
  • Business presentation belongs in named, documented marts.
  • Dashboard-specific logic should be the exception, not the norm.

If a metric definition cannot be explained in two sentences, it probably should not live in five places. That is especially true with ERP analytics, where a small change in treatment can swing month-end results.

When self-service stops being safe

There is a point where ERP data in a lakehouse becomes unreliable enough that teams stop trusting self-service analytics and go back to direct ERP extracts or a traditional warehouse model.

You will see it when:

  • users export to Excel to “check the numbers”
  • finance refuses to sign off on dashboard totals
  • analysts keep building one-off reconciliations
  • the lakehouse is fast, but no longer trusted

That usually happens when governance is weak, not because the technology failed. If lineage is unclear, period logic is inconsistent, or master data is unmanaged, self-service becomes a liability. People will always choose a slower report they trust over a faster one they do not.

At that point, the answer is not to add more dashboards. It is to tighten the model, reduce the number of published metrics, and re-establish the governed source of truth.

What a workable lakehouse above ERP actually looks like

A workable design is usually less ambitious than people expect, and much more disciplined.

  1. Land ERP data with minimal loss, but do not confuse raw with usable.
  2. Separate operational freshness from financial truth so period-close reporting stays stable.
  3. Conform the core dimensions once, especially customer, item, location, account, and period.
  4. Pre-model high-noise objects where the source is too messy to expose directly.
  5. Publish a small number of trusted marts for warehouse reporting and planning.
  6. Reconcile regularly against the ERP, especially after close, master-data changes, or migrations.

That is how you build a data foundation that survives real business use.

The practical test before you build another pipeline

Before you add the next NetSuite data integration, ask three questions:

  • Can finance reconcile this to the ERP without manual explanation?
  • Will this still work after a reopened period or backdated correction?
  • Is this logic reusable, or are we about to build another one-off report?

If the answer to any of those is no, the lakehouse layer is not ready yet.

Start with the objects that matter most, not the objects that are easiest to extract. Then design for change, because ERP data will change whether your architecture likes it or not.

If you are reviewing an ERP-to-lakehouse design this week, pull one high-value report, trace it back to source tables, and check where the numbers diverge. That exercise will tell you more than another architecture diagram ever will.

TAGS

SHARE

Artigence

Founder of Artigence. Helping businesses build better technology and unlock value from their data.

Connect on LinkedIn →

Related Articles

ERP Data Liberation: Contrarian Insights and Innovation

ERP Data Liberation: Contrarian Insights and Innovation

Unlock ERP data potential and avoid hidden pitfalls. Discover insights on overcoming integration challenges to ensure seamless and secure data flow.

4 min read

Let's Work Together

Need help with your technology strategy, data infrastructure, or product development? We're here to help.