Master data management: what it actually takes to keep one version of the truth

Your operations team says you sell 4,200 products. Your e-commerce platform lists 4,847. The warehouse management system has 3,980 active SKUs. Finance reports revenue across 4,612 product line items.

None of these numbers are wrong. They’re just counting different representations of the same products across systems that were never designed to agree with each other.

This is the problem master data management exists to solve.

What master data actually is

Master data is the core set of business entities that your organization operates on — products, customers, suppliers, locations, employees, assets. These are the nouns of your business. Not the transactions (orders, payments, shipments) and not the analytics (reports, dashboards, forecasts). The things that transactions happen to and analytics are about.

A product is master data. An order for that product is transactional data. The quarterly revenue from that product is analytical data. All three depend on the product record being accurate, but only the product record is master data.

Master data management is the discipline of ensuring these core entities have a single, authoritative representation — a golden record — that every system in the organization can reference. When someone asks “how many products do we sell?” the answer should be the same regardless of which system they query.

Why organizations end up with multiple versions

No one sets out to create conflicting master data. It accumulates naturally.

Systems are adopted independently. The CRM team picks a platform. The warehouse team picks a different one. E-commerce launches on a third. Each system creates its own entity records because that’s how software works — you set it up and populate it with data.

Naming conventions diverge. One team enters “Apple iPhone 15 Pro 256GB Natural Titanium.” Another enters “iPhone 15 Pro — Nat Ti, 256.” A third enters “APPL-IP15P-256-NT.” All correct within their system. None match across systems.

Mergers and acquisitions double everything overnight. Two companies merge. Each has its own product catalog, customer database, and supplier list. The overlap might be 30% or 70% — but until someone matches the records, you don’t know which 30%.

Fields go stale at different rates. The ERP has last year’s pricing. The website has current pricing. The POS system has promotional pricing that expired two months ago. Each system is “accurate” as of its last update.

The same product across four internal systems

System	Product name	SKU	Category	Price
ERP	iPhone 15 Pro 256GB NatTi	APPL-IP15P-256-NT	Mobile Devices	$1,199.00
E-commerce	Apple iPhone 15 Pro (256GB) — Natural Titanium	IP15PRO-256-TITAN	Smartphones > Apple	$1,199.99
POS	iPhone 15 Pro 256 Titanium	7291048	Phones	$1,149.99
Warehouse	APPLE IPHONE 15 PRO 256GB	UPC-194253938897	Electronics/Phones	$1,199.00

Same physical product. Four different names, four different SKU formats, four different category taxonomies, three different prices.

Look at that table. There’s no field you can join on across all four systems. The product name is close but never identical. The SKU formats are incompatible — internal codes, retailer codes, POS register numbers, UPC barcodes. Even the price disagrees because the POS still reflects a holiday promotion.

A human can see these are the same product in about two seconds. Getting software to make that determination reliably across thousands of products is the core challenge of master data management.

The cost of fragmented master data

The consequences compound silently. No error messages fire. Reports just quietly diverge.

Business impact of fragmented master data

Duplicate product listings Average across retail organizations

18%

Pricing inconsistencies Cross-channel price conflicts

14%

Redundant vendor payments Same supplier paid under different codes

Compliance exposure Incomplete entity views in regulated industries

16%

Wasted marketing spend Duplicate mailings, fragmented segments

12%

Percentage impact. Based on Gartner data quality research and industry benchmarks.

Duplicate product listings mean your catalog is larger than reality. Inventory counts are split across records. Reorder points trigger incorrectly because demand is fragmented across two entries for the same product.

Pricing inconsistencies erode customer trust. A customer sees one price on the website, another at checkout, and a third on the receipt. Or worse — a B2B buyer negotiates a contract price that never makes it into the ordering system.

Redundant vendor payments happen when the same supplier exists under two vendor codes. Procurement doesn’t realize they’re consolidating volume with one supplier and misses quantity discounts, or pays the same invoice twice under different codes.

Compliance failures in regulated industries aren’t just expensive — they’re existential. If the same entity appears under different names across your systems, your exposure calculations, KYC checks, and sanctions screening are all incomplete.

The traditional MDM approach — and why it stalls

Enterprise MDM platforms — Informatica MDM, Reltio, SAP Master Data Governance, Profisee — attack this problem with a centralized hub-and-spoke architecture. Every system feeds its entity records into a central hub, which deduplicates, merges, and publishes golden records back out.

The architecture is sound. The execution is brutal.

Implementation cycles run 12-24 months. The software itself is expensive — six figures annually for mid-market, seven figures for enterprise. But the real cost is organizational: data stewards must manually review thousands of potential duplicates that the system can’t resolve automatically. Governance committees must agree on naming conventions, category taxonomies, and merge rules.

Most MDM projects stall in this manual review phase. The initial deduplication pass identifies 15,000 potential matches. A data steward reviews 200 per day. At that rate, the backlog takes three months to clear — by which time the source systems have generated another 5,000 potential duplicates.

The result is a half-implemented MDM platform that nobody trusts, sitting alongside the same fragmented source systems it was supposed to replace.

A practical alternative: match first, govern later

There’s a simpler way to start.

Instead of building the full MDM infrastructure, start with the matching problem. Export your product data from two systems. Upload both CSVs. Find the overlaps.

This gives you immediate, concrete value:

Here are the 3,400 products that exist in both systems — now you can merge them, flag them, or at least stop creating more duplicates.
Here are the 800 products in System A that don’t appear in System B — these might be legitimate differences, or they might be records that never got migrated.
Here are the 1,200 records where names are similar but attributes conflict — these need human review, but now you have a focused list instead of an open-ended mandate.

This is what MDM looks like in practice for most teams: periodic matching runs across systems, not a real-time golden record platform. It’s less elegant architecturally, but it actually ships.

When master data includes more than text

Here’s what traditional MDM platforms — and traditional matching tools — miss: modern master data isn’t just text and numbers in spreadsheet columns.

Products have images. A product listing includes a photo, and that photo contains information — brand logos, packaging design, color accuracy, physical condition — that the text fields don’t capture.

Suppliers have contracts and spec sheets. A vendor record might link to a PDF certificate, a product spec sheet, or an annual report. Those documents contain structured information (certifications, financial data, technical specifications) that’s relevant for matching but locked inside unstructured files.

Real estate listings have photos. Insurance claims have damage images. Construction projects have blueprints.

When your matching tool can only compare text columns, it’s working with a fraction of the available information. Two product records might have different names but identical product photos. Two vendor records might have different addresses but the same PDF certification number. A text-only matcher will miss these connections.

This is where file-based matching changes the equation. By including images, PDFs, and documents as first-class data in the matching pipeline, you capture similarity signals that text alone can’t provide. AI reads the product photo and extracts the brand. It reads the PDF spec sheet and extracts the certification number. Those extracted attributes become matchable data alongside the text fields you already have.

For product MDM specifically, adding images and annotations to your matching pipeline dramatically increases match accuracy — because the product photo is often the most reliable identifier you have.

Getting started

If you’re sitting on two exports from systems that should agree but don’t, you have a master data problem. You don’t need a million-dollar MDM platform to start solving it.

Upload both files. Match them. See where they overlap, where they diverge, and where the conflicts are. That’s the foundation — and it takes ten minutes, not twelve months.

Ready to consolidate your master data? Start matching →

Keep reading

Entity resolution explained — the matching technique at the heart of every MDM initiative
Getting started with CSV matching — try matching two datasets hands-on
Data cleaning before matching — the prep work that feeds your golden records