Entity resolution explained: turning messy records into clean data

You have a spreadsheet of customer records from your CRM. You have another from a marketing platform. A third from a billing system. Somewhere in these three files, the same person appears three times — with three different spellings of their name, two different email addresses, and an outdated phone number in one of them.

Figuring out that these three records are the same person is entity resolution.

What entity resolution actually is

Entity resolution is the process of determining when two or more records refer to the same real-world entity. That entity might be a person, a company, a product, a physical address, or a financial instrument. The records might live in the same database or across completely separate systems.

The core question is simple: are these two records the same thing?

The answer is almost never straightforward.

Why it’s hard

If every system used the same unique identifier for every entity, this problem wouldn’t exist. But they don’t. And that’s not a solvable infrastructure problem — it’s a fundamental characteristic of how data gets created.

Names change. People get married, companies rebrand, products get renamed. Jane Miller becomes Jane Torres. Facebook, Inc. becomes Meta Platforms, Inc.

Data entry varies. One clerk types Robert. Another types Bob. A third types Rob. with a trailing period. All correct. None match exactly.

Abbreviations and formatting differ. 123 N. Main St., Ste 200 in one system. 123 North Main Street Suite 200 in another. 123 N Main St #200 in a third.

Multilingual variation. Mohammed has over a dozen common English transliterations. Japanese company names may appear in kanji, katakana, or romanized form depending on the source.

Fields go stale. The phone number on record from 2021 is not the same phone number the person uses in 2026. The email they used for a free trial three years ago has been abandoned.

None of these are bugs. They’re the normal state of data across any organization that’s been operating for more than a year.

The same person across five systems

System	Name	Email	Phone	Address
Hospital	Robert J. Chen	rchen@gmail.com	(415) 555-0142	789 Oak Ave, SF, CA 94102
Bank	Robert Chen	robert.chen@outlook.com	415-555-0142	789 Oak Avenue, San Francisco, CA
Voter roll	CHEN, ROBERT JAMES	—	—	789 OAK AVE APT 3B SAN FRANCISCO CA 94102
Loyalty program	Bob Chen	bobchen88@yahoo.com	415.555.0199	789 Oak Ave #3B, SF CA 94102
Social media	bob_chen_sf	bobchen88@yahoo.com	—	San Francisco, CA

Same person, five representations. No two records share all fields. Only two share an email.

Look at that table. An exact-match join on any single field would miss most of these connections. Even a join on email only links the loyalty program and social media records. The hospital and bank records share a phone number — but in different formats. The voter roll uses all-caps last-name-first formatting with no email or phone at all.

A human reviewer could probably connect all five in under a minute. Getting software to do it reliably at scale is the core challenge of entity resolution.

The three sub-problems

Entity resolution is an umbrella term that covers three distinct operations.

Deduplication finds duplicate records within a single dataset. Your CRM has 50,000 contacts and you suspect 8% are duplicates created when the same person inquired through different channels. Deduplication identifies and consolidates them.

Record linkage connects records across two different datasets. You have a customer list and a purchased lead list — which leads are already customers? This is what most people mean when they say “data matching.”

Canonicalization decides which version of a matched record to keep. If you’ve identified three records for the same person, which name spelling do you use? Which email? Which address? Canonicalization produces a single “golden record” from multiple inputs.

Most projects focus on one or two of these. Record linkage is the most common starting point.

Where entity resolution matters

This isn’t an abstract data science exercise. Failed entity resolution has direct operational consequences across every industry that maintains customer or entity records.

Healthcare. Patient matching errors are a patient safety issue. When the same patient has two medical records, clinicians miss medication interactions, allergies, and prior test results. The industry estimates a 8-12% duplicate rate across hospital systems. CHIME’s National Patient ID Challenge highlighted that no reliable universal patient identifier exists in the US.

Financial services. Know Your Customer (KYC) and Anti-Money Laundering (AML) regulations require firms to maintain a complete view of each client relationship. If the same entity holds accounts under slightly different names across subsidiaries, the firm’s exposure calculations are wrong and regulatory reporting is incomplete.

Government. Voter roll maintenance, benefits administration, and tax records all require entity resolution. The same person registered in two states needs to be identified — not to prevent fraud (which is vanishingly rare) but to maintain accurate rolls.

Retail and e-commerce. A customer with three accounts sees inconsistent loyalty points, receives duplicate marketing emails, and generates misleading lifetime value calculations. At scale, this fragments your understanding of who your best customers actually are.

The cost of getting it wrong

The consequences of poor entity resolution compound silently. No one gets an error message. The data just quietly degrades.

Business impact of unresolved duplicates

Duplicate rate in CRM Average across industries

15%

Wasted marketing spend Duplicate mailings, split audiences

12%

Missed match rate Real matches not identified

25%

Customer complaints from duplicates Wrong name, repeated outreach

Compliance exposure Incomplete entity views

18%

Figures represent percentage impact. Sources: Gartner data quality research, industry benchmarks.

Duplicate mailings are the most visible symptom. A customer receives the same offer twice — or worse, receives a “We miss you!” win-back email while actively being a customer, just under a different record.

Split customer profiles mean your analytics are wrong. Lifetime value calculations are understated because revenue is spread across two records. Segmentation models make worse predictions because they’re trained on fragmented data.

Missed fraud detection. If a bad actor creates accounts under slight name variations, and your system treats each as a separate entity, the pattern is invisible.

Compliance failures. In regulated industries, an incomplete view of an entity’s relationship with your firm isn’t just an analytics problem — it’s a regulatory finding.

How matching tools solve this

Modern entity resolution tools — including Match Data Studio — attack this problem through a pipeline approach:

Normalize and clean the data so that superficial differences (case, whitespace, abbreviations) don’t cause false negatives.
Block or pre-filter to avoid comparing every record against every other record. Group candidates by ZIP code, first letter of last name, or similar blocking keys to keep the comparison space manageable.
Compute similarity using multiple methods — fuzzy string matching for names, embedding similarity for descriptions, exact matching for IDs, numeric comparison for dates and amounts.
Score and threshold to separate confident matches from borderline cases.
Confirm ambiguous pairs using LLM reasoning that considers the full context of both records.

The result is a list of matched pairs with confidence scores, ready for review or automated merging.

Getting started

If you’re sitting on two datasets that probably overlap and you need to find the connections, entity resolution is the formal name for what you’re trying to do.

Match Data Studio handles the full pipeline — from normalization through AI-confirmed matching — without requiring you to write code or configure fuzzy matching algorithms manually. Upload two CSVs, describe what you’re matching, and let the AI assistant configure the pipeline.

Start matching your data →

Keep reading

Master data management — how entity resolution fits into an MDM strategy
Getting started with CSV matching — a step-by-step walkthrough for your first match
How to choose the right matching algorithm — picking the right approach for your data