Entity resolution explained: turning messy records into clean data
Entity resolution is the process of determining when two records refer to the same real-world entity. Here's what it is, why it's hard, and what happens when you get it wrong.
You have a spreadsheet of customer records from your CRM. You have another from a marketing platform. A third from a billing system. Somewhere in these three files, the same person appears three times — with three different spellings of their name, two different email addresses, and an outdated phone number in one of them.
Figuring out that these three records are the same person is entity resolution.
What entity resolution actually is
Entity resolution is the process of determining when two or more records refer to the same real-world entity. That entity might be a person, a company, a product, a physical address, or a financial instrument. The records might live in the same database or across completely separate systems.
The core question is simple: are these two records the same thing?
The answer is almost never straightforward.
Why it’s hard
If every system used the same unique identifier for every entity, this problem wouldn’t exist. But they don’t. And that’s not a solvable infrastructure problem — it’s a fundamental characteristic of how data gets created.
Names change. People get married, companies rebrand, products get renamed. Jane Miller becomes Jane Torres. Facebook, Inc. becomes Meta Platforms, Inc.
Data entry varies. One clerk types Robert. Another types Bob. A third types Rob. with a trailing period. All correct. None match exactly.
Abbreviations and formatting differ. 123 N. Main St., Ste 200 in one system. 123 North Main Street Suite 200 in another. 123 N Main St #200 in a third.
Multilingual variation. Mohammed has over a dozen common English transliterations. Japanese company names may appear in kanji, katakana, or romanized form depending on the source.
Fields go stale. The phone number on record from 2021 is not the same phone number the person uses in 2026. The email they used for a free trial three years ago has been abandoned.
None of these are bugs. They’re the normal state of data across any organization that’s been operating for more than a year.
| System | Name | Phone | Address | |
|---|---|---|---|---|
| Hospital | Robert J. Chen | rchen@gmail.com | (415) 555-0142 | 789 Oak Ave, SF, CA 94102 |
| Bank | Robert Chen | robert.chen@outlook.com | 415-555-0142 | 789 Oak Avenue, San Francisco, CA |
| Voter roll | CHEN, ROBERT JAMES | — | — | 789 OAK AVE APT 3B SAN FRANCISCO CA 94102 |
| Loyalty program | Bob Chen | bobchen88@yahoo.com | 415.555.0199 | 789 Oak Ave #3B, SF CA 94102 |
| Social media | bob_chen_sf | bobchen88@yahoo.com | — | San Francisco, CA |
Same person, five representations. No two records share all fields. Only two share an email.
Look at that table. An exact-match join on any single field would miss most of these connections. Even a join on email only links the loyalty program and social media records. The hospital and bank records share a phone number — but in different formats. The voter roll uses all-caps last-name-first formatting with no email or phone at all.
A human reviewer could probably connect all five in under a minute. Getting software to do it reliably at scale is the core challenge of entity resolution.
The three sub-problems
Entity resolution is an umbrella term that covers three distinct operations.
Deduplication finds duplicate records within a single dataset. Your CRM has 50,000 contacts and you suspect 8% are duplicates created when the same person inquired through different channels. Deduplication identifies and consolidates them.
Record linkage connects records across two different datasets. You have a customer list and a purchased lead list — which leads are already customers? This is what most people mean when they say “data matching.”
Canonicalization decides which version of a matched record to keep. If you’ve identified three records for the same person, which name spelling do you use? Which email? Which address? Canonicalization produces a single “golden record” from multiple inputs.
Most projects focus on one or two of these. Record linkage is the most common starting point.
Where entity resolution matters
This isn’t an abstract data science exercise. Failed entity resolution has direct operational consequences across every industry that maintains customer or entity records.
Healthcare. Patient matching errors are a patient safety issue. When the same patient has two medical records, clinicians miss medication interactions, allergies, and prior test results. The industry estimates a 8-12% duplicate rate across hospital systems. CHIME’s National Patient ID Challenge highlighted that no reliable universal patient identifier exists in the US.
Financial services. Know Your Customer (KYC) and Anti-Money Laundering (AML) regulations require firms to maintain a complete view of each client relationship. If the same entity holds accounts under slightly different names across subsidiaries, the firm’s exposure calculations are wrong and regulatory reporting is incomplete.
Government. Voter roll maintenance, benefits administration, and tax records all require entity resolution. The same person registered in two states needs to be identified — not to prevent fraud (which is vanishingly rare) but to maintain accurate rolls.
Retail and e-commerce. A customer with three accounts sees inconsistent loyalty points, receives duplicate marketing emails, and generates misleading lifetime value calculations. At scale, this fragments your understanding of who your best customers actually are.
The cost of getting it wrong
The consequences of poor entity resolution compound silently. No one gets an error message. The data just quietly degrades.
Duplicate mailings are the most visible symptom. A customer receives the same offer twice — or worse, receives a “We miss you!” win-back email while actively being a customer, just under a different record.
Split customer profiles mean your analytics are wrong. Lifetime value calculations are understated because revenue is spread across two records. Segmentation models make worse predictions because they’re trained on fragmented data.
Missed fraud detection. If a bad actor creates accounts under slight name variations, and your system treats each as a separate entity, the pattern is invisible.
Compliance failures. In regulated industries, an incomplete view of an entity’s relationship with your firm isn’t just an analytics problem — it’s a regulatory finding.
How matching tools solve this
Modern entity resolution tools — including Match Data Studio — attack this problem through a pipeline approach:
- Normalize and clean the data so that superficial differences (case, whitespace, abbreviations) don’t cause false negatives.
- Block or pre-filter to avoid comparing every record against every other record. Group candidates by ZIP code, first letter of last name, or similar blocking keys to keep the comparison space manageable.
- Compute similarity using multiple methods — fuzzy string matching for names, embedding similarity for descriptions, exact matching for IDs, numeric comparison for dates and amounts.
- Score and threshold to separate confident matches from borderline cases.
- Confirm ambiguous pairs using LLM reasoning that considers the full context of both records.
The result is a list of matched pairs with confidence scores, ready for review or automated merging.
Getting started
If you’re sitting on two datasets that probably overlap and you need to find the connections, entity resolution is the formal name for what you’re trying to do.
Match Data Studio handles the full pipeline — from normalization through AI-confirmed matching — without requiring you to write code or configure fuzzy matching algorithms manually. Upload two CSVs, describe what you’re matching, and let the AI assistant configure the pipeline.
Keep reading
- Master data management — how entity resolution fits into an MDM strategy
- Getting started with CSV matching — a step-by-step walkthrough for your first match
- How to choose the right matching algorithm — picking the right approach for your data