Why does Excel's Remove Duplicates miss so many duplicates?

Excel's Remove Duplicates only catches exact, row-level matches. If any cell in a row differs — even by a single space, abbreviation, or typo — Excel treats the rows as distinct records. In real datasets, true duplicates almost always have at least minor variations across fields, which is why Excel typically catches fewer than 20% of actual duplicates.

What is the difference between exact and fuzzy duplicates?

Exact duplicates are rows where every compared field is identical character-for-character. Fuzzy duplicates are rows that refer to the same real-world entity but have differences in how the data was entered — for example, 'Robert Johnson' vs 'Bob Johnson' or '123 Main St' vs '123 Main Street'. Fuzzy duplicates are far more common and require similarity algorithms to detect.

How do I choose a similarity threshold for deduplication?

Start with a threshold of 0.85 (85% similarity) and review a sample of matched pairs. If you see too many false positives (pairs that are not actual duplicates), raise the threshold to 0.90 or 0.95. If you are missing obvious duplicates, lower it to 0.80. The right threshold depends on your data quality and how much manual review you can afford.

Can I deduplicate a CSV file with more than 100,000 rows?

Yes, but you need a strategy beyond naive pairwise comparison. Blocking keys — such as the first three characters of a name or a ZIP code — divide your file into smaller groups that are compared internally. This reduces comparisons from billions to thousands while still catching duplicates across groups via multi-pass blocking.

How to find and remove duplicates in a CSV file

You open a CSV file exported from your CRM, a vendor database, or a scraped dataset. It has 50,000 rows. You suspect many of those rows are duplicates — the same person, company, or product entered multiple times with slight variations. You select all columns, click Remove Duplicates in Excel, and it finds 312 exact duplicates. Problem solved?

Not even close. The 312 records Excel found are the trivial case — rows that happen to be identical across every single cell. The real duplicates, the ones inflating your counts and corrupting your analytics, have a different phone format in one field, a nickname instead of a full name in another, and an abbreviated address in a third. Excel doesn’t see these as duplicates. They are.

The gap between what Excel catches and what actually exists in your data is often staggering. In our experience working with customer and product datasets, true duplicate rates range from 8% to 25% of total rows, while exact-match detection catches 1-4%. That means the tool most people use for deduplication misses 80-95% of actual duplicates.

Exact duplicates vs fuzzy duplicates (and why the second kind matters more)

An exact duplicate is a row where every compared field is identical, character for character. These are relatively rare in practice. They occur when the same record is imported twice from an identical source — a double-submitted form, a repeated batch import, a copy-paste error.

A fuzzy duplicate is a row that refers to the same real-world entity but was recorded differently. Same person, different spelling. Same address, different abbreviation. Same product, different description.

The same customer entered four times — only one pair is an exact duplicate

Row	Name	Email	Phone	City
A	Robert Johnson	r.johnson@gmail.com	(555) 234-5678	San Francisco
B	Robert Johnson	r.johnson@gmail.com	(555) 234-5678	San Francisco
C	Bob Johnson	rjohnson@gmail.com	555-234-5678	San Francisco, CA
D	R. Johnson	r.johnson@gmail.com	5552345678	SF

Rows A and B are exact duplicates. Rows C and D are fuzzy duplicates of the same person. Excel's Remove Duplicates finds only the A/B pair.

Rows A and B are identical in every cell. Excel catches this. Rows C and D are clearly the same person as A and B — same phone number (different format), same email (with or without the period), same city (abbreviated differently). Excel treats C and D as completely unique records.

In most real datasets, fuzzy duplicates outnumber exact duplicates by a factor of five to ten. They’re also more damaging, because they create the illusion of a larger dataset. When a sales manager sees 50,000 contacts, they make staffing and budget decisions based on that number. If 8,000 of those contacts are duplicates, every downstream decision is working from inflated data.

Why Excel’s “Remove Duplicates” misses most real duplicates

Excel’s deduplication is a character-level exact comparison. It hashes each row (or selected columns) and groups identical hashes. This approach has three fundamental limitations.

It has no concept of similarity. Robert and Bob are 0% match in Excel’s view. There is no partial credit, no similarity score, no way to say “these are close enough.” It’s binary: identical or different.

It compares entire fields, not meaningful components. (555) 234-5678 and 555-234-5678 contain the same ten digits. They represent the same phone number. But as strings, they share only 72% of their characters once you account for the parentheses, spaces, and dashes.

It can’t handle structural variation. If one row has San Francisco in the city column and another has San Francisco, CA, Excel sees different strings. It doesn’t know that the second value contains extra information rather than contradictory information.

These aren’t bugs in Excel. Remove Duplicates does exactly what it promises — it removes rows that are identical across selected columns. The problem is that real-world deduplication requires a fundamentally different approach: comparing records by what they represent, not by the exact characters in each cell.

Detecting fuzzy duplicates: the field-by-field approach

Effective deduplication compares records field by field, using the right comparison method for each field type, then combines the field-level scores into an overall similarity score.

Name fields need algorithms that handle transpositions, abbreviations, and nicknames. Jaro-Winkler is effective for personal names because it gives extra weight to matching prefixes (catching Robert vs Rob). For business names, token-based comparison works better because word order varies (Johnson & Johnson Inc vs Inc Johnson and Johnson).

Phone and ID fields should be compared after stripping formatting. Extract digits only, then compare. (555) 234-5678 becomes 5552345678. This alone converts most phone “mismatches” into exact matches.

Address fields need normalization before comparison: expand abbreviations (St to Street, Blvd to Boulevard), standardize directionals (N to North), and parse into components (street number, street name, unit, city, state, ZIP). Compare each component separately.

Email fields are highly discriminating. A character-level comparison works well, but consider splitting on @ and comparing the local part and domain separately. rjohnson@gmail.com and r.johnson@gmail.com differ by one period in the local part — that’s almost certainly the same person.

Free-text fields (descriptions, notes, comments) benefit from embedding-based comparison. A product described as “wireless bluetooth headphones with noise canceling” and “noise-cancelling BT headphones, wireless” contain the same information in different words. AI embeddings capture this semantic similarity where character-level comparison fails.

Duplicate detection rate by method

Excel Remove Duplicates Exact match only

14%

Case-insensitive exact After lowercasing

23%

Normalized exact After cleaning + formatting

41%

Single-field fuzzy Name field, Jaro-Winkler

58%

Multi-field fuzzy Name + phone + city

76%

AI-powered matching Embeddings + LLM confirm

91%

Percentage of true duplicates found in a 25,000-row CRM export with a 12% true duplicate rate. Illustrative figures.

The progression tells a clear story. Exact matching in any form (including case-insensitive) catches fewer than a quarter of real duplicates. Normalization helps — cleaning phone formats and addresses pushes detection to 41%. Fuzzy matching on a single field reaches 58%, and multi-field fuzzy matching reaches 76%. AI-powered matching, which understands semantic equivalence and can reason about ambiguous cases, pushes detection above 90%.

Each step adds value. But the jump from character-level to semantic comparison — from normalized exact matching to fuzzy matching — is where the biggest gains happen.

Choosing a similarity threshold for deduplication

Every fuzzy matching approach produces a similarity score for each pair. You need a threshold: pairs above the threshold are flagged as duplicates; pairs below are treated as distinct records.

Setting this threshold is the most consequential decision in any deduplication workflow. Too high and you miss real duplicates. Too low and you merge records that are actually different people, companies, or products.

Start at 0.85. This is a reasonable default for most datasets. At 0.85, two records need to be very similar across multiple fields to be flagged — similar enough that the vast majority of flagged pairs are real duplicates.

Review a sample of 30-50 pairs near the boundary. Look at pairs scoring 0.83-0.87. How many are real duplicates? If all 30 are correct, you can probably lower the threshold. If 10 of them are false positives, you should raise it.

Adjust per field type. Names might need a lower threshold (0.80) because nicknames and abbreviations create legitimate variation between duplicates. Phone numbers should have a higher threshold (0.95) because they’re either the same number or they’re not — there’s no “kind of the same phone number.”

Consider the cost of each error type. In a marketing database, false positives (merging two different people into one record) cause you to lose a contact. False negatives (failing to merge duplicates) cause you to send duplicate communications. Which is more costly for your use case? If losing contacts is worse, set a lower threshold and manually review flagged pairs. If duplicate outreach is worse, set a higher threshold and accept that some duplicates will survive.

Threshold tuning: precision and recall tradeoff on a 25,000-row dataset

Threshold	Duplicate pairs found	True positives	False positives	Precision	Recall
0.70	2,840	1,380	1,460	48.6%	92.0%
0.75	2,210	1,340	870	60.6%	89.3%
0.80	1,720	1,280	440	74.4%	85.3%
0.85	1,350	1,190	160	88.1%	79.3%
0.90	1,050	1,010	40	96.2%	67.3%
0.95	680	670	10	98.5%	44.7%

Based on a dataset with 1,500 true duplicate pairs. Precision = true positives / pairs found. Recall = true positives / total true duplicates.

The table illustrates the fundamental tradeoff. At 0.70, you catch 92% of duplicates but nearly half of your flagged pairs are wrong — you’d spend more time reviewing false positives than you’d save by catching duplicates. At 0.95, almost every flagged pair is correct, but you miss more than half of the actual duplicates. The 0.85 row is often the sweet spot: 88% precision with 79% recall means manageable review burden and meaningful duplicate reduction.

Handling the duplicates you find: merge, keep-best, or flag

Finding duplicates is half the problem. Deciding what to do with them is the other half.

Merge strategy: combine the best data from each duplicate into a single surviving record. Take the most complete name, the most recent email, the phone number that’s most likely current. This produces the highest-quality output but requires rules for which value wins when duplicates disagree.

Keep-best strategy: pick the most complete or most recent duplicate and discard the others. Simpler than merging, but you lose data that only exists in the discarded records. A secondary record might have a phone number that the primary record lacks.

Flag-for-review strategy: mark duplicate pairs in a new column without removing anything. A human reviewer makes the final call on each pair. This is the safest approach for sensitive data (medical records, financial accounts) where an incorrect merge has serious consequences.

In practice, most teams use a hybrid: auto-merge high-confidence pairs (scores above 0.95), flag medium-confidence pairs (0.85-0.95) for review, and leave low-confidence pairs (below 0.85) as separate records. This directs human attention where it matters most — the ambiguous cases — while automating the clear-cut ones.

Deduplication at scale: when your file has 100K+ rows

A file with 100,000 rows contains 4,999,950,000 possible pairs. Comparing every row against every other row is computationally infeasible — even at 10,000 comparisons per second, it would take nearly six days.

The solution is blocking: dividing your file into groups of records that are likely to contain duplicates, then only comparing within each group.

Common blocking keys:

First three characters of last name (all Smiths compared together, not against all Johnsons)
ZIP code or first three digits of ZIP
First letter of name plus state
Phone area code
Domain of email address

A good blocking key reduces comparisons by 95-99% while keeping virtually all true duplicate pairs in the same block. The tradeoff is that duplicates split across blocks — a Smith in one block and a Smyth in another — won’t be compared. Multi-pass blocking with different keys addresses this: first pass blocks on name prefix, second pass blocks on ZIP code, third pass blocks on phone prefix. Pairs caught by any pass are compared.

Sorted-neighborhood approach is an alternative to blocking. Sort the file by a key field (like a phonetic encoding of the name), then compare each record only against its nearest neighbors in the sorted order — typically a window of 5-20 records. This works well when duplicates cluster naturally in sorted order and avoids the hard boundaries of blocking.

Iterative deduplication handles the case where a single entity appears more than twice. After the first pass identifies pairs, you build clusters: if A matches B and B matches C, then A, B, and C are all duplicates even if A and C don’t directly match. Transitive closure on the match graph produces complete duplicate clusters rather than isolated pairs.

Match Data Studio’s pipeline handles all of this automatically. Stage 1 applies blocking to generate candidate pairs. Stage 2 computes embeddings only for rows that survived blocking (saving AI costs). Stage 3 runs similarity comparison and LLM confirmation on the candidate pairs. You configure the matching rules; the pipeline handles the scale.

Deduplicating a CSV file properly requires more than Excel’s Remove Duplicates. It requires fuzzy matching, threshold tuning, blocking strategies, and a plan for handling the duplicates you find. Match Data Studio automates the entire workflow — from field-level cleaning through AI-powered duplicate detection to exportable results. Start deduplicating your data now —>

Keep reading

Understanding similarity thresholds — how to set cutoffs that balance precision and recall for your specific data
Marketplace data deduplication — deduplication strategies for scraped product listings across platforms
Data cleaning before matching — the preprocessing steps that make deduplication dramatically more effective