You open a CSV file exported from your CRM, a vendor database, or a scraped dataset. It has 50,000 rows. You suspect many of those rows are duplicates — the same person, company, or product entered multiple times with slight variations. You select all columns, click Remove Duplicates in Excel, and it finds 312 exact duplicates. Problem solved?

Not even close. The 312 records Excel found are the trivial case — rows that happen to be identical across every single cell. The real duplicates, the ones inflating your counts and corrupting your analytics, have a different phone format in one field, a nickname instead of a full name in another, and an abbreviated address in a third. Excel doesn’t see these as duplicates. They are.

The gap between what Excel catches and what actually exists in your data is often staggering. In our experience working with customer and product datasets, true duplicate rates range from 8% to 25% of total rows, while exact-match detection catches 1-4%. That means the tool most people use for deduplication misses 80-95% of actual duplicates.

Exact duplicates vs fuzzy duplicates (and why the second kind matters more)

An exact duplicate is a row where every compared field is identical, character for character. These are relatively rare in practice. They occur when the same record is imported twice from an identical source — a double-submitted form, a repeated batch import, a copy-paste error.

A fuzzy duplicate is a row that refers to the same real-world entity but was recorded differently. Same person, different spelling. Same address, different abbreviation. Same product, different description.

The same customer entered four times — only one pair is an exact duplicate
Row Name Email Phone City
A Robert Johnson r.johnson@gmail.com (555) 234-5678 San Francisco
B Robert Johnson r.johnson@gmail.com (555) 234-5678 San Francisco
C Bob Johnson rjohnson@gmail.com 555-234-5678 San Francisco, CA
D R. Johnson r.johnson@gmail.com 5552345678 SF

Rows A and B are exact duplicates. Rows C and D are fuzzy duplicates of the same person. Excel's Remove Duplicates finds only the A/B pair.

Rows A and B are identical in every cell. Excel catches this. Rows C and D are clearly the same person as A and B — same phone number (different format), same email (with or without the period), same city (abbreviated differently). Excel treats C and D as completely unique records.

In most real datasets, fuzzy duplicates outnumber exact duplicates by a factor of five to ten. They’re also more damaging, because they create the illusion of a larger dataset. When a sales manager sees 50,000 contacts, they make staffing and budget decisions based on that number. If 8,000 of those contacts are duplicates, every downstream decision is working from inflated data.

Why Excel’s “Remove Duplicates” misses most real duplicates

Excel’s deduplication is a character-level exact comparison. It hashes each row (or selected columns) and groups identical hashes. This approach has three fundamental limitations.

It has no concept of similarity. Robert and Bob are 0% match in Excel’s view. There is no partial credit, no similarity score, no way to say “these are close enough.” It’s binary: identical or different.

It compares entire fields, not meaningful components. (555) 234-5678 and 555-234-5678 contain the same ten digits. They represent the same phone number. But as strings, they share only 72% of their characters once you account for the parentheses, spaces, and dashes.

It can’t handle structural variation. If one row has San Francisco in the city column and another has San Francisco, CA, Excel sees different strings. It doesn’t know that the second value contains extra information rather than contradictory information.

These aren’t bugs in Excel. Remove Duplicates does exactly what it promises — it removes rows that are identical across selected columns. The problem is that real-world deduplication requires a fundamentally different approach: comparing records by what they represent, not by the exact characters in each cell.

Detecting fuzzy duplicates: the field-by-field approach

Effective deduplication compares records field by field, using the right comparison method for each field type, then combines the field-level scores into an overall similarity score.

Name fields need algorithms that handle transpositions, abbreviations, and nicknames. Jaro-Winkler is effective for personal names because it gives extra weight to matching prefixes (catching Robert vs Rob). For business names, token-based comparison works better because word order varies (Johnson & Johnson Inc vs Inc Johnson and Johnson).

Phone and ID fields should be compared after stripping formatting. Extract digits only, then compare. (555) 234-5678 becomes 5552345678. This alone converts most phone “mismatches” into exact matches.

Address fields need normalization before comparison: expand abbreviations (St to Street, Blvd to Boulevard), standardize directionals (N to North), and parse into components (street number, street name, unit, city, state, ZIP). Compare each component separately.

Email fields are highly discriminating. A character-level comparison works well, but consider splitting on @ and comparing the local part and domain separately. rjohnson@gmail.com and r.johnson@gmail.com differ by one period in the local part — that’s almost certainly the same person.

Free-text fields (descriptions, notes, comments) benefit from embedding-based comparison. A product described as “wireless bluetooth headphones with noise canceling” and “noise-cancelling BT headphones, wireless” contain the same information in different words. AI embeddings capture this semantic similarity where character-level comparison fails.

Duplicate detection rate by method
Excel Remove Duplicates Exact match only
14%
Case-insensitive exact After lowercasing
23%
Normalized exact After cleaning + formatting
41%
Single-field fuzzy Name field, Jaro-Winkler
58%
Multi-field fuzzy Name + phone + city
76%
AI-powered matching Embeddings + LLM confirm
91%

Percentage of true duplicates found in a 25,000-row CRM export with a 12% true duplicate rate. Illustrative figures.

The progression tells a clear story. Exact matching in any form (including case-insensitive) catches fewer than a quarter of real duplicates. Normalization helps — cleaning phone formats and addresses pushes detection to 41%. Fuzzy matching on a single field reaches 58%, and multi-field fuzzy matching reaches 76%. AI-powered matching, which understands semantic equivalence and can reason about ambiguous cases, pushes detection above 90%.

Each step adds value. But the jump from character-level to semantic comparison — from normalized exact matching to fuzzy matching — is where the biggest gains happen.

Choosing a similarity threshold for deduplication

Every fuzzy matching approach produces a similarity score for each pair. You need a threshold: pairs above the threshold are flagged as duplicates; pairs below are treated as distinct records.

Setting this threshold is the most consequential decision in any deduplication workflow. Too high and you miss real duplicates. Too low and you merge records that are actually different people, companies, or products.

Start at 0.85. This is a reasonable default for most datasets. At 0.85, two records need to be very similar across multiple fields to be flagged — similar enough that the vast majority of flagged pairs are real duplicates.

Review a sample of 30-50 pairs near the boundary. Look at pairs scoring 0.83-0.87. How many are real duplicates? If all 30 are correct, you can probably lower the threshold. If 10 of them are false positives, you should raise it.

Adjust per field type. Names might need a lower threshold (0.80) because nicknames and abbreviations create legitimate variation between duplicates. Phone numbers should have a higher threshold (0.95) because they’re either the same number or they’re not — there’s no “kind of the same phone number.”

Consider the cost of each error type. In a marketing database, false positives (merging two different people into one record) cause you to lose a contact. False negatives (failing to merge duplicates) cause you to send duplicate communications. Which is more costly for your use case? If losing contacts is worse, set a lower threshold and manually review flagged pairs. If duplicate outreach is worse, set a higher threshold and accept that some duplicates will survive.

Threshold tuning: precision and recall tradeoff on a 25,000-row dataset
Threshold Duplicate pairs found True positives False positives Precision Recall
0.70 2,840 1,380 1,460 48.6% 92.0%
0.75 2,210 1,340 870 60.6% 89.3%
0.80 1,720 1,280 440 74.4% 85.3%
0.85 1,350 1,190 160 88.1% 79.3%
0.90 1,050 1,010 40 96.2% 67.3%
0.95 680 670 10 98.5% 44.7%

Based on a dataset with 1,500 true duplicate pairs. Precision = true positives / pairs found. Recall = true positives / total true duplicates.

The table illustrates the fundamental tradeoff. At 0.70, you catch 92% of duplicates but nearly half of your flagged pairs are wrong — you’d spend more time reviewing false positives than you’d save by catching duplicates. At 0.95, almost every flagged pair is correct, but you miss more than half of the actual duplicates. The 0.85 row is often the sweet spot: 88% precision with 79% recall means manageable review burden and meaningful duplicate reduction.

Handling the duplicates you find: merge, keep-best, or flag

Finding duplicates is half the problem. Deciding what to do with them is the other half.

Merge strategy: combine the best data from each duplicate into a single surviving record. Take the most complete name, the most recent email, the phone number that’s most likely current. This produces the highest-quality output but requires rules for which value wins when duplicates disagree.

Keep-best strategy: pick the most complete or most recent duplicate and discard the others. Simpler than merging, but you lose data that only exists in the discarded records. A secondary record might have a phone number that the primary record lacks.

Flag-for-review strategy: mark duplicate pairs in a new column without removing anything. A human reviewer makes the final call on each pair. This is the safest approach for sensitive data (medical records, financial accounts) where an incorrect merge has serious consequences.

In practice, most teams use a hybrid: auto-merge high-confidence pairs (scores above 0.95), flag medium-confidence pairs (0.85-0.95) for review, and leave low-confidence pairs (below 0.85) as separate records. This directs human attention where it matters most — the ambiguous cases — while automating the clear-cut ones.

Deduplication at scale: when your file has 100K+ rows

A file with 100,000 rows contains 4,999,950,000 possible pairs. Comparing every row against every other row is computationally infeasible — even at 10,000 comparisons per second, it would take nearly six days.

The solution is blocking: dividing your file into groups of records that are likely to contain duplicates, then only comparing within each group.

Common blocking keys:

  • First three characters of last name (all Smiths compared together, not against all Johnsons)
  • ZIP code or first three digits of ZIP
  • First letter of name plus state
  • Phone area code
  • Domain of email address

A good blocking key reduces comparisons by 95-99% while keeping virtually all true duplicate pairs in the same block. The tradeoff is that duplicates split across blocks — a Smith in one block and a Smyth in another — won’t be compared. Multi-pass blocking with different keys addresses this: first pass blocks on name prefix, second pass blocks on ZIP code, third pass blocks on phone prefix. Pairs caught by any pass are compared.

Sorted-neighborhood approach is an alternative to blocking. Sort the file by a key field (like a phonetic encoding of the name), then compare each record only against its nearest neighbors in the sorted order — typically a window of 5-20 records. This works well when duplicates cluster naturally in sorted order and avoids the hard boundaries of blocking.

Iterative deduplication handles the case where a single entity appears more than twice. After the first pass identifies pairs, you build clusters: if A matches B and B matches C, then A, B, and C are all duplicates even if A and C don’t directly match. Transitive closure on the match graph produces complete duplicate clusters rather than isolated pairs.

Match Data Studio’s pipeline handles all of this automatically. Stage 1 applies blocking to generate candidate pairs. Stage 2 computes embeddings only for rows that survived blocking (saving AI costs). Stage 3 runs similarity comparison and LLM confirmation on the candidate pairs. You configure the matching rules; the pipeline handles the scale.


Deduplicating a CSV file properly requires more than Excel’s Remove Duplicates. It requires fuzzy matching, threshold tuning, blocking strategies, and a plan for handling the duplicates you find. Match Data Studio automates the entire workflow — from field-level cleaning through AI-powered duplicate detection to exportable results. Start deduplicating your data now —>


Keep reading