How to merge two spreadsheets when the data doesn't match exactly
VLOOKUP and INDEX-MATCH break when data is messy. Learn how to merge two spreadsheets with fuzzy matching, AI-powered joins, and practical validation steps.
You have two spreadsheets that contain overlapping records. One is a customer export from your CRM. The other is a purchase history from your e-commerce platform. You need to combine them into a single view where each customer row includes their purchase data.
The obvious approach is VLOOKUP or INDEX-MATCH on a shared key like email or customer ID. You write the formula, press Enter, and get a wall of #N/A errors. Not because the data doesn’t overlap — you know many of these customers exist in both files — but because the keys don’t match character-for-character.
This is the most common data task that spreadsheet formulas cannot reliably solve. And for most data professionals, it’s a recurring problem: vendor lists against internal catalogs, lead lists against CRM exports, survey responses against enrollment records. The data should join. It just doesn’t.
Why VLOOKUP and INDEX-MATCH fail on real data
VLOOKUP and INDEX-MATCH are designed for exact lookups. They compare the lookup value against the lookup array byte-for-byte. If even a single character differs — a trailing space, a period after an abbreviation, a different capitalization pattern — the formula returns no match.
This works fine on clean, system-generated data where IDs are consistent. It falls apart on data entered or modified by humans, exported from different systems, or collected at different times.
Consider a simple example: merging a CRM export with a vendor contact list on company name.
| CRM export | Vendor list | Exact match? |
|---|---|---|
| Acme Corporation | Acme Corp. | No |
| Johnson & Johnson | Johnson and Johnson | No |
| AT&T Inc. | AT&T | No |
| The Walt Disney Company | Walt Disney Co | No |
| JPMorgan Chase & Co. | JP Morgan Chase | No |
| 3M Company | 3M | No |
| McDonald's Corporation | McDonalds Corp | No |
| Procter & Gamble Co. | Procter and Gamble | No |
Every row refers to the same company. Zero rows match with VLOOKUP.
Eight companies. Eight obvious matches to any human reader. Zero matches for VLOOKUP. The formula doesn’t know that Corp. and Corporation mean the same thing, that & and and are interchangeable, or that a trailing Inc. is irrelevant for identification purposes.
You could try to fix this with helper columns that strip punctuation, remove suffixes, and normalize abbreviations. Some people build elaborate nested SUBSTITUTE formulas. But each fix is specific to one pattern. You catch Corp. vs Corporation and then miss LLC vs L.L.C.. You handle & vs and and then discover records where one file uses + instead. The whack-a-mole never ends because real-world data variation is combinatorial.
The three types of mismatches that break spreadsheet joins
Not all mismatches are the same. Understanding the categories helps you choose the right fix.
Formatting mismatches are surface-level differences in how the same value was recorded. Case differences (SMITH vs Smith), punctuation differences (St. vs Street), whitespace differences (leading spaces, double spaces), and encoding differences (curly quotes vs straight quotes). These are the easiest to fix because the underlying value is identical — only the representation differs.
Semantic mismatches are different representations of the same meaning. Robert vs Bob. United States vs US vs USA. Apt 4B vs Unit 4B vs #4B. These require knowledge about what the values mean, not just what characters they contain. No string manipulation formula can reliably resolve these without a lookup table or language model.
Structural mismatches are differences in how information is organized across columns. One file has First Name and Last Name as separate columns; the other has a single Full Name field. One file stores city, state, and ZIP in three columns; the other concatenates them into one address string. These require parsing and restructuring before any matching can happen.
Most real-world merge failures involve all three types simultaneously. A typical unmatched record might have a formatting difference in the name field, a semantic difference in the address field, and a structural difference in how phone numbers are stored.
Cleaning your data before merging (the 80% fix)
Before reaching for advanced tools, apply basic cleaning to both spreadsheets. This is the single highest-impact step in any merge workflow, and it resolves the majority of formatting mismatches.
Lowercase everything. =LOWER(A2) in a helper column eliminates all case-related mismatches. This alone recovers 10-15% of failed joins in typical datasets.
Trim whitespace. =TRIM(A2) removes leading, trailing, and excess internal spaces. Invisible whitespace is one of the most common causes of #N/A errors that feel inexplicable — the values look identical in the cell but don’t match.
Standardize punctuation. Remove periods from abbreviations, normalize ampersands, strip quotes and special characters. =SUBSTITUTE(SUBSTITUTE(A2,".",""),"&","and") handles two common patterns, but you’ll need to chain more for thorough cleaning.
Normalize formats. Convert phone numbers to digits only. Convert dates to a single format. Expand or remove common suffixes like Inc., LLC, Corp., Co., Ltd..
After cleaning both files with these steps, try your VLOOKUP again. You’ll often find that 60-80% of the previously unmatched records now join successfully.
The cleaning steps are free and fast. They don’t require specialized software. And they cut the unmatched residual from 66% to 21%. The remaining 21% is where you need fuzzy matching — but now fuzzy matching only has to handle the genuinely hard cases, not the formatting noise.
When you need fuzzy matching instead of exact joins
After cleaning, the records that still don’t match fall into categories that string manipulation can’t solve:
Nicknames and abbreviations. Bob vs Robert, Liz vs Elizabeth, Intl vs International. These require knowledge of common equivalences.
Typos and data entry errors. Jhon vs John, Sanfrancisco vs San Francisco. These are one or two characters off but unpredictable in where the error occurs.
Reordered components. Smith, John vs John Smith. 100 Main St, New York vs New York, 100 Main St. The same information, rearranged.
Missing or extra tokens. John Smith vs John R. Smith. Acme vs Acme Industries Inc. One record has more detail than the other.
Fuzzy matching algorithms handle these by scoring how similar two strings are rather than checking for exact equality. A Levenshtein distance algorithm scores Jhon Smith and John Smith as 93% similar (one character transposition). A Jaro-Winkler algorithm scores Robert Johnson and Bob Johnson lower on characters but a token-based approach can recognize the shared surname.
AI-powered matching goes further. Embedding models convert text into numerical vectors that capture meaning, not just characters. In embedding space, Bob and Robert are close together because they appear in similar contexts across millions of documents. 123 Main St and 123 Main Street are nearly identical vectors. This semantic understanding is what lets AI matching handle the cases that even sophisticated string algorithms miss.
Step-by-step: merging two CSVs with AI-powered matching
Here’s a practical workflow for merging two spreadsheets when cleaning alone isn’t enough.
1. Export both files as CSV. Most matching tools work with CSV, not native Excel formats. Save both spreadsheets as .csv files. Check that special characters (accented letters, currency symbols) survived the export.
2. Identify your matching fields. Which columns should be compared? Ideally, pick two to four fields that together uniquely identify a record. Company name alone has too many false positives. Company name plus city plus phone number is much more discriminating.
3. Upload and configure. In Match Data Studio, create a project and upload both CSVs. The AI assistant analyzes your columns and suggests a matching configuration — which fields to compare, what similarity algorithms to use per field, and what thresholds to set.
4. Run a sample. Before processing the full dataset, run a sample on 100 rows. Review the matched pairs: are the top matches correct? Are obvious pairs present? Are there false positives?
5. Adjust and iterate. If the sample shows problems — too many false matches, missing obvious pairs — adjust the configuration. Raise thresholds if you’re getting false positives. Add more matching fields if common values are causing confusion. Lower thresholds if you’re missing close matches.
6. Run the full merge. Once the sample looks right, run the full dataset. The pipeline applies blocking to keep the comparison space manageable, runs your configured similarity algorithms, and outputs a merged file with both sides joined and confidence scores for each pair.
7. Export and review. Download the results CSV. Sort by confidence score to review borderline matches first — these are the ones most likely to need human judgment.
How to validate your merged results
A merged spreadsheet is only useful if the matches are correct. Validation is not optional.
Spot-check the high-confidence matches. Pick 20-30 pairs with scores above 0.90. These should be obviously correct. If you find errors here, something is fundamentally misconfigured.
Focus on the boundary. The matches most likely to be wrong are those near your threshold. If your threshold is 0.80, review the pairs scoring 0.80-0.85 carefully. This is where the precision/recall tradeoff lives.
Check the unmatched residual. Look at the records that didn’t match anything. Are there obvious pairs that were missed? If so, your threshold may be too high or you may need to add matching fields.
Measure your match rate. What percentage of records from each file found a match? If File A has 5,000 rows and File B has 3,000 rows, and you know the true overlap is roughly 2,500 records, your match count should be in that range.
| Check | What to look for | Action if problem found |
|---|---|---|
| High-confidence pairs (>0.90) | All pairs should be correct matches | Review matching field configuration |
| Boundary pairs (0.80-0.85) | Most should be correct; some false positives expected | Raise threshold if too many false positives |
| Unmatched records | No obvious missed matches | Lower threshold or add matching fields |
| Match count vs expected overlap | Should be within 10-15% of known overlap | Investigate systematic misses |
| Duplicate matches | Each record should match at most one partner | Enable one-to-one constraint or review manually |
| Field-level scores | Individual field scores should make sense | Check for uncleaned formatting issues |
Spend 15-20 minutes on validation before using merged data downstream. This prevents compounding errors in reports and analyses.
Validation isn’t just about correctness today — it builds your intuition for threshold tuning on future projects. Once you’ve validated three or four merge jobs, you’ll develop a sense for what a good score distribution looks like and where to set thresholds for different data types.
What to do with the records that don’t merge
Even with fuzzy matching and AI, some records won’t find a partner. This is expected and informative.
Genuinely unique records exist in one file but not the other. A customer who never made a purchase. A transaction from a customer not in your CRM. These are valid non-matches.
Records that should match but didn’t usually have a specific, fixable cause. Maybe the matching fields you chose don’t overlap for those records. Maybe there’s a data quality issue unique to a subset of records (a batch import with truncated names, a region where addresses follow a different format). Investigating a sample of these non-matches often reveals patterns you can address with additional cleaning or configuration adjustments.
Ambiguous records are cases where a record could plausibly match multiple partners or where the similarity scores are too low to confidently choose. Flag these for manual review rather than forcing a match or discarding them.
The goal is not 100% automated merge rate. The goal is automating the easy 85-95% so your human review time is spent on the genuinely ambiguous cases, not on formatting differences that a machine should handle.
Merging spreadsheets with messy data doesn’t require writing complex formulas or manual row-by-row comparison. Match Data Studio automates the cleaning, fuzzy matching, and validation steps so you get a reliable merged file in minutes. Start your first merge project free —>
Keep reading
- Getting started with CSV matching — a step-by-step walkthrough of your first matching job from upload to download
- Data cleaning before matching — the prep steps that resolve most matching failures before you even start
- Five matching mistakes that silently ruin your results — the configuration errors that produce plausible but wrong output