How to match two CSV files in Python (and when to use a tool instead)
From pandas merge to fuzzywuzzy to recordlinkage — a practical guide to CSV matching in Python, plus a framework for deciding when custom code stops making sense.
Matching two CSV files is one of the most common data tasks in any analytics or engineering workflow. You have a list of customers in one file and a list of leads in another. Or product records from two suppliers. Or employee data from an HRIS export and a payroll system dump. The goal is the same: figure out which rows in file A correspond to which rows in file B.
Python is the natural first choice. The ecosystem is rich, the libraries are well-documented, and most data professionals already have pandas installed. But as matching complexity grows, the gap between what a script can handle and what the job actually requires widens faster than most people expect.
This post walks through the Python approach at three levels of sophistication, then gives you a framework for deciding when to stop building and start using a purpose-built tool.
The basic approach: pandas merge on a shared key
If both CSV files share a clean, unique key column — an email address, a product SKU, a customer ID — then pd.merge() is all you need.
import pandas as pd
df1 = pd.read_csv("customers.csv")
df2 = pd.read_csv("leads.csv")
matched = pd.merge(df1, df2, on="email", how="inner")
The how parameter controls the join type: inner returns only rows that match in both files, left keeps all rows from the first file, outer keeps everything. You can merge on multiple columns using on=["email", "last_name"] and handle column name mismatches with left_on and right_on.
This approach is fast, simple, and works perfectly when the data is clean. In practice, it rarely is. The email column in file A has john.smith@company.com while file B has JOHN.SMITH@COMPANY.COM. Or one file uses Johnson & Associates while the other uses Johnson and Associates Inc. Exact merge silently misses these — no error, no warning, just a smaller result set than expected.
Exact merge is step zero, not the solution. It handles the easy cases and tells you how much of your data is clean enough to match exactly. Everything it misses needs a different approach.
Adding fuzzy matching with fuzzywuzzy and rapidfuzz
When keys don’t match exactly, fuzzy string matching compares strings by similarity rather than equality. The fuzzywuzzy library (and its faster C-extension drop-in, rapidfuzz) provides several comparison functions out of the box.
from rapidfuzz import fuzz, process
# Compare two strings directly
score = fuzz.ratio("Johnson Associates", "Johnson & Associates Inc")
# Returns ~82
# Find best match for a string within a list
best = process.extractOne(
"Johnson Associates",
["Johnson & Associates Inc", "Jackson Corp", "Johnston LLC"],
scorer=fuzz.token_sort_ratio
)
# Returns ("Johnson & Associates Inc", 90, 0)
The token_sort_ratio scorer splits strings into tokens, sorts them alphabetically, and then compares — which makes it order-independent. token_set_ratio goes further by comparing the intersection and remainders separately, handling cases where one string has extra words.
For a full CSV matching job, you typically loop through one file and find the best match in the other:
results = []
for _, row in df1.iterrows():
match = process.extractOne(
row["company_name"],
df2["company_name"].tolist(),
scorer=fuzz.token_sort_ratio,
score_cutoff=75
)
if match:
results.append({
"source": row["company_name"],
"matched": match[0],
"score": match[1]
})
This works. For a few hundred rows, it even works quickly. But there are three problems you will hit almost immediately.
First, speed. extractOne with default settings compares the query against every candidate. With 5,000 rows in each file, that is 25 million string comparisons. The rapidfuzz library is significantly faster than fuzzywuzzy (often 10-50x), but even at microseconds per comparison, large datasets grind to a halt.
Second, multi-field matching. Real matching usually requires comparing more than one field — name plus city, or name plus phone number. Fuzzy matching libraries operate on single strings. Combining scores across fields means writing your own weighted scoring logic, deciding how to handle missing values, and testing edge cases.
Third, threshold tuning. A score_cutoff of 75 is a guess. Too low and you get false positives. Too high and you miss valid matches. The right threshold depends on your data, your field types, and how dirty the formatting is. There is no way to know without running experiments and inspecting results.
Scaling up: the recordlinkage library for blocking and classification
The recordlinkage library was built specifically for the full matching pipeline: indexing (generating candidate pairs), comparing, and classifying.
import recordlinkage
indexer = recordlinkage.Index()
indexer.block("state") # Only compare records in the same state
candidates = indexer.index(df1, df2)
compare = recordlinkage.Compare()
compare.string("name", "name", method="jarowinkler", threshold=0.85)
compare.string("city", "city", method="levenshtein", threshold=0.80)
compare.exact("zip_code", "zip_code")
features = compare.compute(candidates, df1, df2)
The blocking step is the key improvement. Instead of comparing every record against every other record, you only compare records that share a blocking key — in this case, the same state. This eliminates the vast majority of comparisons and makes the problem tractable for larger datasets.
After computing features, you can classify matches using a threshold on the summed scores, or use the library’s built-in classifiers (K-means, logistic regression, SVM) trained on labeled examples.
recordlinkage is a serious tool. It handles the mechanics well. But it still requires significant effort to use effectively: you choose the blocking keys, the comparison methods, the thresholds, and the classifier. You write the data preprocessing, handle missing values, and build the evaluation pipeline. For a one-off matching job, this can mean days of development before you see your first results.
Where DIY matching hits a wall (the complexity cliff)
The progression from pd.merge to rapidfuzz to recordlinkage is a progression of increasing capability and increasing code complexity. At each level, you solve new problems but introduce new engineering requirements.
| Approach | Lines of Code | Handles Typos | Handles Scale | Iteration Speed | Non-technical Users |
|---|---|---|---|---|---|
| pandas merge | 5–10 | No | Yes (exact only) | Minutes | No |
| rapidfuzz loop | 30–60 | Single field | < 5K rows | Hours | No |
| rapidfuzz multi-field | 100–200 | Multiple fields | < 5K rows | Hours | No |
| recordlinkage pipeline | 200–500 | Multiple fields | < 50K rows | Days | No |
| recordlinkage + embeddings | 500–1,000+ | Semantic | < 100K rows | Weeks | No |
Lines of code include preprocessing, scoring, evaluation, and output formatting — not just the matching step.
The complexity cliff appears between rows three and four. Adding blocking, multi-algorithm comparison, classification, and evaluation turns a script into a project. Adding AI embeddings or LLM confirmation turns a project into an infrastructure challenge — you need API keys, rate limiting, caching, error handling, and cost management.
And there is a subtler problem: iteration cost. In a script, changing the matching logic means editing code, re-running, and inspecting output. Each iteration takes minutes to hours depending on dataset size. In a production matching job, you often need 5-10 iterations to get thresholds and field weights right. If each iteration requires code changes and a full re-run, the feedback loop is painfully slow.
Cost comparison: custom code vs dedicated matching tool
The real cost of DIY matching is not the API bill — it is the engineering time. Consider a typical matching project: 20,000 rows per dataset, company names with formatting differences, addresses that need normalization, and a requirement for high precision.
The pandas merge approach is cheapest in time but produces incomplete results. The recordlinkage pipeline produces good results but requires a week of work. The dedicated tool sits in a different category entirely — the complexity is handled by the platform, so you spend your time on configuration and validation rather than engineering.
This is not a knock against Python. For one-off exploratory matching on small datasets, a script is the right answer. But when matching is a recurring task, when the datasets are large, or when the results need to be reliable enough for production use, the economics shift.
A decision framework: build vs use a tool
The build-vs-buy decision depends on five factors. Score each one for your situation.
Dataset size. Under 1,000 rows per side, a script is fine. Between 1,000 and 10,000, you need blocking and some optimization. Above 10,000, you need a real pipeline with pre-filtering, batched processing, and checkpointing.
Match complexity. Exact key matching is trivial. Single-field fuzzy matching is manageable. Multi-field matching with different algorithms per field, weighted scoring, and threshold tuning is where code complexity compounds.
Iteration frequency. If you match once and move on, the upfront development cost is amortized over a single run. If you match monthly — new customer lists, updated vendor data, quarterly reconciliation — every code change costs you again. A configurable tool lets you adjust parameters without touching code.
Team composition. If the person who needs the matching results is also a Python developer, scripting is natural. If the matching requester is an analyst, a product manager, or a finance team member, they need a GUI they can operate independently.
AI requirements. If your data has semantic equivalences — “IBM” vs “International Business Machines”, “laptop” vs “notebook computer” — string algorithms will not catch them. You need embeddings, which means API integration, vector storage, and cosine similarity computation. If borderline cases need reasoning about context, you need LLM confirmation, which adds another layer of infrastructure.
Score yourself honestly on each factor. If three or more point toward “use a tool,” the development time you save will far exceed the tool’s cost.
One more consideration: maintenance. A matching script that works today breaks when the data format changes, when a library updates its API, or when a new edge case appears. A maintained tool handles these changes for you.
Match Data Studio handles the full pipeline — from pre-filtering and fuzzy matching to AI embeddings and LLM confirmation — without writing code. Upload your CSVs, describe what you are matching, and the AI assistant configures the right combination of algorithms for your data. Get started free →
Keep reading
- Fuzzy matching algorithms explained — Levenshtein, Jaro-Winkler, and when to use each one
- How to choose the right matching algorithm — a decision framework for selecting the best approach for your field types
- Matching at scale: performance strategies — blocking, pre-filter cascades, and parallelism for large datasets