Matching two CSV files is one of the most common data tasks in any analytics or engineering workflow. You have a list of customers in one file and a list of leads in another. Or product records from two suppliers. Or employee data from an HRIS export and a payroll system dump. The goal is the same: figure out which rows in file A correspond to which rows in file B.

Python is the natural first choice. The ecosystem is rich, the libraries are well-documented, and most data professionals already have pandas installed. But as matching complexity grows, the gap between what a script can handle and what the job actually requires widens faster than most people expect.

This post walks through the Python approach at three levels of sophistication, then gives you a framework for deciding when to stop building and start using a purpose-built tool.

The basic approach: pandas merge on a shared key

If both CSV files share a clean, unique key column — an email address, a product SKU, a customer ID — then pd.merge() is all you need.

import pandas as pd

df1 = pd.read_csv("customers.csv")
df2 = pd.read_csv("leads.csv")

matched = pd.merge(df1, df2, on="email", how="inner")

The how parameter controls the join type: inner returns only rows that match in both files, left keeps all rows from the first file, outer keeps everything. You can merge on multiple columns using on=["email", "last_name"] and handle column name mismatches with left_on and right_on.

This approach is fast, simple, and works perfectly when the data is clean. In practice, it rarely is. The email column in file A has john.smith@company.com while file B has JOHN.SMITH@COMPANY.COM. Or one file uses Johnson & Associates while the other uses Johnson and Associates Inc. Exact merge silently misses these — no error, no warning, just a smaller result set than expected.

Exact merge is step zero, not the solution. It handles the easy cases and tells you how much of your data is clean enough to match exactly. Everything it misses needs a different approach.

Adding fuzzy matching with fuzzywuzzy and rapidfuzz

When keys don’t match exactly, fuzzy string matching compares strings by similarity rather than equality. The fuzzywuzzy library (and its faster C-extension drop-in, rapidfuzz) provides several comparison functions out of the box.

from rapidfuzz import fuzz, process

# Compare two strings directly
score = fuzz.ratio("Johnson Associates", "Johnson & Associates Inc")
# Returns ~82

# Find best match for a string within a list
best = process.extractOne(
    "Johnson Associates",
    ["Johnson & Associates Inc", "Jackson Corp", "Johnston LLC"],
    scorer=fuzz.token_sort_ratio
)
# Returns ("Johnson & Associates Inc", 90, 0)

The token_sort_ratio scorer splits strings into tokens, sorts them alphabetically, and then compares — which makes it order-independent. token_set_ratio goes further by comparing the intersection and remainders separately, handling cases where one string has extra words.

For a full CSV matching job, you typically loop through one file and find the best match in the other:

results = []
for _, row in df1.iterrows():
    match = process.extractOne(
        row["company_name"],
        df2["company_name"].tolist(),
        scorer=fuzz.token_sort_ratio,
        score_cutoff=75
    )
    if match:
        results.append({
            "source": row["company_name"],
            "matched": match[0],
            "score": match[1]
        })

This works. For a few hundred rows, it even works quickly. But there are three problems you will hit almost immediately.

First, speed. extractOne with default settings compares the query against every candidate. With 5,000 rows in each file, that is 25 million string comparisons. The rapidfuzz library is significantly faster than fuzzywuzzy (often 10-50x), but even at microseconds per comparison, large datasets grind to a halt.

Second, multi-field matching. Real matching usually requires comparing more than one field — name plus city, or name plus phone number. Fuzzy matching libraries operate on single strings. Combining scores across fields means writing your own weighted scoring logic, deciding how to handle missing values, and testing edge cases.

Third, threshold tuning. A score_cutoff of 75 is a guess. Too low and you get false positives. Too high and you miss valid matches. The right threshold depends on your data, your field types, and how dirty the formatting is. There is no way to know without running experiments and inspecting results.

Scaling up: the recordlinkage library for blocking and classification

The recordlinkage library was built specifically for the full matching pipeline: indexing (generating candidate pairs), comparing, and classifying.

import recordlinkage

indexer = recordlinkage.Index()
indexer.block("state")  # Only compare records in the same state
candidates = indexer.index(df1, df2)

compare = recordlinkage.Compare()
compare.string("name", "name", method="jarowinkler", threshold=0.85)
compare.string("city", "city", method="levenshtein", threshold=0.80)
compare.exact("zip_code", "zip_code")

features = compare.compute(candidates, df1, df2)

The blocking step is the key improvement. Instead of comparing every record against every other record, you only compare records that share a blocking key — in this case, the same state. This eliminates the vast majority of comparisons and makes the problem tractable for larger datasets.

After computing features, you can classify matches using a threshold on the summed scores, or use the library’s built-in classifiers (K-means, logistic regression, SVM) trained on labeled examples.

recordlinkage is a serious tool. It handles the mechanics well. But it still requires significant effort to use effectively: you choose the blocking keys, the comparison methods, the thresholds, and the classifier. You write the data preprocessing, handle missing values, and build the evaluation pipeline. For a one-off matching job, this can mean days of development before you see your first results.

Where DIY matching hits a wall (the complexity cliff)

The progression from pd.merge to rapidfuzz to recordlinkage is a progression of increasing capability and increasing code complexity. At each level, you solve new problems but introduce new engineering requirements.

Complexity progression in Python CSV matching
Approach Lines of Code Handles Typos Handles Scale Iteration Speed Non-technical Users
pandas merge 5–10 No Yes (exact only) Minutes No
rapidfuzz loop 30–60 Single field < 5K rows Hours No
rapidfuzz multi-field 100–200 Multiple fields < 5K rows Hours No
recordlinkage pipeline 200–500 Multiple fields < 50K rows Days No
recordlinkage + embeddings 500–1,000+ Semantic < 100K rows Weeks No

Lines of code include preprocessing, scoring, evaluation, and output formatting — not just the matching step.

The complexity cliff appears between rows three and four. Adding blocking, multi-algorithm comparison, classification, and evaluation turns a script into a project. Adding AI embeddings or LLM confirmation turns a project into an infrastructure challenge — you need API keys, rate limiting, caching, error handling, and cost management.

And there is a subtler problem: iteration cost. In a script, changing the matching logic means editing code, re-running, and inspecting output. Each iteration takes minutes to hours depending on dataset size. In a production matching job, you often need 5-10 iterations to get thresholds and field weights right. If each iteration requires code changes and a full re-run, the feedback loop is painfully slow.

Cost comparison: custom code vs dedicated matching tool

The real cost of DIY matching is not the API bill — it is the engineering time. Consider a typical matching project: 20,000 rows per dataset, company names with formatting differences, addresses that need normalization, and a requirement for high precision.

Total cost for a 20K x 20K company matching project
pandas merge (exact only) Fast but misses ~40% of matches
2 hours
rapidfuzz script 2 days dev + iteration
16 hours
recordlinkage pipeline 5 days dev + tuning
40 hours
Custom + embeddings 10 days dev + API costs
80 hours
Dedicated matching tool 30 min setup + credits
4 hours

Estimates include development, testing, iteration, and manual review time. Dedicated tool assumes a pay-per-use model.

The pandas merge approach is cheapest in time but produces incomplete results. The recordlinkage pipeline produces good results but requires a week of work. The dedicated tool sits in a different category entirely — the complexity is handled by the platform, so you spend your time on configuration and validation rather than engineering.

This is not a knock against Python. For one-off exploratory matching on small datasets, a script is the right answer. But when matching is a recurring task, when the datasets are large, or when the results need to be reliable enough for production use, the economics shift.

A decision framework: build vs use a tool

The build-vs-buy decision depends on five factors. Score each one for your situation.

Dataset size. Under 1,000 rows per side, a script is fine. Between 1,000 and 10,000, you need blocking and some optimization. Above 10,000, you need a real pipeline with pre-filtering, batched processing, and checkpointing.

Match complexity. Exact key matching is trivial. Single-field fuzzy matching is manageable. Multi-field matching with different algorithms per field, weighted scoring, and threshold tuning is where code complexity compounds.

Iteration frequency. If you match once and move on, the upfront development cost is amortized over a single run. If you match monthly — new customer lists, updated vendor data, quarterly reconciliation — every code change costs you again. A configurable tool lets you adjust parameters without touching code.

Team composition. If the person who needs the matching results is also a Python developer, scripting is natural. If the matching requester is an analyst, a product manager, or a finance team member, they need a GUI they can operate independently.

AI requirements. If your data has semantic equivalences — “IBM” vs “International Business Machines”, “laptop” vs “notebook computer” — string algorithms will not catch them. You need embeddings, which means API integration, vector storage, and cosine similarity computation. If borderline cases need reasoning about context, you need LLM confirmation, which adds another layer of infrastructure.

Score yourself honestly on each factor. If three or more point toward “use a tool,” the development time you save will far exceed the tool’s cost.

One more consideration: maintenance. A matching script that works today breaks when the data format changes, when a library updates its API, or when a new edge case appears. A maintained tool handles these changes for you.


Match Data Studio handles the full pipeline — from pre-filtering and fuzzy matching to AI embeddings and LLM confirmation — without writing code. Upload your CSVs, describe what you are matching, and the AI assistant configures the right combination of algorithms for your data. Get started free →


Keep reading