How do I merge two CSV files in Python using pandas?

Use pandas.merge() with a shared key column: df_merged = pd.merge(df1, df2, on='id', how='outer'). For left, right, or inner joins, change the 'how' parameter. This works well when both files share an exact key, but fails when keys contain typos, formatting differences, or other inconsistencies.

What Python library is best for fuzzy matching CSV files?

For small datasets, fuzzywuzzy or its faster drop-in replacement rapidfuzz work well for string-level comparison. For larger datasets requiring blocking, classification, and scoring, the recordlinkage library provides a full pipeline. For production workloads, consider a dedicated matching tool that handles scaling, threshold tuning, and validation out of the box.

Why is matching two large CSV files in Python so slow?

Naive pairwise comparison has O(N×M) complexity. Matching 50,000 rows against 50,000 rows means 2.5 billion comparisons. Each comparison involves string similarity calculations that are CPU-intensive. Without blocking keys to reduce the comparison space, even optimized libraries struggle with datasets above 10,000 rows per side.

When should I use a matching tool instead of writing Python code?

Consider a dedicated tool when your datasets exceed 10,000 rows, when you need to iterate on matching rules without rewriting code, when non-technical team members need to run matches, or when you need AI-powered matching (embeddings, LLM confirmation) that would require significant infrastructure to build yourself.

How to match two CSV files in Python (and when to use a tool instead)

Matching two CSV files is one of the most common data tasks in any analytics or engineering workflow. You have a list of customers in one file and a list of leads in another. Or product records from two suppliers. Or employee data from an HRIS export and a payroll system dump. The goal is the same: figure out which rows in file A correspond to which rows in file B.

Python is the natural first choice. The ecosystem is rich, the libraries are well-documented, and most data professionals already have pandas installed. But as matching complexity grows, the gap between what a script can handle and what the job actually requires widens faster than most people expect.

This post walks through the Python approach at three levels of sophistication, then gives you a framework for deciding when to stop building and start using a purpose-built tool.

The basic approach: pandas merge on a shared key

If both CSV files share a clean, unique key column — an email address, a product SKU, a customer ID — then pd.merge() is all you need.

import pandas as pd

df1 = pd.read_csv("customers.csv")
df2 = pd.read_csv("leads.csv")

matched = pd.merge(df1, df2, on="email", how="inner")

The how parameter controls the join type: inner returns only rows that match in both files, left keeps all rows from the first file, outer keeps everything. You can merge on multiple columns using on=["email", "last_name"] and handle column name mismatches with left_on and right_on.

This approach is fast, simple, and works perfectly when the data is clean. In practice, it rarely is. The email column in file A has john.smith@company.com while file B has JOHN.SMITH@COMPANY.COM. Or one file uses Johnson & Associates while the other uses Johnson and Associates Inc. Exact merge silently misses these — no error, no warning, just a smaller result set than expected.

Exact merge is step zero, not the solution. It handles the easy cases and tells you how much of your data is clean enough to match exactly. Everything it misses needs a different approach.

Adding fuzzy matching with fuzzywuzzy and rapidfuzz

When keys don’t match exactly, fuzzy string matching compares strings by similarity rather than equality. The fuzzywuzzy library (and its faster C-extension drop-in, rapidfuzz) provides several comparison functions out of the box.

from rapidfuzz import fuzz, process

# Compare two strings directly
score = fuzz.ratio("Johnson Associates", "Johnson & Associates Inc")
# Returns ~82

# Find best match for a string within a list
best = process.extractOne(
    "Johnson Associates",
    ["Johnson & Associates Inc", "Jackson Corp", "Johnston LLC"],
    scorer=fuzz.token_sort_ratio
)
# Returns ("Johnson & Associates Inc", 90, 0)

The token_sort_ratio scorer splits strings into tokens, sorts them alphabetically, and then compares — which makes it order-independent. token_set_ratio goes further by comparing the intersection and remainders separately, handling cases where one string has extra words.

For a full CSV matching job, you typically loop through one file and find the best match in the other:

results = []
for _, row in df1.iterrows():
    match = process.extractOne(
        row["company_name"],
        df2["company_name"].tolist(),
        scorer=fuzz.token_sort_ratio,
        score_cutoff=75
    )
    if match:
        results.append({
            "source": row["company_name"],
            "matched": match[0],
            "score": match[1]
        })

This works. For a few hundred rows, it even works quickly. But there are three problems you will hit almost immediately.

First, speed. extractOne with default settings compares the query against every candidate. With 5,000 rows in each file, that is 25 million string comparisons. The rapidfuzz library is significantly faster than fuzzywuzzy (often 10-50x), but even at microseconds per comparison, large datasets grind to a halt.

Second, multi-field matching. Real matching usually requires comparing more than one field — name plus city, or name plus phone number. Fuzzy matching libraries operate on single strings. Combining scores across fields means writing your own weighted scoring logic, deciding how to handle missing values, and testing edge cases.

Third, threshold tuning. A score_cutoff of 75 is a guess. Too low and you get false positives. Too high and you miss valid matches. The right threshold depends on your data, your field types, and how dirty the formatting is. There is no way to know without running experiments and inspecting results.

Scaling up: the recordlinkage library for blocking and classification

The recordlinkage library was built specifically for the full matching pipeline: indexing (generating candidate pairs), comparing, and classifying.

import recordlinkage

indexer = recordlinkage.Index()
indexer.block("state")  # Only compare records in the same state
candidates = indexer.index(df1, df2)

compare = recordlinkage.Compare()
compare.string("name", "name", method="jarowinkler", threshold=0.85)
compare.string("city", "city", method="levenshtein", threshold=0.80)
compare.exact("zip_code", "zip_code")

features = compare.compute(candidates, df1, df2)

The blocking step is the key improvement. Instead of comparing every record against every other record, you only compare records that share a blocking key — in this case, the same state. This eliminates the vast majority of comparisons and makes the problem tractable for larger datasets.

After computing features, you can classify matches using a threshold on the summed scores, or use the library’s built-in classifiers (K-means, logistic regression, SVM) trained on labeled examples.

recordlinkage is a serious tool. It handles the mechanics well. But it still requires significant effort to use effectively: you choose the blocking keys, the comparison methods, the thresholds, and the classifier. You write the data preprocessing, handle missing values, and build the evaluation pipeline. For a one-off matching job, this can mean days of development before you see your first results.

Where DIY matching hits a wall (the complexity cliff)

The progression from pd.merge to rapidfuzz to recordlinkage is a progression of increasing capability and increasing code complexity. At each level, you solve new problems but introduce new engineering requirements.

Complexity progression in Python CSV matching

Approach	Lines of Code	Handles Typos	Handles Scale	Iteration Speed	Non-technical Users
pandas merge	5–10	No	Yes (exact only)	Minutes	No
rapidfuzz loop	30–60	Single field	< 5K rows	Hours	No
rapidfuzz multi-field	100–200	Multiple fields	< 5K rows	Hours	No
recordlinkage pipeline	200–500	Multiple fields	< 50K rows	Days	No
recordlinkage + embeddings	500–1,000+	Semantic	< 100K rows	Weeks	No

Lines of code include preprocessing, scoring, evaluation, and output formatting — not just the matching step.

The complexity cliff appears between rows three and four. Adding blocking, multi-algorithm comparison, classification, and evaluation turns a script into a project. Adding AI embeddings or LLM confirmation turns a project into an infrastructure challenge — you need API keys, rate limiting, caching, error handling, and cost management.

And there is a subtler problem: iteration cost. In a script, changing the matching logic means editing code, re-running, and inspecting output. Each iteration takes minutes to hours depending on dataset size. In a production matching job, you often need 5-10 iterations to get thresholds and field weights right. If each iteration requires code changes and a full re-run, the feedback loop is painfully slow.

Cost comparison: custom code vs dedicated matching tool

The real cost of DIY matching is not the API bill — it is the engineering time. Consider a typical matching project: 20,000 rows per dataset, company names with formatting differences, addresses that need normalization, and a requirement for high precision.

Total cost for a 20K x 20K company matching project

pandas merge (exact only) Fast but misses ~40% of matches

2 hours

rapidfuzz script 2 days dev + iteration

16 hours

recordlinkage pipeline 5 days dev + tuning

40 hours

Custom + embeddings 10 days dev + API costs

80 hours

Dedicated matching tool 30 min setup + credits

4 hours

Estimates include development, testing, iteration, and manual review time. Dedicated tool assumes a pay-per-use model.

The pandas merge approach is cheapest in time but produces incomplete results. The recordlinkage pipeline produces good results but requires a week of work. The dedicated tool sits in a different category entirely — the complexity is handled by the platform, so you spend your time on configuration and validation rather than engineering.

This is not a knock against Python. For one-off exploratory matching on small datasets, a script is the right answer. But when matching is a recurring task, when the datasets are large, or when the results need to be reliable enough for production use, the economics shift.

A decision framework: build vs use a tool

The build-vs-buy decision depends on five factors. Score each one for your situation.

Dataset size. Under 1,000 rows per side, a script is fine. Between 1,000 and 10,000, you need blocking and some optimization. Above 10,000, you need a real pipeline with pre-filtering, batched processing, and checkpointing.

Match complexity. Exact key matching is trivial. Single-field fuzzy matching is manageable. Multi-field matching with different algorithms per field, weighted scoring, and threshold tuning is where code complexity compounds.

Iteration frequency. If you match once and move on, the upfront development cost is amortized over a single run. If you match monthly — new customer lists, updated vendor data, quarterly reconciliation — every code change costs you again. A configurable tool lets you adjust parameters without touching code.

Team composition. If the person who needs the matching results is also a Python developer, scripting is natural. If the matching requester is an analyst, a product manager, or a finance team member, they need a GUI they can operate independently.

AI requirements. If your data has semantic equivalences — “IBM” vs “International Business Machines”, “laptop” vs “notebook computer” — string algorithms will not catch them. You need embeddings, which means API integration, vector storage, and cosine similarity computation. If borderline cases need reasoning about context, you need LLM confirmation, which adds another layer of infrastructure.

Score yourself honestly on each factor. If three or more point toward “use a tool,” the development time you save will far exceed the tool’s cost.

One more consideration: maintenance. A matching script that works today breaks when the data format changes, when a library updates its API, or when a new edge case appears. A maintained tool handles these changes for you.

Match Data Studio handles the full pipeline — from pre-filtering and fuzzy matching to AI embeddings and LLM confirmation — without writing code. Upload your CSVs, describe what you are matching, and the AI assistant configures the right combination of algorithms for your data. Get started free →

Keep reading

Fuzzy matching algorithms explained — Levenshtein, Jaro-Winkler, and when to use each one
How to choose the right matching algorithm — a decision framework for selecting the best approach for your field types
Matching at scale: performance strategies — blocking, pre-filter cascades, and parallelism for large datasets