Can ChatGPT do fuzzy matching? Yes — but here's where it breaks down

You’ve got two spreadsheets. One from your CRM, one from a purchased lead list. You need to find the overlapping records. The names don’t match exactly — “Robert Smith” in one, “Bob Smith” in the other. Addresses are abbreviated differently. Some phone numbers have country codes, some don’t.

You open ChatGPT and paste in a few rows from each spreadsheet. “Which of these records from List B match records in List A?” It works. ChatGPT identifies that Robert and Bob are the same person, that “123 Oak St” and “123 Oak Street” are the same address, and that the phone numbers match once you strip the +1.

So the natural next question: can you just do this for the whole dataset?

The short answer is yes, LLMs can do fuzzy matching. The longer answer is that what works brilliantly on 10 rows falls apart — or becomes impractical — at 500, 5,000, or 50,000 rows.

What LLMs are actually good at

Let’s give credit where it’s due. Large language models — ChatGPT (GPT-4o), Gemini, Claude, Llama, Mistral — are remarkably good at understanding when two records refer to the same entity. They handle problems that traditional fuzzy matching algorithms struggle with:

Nicknames and abbreviations. Bob = Robert, Bill = William, Corp. = Corporation, St. = Street. LLMs know these equivalences because they’ve seen them millions of times in training data. A Jaro-Winkler algorithm gives “Bob” and “Robert” a similarity score of 0.38 — a clear non-match. An LLM knows they’re the same name.

Semantic equivalence. VP of Engineering and Vice President, Engineering Dept describe the same role. 123 Oak Street, Suite 4B and 123 Oak St #4B are the same location. LLMs understand meaning, not just character sequences.

Contextual reasoning. An LLM can consider multiple fields together. “Robert Smith at 123 Oak Street” and “Bob Smith at 123 Oak St” — even though neither the name nor the address is an exact match, the combination is strongly suggestive. The LLM reasons across fields the way a human would.

Multilingual matching. 株式会社トヨタ and Toyota Motor Corporation refer to the same entity. An LLM handles this without any translation rules or language-specific configuration.

Here’s a quick test. Five record pairs, scored by a traditional algorithm and by an LLM:

Traditional fuzzy matching vs. LLM matching

Record A	Record B	Jaro-Winkler	LLM verdict
Robert Smith	Bob Smith	0.38	Match (nickname)
Johnson & Associates Inc	Johnson Associates Incorporated	0.61	Match (abbreviation)
123 Oak Street, Ste 4B	123 Oak St #4B	0.72	Match (formatting)
María García López	Maria Garcia Lopez	0.89	Match (accent normalization)
Acme Corp	Acme Corporation of Delaware	0.52	Match (legal entity variant)

Jaro-Winkler scores below 0.85 would typically be classified as non-matches. The LLM correctly identifies all five as matches.

On these five pairs, the LLM goes 5 for 5. The traditional algorithm would miss three of them at a standard 0.85 threshold. That’s a compelling demo.

Where it breaks down

So if LLMs are so good at matching, why not just use them? Five reasons — and they all compound at scale.

1. Context window limits

ChatGPT (GPT-4o) has a 128K token context window. Claude has 200K. Gemini has up to 1M. That sounds like a lot. It isn’t.

A typical CSV row — first name, last name, email, phone, company, address, city, state, ZIP — is about 80-120 tokens. To compare every record in Dataset A against every record in Dataset B, you need to present pairs.

For a 500 × 500 comparison, that’s 250,000 pairs. At ~250 tokens per pair (both records plus formatting), you need 62.5 million tokens. That’s 490x larger than GPT-4o’s context window and 62x larger than Gemini’s.

You can’t paste your datasets into a chat window and ask “find the matches.” Even for relatively small datasets, you need to break the problem into thousands of individual API calls.

Pairs to compare by dataset size

100 × 100 10,000 pairs — fits in context

1K pairs

500 × 500 250,000 pairs — needs batching

25K pairs

5K × 5K 25 million pairs — major engineering

2500K pairs

50K × 50K 2.5 billion pairs — impractical

250000K pairs

Full cross-product comparison. Smart pre-filtering can reduce this by 90%+ but requires building the filtering infrastructure.

2. Cost

LLM API calls are cheap individually but expensive at scale.

At GPT-4o pricing (~$2.50 per million input tokens, ~$10 per million output tokens), comparing 10,000 pairs costs roughly $4-8. Comparing 250,000 pairs costs $100-200. Comparing 25 million pairs costs $10,000-20,000.

And that’s assuming you structure every call efficiently — minimal prompt, batch processing, no retries. In practice, costs are 2-3x higher because of prompt overhead, error handling, and the need for structured output parsing.

Claude, Gemini, and other models have comparable pricing at the high end. Cheaper models (GPT-4o mini, Gemini Flash, Claude Haiku) reduce costs by 5-10x but also reduce matching accuracy on difficult pairs.

Estimated LLM matching cost by dataset size

100 × 100 (10K pairs) Affordable — good for testing

500 × 500 (250K pairs) Noticeable — needs justification

150$

5K × 5K (25M pairs) Expensive — enterprise budget

15000$

50K × 50K (2.5B pairs) Impractical — nobody does this

1500000$

Estimated cost using GPT-4o for direct pairwise LLM comparison without pre-filtering. Actual costs vary by prompt length and output structure.

3. Speed

Even with parallel API calls, LLMs are slow compared to algorithmic matching.

A traditional fuzzy matching algorithm (Jaro-Winkler, TF-IDF + cosine) can score 100,000 pairs per second on a single CPU core. An LLM API call takes 0.5-3 seconds per response, and each response handles one pair (or a small batch of 5-10 if you’re clever with prompting).

At 10 pairs per second (optimistic), 250,000 pairs take 7 hours. At 2 pairs per second (realistic with rate limits and retries), it’s 35 hours. For 25 million pairs, you’re looking at weeks.

4. Consistency

LLMs are non-deterministic. Ask the same question twice and you might get different answers. Temperature settings help, but don’t eliminate the issue.

This matters for matching because you need consistent thresholds. If “Bob Smith” and “Robert Smith” match on Monday but not on Tuesday because the model’s confidence fluctuated, your results are unreliable. Traditional algorithms give you the same score every time.

It also matters for auditing. If a compliance team asks why two records were (or weren’t) matched, “the LLM decided” isn’t an acceptable answer. You need reproducible, traceable match decisions.

5. No pipeline infrastructure

Matching at scale isn’t just comparison — it’s a pipeline.

You need to parse and normalize the input CSVs. Generate candidate pairs (because comparing every row against every other row is combinatorially explosive). Run cheap pre-filters to eliminate obvious non-matches. Score the remaining candidates. Apply thresholds. Handle conflicts (what if Record A matches both Record B and Record C?). Generate output with match confidence scores. Allow re-runs with adjusted thresholds.

None of this exists when you’re pasting data into ChatGPT. You’d need to build it — the prompt engineering, the batching logic, the error handling, the output parsing, the deduplication, the threshold tuning. You’re building a matching platform with LLM API calls as the comparison engine.

At which point you should probably use a matching platform.

The model-by-model breakdown

Each LLM has different strengths for matching tasks.

LLM capabilities for record matching

Model	Matching quality	Context window	Best use case	Limitation
GPT-4o	Excellent	128K tokens	High-accuracy pairwise comparison	Cost at scale, rate limits
GPT-4o mini	Good	128K tokens	Budget-friendly batch comparison	Misses subtle matches
Claude Opus	Excellent	200K tokens	Complex multi-field reasoning	Cost at scale
Claude Haiku	Good	200K tokens	Fast, cheap pre-screening	Less nuanced on edge cases
Gemini 2.5 Pro	Excellent	1M tokens	Largest batch context	Availability, rate limits
Gemini 2.5 Flash	Good	1M tokens	Fast multimodal matching	Less reasoning depth
Llama 3 (self-hosted)	Good	128K tokens	No API costs, full control	Requires GPU infrastructure

Quality ratings are relative to the matching task. All frontier models handle standard name and address matching well. Differences emerge on edge cases.

For a one-time, small matching task (under 100 rows per dataset), any of these models will do fine in a chat interface. Paste the data, ask for matches, review the results.

For recurring or larger tasks, the model choice matters less than the architecture around it. The right answer isn’t “which LLM should I use for matching?” — it’s “how do I build a pipeline that uses LLMs only where they add value?”

The right architecture: LLMs where they matter, algorithms everywhere else

The cost and speed problems dissolve when you stop using LLMs for everything and start using them for the specific things they’re good at.

A well-designed matching pipeline uses LLMs in three targeted ways:

1. Understanding the data. Before matching starts, an LLM analyzes your columns and sample rows to understand what you’re matching. It suggests which fields to compare, which algorithms to use, and what thresholds make sense. This is a one-time call, not a per-record cost.

2. Extracting attributes. LLMs fill in missing data. If your CSV has product names but no brand column, the LLM reads each product name and extracts the brand. If you have product images, the LLM describes what it sees and extracts structured attributes. This runs once per unique record, not once per pair.

3. Confirming borderline matches. After cheap algorithmic pre-filters and embedding similarity have eliminated 95% of pairs, the remaining 5% are genuinely ambiguous. This is where LLM reasoning shines — and where the cost is manageable because you’re sending 5% of pairs, not 100%.

Everything else — normalization, pre-filtering, candidate generation, embedding, cosine similarity, threshold application, output formatting — is handled by purpose-built algorithms that are faster, cheaper, and deterministic.

Cost comparison: raw LLM vs. hybrid pipeline

LLM on all pairs Every comparison through GPT-4o/Claude

100%

Hybrid pipeline Pre-filter → Embed → LLM on 5% borderline

Relative cost for a 2,000 × 3,000 record matching job. Hybrid pipeline achieves comparable accuracy at ~4% of the cost.

The hybrid pipeline achieves comparable or better accuracy at a fraction of the cost — because the LLM only handles the cases that actually need its reasoning capability.

What a dedicated matching tool gives you

Match Data Studio is built on exactly this architecture. LLMs handle what they’re good at. Algorithms handle what they’re good at. The pipeline orchestrates both.

Here’s what you get that you don’t get from pasting data into ChatGPT:

Scale. Upload CSVs with tens of thousands of rows. The pipeline generates candidate pairs, runs cheap pre-filters, and only sends the hard cases to AI. A job that would cost $15,000 in raw LLM calls costs a few dollars in credits.

Speed. String and numeric pre-filters eliminate 90%+ of pairs in seconds. Embeddings handle the remaining candidates in parallel. LLM confirmation runs only on borderline pairs. A full run on a few thousand rows takes minutes, not hours.

Consistency. Pre-filters and embeddings produce the same scores every time. LLM confirmation is used for final arbitration, not as the primary scoring mechanism. Your results are reproducible.

Configuration without code. An AI assistant analyzes your data and configures the pipeline — which fields to compare, which algorithms to use, what thresholds to set. You don’t write prompts, build batching logic, or parse JSON responses.

File-based matching. This is something no chat-based LLM workflow can do. Upload product images or PDF documents alongside your CSVs, and the pipeline extracts attributes from files, embeds visual descriptions, and compares files side by side. Try asking ChatGPT to “match these 500 product photos against this catalog” — it simply can’t. A dedicated tool can.

Iterative refinement. Run a sample (100 rows), check the results, adjust thresholds, run again. This feedback loop is critical for getting matching right and impossible to replicate in a chat interface.

So, can ChatGPT do fuzzy matching?

Yes. And so can Claude, Gemini, Llama, and every other capable LLM. They’re genuinely excellent at understanding when two records refer to the same entity.

But “can it do matching” and “should you use it for matching” are different questions.

If you have 50 records and a one-time need, paste them into your favorite chatbot. It’ll work.

If you have 500 records and need reliable results, you’ll spend an hour crafting prompts, batching requests, and parsing output. It’ll work, but you’ll wonder if there’s a better way.

If you have 5,000+ records, need repeatable results, or want to match on images and documents alongside text — you need a tool built for the job. One that uses LLMs where they add value and purpose-built algorithms everywhere else.

Match Data Studio uses the same AI models you’d use manually — but in a pipeline designed for matching at scale. Upload your CSVs, let the AI configure the pipeline, and get results in minutes instead of hours.

Try it free →