What matching algorithm should I use for my data?

It depends on your field types and data quality. For clean, structured fields like IDs or dates, use exact matching. For names and short text with typos, use Jaro-Winkler or Levenshtein. For product descriptions and addresses, use TF-IDF or AI embeddings that capture semantic meaning. Most real projects combine two or three algorithms across different columns.

When should I use fuzzy matching vs AI embeddings?

Use fuzzy matching (Levenshtein, Jaro-Winkler) when differences are character-level — typos, abbreviations, minor formatting changes. Use AI embeddings when differences are semantic — 'laptop' vs 'notebook computer', or 'IBM' vs 'International Business Machines'. Embeddings cost more per comparison but catch matches that string algorithms miss entirely.

How dirty is too dirty for rule-based matching?

If more than 20-30% of your records have inconsistent formatting, missing fields, or mixed languages in the same column, rule-based matching alone will miss too many matches. At that point, add AI embeddings or LLM confirmation as a second pass. You can still use rules as a cheap pre-filter to reduce the number of expensive AI comparisons.

Can I combine multiple matching algorithms?

Yes, and you should. The most effective approach is a funnel: start with cheap string pre-filters to eliminate obvious non-matches, then use embeddings on surviving pairs for semantic similarity, and optionally confirm borderline cases with an LLM. This gives you both speed and accuracy.

How to choose the right matching algorithm for your data

You have two datasets. You need to find which records refer to the same entity. You’ve read about Levenshtein, Jaro-Winkler, TF-IDF, embeddings, and a dozen other algorithms. Now you need to pick one — or more likely, pick the right combination.

This guide gives you a decision framework based on what your data actually looks like.

Start with your data type

The single biggest factor in algorithm selection is what you’re comparing. A person’s name behaves differently than a company name, which behaves differently than a street address. Each has its own dominant error patterns.

Personal names — Typos, nicknames (Bob vs Robert), cultural variations (family name first vs last), initials, suffixes (Jr, III). The prefix is usually correct. Errors concentrate in the middle and end.

Company names — Abbreviations (Inc vs Incorporated), dropped words (& Associates), legal entity suffixes, doing-business-as variations. The distinctive part of the name is often buried among generic tokens.

Addresses — Abbreviations (St vs Street), unit/suite formatting, directional prefixes (N, South), ZIP+4 variations. Highly structured but inconsistently formatted.

Product names — Model numbers embedded in free text, brand prefixes, size/color suffixes, version numbers. Often a mix of meaningful identifiers and descriptive words.

Mixed or compound fields — Some datasets concatenate name + address + ID into a single field, or you need to match across multiple fields simultaneously. This requires a different strategy than single-field comparison.

Key questions before you choose

Before picking algorithms, answer these:

How dirty is the data? If both datasets come from the same system with minor variations, simple fuzzy matching works. If they come from different vendors with different formatting standards, you need more sophisticated approaches.

Are there word-order variations? If names appear as Last, First in one dataset and First Last in another, you need token-aware algorithms that ignore order.

Is the data multilingual? Phonetic algorithms like Soundex only work for English. Embeddings handle multiple languages natively.

How much data? At 1,000 records, you can afford expensive pairwise comparisons. At 100,000, you need blocking and pre-filtering strategies. At 1 million, you need to be deliberate about every computation.

What’s the cost of errors? In some domains (medical records, financial compliance), a missed match is dangerous. In others (marketing deduplication), a few misses are acceptable. This determines how aggressively you filter.

Decision matrix

Algorithm selection by data type

Data Type	Primary Issue	Recommended Algorithm	Fallback
Personal names (typos)	Character-level errors	Jaro-Winkler	Levenshtein
Personal names (nicknames)	Semantic equivalence	Embeddings + LLM	Phonetic (Soundex)
Company names	Abbreviations, suffixes	TF-IDF + Token Set Ratio	Embeddings
Street addresses	Format inconsistency	Normalization + Token matching	N-gram overlap
Product codes / SKUs	Exact or near-exact IDs	Exact match + Levenshtein	N-grams
Product descriptions	Variable wording	TF-IDF + Cosine	Embeddings
Email addresses	Typos, domain variations	Exact + Levenshtein (local part)	Token matching
Phone numbers	Formatting only	Normalization + Exact match	Levenshtein (digits only)
Mixed / multi-field	Multiple error types	Layered pipeline	Weighted field scoring

Recommendations assume moderate data quality. Severely messy data usually requires embeddings regardless of field type.

The layered approach

The most effective matching pipelines don’t rely on a single algorithm. They use a cascade: cheap, fast filters first, then progressively more expensive and accurate methods on the surviving candidates.

Here’s what that looks like in practice:

Layer 1: Exact match — Zero cost. Compare normalized strings directly. This catches 40-60% of true matches in typical datasets where both sources have some records in common format.

Layer 2: String pre-filter — Low cost. Apply Jaro-Winkler, token matching, or n-gram overlap with a generous threshold. The goal isn’t precision — it’s filtering out obvious non-matches. A pair that scores below 0.4 on Jaro-Winkler is almost certainly not a match. Remove it from consideration.

Layer 3: Numeric pre-filter — Low cost. If you have numeric fields (ZIP codes, prices, dates, IDs), apply percentage-difference filters. Two records with ZIP codes 500 miles apart are not the same entity.

Layer 4: Embedding similarity — Moderate cost. Generate vector embeddings for surviving candidates and compute cosine similarity. This catches semantic matches that string algorithms miss — abbreviations, nicknames, format variations.

Layer 5: LLM confirmation — High cost. Send borderline pairs (e.g., cosine similarity 0.70-0.85) to a language model with full record context. The LLM reasons about whether two records plausibly refer to the same entity, considering all fields together.

Each layer is a funnel. The expensive layers only process what survived the cheap ones.

Records requiring LLM review — effect of pre-filtering

No pre-filtering Brute force: every pair

100%

After exact match 28% resolved cheaply

72%

After string filter Removes obvious non-matches

31%

After numeric filter ZIP, price, date constraints

15%

After embedding filter Only ambiguous pairs remain

Illustrative figures for a 10K x 10K matching job. Actual reduction depends on data overlap and quality.

The difference is dramatic. Without pre-filtering, you’d send all candidate pairs through the LLM — slow and expensive. With a proper cascade, only 8% of pairs need expensive processing. On a 10K x 10K job, that’s the difference between reviewing 100 million pairs with an LLM and reviewing 8 million. At $0.001 per LLM call, that’s $92,000 in savings.

This is exactly the architecture Match Data Studio uses. The pipeline runs string and numeric pre-filters in Stage 1 (no AI cost), generates embeddings only for surviving rows in Stage 2, applies cosine similarity and LLM confirmation in Stage 3, and outputs results in Stage 4.

Common mistakes

Using one algorithm for everything

Jaro-Winkler is great for names but terrible for company names with word-order variations. TF-IDF is great for company names but overkill for ZIP codes. Match each field to the right algorithm.

Setting thresholds too tight

A threshold of 0.95 on a name field will miss Robert vs Bob, Catherine vs Katherine, and Smith Jr. vs Smith. Start with a lower threshold (0.70-0.80) and tighten it after reviewing sample results. It’s easier to remove false positives than to discover missed matches.

Setting thresholds too loose

A threshold of 0.40 will match John with Jane and Smith with Smoot. If your results have too many false positives, the signal-to-noise ratio collapses and manual review becomes impractical. Aim for a sweet spot where most flagged pairs are real matches.

Ignoring field-specific strategies

Some fields are high-signal (email, phone, SSN last 4). Others are low-signal (city name, state). Weighting all fields equally dilutes the value of strong identifiers. Configure higher weights for fields that are more unique to an entity.

Skipping normalization

Before any algorithm runs, normalize your data: lowercase, strip punctuation, expand common abbreviations (St to Street, Inc to Incorporated), remove extra whitespace. This alone can improve match rates by 10-20% with zero algorithmic cost.

Practical starting points

If you’re not sure where to start, here are sensible defaults:

For a quick first pass: Normalize both datasets, then run Token Set Ratio with a threshold of 0.75. This handles most common variations (order, abbreviation, case) with a single algorithm. Review the results and iterate.

For production accuracy: Use the layered approach. Configure field-specific algorithms — Jaro-Winkler for names, token matching for addresses, exact match for IDs — with embedding similarity as a catch-all, and LLM confirmation for borderline cases.

For maximum recall: Lower all thresholds by 10-15 points and add LLM confirmation on a wider band of candidates. Accept higher cost and more manual review in exchange for catching every possible match.

The right configuration depends on your specific data and tolerance for errors. Run sample batches, review the results, and adjust. Two or three iterations usually gets you to a configuration that holds across the full dataset.

Match Data Studio handles algorithm selection automatically. The AI assistant analyzes your data columns and sample rows, then configures the right combination of string matching, numeric filtering, embeddings, and LLM confirmation for each field. Get started free →

Keep reading

Fuzzy matching algorithms explained — deep dive into Levenshtein, Jaro-Winkler, and more
AI embeddings vs rule-based matching — the tradeoffs between semantic and deterministic approaches
Matching at scale — performance strategies when your dataset outgrows simple methods