You have two datasets. You need to find which records refer to the same entity. You’ve read about Levenshtein, Jaro-Winkler, TF-IDF, embeddings, and a dozen other algorithms. Now you need to pick one — or more likely, pick the right combination.

This guide gives you a decision framework based on what your data actually looks like.

Start with your data type

The single biggest factor in algorithm selection is what you’re comparing. A person’s name behaves differently than a company name, which behaves differently than a street address. Each has its own dominant error patterns.

Personal names — Typos, nicknames (Bob vs Robert), cultural variations (family name first vs last), initials, suffixes (Jr, III). The prefix is usually correct. Errors concentrate in the middle and end.

Company names — Abbreviations (Inc vs Incorporated), dropped words (& Associates), legal entity suffixes, doing-business-as variations. The distinctive part of the name is often buried among generic tokens.

Addresses — Abbreviations (St vs Street), unit/suite formatting, directional prefixes (N, South), ZIP+4 variations. Highly structured but inconsistently formatted.

Product names — Model numbers embedded in free text, brand prefixes, size/color suffixes, version numbers. Often a mix of meaningful identifiers and descriptive words.

Mixed or compound fields — Some datasets concatenate name + address + ID into a single field, or you need to match across multiple fields simultaneously. This requires a different strategy than single-field comparison.

Key questions before you choose

Before picking algorithms, answer these:

How dirty is the data? If both datasets come from the same system with minor variations, simple fuzzy matching works. If they come from different vendors with different formatting standards, you need more sophisticated approaches.

Are there word-order variations? If names appear as Last, First in one dataset and First Last in another, you need token-aware algorithms that ignore order.

Is the data multilingual? Phonetic algorithms like Soundex only work for English. Embeddings handle multiple languages natively.

How much data? At 1,000 records, you can afford expensive pairwise comparisons. At 100,000, you need blocking and pre-filtering strategies. At 1 million, you need to be deliberate about every computation.

What’s the cost of errors? In some domains (medical records, financial compliance), a missed match is dangerous. In others (marketing deduplication), a few misses are acceptable. This determines how aggressively you filter.

Decision matrix

Algorithm selection by data type
Data Type Primary Issue Recommended Algorithm Fallback
Personal names (typos) Character-level errors Jaro-Winkler Levenshtein
Personal names (nicknames) Semantic equivalence Embeddings + LLM Phonetic (Soundex)
Company names Abbreviations, suffixes TF-IDF + Token Set Ratio Embeddings
Street addresses Format inconsistency Normalization + Token matching N-gram overlap
Product codes / SKUs Exact or near-exact IDs Exact match + Levenshtein N-grams
Product descriptions Variable wording TF-IDF + Cosine Embeddings
Email addresses Typos, domain variations Exact + Levenshtein (local part) Token matching
Phone numbers Formatting only Normalization + Exact match Levenshtein (digits only)
Mixed / multi-field Multiple error types Layered pipeline Weighted field scoring

Recommendations assume moderate data quality. Severely messy data usually requires embeddings regardless of field type.

The layered approach

The most effective matching pipelines don’t rely on a single algorithm. They use a cascade: cheap, fast filters first, then progressively more expensive and accurate methods on the surviving candidates.

Here’s what that looks like in practice:

Layer 1: Exact match — Zero cost. Compare normalized strings directly. This catches 40-60% of true matches in typical datasets where both sources have some records in common format.

Layer 2: String pre-filter — Low cost. Apply Jaro-Winkler, token matching, or n-gram overlap with a generous threshold. The goal isn’t precision — it’s filtering out obvious non-matches. A pair that scores below 0.4 on Jaro-Winkler is almost certainly not a match. Remove it from consideration.

Layer 3: Numeric pre-filter — Low cost. If you have numeric fields (ZIP codes, prices, dates, IDs), apply percentage-difference filters. Two records with ZIP codes 500 miles apart are not the same entity.

Layer 4: Embedding similarity — Moderate cost. Generate vector embeddings for surviving candidates and compute cosine similarity. This catches semantic matches that string algorithms miss — abbreviations, nicknames, format variations.

Layer 5: LLM confirmation — High cost. Send borderline pairs (e.g., cosine similarity 0.70-0.85) to a language model with full record context. The LLM reasons about whether two records plausibly refer to the same entity, considering all fields together.

Each layer is a funnel. The expensive layers only process what survived the cheap ones.

Records requiring LLM review — effect of pre-filtering
No pre-filtering Brute force: every pair
100%
After exact match 28% resolved cheaply
72%
After string filter Removes obvious non-matches
31%
After numeric filter ZIP, price, date constraints
15%
After embedding filter Only ambiguous pairs remain
8%

Illustrative figures for a 10K x 10K matching job. Actual reduction depends on data overlap and quality.

The difference is dramatic. Without pre-filtering, you’d send all candidate pairs through the LLM — slow and expensive. With a proper cascade, only 8% of pairs need expensive processing. On a 10K x 10K job, that’s the difference between reviewing 100 million pairs with an LLM and reviewing 8 million. At $0.001 per LLM call, that’s $92,000 in savings.

This is exactly the architecture Match Data Studio uses. The pipeline runs string and numeric pre-filters in Stage 1 (no AI cost), generates embeddings only for surviving rows in Stage 2, applies cosine similarity and LLM confirmation in Stage 3, and outputs results in Stage 4.

Common mistakes

Using one algorithm for everything

Jaro-Winkler is great for names but terrible for company names with word-order variations. TF-IDF is great for company names but overkill for ZIP codes. Match each field to the right algorithm.

Setting thresholds too tight

A threshold of 0.95 on a name field will miss Robert vs Bob, Catherine vs Katherine, and Smith Jr. vs Smith. Start with a lower threshold (0.70-0.80) and tighten it after reviewing sample results. It’s easier to remove false positives than to discover missed matches.

Setting thresholds too loose

A threshold of 0.40 will match John with Jane and Smith with Smoot. If your results have too many false positives, the signal-to-noise ratio collapses and manual review becomes impractical. Aim for a sweet spot where most flagged pairs are real matches.

Ignoring field-specific strategies

Some fields are high-signal (email, phone, SSN last 4). Others are low-signal (city name, state). Weighting all fields equally dilutes the value of strong identifiers. Configure higher weights for fields that are more unique to an entity.

Skipping normalization

Before any algorithm runs, normalize your data: lowercase, strip punctuation, expand common abbreviations (St to Street, Inc to Incorporated), remove extra whitespace. This alone can improve match rates by 10-20% with zero algorithmic cost.

Practical starting points

If you’re not sure where to start, here are sensible defaults:

For a quick first pass: Normalize both datasets, then run Token Set Ratio with a threshold of 0.75. This handles most common variations (order, abbreviation, case) with a single algorithm. Review the results and iterate.

For production accuracy: Use the layered approach. Configure field-specific algorithms — Jaro-Winkler for names, token matching for addresses, exact match for IDs — with embedding similarity as a catch-all, and LLM confirmation for borderline cases.

For maximum recall: Lower all thresholds by 10-15 points and add LLM confirmation on a wider band of candidates. Accept higher cost and more manual review in exchange for catching every possible match.

The right configuration depends on your specific data and tolerance for errors. Run sample batches, review the results, and adjust. Two or three iterations usually gets you to a configuration that holds across the full dataset.


Match Data Studio handles algorithm selection automatically. The AI assistant analyzes your data columns and sample rows, then configures the right combination of string matching, numeric filtering, embeddings, and LLM confirmation for each field. Get started free →


Keep reading