When two records refer to the same entity but don’t match exactly, you need a fuzzy matching algorithm. The trouble is there are dozens of them, each with different strengths. Picking the wrong one means missed matches or false positives — sometimes both.

This post covers six algorithms worth knowing, with concrete examples so you can see how each one behaves on real data.

Levenshtein distance (edit distance)

Levenshtein counts the minimum number of single-character edits — insertions, deletions, and substitutions — needed to transform one string into another.

Example: Smith vs Smyth

Only one substitution is needed (i to y), so the edit distance is 1. To get a normalized similarity score, use the formula:

score = 1 - (distance / max_length)
score = 1 - (1 / 5) = 0.80

Levenshtein is intuitive and well-understood. It works well for short strings with minor typos. But it has two weaknesses: it’s slow on long strings (O(n*m) time complexity), and it treats every character position equally. Swapping the first two letters costs the same as swapping letters in the middle, even though first-letter errors are much rarer in names.

Best for: Short fields with occasional typos — product codes, last names, ZIP codes.

Jaro-Winkler similarity

Jaro-Winkler measures character-level matching within a sliding window, then applies a bonus for matching prefixes. The prefix bonus is the key insight: most typos in names happen in the middle or end, not the beginning.

Example: Martha vs Marhta

The base Jaro score considers how many characters match (within a window of floor(max(len1, len2) / 2) - 1) and how many transpositions are needed. Then Winkler adds up to a 0.1 bonus for each matching character in the first four positions.

For Martha vs Marhta, the Jaro-Winkler score is approximately 0.961 — very high, because the prefix Mar matches and only two adjacent characters are swapped.

Compare that to Levenshtein on the same pair: distance 2, normalized score 0.667. Jaro-Winkler is clearly better here.

Best for: Personal names, especially first and last names where the beginning is usually correct.

Soundex and Metaphone (phonetic encoding)

Soundex converts a name into a four-character code based on how it sounds in English. Names that sound alike produce the same code, regardless of spelling.

Example: Steven and Stephen both encode to S315.

Metaphone is a more refined version that handles English pronunciation rules better — silent letters, ph = f, consonant clusters. Double Metaphone generates two codes to handle ambiguous pronunciations.

The approach is simple: encode both strings, then compare codes for equality.

The limitations are real. Phonetic algorithms are language-specific (Soundex was designed for English census data in the early 1900s). They discard most vowel information, so they can’t distinguish Smith from Smoth. And they’re useless for non-name fields — you wouldn’t use Soundex on company names or addresses.

Best for: English-language personal names where spelling varies but pronunciation doesn’t. Not suitable for addresses, product names, or non-English text.

Token Set Ratio

Token Set Ratio splits both strings into sets of tokens (words), then finds the best overlap between them. It’s order-independent, which makes it powerful for fields where the same information appears in different arrangements.

Example: John Robert Smith vs Smith, John R.

A character-level algorithm would score this pair poorly — the characters are in completely different positions. Token Set Ratio first normalizes by lowercasing, stripping punctuation, and sorting tokens. Then it compares the intersection and remainders. The score for this pair is typically above 0.85.

Token Set Ratio handles:

  • Word reordering (First Last vs Last, First)
  • Abbreviations mixed with full words (J. Robert vs John Robert)
  • Extra tokens in one string (Johnson & Associates Inc. vs Johnson Associates)

Best for: Multi-word fields where word order varies — full names, company names, addresses.

N-grams (character-level)

N-gram matching breaks strings into overlapping substrings of length n (typically 2 or 3 characters), then measures overlap using Jaccard similarity or cosine similarity.

Example: Bigrams (n=2) of Smith: {Sm, mi, it, th}. Bigrams of Smyth: {Sm, my, yt, th}.

Jaccard similarity = |intersection| / |union| = |{Sm, th}| / |{Sm, mi, it, th, my, yt}| = 2/6 = 0.333.

That score looks low, but n-grams shine on longer strings where a single typo disrupts fewer n-grams proportionally. For Johnson Associates vs Johnsen Associates, the bigram overlap is much higher because the typo only affects two out of many bigrams.

N-grams are also fast to compute using inverted indexes, making them practical for large-scale candidate generation.

Best for: Longer strings where Levenshtein becomes expensive. Good as a pre-filter to generate candidates for a more precise algorithm.

TF-IDF + cosine similarity

TF-IDF (Term Frequency-Inverse Document Frequency) treats each record as a document. Common tokens like Inc, LLC, or Street get low weight because they appear everywhere. Rare tokens like Acme or Palantir get high weight because they’re distinctive.

Example: Comparing Johnson & Associates Inc vs Johnson Associates Incorporated.

The token Johnson is moderately common. Associates is common. Inc and Incorporated are extremely common. TF-IDF downweights all of these and focuses on the distinctive overlap — Johnson — which is exactly the signal you want.

After TF-IDF weighting, cosine similarity measures the angle between the two weighted vectors. This combination handles the classic problem with company names: half the tokens are generic suffixes that add noise to simpler algorithms.

Best for: Company names, product descriptions, any field where some tokens are far more informative than others.

Algorithm comparison

Fuzzy matching algorithms at a glance
Algorithm Best For Handles Typos Handles Reordering Speed Score Range
Levenshtein Short fields, codes Yes No Moderate 0–1
Jaro-Winkler Personal names Yes (prefix-aware) No Fast 0–1
Soundex / Metaphone English names Phonetic only No Very fast Match / no match
Token Set Ratio Multi-word fields Partial Yes Moderate 0–100
N-grams Longer strings Yes Partial Fast (indexable) 0–1
TF-IDF + Cosine Company names Partial Yes Fast (pre-indexed) 0–1

Speed is relative to string length and dataset size. All algorithms have O(1) cost for short strings.

Concrete example: scoring the same pair

How do these algorithms score a realistic company name comparison?

Comparing: Johnson & Associates Inc vs Johnson Associates Incorporated

Similarity scores — 'Johnson & Associates Inc' vs 'Johnson Associates Incorporated'
Levenshtein Penalized by length difference
62%
Jaro-Winkler Prefix match helps
78%
Token Set Ratio Ignores order, handles partial
90%
Bigram Jaccard Many non-overlapping bigrams
58%
TF-IDF + Cosine Downweights common suffixes
94%

Scores are illustrative. Exact values depend on implementation and normalization settings.

TF-IDF + cosine and Token Set Ratio both handle this case well because they focus on meaningful tokens and ignore word order. Levenshtein and bigram Jaccard struggle because the strings differ significantly at the character level despite referring to the same entity.

When one algorithm isn’t enough

No single algorithm handles every data quality issue. A name field with typos needs Jaro-Winkler. A company name field with abbreviations needs TF-IDF. An address field needs token matching after normalization.

The practical approach is to layer algorithms:

  1. Exact match first — free, catches the easy ones
  2. String pre-filter — Jaro-Winkler or token matching to catch near-misses cheaply
  3. Semantic comparison — embeddings or TF-IDF for the harder cases
  4. LLM confirmation — for genuinely ambiguous pairs that need reasoning

Each layer is a filter. The expensive algorithms only run on records that survive the cheaper stages. This keeps costs manageable even at scale.

Choosing your approach

The right algorithm depends on your field type, data quality, and scale. If you’re matching a single field type with consistent formatting, one algorithm may suffice. If you’re matching real-world data across systems — with mixed formatting, abbreviations, and semantic equivalences — you’ll need a combination.

The next post in this series covers how to choose the right algorithm for specific data types and use cases.


Match Data Studio applies these algorithms automatically as part of its matching pipeline. Upload your CSVs, describe what you’re matching, and the AI assistant configures the right combination of string, numeric, and semantic matching for your data. Get started free →


Keep reading