What is Levenshtein distance in fuzzy matching?

Levenshtein distance counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. 'kitten' to 'sitting' has a Levenshtein distance of 3. It's simple and universal but treats all edits equally, so it can't distinguish a typo from a fundamentally different word.

What is Jaro-Winkler similarity and when should I use it?

Jaro-Winkler measures similarity based on matching characters and transpositions, with a bonus for strings that share a common prefix. It works best for short strings like personal names where typos tend to occur in the middle or end — 'Smith' vs 'Smtih'. It ranges from 0 to 1, with 1 being identical. It's less useful for long text or product descriptions.

Which fuzzy matching algorithm is best for my data?

No single algorithm is best for all data. Jaro-Winkler excels at names, Levenshtein handles general typos, and token-based methods like TF-IDF work better for multi-word fields where word order varies. For addresses and product descriptions, use token-based or embedding approaches. Most production systems combine two or three algorithms across different field types.

What is the difference between character-level and token-level fuzzy matching?

Character-level algorithms (Levenshtein, Jaro-Winkler) compare strings character by character, catching typos and minor edits. Token-level algorithms (Jaccard, TF-IDF cosine) split text into words and compare word sets, catching reordering and missing words. 'John A. Smith' vs 'Smith, John' scores poorly with character-level but high with token-level matching.

Fuzzy matching algorithms explained: Levenshtein, Jaro-Winkler, and beyond

When two records refer to the same entity but don’t match exactly, you need a fuzzy matching algorithm. The trouble is there are dozens of them, each with different strengths. Picking the wrong one means missed matches or false positives — sometimes both.

This post covers six algorithms worth knowing, with concrete examples so you can see how each one behaves on real data.

Levenshtein distance (edit distance)

Levenshtein counts the minimum number of single-character edits — insertions, deletions, and substitutions — needed to transform one string into another.

Example: Smith vs Smyth

Only one substitution is needed (i to y), so the edit distance is 1. To get a normalized similarity score, use the formula:

score = 1 - (distance / max_length)
score = 1 - (1 / 5) = 0.80

Levenshtein is intuitive and well-understood. It works well for short strings with minor typos. But it has two weaknesses: it’s slow on long strings (O(n*m) time complexity), and it treats every character position equally. Swapping the first two letters costs the same as swapping letters in the middle, even though first-letter errors are much rarer in names.

Best for: Short fields with occasional typos — product codes, last names, ZIP codes.

Jaro-Winkler similarity

Jaro-Winkler measures character-level matching within a sliding window, then applies a bonus for matching prefixes. The prefix bonus is the key insight: most typos in names happen in the middle or end, not the beginning.

Example: Martha vs Marhta

The base Jaro score considers how many characters match (within a window of floor(max(len1, len2) / 2) - 1) and how many transpositions are needed. Then Winkler adds up to a 0.1 bonus for each matching character in the first four positions.

For Martha vs Marhta, the Jaro-Winkler score is approximately 0.961 — very high, because the prefix Mar matches and only two adjacent characters are swapped.

Compare that to Levenshtein on the same pair: distance 2, normalized score 0.667. Jaro-Winkler is clearly better here.

Best for: Personal names, especially first and last names where the beginning is usually correct.

Soundex and Metaphone (phonetic encoding)

Soundex converts a name into a four-character code based on how it sounds in English. Names that sound alike produce the same code, regardless of spelling.

Example: Steven and Stephen both encode to S315.

Metaphone is a more refined version that handles English pronunciation rules better — silent letters, ph = f, consonant clusters. Double Metaphone generates two codes to handle ambiguous pronunciations.

The approach is simple: encode both strings, then compare codes for equality.

The limitations are real. Phonetic algorithms are language-specific (Soundex was designed for English census data in the early 1900s). They discard most vowel information, so they can’t distinguish Smith from Smoth. And they’re useless for non-name fields — you wouldn’t use Soundex on company names or addresses.

Best for: English-language personal names where spelling varies but pronunciation doesn’t. Not suitable for addresses, product names, or non-English text.

Token Set Ratio

Token Set Ratio splits both strings into sets of tokens (words), then finds the best overlap between them. It’s order-independent, which makes it powerful for fields where the same information appears in different arrangements.

Example: John Robert Smith vs Smith, John R.

A character-level algorithm would score this pair poorly — the characters are in completely different positions. Token Set Ratio first normalizes by lowercasing, stripping punctuation, and sorting tokens. Then it compares the intersection and remainders. The score for this pair is typically above 0.85.

Token Set Ratio handles:

Word reordering (First Last vs Last, First)
Abbreviations mixed with full words (J. Robert vs John Robert)
Extra tokens in one string (Johnson & Associates Inc. vs Johnson Associates)

Best for: Multi-word fields where word order varies — full names, company names, addresses.

N-grams (character-level)

N-gram matching breaks strings into overlapping substrings of length n (typically 2 or 3 characters), then measures overlap using Jaccard similarity or cosine similarity.

Example: Bigrams (n=2) of Smith: {Sm, mi, it, th}. Bigrams of Smyth: {Sm, my, yt, th}.

Jaccard similarity = |intersection| / |union| = |{Sm, th}| / |{Sm, mi, it, th, my, yt}| = 2/6 = 0.333.

That score looks low, but n-grams shine on longer strings where a single typo disrupts fewer n-grams proportionally. For Johnson Associates vs Johnsen Associates, the bigram overlap is much higher because the typo only affects two out of many bigrams.

N-grams are also fast to compute using inverted indexes, making them practical for large-scale candidate generation.

Best for: Longer strings where Levenshtein becomes expensive. Good as a pre-filter to generate candidates for a more precise algorithm.

TF-IDF + cosine similarity

TF-IDF (Term Frequency-Inverse Document Frequency) treats each record as a document. Common tokens like Inc, LLC, or Street get low weight because they appear everywhere. Rare tokens like Acme or Palantir get high weight because they’re distinctive.

Example: Comparing Johnson & Associates Inc vs Johnson Associates Incorporated.

The token Johnson is moderately common. Associates is common. Inc and Incorporated are extremely common. TF-IDF downweights all of these and focuses on the distinctive overlap — Johnson — which is exactly the signal you want.

After TF-IDF weighting, cosine similarity measures the angle between the two weighted vectors. This combination handles the classic problem with company names: half the tokens are generic suffixes that add noise to simpler algorithms.

Best for: Company names, product descriptions, any field where some tokens are far more informative than others.

Algorithm comparison

Fuzzy matching algorithms at a glance

Algorithm	Best For	Handles Typos	Handles Reordering	Speed	Score Range
Levenshtein	Short fields, codes	Yes	No	Moderate	0–1
Jaro-Winkler	Personal names	Yes (prefix-aware)	No	Fast	0–1
Soundex / Metaphone	English names	Phonetic only	No	Very fast	Match / no match
Token Set Ratio	Multi-word fields	Partial	Yes	Moderate	0–100
N-grams	Longer strings	Yes	Partial	Fast (indexable)	0–1
TF-IDF + Cosine	Company names	Partial	Yes	Fast (pre-indexed)	0–1

Speed is relative to string length and dataset size. All algorithms have O(1) cost for short strings.

Concrete example: scoring the same pair

How do these algorithms score a realistic company name comparison?

Comparing: Johnson & Associates Inc vs Johnson Associates Incorporated

Similarity scores — 'Johnson & Associates Inc' vs 'Johnson Associates Incorporated'

Levenshtein Penalized by length difference

62%

Jaro-Winkler Prefix match helps

78%

Token Set Ratio Ignores order, handles partial

90%

Bigram Jaccard Many non-overlapping bigrams

58%

TF-IDF + Cosine Downweights common suffixes

94%

Scores are illustrative. Exact values depend on implementation and normalization settings.

TF-IDF + cosine and Token Set Ratio both handle this case well because they focus on meaningful tokens and ignore word order. Levenshtein and bigram Jaccard struggle because the strings differ significantly at the character level despite referring to the same entity.

When one algorithm isn’t enough

No single algorithm handles every data quality issue. A name field with typos needs Jaro-Winkler. A company name field with abbreviations needs TF-IDF. An address field needs token matching after normalization.

The practical approach is to layer algorithms:

Exact match first — free, catches the easy ones
String pre-filter — Jaro-Winkler or token matching to catch near-misses cheaply
Semantic comparison — embeddings or TF-IDF for the harder cases
LLM confirmation — for genuinely ambiguous pairs that need reasoning

Each layer is a filter. The expensive algorithms only run on records that survive the cheaper stages. This keeps costs manageable even at scale.

Choosing your approach

The right algorithm depends on your field type, data quality, and scale. If you’re matching a single field type with consistent formatting, one algorithm may suffice. If you’re matching real-world data across systems — with mixed formatting, abbreviations, and semantic equivalences — you’ll need a combination.

The next post in this series covers how to choose the right algorithm for specific data types and use cases.

Match Data Studio applies these algorithms automatically as part of its matching pipeline. Upload your CSVs, describe what you’re matching, and the AI assistant configures the right combination of string, numeric, and semantic matching for your data. Get started free →

Keep reading

How to choose the right matching algorithm — a decision framework for selecting the best approach
AI embeddings vs rule-based matching — when semantic similarity beats string distance
Understanding similarity thresholds — tuning cutoffs for precision and recall