Every organization that works with data from more than one source eventually hits the same wall: two spreadsheets that describe the same real-world entities using different words, different formats, and different levels of completeness. Figuring out which rows refer to the same thing is data matching — and the way we solve it has changed fundamentally over the past two decades.

The progression from handwritten rules to machine learning to neural embeddings to large language models isn’t just a story of incremental improvement. Each generation of technology solved a category of problem that the previous generation couldn’t touch. Understanding this progression — what each approach does, where it breaks, and why the next one was needed — is the key to building matching systems that actually work on messy data.

Generation 1: Rules and string comparison

The first generation of data matching is entirely deterministic. You define rules. The system applies them. A match is either true or false.

At the simplest level, this means exact matching — if the email addresses are identical, the records match. More sophisticated rule-based systems normalize values first (lowercase, strip punctuation, expand abbreviations) and then compare using string distance algorithms. Jaro-Winkler scores the similarity of two strings with a bonus for matching prefixes. Levenshtein counts the minimum character edits to transform one string into another. Soundex encodes names by their English pronunciation.

These algorithms are well-understood, fast, and cheap. A single CPU core can evaluate millions of string comparisons per second. The rules are fully traceable — you can explain exactly why any two records matched or didn’t.

Where rules work

Rules handle the easy cases reliably. When both datasets follow consistent formatting conventions, when the matching fields are structured (product codes, phone numbers, ZIP codes), and when variations are predictable (abbreviations, case differences, minor typos), string rules achieve 90–95% accuracy.

Where rules break

The problems start with anything that requires understanding rather than character comparison.

Semantic equivalence. Bob and Robert share no character similarity. 3M and Minnesota Mining and Manufacturing have zero string overlap. Société Générale SA and SocGen — no string comparison function will connect these, because the relationship isn’t in the characters. It’s in human knowledge about what those strings mean.

Unpredictable variation. Every rule handles variations you’ve anticipated. A rule for StStreet won’t catch BlvdBoulevard unless you add another rule. Each new dataset brings formatting conventions you haven’t seen. The rule set grows without bound, and maintaining it becomes an engineering project in itself.

Missing and partial data. When Record A has a phone number but no address, and Record B has an address but no phone, string rules on those fields produce no signal. A human would look at the name, the city, and the context — and make a judgment. Rules can’t make judgment calls.

Scale. For n records matched against m records, you need n × m comparisons. Each comparison runs every rule. At 10,000 × 10,000 (100 million pairs), even fast string comparisons take hours, and most of those comparisons are between records that obviously don’t match.

String comparison algorithms: what they catch and what they miss
Algorithm Catches Misses Best For
Exact match Identical values Any variation at all IDs, codes, emails
Jaro-Winkler Typos, transpositions (prefix-weighted) Reordering, semantic equivalence Personal names
Levenshtein Insertions, deletions, substitutions Word reordering, synonyms Short strings, codes
Soundex / Metaphone Phonetic similarity Non-English, non-names English personal names
Token Set Ratio Word reordering, partial overlap Semantic meaning, abbreviations Multi-word fields
TF-IDF + Cosine Important token overlap, downweights common terms Synonyms, cross-language Company names, descriptions

Each algorithm covers a different failure mode. No single algorithm handles all types of variation.

Rule-based matching peaked in the early 2000s. It’s still the foundation of every matching system — but it’s not sufficient on its own for any dataset where the formatting wasn’t controlled at the point of entry.

Generation 2: Machine learning

The insight behind ML-based matching is simple: instead of writing rules by hand, train a model to learn which field comparisons predict a true match.

The Fellegi-Sunter foundation

The theoretical groundwork came decades before the practical ML implementations. In 1969, Ivan Fellegi and Alan Sunter published a statistical framework for record linkage that formalized what string rules couldn’t: not all field agreements are equally informative.

Their model assigns weights based on two probabilities. The m-probability is the chance a field agrees given the records truly match (e.g., 95% of true matches share a ZIP code). The u-probability is the chance a field agrees by coincidence (e.g., 2% of random pairs share a ZIP code). Fields with high m-probability and low u-probability — like phone numbers or tax IDs — get high weights. Fields with high u-probability — like sharing the last name “Smith” — get low weights.

This was a fundamental shift: from treating every field comparison equally to weighting them by their statistical informativeness. A shared rare last name like “Xiangwenthakur” is far more diagnostic than a shared “Johnson.” Fellegi-Sunter captures this. Rule-based systems don’t.

Supervised learning for matching

With the Fellegi-Sunter framework providing the intuition, supervised machine learning made it operational. The approach works like this:

  1. Feature engineering. For each candidate pair, compute a vector of comparison features: Jaro-Winkler score on the name field, exact match on ZIP code, Levenshtein distance on address, token overlap on company name. Each pair becomes a row of numbers.

  2. Labeled training data. Human reviewers label a sample of pairs as match or non-match. This is the expensive part — you need hundreds to thousands of labeled examples for the model to learn from.

  3. Classification. A model (random forest, gradient-boosted tree, SVM, or logistic regression) learns which combinations of feature values predict a match. The model might learn that a high name similarity combined with an exact ZIP code match is strong evidence, while a high name similarity alone — without address agreement — is weak evidence.

  4. Prediction. The trained model scores unlabeled pairs, producing a probability of matching. A threshold determines the final classification.

What ML added

Automatic weight learning. Instead of manually tuning thresholds for each field, the model learns optimal weights from data. It discovers interactions humans might miss — like the fact that company name similarity matters more when the city also matches.

Handling of missing values. A decision tree can route around missing fields. If the phone number is NULL, the model relies on other features instead of producing no signal.

Adaptability. Train on a new labeled set and the model adjusts to a new data domain. No rule rewriting needed — just new examples.

What ML still couldn’t do

Feature engineering was manual. Someone still had to decide which string similarity metrics to compute for which fields, how to handle multi-valued fields, and how to normalize inputs. The model learned weights over features, but the features themselves were hand-designed.

No semantic understanding. The features were still character-level comparisons. Bob vs Robert produced a low Jaro-Winkler score, and the model learned to weight that accordingly. But it couldn’t understand that Bob is a nickname for Robert — it could only learn that low name similarity sometimes still correlates with true matches, which reduces precision.

Labeled data requirement. Good supervised models need substantial labeled training data. For a new domain (matching pharmaceutical products instead of customer records), you need a new labeled set. Active learning helped — the model identifies the most uncertain pairs and asks humans to label only those — but the requirement never disappeared.

Matching recall by approach on messy vendor data
Exact match only Misses all variations
48%
Hand-tuned rules Covers anticipated patterns
68%
ML (random forest) Learns optimal feature weights
82%
ML + active learning Prioritized labeling
86%

Recall measured on a 5,000-record vendor matching dataset with real-world quality issues — abbreviations, missing fields, multiple formats.

Machine learning was the standard for serious record linkage from roughly 2010 to 2018. Libraries like dedupe (Python), Magellan (University of Wisconsin), and commercial platforms operationalized the supervised approach. The bottleneck was always the same: feature engineering and labeled data.

Generation 3: Neural embeddings

The next leap came not from better classification algorithms, but from better representations. Instead of hand-engineering comparison features, neural networks learned to represent entire records as dense vectors — embeddings — that capture semantic meaning.

From words to vectors

The precursors appeared in NLP research. Word2Vec (2013) and GloVe (2014) demonstrated that words could be represented as vectors in a space where relationships were preserved: king - man + woman ≈ queen. These word embeddings captured semantic similarity that character comparisons never could — automobile and car landed near each other despite sharing no characters.

But word embeddings had a critical limitation for data matching: they operated on individual words, not on phrases or records. Averaging the word vectors for a multi-field record produced a lossy, order-insensitive representation that missed important context.

Sentence-BERT and record embeddings

The transformer architecture (2017) and BERT (2018) changed this. Transformers process entire sequences with an attention mechanism that weighs the importance of each token relative to every other token. BERT learned deep contextual representations — the word “bank” gets a different vector depending on whether it appears near “river” or “money.”

But BERT alone was impractical for matching. Comparing two records required feeding both into the network simultaneously, making pairwise comparison across large datasets computationally impossible — Nils Reimers and Iryna Gurevych calculated that finding the most similar pair in 10,000 sentences would take ~65 hours with BERT.

Their solution, Sentence-BERT (2019), used a Siamese network architecture: two identical BERT networks with shared weights, each encoding one record independently. The output is a fixed-size vector per record. You compute vectors once, then compare millions of pairs using cosine similarity in seconds. This reduced the 65-hour problem to 5 seconds.

For data matching, the approach works like this:

  1. Concatenate relevant fields from a record into a text representation: "Acme Corp, 123 Oak St, Austin TX 78701, Industrial Supplies"
  2. Pass through the embedding model to get a vector (256–768 dimensions)
  3. Do the same for every record in both datasets
  4. Compare vectors using cosine similarity — a score from -1 to 1
  5. Apply a threshold to classify matches

What embeddings solved

Semantic understanding at scale. Robert Smith and Bob Smith produce similar embeddings because the model learned from billions of text examples that Bob is a form of Robert. 123 Oak Street and 123 Oak St produce nearly identical vectors. München and Munich are close in vector space. No rules, no lookup tables — the model generalizes from its training data.

Cross-field reasoning. When you embed a concatenated record rather than individual fields, the model captures relationships between fields. A company name at the same address as a personal name will have higher similarity than two unrelated records at different addresses.

Multilingual matching. Modern embedding models work across languages. A product description in Spanish and its English equivalent cluster together without translation rules.

No feature engineering. The model learns its own features from the raw text. No manual selection of which string metrics to compute for which fields.

What embeddings still miss

Opaque decisions. When two records get a cosine similarity of 0.82, you can’t easily explain why. Which tokens drove the score? Which were ignored? The vector space is a black box.

Threshold sensitivity. Embedding cosine scores occupy a narrower effective range than string similarity scores. Most pairs fall between 0.60 and 0.95. A threshold change from 0.80 to 0.78 can dramatically shift your match count.

Cost per record. Each embedding requires model inference — either an API call or local GPU time. At scale, this adds up. Embedding 100,000 records is affordable; embedding all 50 million pairs is not.

No reasoning. Embeddings capture similarity but don’t reason. Two records might have high embedding similarity because they describe similar-but-different entities (two dentists named Smith in the same city). Distinguishing “similar” from “same” requires contextual judgment that vector distance can’t provide.

Evolution of data representation for matching
Generation Representation Captures Semantics Cost Per Record Reasoning
Rules Raw field values + string metrics No Free None
ML Hand-engineered comparison features No Free Statistical patterns
Word embeddings Average of per-word vectors Partial Low None
Sentence/record embeddings Dense vector from transformer Yes Moderate None
LLMs Full contextual understanding Yes High Multi-step reasoning

Each generation provides richer representation at higher cost. Production systems layer them to use the cheapest sufficient method for each pair.

Generation 4: Large language models

LLMs brought something none of the previous approaches had: the ability to reason about whether two records refer to the same entity.

From similarity to judgment

Embeddings answer the question “how similar are these two records?” LLMs answer a different question: “are these two records the same entity, and why?”

That distinction matters. Two dental practices in the same city — "Smith Dental, 456 Elm St, Portland OR" and "Smith Family Dentistry, 789 Pine Ave, Portland OR" — might produce high embedding similarity. They’re both dental practices named Smith in Portland. But they’re different businesses at different addresses. An embedding score of 0.84 doesn’t tell you whether this is one entity or two.

An LLM can examine both records, notice the different addresses, consider that dental practices don’t typically operate from two locations, and conclude: not a match. Or, if additional context shows the same phone number and the same dentist name, the LLM can reason that the practice moved locations — same entity, new address.

Zero-shot and few-shot matching

One of the most significant properties of LLMs for data matching is that they require no training data for the specific matching task.

Zero-shot matching. Present two records to the LLM with a prompt like: “Are these two records the same company? Consider the company name, address, and industry. Respond with MATCH, NO MATCH, or UNCERTAIN with a brief explanation.” The model draws on its training knowledge to make the judgment. No labeled examples needed.

Few-shot matching. Include 3–5 example pairs with their correct labels in the prompt. “Here are examples of matching and non-matching records…” The LLM adapts its decision-making to the pattern established by the examples — handling domain-specific conventions (like knowing that pharmaceutical companies often have subsidiary names that differ from their parent) without explicit rules.

This eliminates the labeled-data bottleneck that constrained supervised ML. A new matching task in a new domain doesn’t require weeks of labeling. It requires a well-crafted prompt.

Multimodal matching

The latest LLMs extend beyond text. When a dataset includes images (product photos, document scans, logos), file attachments (PDFs, invoices), or URLs pointing to media, multimodal models can incorporate this content directly into their matching judgment.

A product matching task might have a record with the text "Blue ceramic coffee mug, 12oz" in one dataset and a product photo with a brief description "Handmade mug" in another. A multimodal LLM can examine the image, read the text, and determine that they’re describing the same product — something no amount of string comparison or text embedding can accomplish without the visual information.

The cost problem

LLMs are the most capable matching technique available — and the most expensive. A single match judgment using a capable model costs $0.005–$0.05 depending on the model and prompt length. For 1,000 pairs, that’s $5–$50. Manageable. For 50 million pairs (a 5,000 × 10,000 matching job), it’s $250,000–$2.5 million. Not manageable.

This is why no production system sends every pair through an LLM. The economics demand a layered approach.

The modern pipeline: every generation in one system

The state of the art isn’t any single technique — it’s a pipeline that applies each generation of technology where it’s most cost-effective.

The funnel architecture

Stage 1: Rules (Generation 1). Deterministic pre-filters eliminate the obvious non-matches. String prefix matching, exact ZIP code comparison, numeric range filters. These run in microseconds, cost nothing, and typically eliminate 80–95% of candidate pairs. You’re spending $0 to remove the pairs that obviously don’t match.

Stage 2: Embeddings (Generation 3). The surviving pairs get embedded and compared by cosine similarity. This is where semantic understanding kicks in — Bob/Robert, St/Street, München/Munich. The embedding model processes only the records that survived Stage 1, which is typically 5–20% of the full dataset. Cost: a few dollars for a medium-sized job.

Stage 3: LLM confirmation (Generation 4). Pairs with embedding scores in the ambiguous range — high enough to be plausible, too low to be certain — get sent to an LLM. The model examines both records in full context and renders a judgment. This is the most expensive per-pair step, but because only 1–5% of pairs reach this stage, the total cost is contained.

The ML weight-learning from Generation 2 is embedded throughout: threshold selection, feature importance, and the statistical models that determine which pairs are “ambiguous enough” to warrant LLM review all trace back to the supervised learning framework.

Pairs remaining at each pipeline stage (5K × 10K records)
All candidate pairs 50 million pairs (full cross-product)
50M
After string pre-filter 90% eliminated by rules
5M
After numeric pre-filter 50% more eliminated
2.5M
After embedding threshold 80% eliminated by similarity
0.5M
Confirmed matches (LLM) Final output
0.12M

50 million pairs enter. ~120,000 confirmed matches emerge. Rules handled 99% of the elimination for free. Embeddings refined the remaining 1%. The LLM saw less than 0.01% of total pairs.

Why the layered approach wins

The economics are compelling. Running an LLM on all 50 million pairs would cost hundreds of thousands of dollars. Running embeddings on all pairs would cost hundreds. Running the full pipeline — rules first, then embeddings, then LLM — costs under $50 for the same job at comparable accuracy.

Cost and accuracy by approach (5K × 10K matching job)
Approach Pairs Processed by AI Estimated Cost Recall Time
Rules only 0 $0 60–72% Seconds
Embeddings on all pairs 50M $150–500 84–90% Hours
LLM on all pairs 50M $250K–2.5M 92–97% Days
Layered pipeline ~500K emb + ~5K LLM $15–50 93–97% Minutes

The layered pipeline achieves LLM-level accuracy at rules-level cost by applying each technique only where it's needed.

The accuracy gains compound, too. Rules catch the clean matches perfectly. Embeddings catch the semantic variations that rules miss. The LLM resolves the genuinely ambiguous cases that embeddings score inconclusively. Each layer handles the failure mode of the previous one.

What changed at each step — and why it mattered

The evolution from rules to LLMs isn’t just a technology progression. It’s a progression in what the system can understand about the data it’s matching.

Rules understand characters. They can tell you that two strings differ by two characters, or that one contains the other. They have no concept of what the strings mean.

ML understands patterns. It can learn that certain combinations of field similarities predict a match, even when individual fields are ambiguous. But it operates on hand-crafted features that still reduce records to character-level comparisons.

Embeddings understand meaning. They know that “athletic footwear” and “running shoes” refer to similar things, even with zero character overlap. But they produce a single similarity score without explanation or reasoning.

LLMs understand context. They can consider multiple pieces of evidence together, apply world knowledge, handle edge cases, and explain their reasoning. “The company names differ, but both records list the same registered agent at the same address — this is likely the same entity operating under a DBA.”

Each generation didn’t replace the previous one. It extended the system’s capability into a domain the previous generation couldn’t reach. And each generation’s strength compensates for the previous one’s weakness — which is why production systems use all of them.

Match recall by technology generation on messy real-world data
Exact match Generation 0: identical strings only
48%
String rules Generation 1: character-level comparison
68%
ML classification Generation 2: learned feature weights
83%
Neural embeddings Generation 3: semantic vectors
89%
Hybrid pipeline All generations: rules → embeddings → LLM
95%

Recall measured as percentage of known true matches found. Tested on a 5,000-record dataset with name, address, and company fields containing real-world quality issues.

Practical implications

If you’re building or choosing a matching system today, three things matter.

Don’t skip the cheap layers. It’s tempting to throw everything at an LLM and call it done. The accuracy will be excellent — and the cost will be prohibitive. String rules and numeric filters are free and eliminate the vast majority of non-matches. Always run them first.

Embeddings are the highest-leverage investment. They provide semantic understanding at a fraction of LLM cost. For most matching tasks, embeddings are the difference between “good enough” and “production-quality.” The jump from rules-only to rules-plus-embeddings is larger than the jump from embeddings to embeddings-plus-LLM.

LLMs are for the hard cases. The 5–10% of pairs where embedding scores are ambiguous — where the records are similar enough to plausibly be the same entity but different enough to warrant scrutiny — is where LLM reasoning earns its cost. Use them selectively, not universally.

The best matching systems aren’t the ones using the most advanced AI on every comparison. They’re the ones that use the right technique at the right stage — fast and cheap where possible, powerful and expensive only where necessary.


Match Data Studio applies this layered architecture automatically. String and numeric pre-filters run before any AI processing to eliminate obvious non-matches for free. Embeddings handle semantic similarity. LLM confirmation resolves the genuinely ambiguous cases. You get four generations of matching technology in a single pipeline — without building it yourself. Get started free →


Keep reading