Record matching broadly falls into two camps: rule-based systems that compare strings deterministically, and AI embedding systems that capture semantic meaning in vector space. Both work. Neither is sufficient alone. Understanding when to use each — and how to combine them — is the key to accurate, cost-effective matching.

Rule-based matching

Rule-based matching applies explicit comparison functions to field values. If the Jaro-Winkler score on the name field exceeds 0.85 and the ZIP codes match exactly, it’s a match. The logic is spelled out, step by step.

Strengths

Deterministic. The same inputs always produce the same output. You can explain every match decision: “These records matched because the name similarity was 0.91 and the address token overlap was 0.88.”

Fast. String comparison functions run in microseconds. You can process millions of pairs per minute on a single machine.

Interpretable. When a match is wrong, you can trace exactly which rule fired and why. Debugging is straightforward. Adjusting a threshold or adding an exception is a config change, not a retraining exercise.

Zero marginal cost. After implementation, rules cost nothing to evaluate. No API calls, no GPU time, no per-record pricing.

Weaknesses

Brittle. Rules only handle the patterns you’ve anticipated. A rule for Inc vs Incorporated won’t catch Corp vs Corporation unless you add another rule. And it definitely won’t catch Bob vs Robert — that’s not a string transformation, it’s world knowledge.

No semantic understanding. Oak Creek Investments LLC and Robert Chen could be the same beneficial owner if Robert is the registered agent. No combination of string rules will connect those records. The information needed isn’t in the character sequences — it’s in the relationship between the name, the address, and the entity type.

Combinatorial explosion. As you add more fields and more variation patterns, the number of rules grows rapidly. A system matching names, addresses, phone numbers, and emails across two messy datasets might need hundreds of rules to achieve 85% recall. Maintaining that rule set becomes its own engineering project.

Format-dependent. Rules that work on one dataset often fail on another. A rule tuned for Last, First Middle format breaks when the next dataset uses First M. Last. Each new data source requires rule auditing and adjustment.

AI embedding matching

Embedding models convert text into high-dimensional vectors (typically 256-768 dimensions) that capture semantic meaning. Records with similar meaning land near each other in vector space, even if the surface text looks different.

How it works

  1. Concatenate relevant fields from a record into a text representation: "Robert Chen, 123 Oak St, Austin TX 78701"
  2. Pass through an embedding model (e.g., Gemini, OpenAI, or Sentence-BERT) to get a vector
  3. Compare vectors using cosine similarity — a score from -1 to 1, where higher means more similar
  4. Apply a threshold to classify matches

Strengths

Semantic understanding. Robert Smith and Bob Smith produce similar embeddings because the model has learned that Bob is a common nickname for Robert. 123 Oak Street and 123 Oak St produce nearly identical embeddings because the model understands the abbreviation.

Robust to format variation. Embeddings handle case differences, abbreviations, word order changes, and missing tokens without explicit rules. The model generalizes from its training data.

Multi-field reasoning. When you embed a concatenated record rather than individual fields, the model captures relationships between fields. An LLC name at the same address as a personal name will have higher similarity than two unrelated names at different addresses.

Multilingual. Modern embedding models work across languages. A record in Spanish and its English equivalent will cluster together without any translation rules.

Weaknesses

Non-deterministic. Different model versions can produce different embeddings. Results may shift slightly when the model is updated. This makes exact reproducibility harder.

Cost per record. Each embedding requires an API call or local model inference. At scale, this adds up — a few cents per thousand records for hosted APIs, or GPU infrastructure costs for self-hosted models.

Less interpretable. When two records get a cosine similarity of 0.82, you can’t easily explain why. Which tokens contributed? Which were ignored? The vector space is opaque.

Threshold sensitivity. Cosine similarity scores for embeddings occupy a narrower effective range than string similarity scores. Most pairs fall between 0.60 and 0.95. A threshold change from 0.80 to 0.78 can dramatically change your match count in ways that aren’t intuitive.

Head-to-head comparison

Rule-based vs embedding matching
Dimension Rule-Based AI Embeddings Hybrid
Accuracy (clean data) 90–95% 88–93% 95–98%
Accuracy (messy data) 60–75% 80–90% 90–96%
Speed (10K pairs) < 1 second 5–30 seconds 2–15 seconds
Cost per 10K pairs ~$0 $0.02–$0.10 $0.01–$0.05
Handles semantic equiv. No Yes Yes
Interpretable Fully Limited Partially
Setup effort High (rule authoring) Low (configure model) Medium

Accuracy ranges are illustrative. Actual performance depends on data characteristics and configuration quality.

The pattern is clear: rules win on clean, structured data. Embeddings win on messy, variable data. The hybrid approach wins on everything — at the cost of slightly more complexity.

The hybrid approach

The best matching systems aren’t pure rule-based or pure AI. They’re layered pipelines that use each approach where it’s strongest.

Architecture

Stage 1: Rules eliminate obvious non-matches. String pre-filters (prefix match, contains, Jaro-Winkler with a low threshold) and numeric filters (ZIP code range, date proximity) are fast and free. They typically eliminate 70-90% of candidate pairs.

Stage 2: Embeddings score remaining candidates. Generate embeddings only for records that survived Stage 1. This is where the AI cost concentrates — but because pre-filtering removed most candidates, you’re embedding a fraction of the full dataset.

Stage 3: LLM confirms borderline cases. Pairs with embedding similarity in the ambiguous range (say, 0.70-0.85) get sent to a language model with full record context. The LLM can reason about whether two records are the same entity using all available evidence.

This cascade gives you the speed and interpretability of rules on the easy cases, the semantic understanding of embeddings on the hard cases, and the reasoning capability of LLMs on the genuinely ambiguous cases.

Cost analysis

The layered approach is dramatically cheaper than using AI on everything.

Cost per 10,000 candidate pairs by approach
Rules only Free but misses 25-40% of messy matches
0$
Hybrid pipeline Rules → Embeddings → LLM
8$
Embeddings on all pairs Generates every vector
35$
LLM on all pairs Every pair reviewed by LLM
100$

Approximate costs at typical API pricing. Rules cost ~$0. Embeddings ~$0.003/record. LLM confirmation ~$0.005/pair.

The hybrid approach costs roughly 8% of what a full LLM pipeline costs while achieving comparable accuracy. The savings come from not sending obvious non-matches through expensive processing.

When to use each approach

Use rules alone when:

  • Your data is well-structured and comes from known, consistent sources
  • The matching fields are short and standardized (product codes, phone numbers, emails)
  • You need deterministic, reproducible results for compliance or auditing
  • Budget is zero and you’re willing to accept lower recall on messy records
  • Latency requirements are sub-second for real-time matching

Use embeddings alone when:

  • Your data is unstructured or highly variable (free-text descriptions, multilingual fields)
  • You’re matching across fundamentally different schemas (one dataset has full addresses, the other has partial addresses with different formatting)
  • You need to match at a semantic level (company names to individual names via ownership)
  • The dataset is small enough that embedding every record is affordable

Use the hybrid approach when:

  • You have a mix of structured and unstructured fields
  • Data quality varies — some records are clean, others are messy
  • You need high accuracy but can’t afford to run AI on every comparison
  • Your dataset is large enough that brute-force AI is cost-prohibitive
  • You want the best achievable result regardless of which technique produces it

Accuracy on real-world messy data

The difference between approaches is most visible on messy data — the kind that comes from merging vendor lists, county records, and CRM exports.

Match recall on a messy 5,000-record dataset
Exact match only Misses all variations
52%
Rules (Jaro-Winkler + Token) Catches typos and reordering
71%
Embeddings only Catches semantic equivalence
84%
Hybrid pipeline Rules + Embeddings + LLM
93%

Recall measured as percentage of known true matches identified. Dataset contains name, address, and company fields with real-world quality issues.

The hybrid pipeline recovers 93% of true matches — 22 percentage points more than rules alone and 9 points more than embeddings alone. The LLM confirmation stage is responsible for most of that final gain, catching cases where even embeddings produce ambiguous scores.

Practical recommendations

If you’re building a matching workflow from scratch, start with the hybrid approach. It’s not more complex to configure — the configuration is just a list of fields with their types and thresholds. The pipeline handles the rest.

If you’re already using rule-based matching and seeing too many misses, add an embedding layer for the fields where rules struggle (names, company names, descriptions). Keep your rules for the fields where they work well (IDs, codes, phone numbers).

If cost is the primary constraint, invest your AI budget where it matters most: embedding the fields that vary semantically, and LLM-confirming only the pairs that fall in the ambiguous score range.


Match Data Studio uses the hybrid architecture by default. String and numeric pre-filters run before any AI operations, keeping costs low. Embeddings and LLM confirmation handle the cases that rules can’t. Get started free →


Keep reading