What are AI embeddings in data matching?

AI embeddings convert text into numerical vectors that capture semantic meaning. Similar concepts end up close together in vector space, so 'running shoes' and 'athletic footwear' get similar vectors even though they share no words. In data matching, you compare these vectors using cosine similarity to find records that mean the same thing but are written differently.

When should I use rule-based matching instead of AI?

Use rule-based matching when your data is structured and differences are predictable — typos in names, abbreviated company suffixes, or minor formatting variations. Rules are faster, cheaper, and deterministic. They're also easier to debug since you can trace exactly why two records matched or didn't.

Can I combine rules and embeddings in the same matching job?

Yes, and it's the recommended approach. Use rules as a pre-filter to cheaply eliminate obvious non-matches (different states, different product categories), then run AI embeddings only on surviving candidate pairs. This cuts embedding costs by 80-95% while maintaining the same match quality.

How much do AI embeddings cost compared to rule-based matching?

AI embeddings typically cost 10-100x more per comparison than string rules because they require API calls to embedding models. However, with a funnel architecture that pre-filters with rules first, you only run embeddings on 5-20% of total pairs, making the overall cost manageable even for large datasets.

AI embeddings vs rule-based matching: when to use each

Record matching broadly falls into two camps: rule-based systems that compare strings deterministically, and AI embedding systems that capture semantic meaning in vector space. Both work. Neither is sufficient alone. Understanding when to use each — and how to combine them — is the key to accurate, cost-effective matching.

Rule-based matching

Rule-based matching applies explicit comparison functions to field values. If the Jaro-Winkler score on the name field exceeds 0.85 and the ZIP codes match exactly, it’s a match. The logic is spelled out, step by step.

Strengths

Deterministic. The same inputs always produce the same output. You can explain every match decision: “These records matched because the name similarity was 0.91 and the address token overlap was 0.88.”

Fast. String comparison functions run in microseconds. You can process millions of pairs per minute on a single machine.

Interpretable. When a match is wrong, you can trace exactly which rule fired and why. Debugging is straightforward. Adjusting a threshold or adding an exception is a config change, not a retraining exercise.

Zero marginal cost. After implementation, rules cost nothing to evaluate. No API calls, no GPU time, no per-record pricing.

Weaknesses

Brittle. Rules only handle the patterns you’ve anticipated. A rule for Inc vs Incorporated won’t catch Corp vs Corporation unless you add another rule. And it definitely won’t catch Bob vs Robert — that’s not a string transformation, it’s world knowledge.

No semantic understanding. Oak Creek Investments LLC and Robert Chen could be the same beneficial owner if Robert is the registered agent. No combination of string rules will connect those records. The information needed isn’t in the character sequences — it’s in the relationship between the name, the address, and the entity type.

Combinatorial explosion. As you add more fields and more variation patterns, the number of rules grows rapidly. A system matching names, addresses, phone numbers, and emails across two messy datasets might need hundreds of rules to achieve 85% recall. Maintaining that rule set becomes its own engineering project.

Format-dependent. Rules that work on one dataset often fail on another. A rule tuned for Last, First Middle format breaks when the next dataset uses First M. Last. Each new data source requires rule auditing and adjustment.

AI embedding matching

Embedding models convert text into high-dimensional vectors (typically 256-768 dimensions) that capture semantic meaning. Records with similar meaning land near each other in vector space, even if the surface text looks different.

How it works

Concatenate relevant fields from a record into a text representation: "Robert Chen, 123 Oak St, Austin TX 78701"
Pass through an embedding model (e.g., Gemini, OpenAI, or Sentence-BERT) to get a vector
Compare vectors using cosine similarity — a score from -1 to 1, where higher means more similar
Apply a threshold to classify matches

Strengths

Semantic understanding. Robert Smith and Bob Smith produce similar embeddings because the model has learned that Bob is a common nickname for Robert. 123 Oak Street and 123 Oak St produce nearly identical embeddings because the model understands the abbreviation.

Robust to format variation. Embeddings handle case differences, abbreviations, word order changes, and missing tokens without explicit rules. The model generalizes from its training data.

Multi-field reasoning. When you embed a concatenated record rather than individual fields, the model captures relationships between fields. An LLC name at the same address as a personal name will have higher similarity than two unrelated names at different addresses.

Multilingual. Modern embedding models work across languages. A record in Spanish and its English equivalent will cluster together without any translation rules.

Weaknesses

Non-deterministic. Different model versions can produce different embeddings. Results may shift slightly when the model is updated. This makes exact reproducibility harder.

Cost per record. Each embedding requires an API call or local model inference. At scale, this adds up — a few cents per thousand records for hosted APIs, or GPU infrastructure costs for self-hosted models.

Less interpretable. When two records get a cosine similarity of 0.82, you can’t easily explain why. Which tokens contributed? Which were ignored? The vector space is opaque.

Threshold sensitivity. Cosine similarity scores for embeddings occupy a narrower effective range than string similarity scores. Most pairs fall between 0.60 and 0.95. A threshold change from 0.80 to 0.78 can dramatically change your match count in ways that aren’t intuitive.

Head-to-head comparison

Rule-based vs embedding matching

Dimension	Rule-Based	AI Embeddings	Hybrid
Accuracy (clean data)	90–95%	88–93%	95–98%
Accuracy (messy data)	60–75%	80–90%	90–96%
Speed (10K pairs)	< 1 second	5–30 seconds	2–15 seconds
Cost per 10K pairs	~$0	$0.02–$0.10	$0.01–$0.05
Handles semantic equiv.	No	Yes	Yes
Interpretable	Fully	Limited	Partially
Setup effort	High (rule authoring)	Low (configure model)	Medium

Accuracy ranges are illustrative. Actual performance depends on data characteristics and configuration quality.

The pattern is clear: rules win on clean, structured data. Embeddings win on messy, variable data. The hybrid approach wins on everything — at the cost of slightly more complexity.

The hybrid approach

The best matching systems aren’t pure rule-based or pure AI. They’re layered pipelines that use each approach where it’s strongest.

Architecture

Stage 1: Rules eliminate obvious non-matches. String pre-filters (prefix match, contains, Jaro-Winkler with a low threshold) and numeric filters (ZIP code range, date proximity) are fast and free. They typically eliminate 70-90% of candidate pairs.

Stage 2: Embeddings score remaining candidates. Generate embeddings only for records that survived Stage 1. This is where the AI cost concentrates — but because pre-filtering removed most candidates, you’re embedding a fraction of the full dataset.

Stage 3: LLM confirms borderline cases. Pairs with embedding similarity in the ambiguous range (say, 0.70-0.85) get sent to a language model with full record context. The LLM can reason about whether two records are the same entity using all available evidence.

This cascade gives you the speed and interpretability of rules on the easy cases, the semantic understanding of embeddings on the hard cases, and the reasoning capability of LLMs on the genuinely ambiguous cases.

Cost analysis

The layered approach is dramatically cheaper than using AI on everything.

Cost per 10,000 candidate pairs by approach

Rules only Free but misses 25-40% of messy matches

Hybrid pipeline Rules → Embeddings → LLM

Embeddings on all pairs Generates every vector

35$

LLM on all pairs Every pair reviewed by LLM

100$

Approximate costs at typical API pricing. Rules cost ~$0. Embeddings ~$0.003/record. LLM confirmation ~$0.005/pair.

The hybrid approach costs roughly 8% of what a full LLM pipeline costs while achieving comparable accuracy. The savings come from not sending obvious non-matches through expensive processing.

When to use each approach

Use rules alone when:

Your data is well-structured and comes from known, consistent sources
The matching fields are short and standardized (product codes, phone numbers, emails)
You need deterministic, reproducible results for compliance or auditing
Budget is zero and you’re willing to accept lower recall on messy records
Latency requirements are sub-second for real-time matching

Use embeddings alone when:

Your data is unstructured or highly variable (free-text descriptions, multilingual fields)
You’re matching across fundamentally different schemas (one dataset has full addresses, the other has partial addresses with different formatting)
You need to match at a semantic level (company names to individual names via ownership)
The dataset is small enough that embedding every record is affordable

Use the hybrid approach when:

You have a mix of structured and unstructured fields
Data quality varies — some records are clean, others are messy
You need high accuracy but can’t afford to run AI on every comparison
Your dataset is large enough that brute-force AI is cost-prohibitive
You want the best achievable result regardless of which technique produces it

Accuracy on real-world messy data

The difference between approaches is most visible on messy data — the kind that comes from merging vendor lists, county records, and CRM exports.

Match recall on a messy 5,000-record dataset

Exact match only Misses all variations

52%

Rules (Jaro-Winkler + Token) Catches typos and reordering

71%

Embeddings only Catches semantic equivalence

84%

Hybrid pipeline Rules + Embeddings + LLM

93%

Recall measured as percentage of known true matches identified. Dataset contains name, address, and company fields with real-world quality issues.

The hybrid pipeline recovers 93% of true matches — 22 percentage points more than rules alone and 9 points more than embeddings alone. The LLM confirmation stage is responsible for most of that final gain, catching cases where even embeddings produce ambiguous scores.

Practical recommendations

If you’re building a matching workflow from scratch, start with the hybrid approach. It’s not more complex to configure — the configuration is just a list of fields with their types and thresholds. The pipeline handles the rest.

If you’re already using rule-based matching and seeing too many misses, add an embedding layer for the fields where rules struggle (names, company names, descriptions). Keep your rules for the fields where they work well (IDs, codes, phone numbers).

If cost is the primary constraint, invest your AI budget where it matters most: embedding the fields that vary semantically, and LLM-confirming only the pairs that fall in the ambiguous score range.

Match Data Studio uses the hybrid architecture by default. String and numeric pre-filters run before any AI operations, keeping costs low. Embeddings and LLM confirmation handle the cases that rules can’t. Get started free →

Keep reading

How to choose the right matching algorithm — a broader decision framework for algorithm selection
Understanding similarity thresholds — tuning the cutoffs that embeddings and rules both need
Matching at scale — strategies for running hybrid pipelines on large datasets