What is a similarity score in data matching?

A similarity score is a number between 0 and 1 that measures how closely two records match. A score of 1.0 means identical, while 0.0 means completely different. Cosine similarity on AI embeddings captures semantic meaning, so 'IBM' and 'International Business Machines' can score above 0.9 even though the strings share no characters.

How do I set the right similarity threshold?

Start with a threshold of 0.75 for most datasets. Review the matches near your threshold boundary — if you see false positives, raise it; if you're missing valid matches, lower it. The right threshold depends on your data quality, field types, and whether you prefer precision (fewer false matches) or recall (fewer missed matches).

Why does the same threshold work differently across datasets?

Threshold behavior depends on the variance in your data. Datasets with many similar records (like product catalogs from the same industry) produce higher similarity scores overall, so you need a higher threshold to separate true matches from near-misses. Sparse or highly varied datasets spread scores more evenly and work well with lower thresholds.

What is the difference between precision and recall in matching?

Precision measures what fraction of your returned matches are correct — high precision means few false positives. Recall measures what fraction of true matches you actually found — high recall means few missed matches. Raising your similarity threshold increases precision but decreases recall, and vice versa.

Understanding similarity thresholds

The most common tuning question in Match Data Studio is: why am I missing matches I expected, or seeing matches I didn’t expect?

The answer is almost always threshold configuration. This post explains what similarity scores mean and how to tune them.

What is a similarity score?

When you embed a text field and compare two records, Match Data Studio computes a cosine similarity score between 0 and 1:

1.0 — Vectors are identical (exact or very close match)
0.85–0.99 — High similarity (likely matches, minor variation)
0.70–0.84 — Moderate similarity (possible matches, more variation)
Below 0.70 — Low similarity (usually not a match)

These thresholds are a starting point — the right values depend heavily on your data.

Why the same threshold works differently across datasets

Consider two scenarios:

Scenario A: Matching product descriptions Descriptions are long, rich in keywords, and highly specific. A threshold of 0.80 is usually appropriate — two products with an 80% cosine similarity are very likely the same product.

Scenario B: Matching people’s names Short fields, few tokens. “Robert Smith” and “Bob Smith” might have a cosine similarity of only 0.65 using text embeddings, because the embedding model doesn’t know they’re the same name. Here you might need string matching (Jaro-Winkler) alongside or instead of embeddings.

The AI assistant tries to pick the right strategy for each field, but you may need to tune.

Precision vs recall

Think of threshold tuning as a dial between two extremes:

High threshold (e.g., 0.90): More precise — fewer false positives, but you’ll miss real matches where the text varies significantly.
Low threshold (e.g., 0.60): Higher recall — catches more real matches, but also produces more false positives that need review.

The right threshold depends on what’s more costly for your use case: missing a real match, or reviewing a false positive.

How to tune thresholds in practice

Run a sample with the AI’s suggested configuration
Look at the results in two buckets:
- Matched pairs: Are any of these wrong? (false positives → raise threshold)
- Unmatched rows: Pick a few you expected to match. Check what score they got. If they scored just below the threshold, lower it slightly.
Adjust thresholds in the Config view, then run another sample
Repeat until results look right

Per-field vs combined thresholds

Match Data Studio lets you set thresholds per field. This is powerful:

High threshold on a reliable field (like email or product code) means you don’t need a high threshold on a noisier field (like address)
You can weight fields differently — a match on a unique identifier can override a weak match on a name

When embeddings aren’t enough

For short, structured fields — names, codes, addresses — embeddings sometimes underperform. In these cases, add a string matching rule for that field alongside the embedding. The pipeline combines scores using configurable weights.

The goal is a configuration that catches every real match with minimal manual review. A few sample runs is usually enough to dial it in.

Keep reading

How to choose the right matching algorithm — thresholds only work when the algorithm fits the data
AI embeddings vs rule-based matching — understanding the scores your thresholds apply to
Five matching mistakes that silently ruin your results — threshold errors and other common pitfalls