The most common tuning question in Match Data Studio is: why am I missing matches I expected, or seeing matches I didn’t expect?

The answer is almost always threshold configuration. This post explains what similarity scores mean and how to tune them.

What is a similarity score?

When you embed a text field and compare two records, Match Data Studio computes a cosine similarity score between 0 and 1:

  • 1.0 — Vectors are identical (exact or very close match)
  • 0.85–0.99 — High similarity (likely matches, minor variation)
  • 0.70–0.84 — Moderate similarity (possible matches, more variation)
  • Below 0.70 — Low similarity (usually not a match)

These thresholds are a starting point — the right values depend heavily on your data.

Why the same threshold works differently across datasets

Consider two scenarios:

Scenario A: Matching product descriptions Descriptions are long, rich in keywords, and highly specific. A threshold of 0.80 is usually appropriate — two products with an 80% cosine similarity are very likely the same product.

Scenario B: Matching people’s names Short fields, few tokens. “Robert Smith” and “Bob Smith” might have a cosine similarity of only 0.65 using text embeddings, because the embedding model doesn’t know they’re the same name. Here you might need string matching (Jaro-Winkler) alongside or instead of embeddings.

The AI assistant tries to pick the right strategy for each field, but you may need to tune.

Precision vs recall

Think of threshold tuning as a dial between two extremes:

  • High threshold (e.g., 0.90): More precise — fewer false positives, but you’ll miss real matches where the text varies significantly.
  • Low threshold (e.g., 0.60): Higher recall — catches more real matches, but also produces more false positives that need review.

The right threshold depends on what’s more costly for your use case: missing a real match, or reviewing a false positive.

How to tune thresholds in practice

  1. Run a sample with the AI’s suggested configuration
  2. Look at the results in two buckets:
    • Matched pairs: Are any of these wrong? (false positives → raise threshold)
    • Unmatched rows: Pick a few you expected to match. Check what score they got. If they scored just below the threshold, lower it slightly.
  3. Adjust thresholds in the Config view, then run another sample
  4. Repeat until results look right

Per-field vs combined thresholds

Match Data Studio lets you set thresholds per field. This is powerful:

  • High threshold on a reliable field (like email or product code) means you don’t need a high threshold on a noisier field (like address)
  • You can weight fields differently — a match on a unique identifier can override a weak match on a name

When embeddings aren’t enough

For short, structured fields — names, codes, addresses — embeddings sometimes underperform. In these cases, add a string matching rule for that field alongside the embedding. The pipeline combines scores using configurable weights.


The goal is a configuration that catches every real match with minimal manual review. A few sample runs is usually enough to dial it in.


Keep reading