What is precision in data matching?

Precision measures what percentage of your matched pairs are correct. If your matching process returns 1,000 matched pairs and 950 of them are true matches, your precision is 95%. Low precision means you are generating too many false positives — pairs that your system matched but that do not actually refer to the same entity.

What is recall in data matching?

Recall measures what percentage of all true matches your process found. If there are 1,200 true matches in your data and your process found 900 of them, your recall is 75%. Low recall means you are missing true matches — pairs that should have been matched but were not, often because of overly strict thresholds or insufficient matching fields.

How do I estimate matching quality without a fully labeled dataset?

Use stratified sampling: take a random sample of 200-500 matched pairs and have a domain expert manually verify whether each pair is a true match (to estimate precision). Then take a random sample of unmatched records, manually search for their true matches, and check whether your process found them (to estimate recall). This gives statistically reliable estimates without labeling the entire dataset.

What F1 score should I target for data matching?

F1 score targets depend on the use case. For financial reconciliation or compliance reporting, aim for F1 above 0.95. For marketing deduplication where some error is tolerable, F1 above 0.85 is often acceptable. For exploratory matching where you will manually review results, F1 above 0.75 may suffice. The key is understanding the cost of false positives versus false negatives in your specific context.

How to measure matching quality: precision, recall, and F1 for data teams

You ran a matching job and it returned 4,327 matched pairs. Is that good? Is that bad? You have no way to know from that number alone. It could mean your pipeline found nearly every true match in your data with high accuracy. Or it could mean half of those pairs are wrong and another 2,000 true matches were missed entirely.

The number of matches is not a quality metric. To understand whether your matching is actually working, you need precision, recall, and F1 score. These three numbers tell you what your match count never can: how much of what you found is correct, and how much of what exists you actually found.

Why “number of matches” is not a quality metric

When teams evaluate matching results, the first instinct is to look at the match count. This is natural but misleading. A high match count could mean your thresholds are too loose (lots of false positives inflating the number). A low match count could mean your thresholds are too strict (missing real matches). The count tells you the size of the output, not the quality.

Consider two matching runs on the same data:

Run A: 5,000 matched pairs. Of those, 4,200 are correct and 800 are wrong. Meanwhile, 600 true matches were missed.

Run B: 3,800 matched pairs. Of those, 3,700 are correct and 100 are wrong. Meanwhile, 1,100 true matches were missed.

Run A found more matches, but nearly 1 in 6 are wrong. Run B found fewer matches, but almost all of them are correct. Which is better? It depends entirely on what you are using the matches for. And you cannot make that judgment without knowing the precision and recall of each run.

This is why data teams need formal quality metrics. They separate signal from noise and give you a basis for tuning your pipeline.

Precision: what percentage of your matches are correct

Precision answers a simple question: of all the pairs my system called a match, how many are actually correct?

Precision = True Positives / (True Positives + False Positives)

A true positive is a matched pair that genuinely refers to the same entity. A false positive is a matched pair that your system returned but that is actually two different entities.

If your matching produces 1,000 pairs and 920 of them are real matches, your precision is 92 percent. The other 80 pairs are noise — records your system incorrectly linked.

Why precision matters. Every false positive has a cost. In customer deduplication, a false positive merges two different customers into one record, corrupting both. In financial reconciliation, a false positive marks a transaction as settled when it is not. In compliance matching, a false positive triggers an investigation for the wrong person.

The cost of false positives determines how much precision you need. For automated downstream processes where errors propagate silently, you need precision above 95 percent. For workflows where a human reviews every match, 85 percent precision may be acceptable because the reviewer catches the mistakes.

Precision requirements by use case

Use Case	Target Precision	False Positive Cost	Typical Review Process
Financial reconciliation	97%+	Incorrect settlement, audit risk	Automated, no review
Compliance screening	95%+	Unnecessary investigation, legal cost	Analyst review of flagged pairs
CRM deduplication	90%+	Merged wrong contacts, lost data	Batch review before merge
Marketing list merge	85%+	Duplicate mailings, wasted spend	Spot check sample
Exploratory analysis	75%+	Wrong conclusions from dirty joins	Manual review of all results

Targets are guidelines. The right threshold depends on the cost of errors in your specific workflow.

Recall: what percentage of true matches did you find

Recall answers the opposite question: of all the true matches that exist in my data, how many did my system find?

Recall = True Positives / (True Positives + False Negatives)

A false negative is a true match that your system missed — two records that refer to the same entity but were not linked.

If there are 1,200 true matches in your data and your system found 960 of them, your recall is 80 percent. The other 240 true matches were missed.

Why recall matters. Every missed match has a cost too. In deduplication, missed matches mean duplicate records persist. In customer matching, missed matches mean you cannot build a unified view of a customer who appears in multiple systems. In fraud detection, missed matches mean a known bad actor slips through because the system did not link their aliases.

Recall is harder to measure than precision because you need to know the total number of true matches — not just the ones your system found. In most real datasets, this number is unknown, which is why estimation methods (covered below) are essential.

The precision-recall tradeoff

Precision and recall are inherently in tension. When you adjust your matching thresholds, you move along a curve: tighter thresholds improve precision but reduce recall. Looser thresholds improve recall but reduce precision.

This is not a flaw in your system. It is a fundamental property of any classification task. The question is not “how do I get both to 100 percent” but “where on the curve does my use case need to be?”

Precision and recall at different similarity thresholds

Threshold 0.60 Precision: 72%

96% recall

Threshold 0.70 Precision: 83%

89% recall

Threshold 0.75 Precision: 89%

82% recall

Threshold 0.80 Precision: 93%

71% recall

Threshold 0.85 Precision: 96%

58% recall

Threshold 0.90 Precision: 98%

42% recall

Values are from a customer matching dataset with ~5,000 true pairs. Precision shown in sublabels for each threshold.

At a threshold of 0.60, the system finds 96 percent of true matches (high recall) but only 72 percent of returned pairs are correct (low precision). At 0.90, precision jumps to 98 percent but recall drops to 42 percent — more than half of true matches are missed.

The sweet spot for most production use cases is in the 0.75 to 0.85 range, where both precision and recall are reasonably high. But the optimal threshold depends on which error is more expensive in your context.

Precision-critical use cases. Financial reconciliation, compliance, automated merges where errors are hard to reverse. Err toward higher thresholds. Accept lower recall and handle missed matches through manual review or a second-pass process.

Recall-critical use cases. Fraud detection, customer 360 consolidation, medical record linkage where missing a link could have serious consequences. Err toward lower thresholds. Accept more false positives and add a review step to filter them.

Balanced use cases. Marketing deduplication, vendor matching, general data consolidation. Aim for the threshold that maximizes the F1 score.

How to estimate quality without a fully labeled dataset

In theory, measuring precision and recall requires knowing the ground truth — which pairs are true matches and which are not. In practice, you almost never have this. Labeling every pair in a dataset of 10,000 x 10,000 records is 100 million judgments. Nobody is doing that.

The solution is sampling. Two targeted samples give you statistically reliable estimates of both metrics.

Estimating precision: sample your matches

Take a random sample of 200 to 500 matched pairs from your results. Have a domain expert review each pair and label it as “correct match” or “incorrect match.” Count the correct ones. Divide by the sample size.

If 185 out of 200 sampled pairs are correct, your estimated precision is 92.5 percent. With a sample of 200, the 95 percent confidence interval is roughly plus or minus 3.5 percentage points, meaning your true precision is likely between 89 and 96 percent. Increasing the sample to 500 narrows the interval to plus or minus 2 percentage points.

Stratified sampling improves accuracy. Instead of a purely random sample, stratify by similarity score. Take equal numbers of pairs from the high-confidence range (0.90+), the medium range (0.75-0.90), and the borderline range (near your threshold). This ensures you measure precision where it matters most — at the decision boundary.

Estimating recall: sample your non-matches

This is harder, but essential. Take a random sample of 100 to 200 records from dataset A that were not matched. For each one, have a domain expert manually search dataset B for a true match. If the expert finds matches that your system missed, those are false negatives.

If the expert finds true matches for 12 out of 100 unmatched records, your estimated false negative rate among unmatched records is 12 percent. Combined with your match count, you can estimate total true matches and compute recall.

Sampling plan for matching quality estimation

What You Measure	Sample Source	Sample Size	Who Reviews	Time Estimate
Precision	Matched pairs	200–500	Domain expert	2–4 hours
Recall (missed matches)	Unmatched records from A	100–200	Domain expert	4–8 hours
Threshold sensitivity	Pairs near threshold boundary	100–200	Domain expert	2–3 hours
Edge case analysis	Lowest-scoring matches	50–100	Domain expert	1–2 hours

Time estimates assume an expert familiar with the data. First-time review takes longer.

The total investment is 10 to 15 hours of expert time. That is substantial, but it gives you reliable quality metrics for a dataset that might contain millions of records. Without this investment, you are flying blind.

When to skip sampling

If your matching job is small enough that a human can review all the results — say, under 500 matched pairs — skip sampling and do a full review. You get exact precision for free. Recall still requires checking the unmatched records, but a full precision review is always better than a sampled one when it is feasible.

Using F1 score to track matching improvement over iterations

F1 score is the harmonic mean of precision and recall. It gives a single number that balances both metrics.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

F1 ranges from 0 to 1. It is high only when both precision and recall are high. An F1 of 0.90 means you are achieving strong performance on both fronts. An F1 of 0.70 means at least one metric is dragging down the other.

The real value of F1 is tracking improvement across iterations. When you change a threshold, add a matching field, or switch algorithms, F1 tells you whether the change helped or hurt overall quality.

Tracking quality across matching iterations

Iteration	Change Made	Precision	Recall	F1 Score
Baseline	Default config, threshold 0.75	84%	78%	0.809
v2	Added phone number as matching field	86%	83%	0.845
v3	Lowered embedding threshold to 0.70	81%	89%	0.849
v4	Added LLM confirmation for borderline pairs	91%	88%	0.895
v5	Normalized addresses before matching	93%	90%	0.915

Each iteration builds on the previous one. F1 captures the net effect of each change.

This table tells a story. The baseline configuration achieved F1 of 0.809. Adding phone numbers helped both metrics. Lowering the embedding threshold improved recall significantly but dropped precision — yet F1 still went up slightly because the recall gain outweighed the precision loss. Adding LLM confirmation was the biggest single improvement, boosting precision without sacrificing recall. Address normalization pushed both metrics higher.

Without F1 tracking, iteration v3 might have looked like a regression because precision dropped. F1 shows it was actually a net positive.

Target F1 scores by use case. Financial and compliance matching should target F1 above 0.95. Customer deduplication and CRM consolidation should aim for F1 above 0.90. Marketing and analytical use cases can often accept F1 of 0.85 or above. Exploratory matching, where all results will be reviewed manually, can work with F1 of 0.75 or above.

The important thing is to pick a target, measure against it, and iterate until you hit it. Matching quality is not a one-shot exercise. It is an iterative process of tuning thresholds, adding fields, improving normalization, and measuring the impact of each change.

Building a quality-first matching workflow

Putting these concepts into practice requires a structured workflow. Here is the sequence that produces reliable results:

Step 1: Run the initial match. Use your best guess at configuration — field selection, thresholds, algorithm choice. Get a first set of results.

Step 2: Sample and measure. Pull 200 matched pairs for precision review. Pull 100 unmatched records for recall estimation. Label them.

Step 3: Compute baselines. Calculate precision, recall, and F1. These are your starting numbers.

Step 4: Diagnose. If precision is low, examine the false positives. What made the system think they were matches? Common causes: threshold too low, irrelevant field contributing noise, missing normalization. If recall is low, examine the false negatives. Why were they missed? Common causes: threshold too high, missing blocking key, data quality issue in a key field.

Step 5: Iterate. Make one change at a time. Re-run. Re-sample. Recompute F1. If F1 improved, keep the change. If not, revert.

Step 6: Validate at scale. Once F1 meets your target on the sample, run the full dataset. Spot-check the results but do not re-label the entire output. Your sample-based estimate should hold.

This workflow takes more time upfront than just running a match and hoping for the best. But it produces results you can trust and explain to stakeholders. When someone asks “how accurate is this matching?” you have a real answer backed by data, not a guess.

Match Data Studio gives you sample results and confidence scores after every run, so you can evaluate precision without building your own review workflow. Configure your pipeline, inspect the results, tune your thresholds, and iterate until quality meets your target. Get started free →

Keep reading

Understanding similarity thresholds — how threshold tuning directly controls your precision-recall balance
Five matching mistakes that silently ruin your results — the errors that hurt quality before you even start measuring
How to choose the right matching algorithm — algorithm selection is the foundation that quality metrics build on