How to measure matching quality: precision, recall, and F1 for data teams
Number of matches is not a quality metric. Learn how to measure precision, recall, and F1 score for data matching — including practical sampling methods when you don't have labeled data.
You ran a matching job and it returned 4,327 matched pairs. Is that good? Is that bad? You have no way to know from that number alone. It could mean your pipeline found nearly every true match in your data with high accuracy. Or it could mean half of those pairs are wrong and another 2,000 true matches were missed entirely.
The number of matches is not a quality metric. To understand whether your matching is actually working, you need precision, recall, and F1 score. These three numbers tell you what your match count never can: how much of what you found is correct, and how much of what exists you actually found.
Why “number of matches” is not a quality metric
When teams evaluate matching results, the first instinct is to look at the match count. This is natural but misleading. A high match count could mean your thresholds are too loose (lots of false positives inflating the number). A low match count could mean your thresholds are too strict (missing real matches). The count tells you the size of the output, not the quality.
Consider two matching runs on the same data:
Run A: 5,000 matched pairs. Of those, 4,200 are correct and 800 are wrong. Meanwhile, 600 true matches were missed.
Run B: 3,800 matched pairs. Of those, 3,700 are correct and 100 are wrong. Meanwhile, 1,100 true matches were missed.
Run A found more matches, but nearly 1 in 6 are wrong. Run B found fewer matches, but almost all of them are correct. Which is better? It depends entirely on what you are using the matches for. And you cannot make that judgment without knowing the precision and recall of each run.
This is why data teams need formal quality metrics. They separate signal from noise and give you a basis for tuning your pipeline.
Precision: what percentage of your matches are correct
Precision answers a simple question: of all the pairs my system called a match, how many are actually correct?
Precision = True Positives / (True Positives + False Positives)
A true positive is a matched pair that genuinely refers to the same entity. A false positive is a matched pair that your system returned but that is actually two different entities.
If your matching produces 1,000 pairs and 920 of them are real matches, your precision is 92 percent. The other 80 pairs are noise — records your system incorrectly linked.
Why precision matters. Every false positive has a cost. In customer deduplication, a false positive merges two different customers into one record, corrupting both. In financial reconciliation, a false positive marks a transaction as settled when it is not. In compliance matching, a false positive triggers an investigation for the wrong person.
The cost of false positives determines how much precision you need. For automated downstream processes where errors propagate silently, you need precision above 95 percent. For workflows where a human reviews every match, 85 percent precision may be acceptable because the reviewer catches the mistakes.
| Use Case | Target Precision | False Positive Cost | Typical Review Process |
|---|---|---|---|
| Financial reconciliation | 97%+ | Incorrect settlement, audit risk | Automated, no review |
| Compliance screening | 95%+ | Unnecessary investigation, legal cost | Analyst review of flagged pairs |
| CRM deduplication | 90%+ | Merged wrong contacts, lost data | Batch review before merge |
| Marketing list merge | 85%+ | Duplicate mailings, wasted spend | Spot check sample |
| Exploratory analysis | 75%+ | Wrong conclusions from dirty joins | Manual review of all results |
Targets are guidelines. The right threshold depends on the cost of errors in your specific workflow.
Recall: what percentage of true matches did you find
Recall answers the opposite question: of all the true matches that exist in my data, how many did my system find?
Recall = True Positives / (True Positives + False Negatives)
A false negative is a true match that your system missed — two records that refer to the same entity but were not linked.
If there are 1,200 true matches in your data and your system found 960 of them, your recall is 80 percent. The other 240 true matches were missed.
Why recall matters. Every missed match has a cost too. In deduplication, missed matches mean duplicate records persist. In customer matching, missed matches mean you cannot build a unified view of a customer who appears in multiple systems. In fraud detection, missed matches mean a known bad actor slips through because the system did not link their aliases.
Recall is harder to measure than precision because you need to know the total number of true matches — not just the ones your system found. In most real datasets, this number is unknown, which is why estimation methods (covered below) are essential.
The precision-recall tradeoff
Precision and recall are inherently in tension. When you adjust your matching thresholds, you move along a curve: tighter thresholds improve precision but reduce recall. Looser thresholds improve recall but reduce precision.
This is not a flaw in your system. It is a fundamental property of any classification task. The question is not “how do I get both to 100 percent” but “where on the curve does my use case need to be?”
At a threshold of 0.60, the system finds 96 percent of true matches (high recall) but only 72 percent of returned pairs are correct (low precision). At 0.90, precision jumps to 98 percent but recall drops to 42 percent — more than half of true matches are missed.
The sweet spot for most production use cases is in the 0.75 to 0.85 range, where both precision and recall are reasonably high. But the optimal threshold depends on which error is more expensive in your context.
Precision-critical use cases. Financial reconciliation, compliance, automated merges where errors are hard to reverse. Err toward higher thresholds. Accept lower recall and handle missed matches through manual review or a second-pass process.
Recall-critical use cases. Fraud detection, customer 360 consolidation, medical record linkage where missing a link could have serious consequences. Err toward lower thresholds. Accept more false positives and add a review step to filter them.
Balanced use cases. Marketing deduplication, vendor matching, general data consolidation. Aim for the threshold that maximizes the F1 score.
How to estimate quality without a fully labeled dataset
In theory, measuring precision and recall requires knowing the ground truth — which pairs are true matches and which are not. In practice, you almost never have this. Labeling every pair in a dataset of 10,000 x 10,000 records is 100 million judgments. Nobody is doing that.
The solution is sampling. Two targeted samples give you statistically reliable estimates of both metrics.
Estimating precision: sample your matches
Take a random sample of 200 to 500 matched pairs from your results. Have a domain expert review each pair and label it as “correct match” or “incorrect match.” Count the correct ones. Divide by the sample size.
If 185 out of 200 sampled pairs are correct, your estimated precision is 92.5 percent. With a sample of 200, the 95 percent confidence interval is roughly plus or minus 3.5 percentage points, meaning your true precision is likely between 89 and 96 percent. Increasing the sample to 500 narrows the interval to plus or minus 2 percentage points.
Stratified sampling improves accuracy. Instead of a purely random sample, stratify by similarity score. Take equal numbers of pairs from the high-confidence range (0.90+), the medium range (0.75-0.90), and the borderline range (near your threshold). This ensures you measure precision where it matters most — at the decision boundary.
Estimating recall: sample your non-matches
This is harder, but essential. Take a random sample of 100 to 200 records from dataset A that were not matched. For each one, have a domain expert manually search dataset B for a true match. If the expert finds matches that your system missed, those are false negatives.
If the expert finds true matches for 12 out of 100 unmatched records, your estimated false negative rate among unmatched records is 12 percent. Combined with your match count, you can estimate total true matches and compute recall.
| What You Measure | Sample Source | Sample Size | Who Reviews | Time Estimate |
|---|---|---|---|---|
| Precision | Matched pairs | 200–500 | Domain expert | 2–4 hours |
| Recall (missed matches) | Unmatched records from A | 100–200 | Domain expert | 4–8 hours |
| Threshold sensitivity | Pairs near threshold boundary | 100–200 | Domain expert | 2–3 hours |
| Edge case analysis | Lowest-scoring matches | 50–100 | Domain expert | 1–2 hours |
Time estimates assume an expert familiar with the data. First-time review takes longer.
The total investment is 10 to 15 hours of expert time. That is substantial, but it gives you reliable quality metrics for a dataset that might contain millions of records. Without this investment, you are flying blind.
When to skip sampling
If your matching job is small enough that a human can review all the results — say, under 500 matched pairs — skip sampling and do a full review. You get exact precision for free. Recall still requires checking the unmatched records, but a full precision review is always better than a sampled one when it is feasible.
Using F1 score to track matching improvement over iterations
F1 score is the harmonic mean of precision and recall. It gives a single number that balances both metrics.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
F1 ranges from 0 to 1. It is high only when both precision and recall are high. An F1 of 0.90 means you are achieving strong performance on both fronts. An F1 of 0.70 means at least one metric is dragging down the other.
The real value of F1 is tracking improvement across iterations. When you change a threshold, add a matching field, or switch algorithms, F1 tells you whether the change helped or hurt overall quality.
| Iteration | Change Made | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Baseline | Default config, threshold 0.75 | 84% | 78% | 0.809 |
| v2 | Added phone number as matching field | 86% | 83% | 0.845 |
| v3 | Lowered embedding threshold to 0.70 | 81% | 89% | 0.849 |
| v4 | Added LLM confirmation for borderline pairs | 91% | 88% | 0.895 |
| v5 | Normalized addresses before matching | 93% | 90% | 0.915 |
Each iteration builds on the previous one. F1 captures the net effect of each change.
This table tells a story. The baseline configuration achieved F1 of 0.809. Adding phone numbers helped both metrics. Lowering the embedding threshold improved recall significantly but dropped precision — yet F1 still went up slightly because the recall gain outweighed the precision loss. Adding LLM confirmation was the biggest single improvement, boosting precision without sacrificing recall. Address normalization pushed both metrics higher.
Without F1 tracking, iteration v3 might have looked like a regression because precision dropped. F1 shows it was actually a net positive.
Target F1 scores by use case. Financial and compliance matching should target F1 above 0.95. Customer deduplication and CRM consolidation should aim for F1 above 0.90. Marketing and analytical use cases can often accept F1 of 0.85 or above. Exploratory matching, where all results will be reviewed manually, can work with F1 of 0.75 or above.
The important thing is to pick a target, measure against it, and iterate until you hit it. Matching quality is not a one-shot exercise. It is an iterative process of tuning thresholds, adding fields, improving normalization, and measuring the impact of each change.
Building a quality-first matching workflow
Putting these concepts into practice requires a structured workflow. Here is the sequence that produces reliable results:
Step 1: Run the initial match. Use your best guess at configuration — field selection, thresholds, algorithm choice. Get a first set of results.
Step 2: Sample and measure. Pull 200 matched pairs for precision review. Pull 100 unmatched records for recall estimation. Label them.
Step 3: Compute baselines. Calculate precision, recall, and F1. These are your starting numbers.
Step 4: Diagnose. If precision is low, examine the false positives. What made the system think they were matches? Common causes: threshold too low, irrelevant field contributing noise, missing normalization. If recall is low, examine the false negatives. Why were they missed? Common causes: threshold too high, missing blocking key, data quality issue in a key field.
Step 5: Iterate. Make one change at a time. Re-run. Re-sample. Recompute F1. If F1 improved, keep the change. If not, revert.
Step 6: Validate at scale. Once F1 meets your target on the sample, run the full dataset. Spot-check the results but do not re-label the entire output. Your sample-based estimate should hold.
This workflow takes more time upfront than just running a match and hoping for the best. But it produces results you can trust and explain to stakeholders. When someone asks “how accurate is this matching?” you have a real answer backed by data, not a guess.
Match Data Studio gives you sample results and confidence scores after every run, so you can evaluate precision without building your own review workflow. Configure your pipeline, inspect the results, tune your thresholds, and iterate until quality meets your target. Get started free →
Keep reading
- Understanding similarity thresholds — how threshold tuning directly controls your precision-recall balance
- Five matching mistakes that silently ruin your results — the errors that hurt quality before you even start measuring
- How to choose the right matching algorithm — algorithm selection is the foundation that quality metrics build on