How to set up blocking keys to speed up large matching jobs
Matching slows down fast at scale. Learn how blocking keys reduce comparisons by orders of magnitude, how to choose effective keys, and how multi-pass blocking recovers missed pairs.
Every record matching project starts with the same math problem, whether you realize it or not. You have N records in one dataset and M records in another. To find every match, you need to compare every record in A against every record in B. That is N times M comparisons. At small scale, this is invisible. At large scale, it is the single biggest obstacle between you and your results.
Blocking keys are the standard solution. They have been used in record linkage since the 1960s, and they remain the most important optimization in any matching pipeline. This post explains how they work, how to choose them, and how to combine them for maximum coverage.
Why matching is slow: the N x M problem
The cost of brute-force matching grows quadratically. Two datasets of 1,000 records each produce 1 million pairs. That is manageable. But datasets grow faster than people expect, and the pair count grows with them.
| Dataset A | Dataset B | Total Pairs | At 0.5ms per pair | At 2ms per pair |
|---|---|---|---|---|
| 1,000 | 1,000 | 1 million | 8 minutes | 33 minutes |
| 5,000 | 5,000 | 25 million | 3.5 hours | 14 hours |
| 10,000 | 10,000 | 100 million | 14 hours | 2.3 days |
| 50,000 | 50,000 | 2.5 billion | 14.5 days | 58 days |
| 100,000 | 100,000 | 10 billion | 58 days | 231 days |
Times assume sequential processing. Parallelism helps, but the fundamental scaling problem remains.
At 10,000 records per side, brute-force matching already takes hours for a fast comparison function and days for a slower one. At 50,000 per side, it becomes impractical. At 100,000, it is physically impossible to evaluate every pair in any reasonable timeframe, regardless of how much hardware you throw at it.
The problem is not the comparison function. The problem is the number of comparisons. You need a way to avoid generating most of those pairs in the first place.
What blocking keys are and how they reduce comparisons
A blocking key is a field value (or a value derived from one or more fields) used to partition records into groups called blocks. Records in different blocks are never compared. Only records that share the same blocking key value become candidate pairs.
The logic is straightforward: if two records do not share a basic attribute, they are almost certainly not a match. Two people in different states are unlikely to be the same person. Two products in different categories are unlikely to be the same item. Do not waste time comparing them.
Consider a dataset of 50,000 customer records with a ZIP code field containing roughly 500 distinct values. Without blocking, you generate 2.5 billion pairs. With ZIP code blocking, you only compare records within the same ZIP. If the 50,000 records distribute evenly across 500 ZIP codes, each block has 100 records, producing 100 times 100 equals 10,000 pairs per block. Across 500 blocks, that is 5 million pairs total — a 500x reduction.
In practice, blocks are not evenly sized. Dense ZIPs produce larger blocks. But even with skewed distributions, blocking typically reduces comparisons by 95 to 99.9 percent. That is the difference between a job that runs overnight and one that cannot run at all.
Choosing blocking keys: high cardinality, low miss rate
Not every field makes a good blocking key. The ideal blocking key has three properties:
High cardinality. The more distinct values the key has, the smaller the blocks. A field with 10 distinct values divides your data into 10 groups — modest reduction. A field with 1,000 distinct values divides your data into 1,000 groups — substantial reduction. ZIP code prefixes, area codes, and category hierarchies tend to have high cardinality.
Low error rate. The blocking key must be correct in both datasets. If a record has a typo in the blocking field, it ends up in the wrong block and never gets compared to its true match. Fields that are entered consistently — structured codes, categories, numeric fields — make better blocking keys than freeform text.
Available in both datasets. The key must exist in both dataset A and dataset B. If dataset A has full addresses but dataset B only has city names, you cannot block on ZIP code. You need to find a common denominator.
| Blocking Key | Typical Cardinality | Error Rate | Pair Reduction | Risk |
|---|---|---|---|---|
| Full ZIP code | ~40,000 | Low (structured) | 99.5%+ | Typos create orphan blocks |
| 3-digit ZIP prefix | ~900 | Very low | 99% | Larger blocks, safer |
| State / region | 50 | Very low | 90–95% | Large metro blocks |
| First 3 chars of name | ~2,000 | Moderate | 98% | Typo in first char = missed pair |
| Product category | 20–200 | Low | 85–99% | Miscategorized items missed |
| Birth year | ~80 | Low | 95% | Narrow; combine with other keys |
Reduction percentages are illustrative and depend on the distribution of values in your data.
The 3-digit ZIP prefix is often the best single blocking key for US address data. It has high cardinality (about 900 values), extremely low error rate (people rarely get the first three digits of their ZIP wrong), and is commonly available across datasets. Full 5-digit ZIP has even higher cardinality but is slightly more error-prone.
For product data, category or brand often works well. For person data, the first three characters of last name combined with state creates tight blocks while remaining robust to most errors.
Common blocking strategies by data type
Different data domains call for different blocking approaches. Here are the strategies that work in practice.
Customer and person matching. Block on 3-digit ZIP prefix or state. Alternatively, block on the first two or three characters of the last name. For datasets with phone numbers, the area code (first three digits) is an excellent blocking key because it is rarely entered incorrectly. For datasets with dates of birth, birth year creates blocks of roughly uniform size.
Address matching. Block on ZIP code or city name. For street addresses, the first three characters of the street name can work, though street name abbreviations (St vs Street, Ave vs Avenue) need normalization first. House number modulo 100 is sometimes used as a secondary key to create tighter blocks.
Product matching. Block on category, brand, or both. For e-commerce products, the combination of top-level category and first word of the product name creates small, focused blocks. For SKU-level matching, the manufacturer code prefix is often available and highly reliable.
Business and company matching. Block on state or metro area. The first three characters of the company name after removing common prefixes (The, A, An) works well. For datasets with industry codes (SIC, NAICS), these provide natural blocking boundaries.
Financial record matching. Block on transaction date (or date range, such as same week). Amount bucketing — rounding to the nearest $100 or $1,000 — creates coarse blocks. Currency code is an obvious blocking key when matching across multi-currency systems.
Multi-pass blocking: using multiple keys for higher recall
Single-key blocking has a fundamental weakness. If the blocking key value is wrong for a record — a typo in the ZIP code, a miscategorized product, a misspelled first character of a name — that record lands in the wrong block and its true match is never found.
Multi-pass blocking solves this by running the blocking and comparison process multiple times, each pass using a different blocking key. The final candidate set is the union of all passes.
Pass 1: Block on 3-digit ZIP prefix. This catches all pairs where both records have the correct ZIP.
Pass 2: Block on first 3 characters of last name. This catches pairs where the ZIP was wrong but the name was entered correctly.
Pass 3: Block on phone area code. This catches pairs where both ZIP and name have errors but the phone number is correct.
A record only needs to match on one blocking key to enter the candidate set. This dramatically improves recall. In practice, two or three passes cover the vast majority of true matches.
With a single ZIP-based pass, you achieve 91 percent recall — good but not great. Every ZIP typo is a missed match. Adding a name-prefix pass recovers most of those losses, pushing recall to 97 percent. A third pass on phone area code gets you to 99 percent.
The cost of multi-pass blocking is more candidate pairs. Two passes roughly double the candidate count (minus overlap), and three passes roughly triple it. But the absolute numbers are still far smaller than brute force. If single-pass blocking produces 5 million pairs from a 50K x 50K dataset, three passes might produce 12 million — still a 200x reduction from the 2.5 billion brute-force count.
The key is to deduplicate candidate pairs across passes. Many pairs will appear in multiple passes. A simple set union on the pair identifiers ensures each pair is only evaluated once.
Measuring blocking quality: the reduction ratio vs pairs completeness tradeoff
Two metrics tell you whether your blocking strategy is working.
Reduction ratio (RR) measures how many pairs blocking eliminated. The formula is:
RR = 1 - (candidate_pairs / total_possible_pairs)
A reduction ratio of 0.99 means blocking eliminated 99 percent of all possible pairs. Higher is better for performance but may sacrifice completeness.
Pairs completeness (PC) measures what fraction of true matching pairs survived blocking. The formula is:
PC = true_pairs_in_candidates / total_true_pairs
A pairs completeness of 0.95 means 95 percent of all true matches made it into the candidate set. Higher is better for match quality.
These two metrics are in tension. Tighter blocking (fewer, smaller blocks) increases the reduction ratio but risks dropping true pairs. Looser blocking (more, larger blocks) improves pairs completeness but generates more candidates.
| Blocking Strategy | Candidate Pairs | Reduction Ratio | Pairs Completeness | Processing Time |
|---|---|---|---|---|
| No blocking | 2.5 billion | 0% | 100% | Impossible |
| State only | 125 million | 95.0% | 99.8% | ~6 hours |
| 3-digit ZIP | 5 million | 99.8% | 96.2% | ~15 min |
| 5-digit ZIP | 800,000 | 99.97% | 89.5% | ~3 min |
| ZIP + Name prefix (2 pass) | 9 million | 99.6% | 99.1% | ~25 min |
| ZIP + Name + Phone (3 pass) | 12 million | 99.5% | 99.7% | ~35 min |
Processing times assume downstream similarity scoring at 0.2ms per pair. Actual times depend on scoring method and hardware.
The three-pass strategy in the last row shows the sweet spot: 99.5 percent of pairs are eliminated (fast processing), and 99.7 percent of true matches are retained (high quality). The five-digit ZIP strategy is faster but drops 10.5 percent of true pairs — an unacceptable loss for most use cases.
The practical approach is to start with a single blocking key that has high cardinality and low error rate, measure pairs completeness on a labeled sample, and add passes until completeness is above your threshold. For most applications, two passes are sufficient. Three passes are worth it when the stakes are high and you can afford the extra processing time.
When labeled data is unavailable — which is common — you can estimate pairs completeness by sampling. Take 200 records from dataset A, manually find their true matches in dataset B, then check how many of those true pairs appear in your candidate set. This gives a reliable estimate without labeling the entire dataset.
Putting it together: a blocking checklist
Setting up blocking keys for a large matching job is not complicated, but it requires deliberate choices. Here is the sequence that works:
Step 1: Inventory your fields. List every field available in both datasets. Identify which fields are structured (codes, categories, dates, numbers) versus unstructured (names, descriptions, addresses).
Step 2: Pick your primary blocking key. Choose the structured field with the highest cardinality and lowest error rate. For US addresses, this is usually 3-digit ZIP prefix. For products, it is category or brand. For people without addresses, it is birth year or last name prefix.
Step 3: Normalize before blocking. Strip whitespace, lowercase text, standardize abbreviations, and remove punctuation. A blocking key that treats “New York” and “new york” as different values defeats the purpose. Normalization should happen before blocking key extraction, not after.
Step 4: Measure block sizes. Compute the distribution of block sizes. If any single block contains more than 1 percent of your dataset, that block is too large and will dominate processing time. Consider splitting large blocks with a secondary key (e.g., add first letter of name within a large ZIP block).
Step 5: Add a second pass. Choose a blocking key from a different field family. If your primary key is geographic (ZIP), your second key should be identity-based (name prefix) or contact-based (phone area code). The point is independence — if one key has an error, the other should still be correct.
Step 6: Deduplicate and proceed. Merge candidate pairs from all passes, remove duplicates, and pass them to your comparison pipeline. The comparison steps — string similarity, embedding similarity, LLM confirmation — only run on the candidate set that blocking produced.
Match Data Studio handles blocking automatically as part of its pipeline architecture. The system applies string pre-filters, numeric pre-filters, and AI similarity in a cascade that eliminates non-matching pairs at each stage — so expensive operations only run on records that survive the cheaper checks. Upload your CSVs and let the pipeline handle the scaling. Get started free →
Keep reading
- Matching at scale: strategies for millions of records — the full optimization stack for large datasets
- How to choose the right matching algorithm — picking the right comparison method before optimizing for speed
- Fuzzy matching algorithms explained — the algorithms that run after blocking narrows the candidate set