Every record matching project starts with the same math problem, whether you realize it or not. You have N records in one dataset and M records in another. To find every match, you need to compare every record in A against every record in B. That is N times M comparisons. At small scale, this is invisible. At large scale, it is the single biggest obstacle between you and your results.

Blocking keys are the standard solution. They have been used in record linkage since the 1960s, and they remain the most important optimization in any matching pipeline. This post explains how they work, how to choose them, and how to combine them for maximum coverage.

Why matching is slow: the N x M problem

The cost of brute-force matching grows quadratically. Two datasets of 1,000 records each produce 1 million pairs. That is manageable. But datasets grow faster than people expect, and the pair count grows with them.

How pair counts scale with dataset size
Dataset A Dataset B Total Pairs At 0.5ms per pair At 2ms per pair
1,000 1,000 1 million 8 minutes 33 minutes
5,000 5,000 25 million 3.5 hours 14 hours
10,000 10,000 100 million 14 hours 2.3 days
50,000 50,000 2.5 billion 14.5 days 58 days
100,000 100,000 10 billion 58 days 231 days

Times assume sequential processing. Parallelism helps, but the fundamental scaling problem remains.

At 10,000 records per side, brute-force matching already takes hours for a fast comparison function and days for a slower one. At 50,000 per side, it becomes impractical. At 100,000, it is physically impossible to evaluate every pair in any reasonable timeframe, regardless of how much hardware you throw at it.

The problem is not the comparison function. The problem is the number of comparisons. You need a way to avoid generating most of those pairs in the first place.

What blocking keys are and how they reduce comparisons

A blocking key is a field value (or a value derived from one or more fields) used to partition records into groups called blocks. Records in different blocks are never compared. Only records that share the same blocking key value become candidate pairs.

The logic is straightforward: if two records do not share a basic attribute, they are almost certainly not a match. Two people in different states are unlikely to be the same person. Two products in different categories are unlikely to be the same item. Do not waste time comparing them.

Consider a dataset of 50,000 customer records with a ZIP code field containing roughly 500 distinct values. Without blocking, you generate 2.5 billion pairs. With ZIP code blocking, you only compare records within the same ZIP. If the 50,000 records distribute evenly across 500 ZIP codes, each block has 100 records, producing 100 times 100 equals 10,000 pairs per block. Across 500 blocks, that is 5 million pairs total — a 500x reduction.

In practice, blocks are not evenly sized. Dense ZIPs produce larger blocks. But even with skewed distributions, blocking typically reduces comparisons by 95 to 99.9 percent. That is the difference between a job that runs overnight and one that cannot run at all.

Choosing blocking keys: high cardinality, low miss rate

Not every field makes a good blocking key. The ideal blocking key has three properties:

High cardinality. The more distinct values the key has, the smaller the blocks. A field with 10 distinct values divides your data into 10 groups — modest reduction. A field with 1,000 distinct values divides your data into 1,000 groups — substantial reduction. ZIP code prefixes, area codes, and category hierarchies tend to have high cardinality.

Low error rate. The blocking key must be correct in both datasets. If a record has a typo in the blocking field, it ends up in the wrong block and never gets compared to its true match. Fields that are entered consistently — structured codes, categories, numeric fields — make better blocking keys than freeform text.

Available in both datasets. The key must exist in both dataset A and dataset B. If dataset A has full addresses but dataset B only has city names, you cannot block on ZIP code. You need to find a common denominator.

Blocking key quality comparison
Blocking Key Typical Cardinality Error Rate Pair Reduction Risk
Full ZIP code ~40,000 Low (structured) 99.5%+ Typos create orphan blocks
3-digit ZIP prefix ~900 Very low 99% Larger blocks, safer
State / region 50 Very low 90–95% Large metro blocks
First 3 chars of name ~2,000 Moderate 98% Typo in first char = missed pair
Product category 20–200 Low 85–99% Miscategorized items missed
Birth year ~80 Low 95% Narrow; combine with other keys

Reduction percentages are illustrative and depend on the distribution of values in your data.

The 3-digit ZIP prefix is often the best single blocking key for US address data. It has high cardinality (about 900 values), extremely low error rate (people rarely get the first three digits of their ZIP wrong), and is commonly available across datasets. Full 5-digit ZIP has even higher cardinality but is slightly more error-prone.

For product data, category or brand often works well. For person data, the first three characters of last name combined with state creates tight blocks while remaining robust to most errors.

Common blocking strategies by data type

Different data domains call for different blocking approaches. Here are the strategies that work in practice.

Customer and person matching. Block on 3-digit ZIP prefix or state. Alternatively, block on the first two or three characters of the last name. For datasets with phone numbers, the area code (first three digits) is an excellent blocking key because it is rarely entered incorrectly. For datasets with dates of birth, birth year creates blocks of roughly uniform size.

Address matching. Block on ZIP code or city name. For street addresses, the first three characters of the street name can work, though street name abbreviations (St vs Street, Ave vs Avenue) need normalization first. House number modulo 100 is sometimes used as a secondary key to create tighter blocks.

Product matching. Block on category, brand, or both. For e-commerce products, the combination of top-level category and first word of the product name creates small, focused blocks. For SKU-level matching, the manufacturer code prefix is often available and highly reliable.

Business and company matching. Block on state or metro area. The first three characters of the company name after removing common prefixes (The, A, An) works well. For datasets with industry codes (SIC, NAICS), these provide natural blocking boundaries.

Financial record matching. Block on transaction date (or date range, such as same week). Amount bucketing — rounding to the nearest $100 or $1,000 — creates coarse blocks. Currency code is an obvious blocking key when matching across multi-currency systems.

Multi-pass blocking: using multiple keys for higher recall

Single-key blocking has a fundamental weakness. If the blocking key value is wrong for a record — a typo in the ZIP code, a miscategorized product, a misspelled first character of a name — that record lands in the wrong block and its true match is never found.

Multi-pass blocking solves this by running the blocking and comparison process multiple times, each pass using a different blocking key. The final candidate set is the union of all passes.

Pass 1: Block on 3-digit ZIP prefix. This catches all pairs where both records have the correct ZIP.

Pass 2: Block on first 3 characters of last name. This catches pairs where the ZIP was wrong but the name was entered correctly.

Pass 3: Block on phone area code. This catches pairs where both ZIP and name have errors but the phone number is correct.

A record only needs to match on one blocking key to enter the candidate set. This dramatically improves recall. In practice, two or three passes cover the vast majority of true matches.

Match recall by blocking strategy (10K person records, 2% field error rate)
ZIP only Misses ZIP typos
91%
Name prefix only Misses name-initial errors
87%
Phone area code only Missing phone fields hurt
82%
ZIP + Name (2 pass) Covers most errors
97%
ZIP + Name + Phone (3 pass) Near-complete coverage
99%

Recall measured against manually labeled ground truth. Error rate is the percentage of records with at least one incorrect blocking field.

With a single ZIP-based pass, you achieve 91 percent recall — good but not great. Every ZIP typo is a missed match. Adding a name-prefix pass recovers most of those losses, pushing recall to 97 percent. A third pass on phone area code gets you to 99 percent.

The cost of multi-pass blocking is more candidate pairs. Two passes roughly double the candidate count (minus overlap), and three passes roughly triple it. But the absolute numbers are still far smaller than brute force. If single-pass blocking produces 5 million pairs from a 50K x 50K dataset, three passes might produce 12 million — still a 200x reduction from the 2.5 billion brute-force count.

The key is to deduplicate candidate pairs across passes. Many pairs will appear in multiple passes. A simple set union on the pair identifiers ensures each pair is only evaluated once.

Measuring blocking quality: the reduction ratio vs pairs completeness tradeoff

Two metrics tell you whether your blocking strategy is working.

Reduction ratio (RR) measures how many pairs blocking eliminated. The formula is:

RR = 1 - (candidate_pairs / total_possible_pairs)

A reduction ratio of 0.99 means blocking eliminated 99 percent of all possible pairs. Higher is better for performance but may sacrifice completeness.

Pairs completeness (PC) measures what fraction of true matching pairs survived blocking. The formula is:

PC = true_pairs_in_candidates / total_true_pairs

A pairs completeness of 0.95 means 95 percent of all true matches made it into the candidate set. Higher is better for match quality.

These two metrics are in tension. Tighter blocking (fewer, smaller blocks) increases the reduction ratio but risks dropping true pairs. Looser blocking (more, larger blocks) improves pairs completeness but generates more candidates.

Blocking strategy quality metrics (50K x 50K customer records)
Blocking Strategy Candidate Pairs Reduction Ratio Pairs Completeness Processing Time
No blocking 2.5 billion 0% 100% Impossible
State only 125 million 95.0% 99.8% ~6 hours
3-digit ZIP 5 million 99.8% 96.2% ~15 min
5-digit ZIP 800,000 99.97% 89.5% ~3 min
ZIP + Name prefix (2 pass) 9 million 99.6% 99.1% ~25 min
ZIP + Name + Phone (3 pass) 12 million 99.5% 99.7% ~35 min

Processing times assume downstream similarity scoring at 0.2ms per pair. Actual times depend on scoring method and hardware.

The three-pass strategy in the last row shows the sweet spot: 99.5 percent of pairs are eliminated (fast processing), and 99.7 percent of true matches are retained (high quality). The five-digit ZIP strategy is faster but drops 10.5 percent of true pairs — an unacceptable loss for most use cases.

The practical approach is to start with a single blocking key that has high cardinality and low error rate, measure pairs completeness on a labeled sample, and add passes until completeness is above your threshold. For most applications, two passes are sufficient. Three passes are worth it when the stakes are high and you can afford the extra processing time.

When labeled data is unavailable — which is common — you can estimate pairs completeness by sampling. Take 200 records from dataset A, manually find their true matches in dataset B, then check how many of those true pairs appear in your candidate set. This gives a reliable estimate without labeling the entire dataset.

Putting it together: a blocking checklist

Setting up blocking keys for a large matching job is not complicated, but it requires deliberate choices. Here is the sequence that works:

Step 1: Inventory your fields. List every field available in both datasets. Identify which fields are structured (codes, categories, dates, numbers) versus unstructured (names, descriptions, addresses).

Step 2: Pick your primary blocking key. Choose the structured field with the highest cardinality and lowest error rate. For US addresses, this is usually 3-digit ZIP prefix. For products, it is category or brand. For people without addresses, it is birth year or last name prefix.

Step 3: Normalize before blocking. Strip whitespace, lowercase text, standardize abbreviations, and remove punctuation. A blocking key that treats “New York” and “new york” as different values defeats the purpose. Normalization should happen before blocking key extraction, not after.

Step 4: Measure block sizes. Compute the distribution of block sizes. If any single block contains more than 1 percent of your dataset, that block is too large and will dominate processing time. Consider splitting large blocks with a secondary key (e.g., add first letter of name within a large ZIP block).

Step 5: Add a second pass. Choose a blocking key from a different field family. If your primary key is geographic (ZIP), your second key should be identity-based (name prefix) or contact-based (phone area code). The point is independence — if one key has an error, the other should still be correct.

Step 6: Deduplicate and proceed. Merge candidate pairs from all passes, remove duplicates, and pass them to your comparison pipeline. The comparison steps — string similarity, embedding similarity, LLM confirmation — only run on the candidate set that blocking produced.


Match Data Studio handles blocking automatically as part of its pipeline architecture. The system applies string pre-filters, numeric pre-filters, and AI similarity in a cascade that eliminates non-matching pairs at each stage — so expensive operations only run on records that survive the cheaper checks. Upload your CSVs and let the pipeline handle the scaling. Get started free →


Keep reading