Matching at scale: strategies for millions of records

Matching two small datasets is straightforward. Compare every record in dataset A against every record in dataset B, score each pair, keep the ones above your threshold. A few hundred records? Done in seconds.

But matching scales quadratically. And quadratic scaling breaks things fast.

The N x M problem

If dataset A has N records and dataset B has M records, brute-force comparison requires N x M pair evaluations. This grows faster than most people expect.

How comparison counts scale

Dataset A	Dataset B	Brute Force Pairs	At 1ms/pair	At $0.001/pair
1,000	1,000	1 million	17 minutes	$1,000
10,000	10,000	100 million	28 hours	$100,000
50,000	50,000	2.5 billion	29 days	$2.5M
100,000	100,000	10 billion	116 days	$10M
500,000	500,000	250 billion	7.9 years	$250M

Costs assume LLM-based comparison at ~$0.001 per pair. Embedding costs are lower but still prohibitive at brute-force scale.

At 10K x 10K, brute force already takes 28 hours for a 1ms-per-pair operation. At 100K x 100K, it’s physically impossible to evaluate every pair with any per-pair AI cost. You need strategies to avoid generating most of those pairs in the first place.

Blocking strategies

Blocking divides records into groups that share a key, then only compares records within the same block. The idea: if two records don’t share a basic attribute, they’re almost certainly not a match. Don’t waste time comparing them.

Common blocking keys

ZIP code or postal code. Two people at the same address will share a ZIP. This is the simplest and most effective blocking key for address-based matching. A dataset with 500 unique ZIPs reduces comparisons by roughly 500x.

First N characters of name. Records starting with Smi only compare against other records starting with Smi. This catches Smith, Smithson, Smitty but won’t compare them against Johnson. Effective for name matching with low typo rates in the first few characters.

Phonetic code. Soundex or Metaphone encoding of the name field. Smith and Smyth produce the same Soundex code and end up in the same block. More robust than prefix blocking for names with typos, but only works for English.

Year or date range. For time-stamped records, only compare within a plausible date window. A transaction from 2024 probably doesn’t match one from 2019.

Composite keys. Combine blocking keys for tighter blocks: ZIP + first letter of last name or state + Soundex code. The tighter the blocks, the fewer comparisons — but you risk missing matches where the blocking key itself has errors.

The blocking tradeoff

Blocking has a fundamental tension: tighter blocks mean fewer comparisons but more missed matches. If a record has a typo in the ZIP code, it ends up in the wrong block and never gets compared to its true match.

The standard mitigation is redundant blocking: use multiple blocking keys and take the union of all candidate pairs. Block on ZIP and on name prefix and on phone area code. A pair only needs to share one blocking key to be compared. This recovers matches that any single blocking scheme would miss, at the cost of more total comparisons.

The pre-filter cascade

Blocking gets you candidate pairs. The pre-filter cascade winnows them down further before expensive processing. Each stage is a progressively more accurate (and more expensive) filter.

Stage 1: Exact match. Do any fields match exactly after normalization? If the email addresses are identical, that’s a match — no further processing needed. This resolves 20-40% of true matches in typical datasets, at zero cost.

Stage 2: String prefix / contains. Does the name in dataset A start with or contain the name in dataset B? This catches John R. Smith vs John Smith without computing a full similarity score. Fast and effective for trimmed or truncated fields.

Stage 3: String similarity threshold. Jaro-Winkler or token overlap with a generous threshold (e.g., 0.60). The goal is to remove pairs that are clearly not matches, not to confirm matches. This is a coarse filter — it should have high recall and is allowed to have low precision.

Stage 4: Numeric range. If a numeric field (price, square footage, year built, quantity) differs by more than a configurable percentage, eliminate the pair. Two houses with square footage differing by 50% are not the same property listing.

Stage 5: Embedding similarity. Generate embeddings for all surviving records and compute cosine similarity on candidate pairs. This is the first expensive step — but by this point, 80-95% of pairs have already been eliminated.

Stage 6: LLM confirmation. For pairs in the ambiguous similarity range, send full record context to a language model for a yes/no match decision. This is the most expensive step per pair, reserved for the hardest cases.

Pre-filter cascade effect on a 50K x 50K matching job

Dataset Size	Brute Force Pairs	After Blocking (ZIP)	After Pre-filter Cascade	Final LLM Reviews
1K x 1K	1M	50K	8K	800
10K x 10K	100M	5M	400K	32K
50K x 50K	2.5B	125M	5M	200K
100K x 100K	10B	500M	18M	540K
500K x 500K	250B	12.5B	180M	1.8M

Figures are illustrative. Actual reduction depends on data characteristics, blocking key distribution, and filter thresholds.

The 50K x 50K case goes from 2.5 billion brute-force pairs to 200,000 LLM reviews — a 12,500x reduction. That’s the difference between a job that’s financially impossible and one that costs a few hundred dollars.

Batch processing and parallelism

Even after filtering, large jobs need efficient execution. Embedding APIs and LLMs have throughput limits and latency overhead that make naive sequential processing slow.

Embedding batching

Most embedding APIs accept batch requests — 100 to 2,000 texts per call. Batching amortizes the HTTP overhead and often gets better throughput. For a dataset of 50,000 records that survived pre-filtering, sending 500-record batches means 100 API calls instead of 50,000.

Key considerations:

Respect rate limits. Most APIs allow 100-500 requests per minute. Back off and retry on 429 errors.
Parallelize within rate limits. Send multiple batch requests concurrently up to your rate ceiling.
Cache embeddings. If you re-run with a different threshold, you shouldn’t need to regenerate embeddings for unchanged records.

LLM parallelism

LLM confirmation calls are independent — each pair can be evaluated without knowing the results of other pairs. This makes them naturally parallelizable.

The practical constraint is API throughput. At 60 requests per minute (common for pay-as-you-go tiers), confirming 200,000 pairs takes 55 hours sequentially. At 600 RPM with batch mode, it takes 5.5 hours. Structure your system to maximize concurrent requests within rate limits.

Chunk-level processing

For very large datasets, process in chunks: take 10,000 records at a time, complete all stages, write results, move to the next chunk. This keeps memory usage bounded and enables progress tracking at the chunk level.

Checkpoint and resume

Long-running matching jobs will fail. Networks drop. APIs hit rate limits. Cloud functions time out. A job that runs for 6 hours and fails at hour 5 with no checkpoint is a disaster — you’ve wasted all the compute and have nothing to show for it.

What to checkpoint

After each pipeline stage. Write intermediate results (filtered pairs, enriched records, confirmed matches) to persistent storage before starting the next stage. If stage 3 fails, restart from stage 3 using stage 2’s outputs.
Within long stages. If LLM confirmation processes 200,000 pairs, write confirmed results every 10,000 pairs. On restart, skip the pairs already confirmed and continue from the last checkpoint.
Embedding cache. Store generated embeddings so they survive restarts. Regenerating embeddings for 50,000 records is expensive and unnecessary if the source data hasn’t changed.

Pause and resume

Some jobs are too large for a single execution window. A cloud function with a 60-minute timeout can’t process a 4-hour matching job in one shot. The system needs to:

Detect when it’s approaching the timeout
Save progress to a checkpoint
Enqueue itself to continue from that checkpoint
On resume, load the checkpoint and continue where it left off

This requires careful state management — tracking which candidate pairs have been processed, which are pending, and what intermediate scores have been computed. But it’s essential for reliability at scale.

Time comparison

How do these strategies combine in practice?

Processing time for 100K x 100K matching

Brute force (all pairs) 10B pairs, impossible

115 hrs

Blocking only (ZIP) 500M pairs, still slow

58 hrs

Blocking + string filter 90M pairs, manageable

12 hrs

Full pre-filter cascade 18M pairs + 540K LLM

4 hrs

Cascade + parallelism Batched + concurrent

1.5 hrs

Times assume cloud infrastructure with parallelized API calls. Actual times depend on API rate limits and data characteristics.

The full optimization stack — blocking, pre-filter cascade, batching, and parallelism — reduces a 115-hour job to 1.5 hours. That’s not an incremental improvement. It’s the difference between a job that can’t run and one that finishes during lunch.

Architecture patterns

Stage-based pipeline

Divide the work into discrete stages, each reading from the previous stage’s output. Stage 1 transforms and pre-filters. Stage 2 generates embeddings. Stage 3 scores and confirms matches. Stage 4 produces output.

Each stage is an independent function that can be deployed, scaled, and debugged separately. If stage 3 has a bug, fix it and re-run from stage 3 — stages 1 and 2 don’t need to repeat.

Queue-driven execution

Use a task queue between stages. Stage 1 completes, writes its output to storage, and enqueues a task for stage 2. Stage 2 picks up the task, processes, and enqueues stage 3. This decouples stages and provides automatic retry on failure.

Progress tracking

For user-facing systems, real-time progress is important. Track:

Which stage is currently running
Percentage complete within the current stage
Estimated time remaining
Record counts at each filter stage (so users can see how the funnel narrows)

This data serves double duty: it keeps users informed and helps you debug when something goes wrong. If 99% of pairs are eliminated by the string filter, either your threshold is too tight or your data has very low overlap.

Practical limits

Even with all optimizations, there are practical ceilings. A 1M x 1M matching job generates candidate pairs in the billions after blocking. The embedding and LLM costs, even after aggressive filtering, can reach thousands of dollars. At that scale, you need:

Domain-specific blocking keys that achieve 99%+ reduction
Aggressive pre-filter thresholds tuned on sample data
Batched embedding APIs with high rate limits
Budget modeling before starting the job

For most real-world use cases — matching customer lists, deduplicating CRM records, reconciling vendor data — datasets under 100K records are the norm. The strategies in this post make those jobs fast, affordable, and reliable.

Match Data Studio implements the full cascade architecture: blocking, string pre-filters, numeric pre-filters, embedding similarity, and LLM confirmation — with checkpointing and progress tracking built in. Upload your CSVs, configure the pipeline, and the system handles the scaling. Get started free →

Keep reading

AI embeddings vs rule-based matching — the hybrid architecture that makes scale feasible
How to choose the right matching algorithm — picking the right method before optimizing for speed
Marketplace deduplication — scaling dedup across millions of scraped listings