Matching at scale: strategies for millions of records
How to handle the N x M explosion in record matching — blocking strategies, pre-filter cascades, batch processing, and fault tolerance for large datasets.
Matching two small datasets is straightforward. Compare every record in dataset A against every record in dataset B, score each pair, keep the ones above your threshold. A few hundred records? Done in seconds.
But matching scales quadratically. And quadratic scaling breaks things fast.
The N x M problem
If dataset A has N records and dataset B has M records, brute-force comparison requires N x M pair evaluations. This grows faster than most people expect.
| Dataset A | Dataset B | Brute Force Pairs | At 1ms/pair | At $0.001/pair |
|---|---|---|---|---|
| 1,000 | 1,000 | 1 million | 17 minutes | $1,000 |
| 10,000 | 10,000 | 100 million | 28 hours | $100,000 |
| 50,000 | 50,000 | 2.5 billion | 29 days | $2.5M |
| 100,000 | 100,000 | 10 billion | 116 days | $10M |
| 500,000 | 500,000 | 250 billion | 7.9 years | $250M |
Costs assume LLM-based comparison at ~$0.001 per pair. Embedding costs are lower but still prohibitive at brute-force scale.
At 10K x 10K, brute force already takes 28 hours for a 1ms-per-pair operation. At 100K x 100K, it’s physically impossible to evaluate every pair with any per-pair AI cost. You need strategies to avoid generating most of those pairs in the first place.
Blocking strategies
Blocking divides records into groups that share a key, then only compares records within the same block. The idea: if two records don’t share a basic attribute, they’re almost certainly not a match. Don’t waste time comparing them.
Common blocking keys
ZIP code or postal code. Two people at the same address will share a ZIP. This is the simplest and most effective blocking key for address-based matching. A dataset with 500 unique ZIPs reduces comparisons by roughly 500x.
First N characters of name. Records starting with Smi only compare against other records starting with Smi. This catches Smith, Smithson, Smitty but won’t compare them against Johnson. Effective for name matching with low typo rates in the first few characters.
Phonetic code. Soundex or Metaphone encoding of the name field. Smith and Smyth produce the same Soundex code and end up in the same block. More robust than prefix blocking for names with typos, but only works for English.
Year or date range. For time-stamped records, only compare within a plausible date window. A transaction from 2024 probably doesn’t match one from 2019.
Composite keys. Combine blocking keys for tighter blocks: ZIP + first letter of last name or state + Soundex code. The tighter the blocks, the fewer comparisons — but you risk missing matches where the blocking key itself has errors.
The blocking tradeoff
Blocking has a fundamental tension: tighter blocks mean fewer comparisons but more missed matches. If a record has a typo in the ZIP code, it ends up in the wrong block and never gets compared to its true match.
The standard mitigation is redundant blocking: use multiple blocking keys and take the union of all candidate pairs. Block on ZIP and on name prefix and on phone area code. A pair only needs to share one blocking key to be compared. This recovers matches that any single blocking scheme would miss, at the cost of more total comparisons.
The pre-filter cascade
Blocking gets you candidate pairs. The pre-filter cascade winnows them down further before expensive processing. Each stage is a progressively more accurate (and more expensive) filter.
Stage 1: Exact match. Do any fields match exactly after normalization? If the email addresses are identical, that’s a match — no further processing needed. This resolves 20-40% of true matches in typical datasets, at zero cost.
Stage 2: String prefix / contains. Does the name in dataset A start with or contain the name in dataset B? This catches John R. Smith vs John Smith without computing a full similarity score. Fast and effective for trimmed or truncated fields.
Stage 3: String similarity threshold. Jaro-Winkler or token overlap with a generous threshold (e.g., 0.60). The goal is to remove pairs that are clearly not matches, not to confirm matches. This is a coarse filter — it should have high recall and is allowed to have low precision.
Stage 4: Numeric range. If a numeric field (price, square footage, year built, quantity) differs by more than a configurable percentage, eliminate the pair. Two houses with square footage differing by 50% are not the same property listing.
Stage 5: Embedding similarity. Generate embeddings for all surviving records and compute cosine similarity on candidate pairs. This is the first expensive step — but by this point, 80-95% of pairs have already been eliminated.
Stage 6: LLM confirmation. For pairs in the ambiguous similarity range, send full record context to a language model for a yes/no match decision. This is the most expensive step per pair, reserved for the hardest cases.
| Dataset Size | Brute Force Pairs | After Blocking (ZIP) | After Pre-filter Cascade | Final LLM Reviews |
|---|---|---|---|---|
| 1K x 1K | 1M | 50K | 8K | 800 |
| 10K x 10K | 100M | 5M | 400K | 32K |
| 50K x 50K | 2.5B | 125M | 5M | 200K |
| 100K x 100K | 10B | 500M | 18M | 540K |
| 500K x 500K | 250B | 12.5B | 180M | 1.8M |
Figures are illustrative. Actual reduction depends on data characteristics, blocking key distribution, and filter thresholds.
The 50K x 50K case goes from 2.5 billion brute-force pairs to 200,000 LLM reviews — a 12,500x reduction. That’s the difference between a job that’s financially impossible and one that costs a few hundred dollars.
Batch processing and parallelism
Even after filtering, large jobs need efficient execution. Embedding APIs and LLMs have throughput limits and latency overhead that make naive sequential processing slow.
Embedding batching
Most embedding APIs accept batch requests — 100 to 2,000 texts per call. Batching amortizes the HTTP overhead and often gets better throughput. For a dataset of 50,000 records that survived pre-filtering, sending 500-record batches means 100 API calls instead of 50,000.
Key considerations:
- Respect rate limits. Most APIs allow 100-500 requests per minute. Back off and retry on 429 errors.
- Parallelize within rate limits. Send multiple batch requests concurrently up to your rate ceiling.
- Cache embeddings. If you re-run with a different threshold, you shouldn’t need to regenerate embeddings for unchanged records.
LLM parallelism
LLM confirmation calls are independent — each pair can be evaluated without knowing the results of other pairs. This makes them naturally parallelizable.
The practical constraint is API throughput. At 60 requests per minute (common for pay-as-you-go tiers), confirming 200,000 pairs takes 55 hours sequentially. At 600 RPM with batch mode, it takes 5.5 hours. Structure your system to maximize concurrent requests within rate limits.
Chunk-level processing
For very large datasets, process in chunks: take 10,000 records at a time, complete all stages, write results, move to the next chunk. This keeps memory usage bounded and enables progress tracking at the chunk level.
Checkpoint and resume
Long-running matching jobs will fail. Networks drop. APIs hit rate limits. Cloud functions time out. A job that runs for 6 hours and fails at hour 5 with no checkpoint is a disaster — you’ve wasted all the compute and have nothing to show for it.
What to checkpoint
- After each pipeline stage. Write intermediate results (filtered pairs, enriched records, confirmed matches) to persistent storage before starting the next stage. If stage 3 fails, restart from stage 3 using stage 2’s outputs.
- Within long stages. If LLM confirmation processes 200,000 pairs, write confirmed results every 10,000 pairs. On restart, skip the pairs already confirmed and continue from the last checkpoint.
- Embedding cache. Store generated embeddings so they survive restarts. Regenerating embeddings for 50,000 records is expensive and unnecessary if the source data hasn’t changed.
Pause and resume
Some jobs are too large for a single execution window. A cloud function with a 60-minute timeout can’t process a 4-hour matching job in one shot. The system needs to:
- Detect when it’s approaching the timeout
- Save progress to a checkpoint
- Enqueue itself to continue from that checkpoint
- On resume, load the checkpoint and continue where it left off
This requires careful state management — tracking which candidate pairs have been processed, which are pending, and what intermediate scores have been computed. But it’s essential for reliability at scale.
Time comparison
How do these strategies combine in practice?
The full optimization stack — blocking, pre-filter cascade, batching, and parallelism — reduces a 115-hour job to 1.5 hours. That’s not an incremental improvement. It’s the difference between a job that can’t run and one that finishes during lunch.
Architecture patterns
Stage-based pipeline
Divide the work into discrete stages, each reading from the previous stage’s output. Stage 1 transforms and pre-filters. Stage 2 generates embeddings. Stage 3 scores and confirms matches. Stage 4 produces output.
Each stage is an independent function that can be deployed, scaled, and debugged separately. If stage 3 has a bug, fix it and re-run from stage 3 — stages 1 and 2 don’t need to repeat.
Queue-driven execution
Use a task queue between stages. Stage 1 completes, writes its output to storage, and enqueues a task for stage 2. Stage 2 picks up the task, processes, and enqueues stage 3. This decouples stages and provides automatic retry on failure.
Progress tracking
For user-facing systems, real-time progress is important. Track:
- Which stage is currently running
- Percentage complete within the current stage
- Estimated time remaining
- Record counts at each filter stage (so users can see how the funnel narrows)
This data serves double duty: it keeps users informed and helps you debug when something goes wrong. If 99% of pairs are eliminated by the string filter, either your threshold is too tight or your data has very low overlap.
Practical limits
Even with all optimizations, there are practical ceilings. A 1M x 1M matching job generates candidate pairs in the billions after blocking. The embedding and LLM costs, even after aggressive filtering, can reach thousands of dollars. At that scale, you need:
- Domain-specific blocking keys that achieve 99%+ reduction
- Aggressive pre-filter thresholds tuned on sample data
- Batched embedding APIs with high rate limits
- Budget modeling before starting the job
For most real-world use cases — matching customer lists, deduplicating CRM records, reconciling vendor data — datasets under 100K records are the norm. The strategies in this post make those jobs fast, affordable, and reliable.
Match Data Studio implements the full cascade architecture: blocking, string pre-filters, numeric pre-filters, embedding similarity, and LLM confirmation — with checkpointing and progress tracking built in. Upload your CSVs, configure the pipeline, and the system handles the scaling. Get started free →
Keep reading
- AI embeddings vs rule-based matching — the hybrid architecture that makes scale feasible
- How to choose the right matching algorithm — picking the right method before optimizing for speed
- Marketplace deduplication — scaling dedup across millions of scraped listings