How to validate a data migration by matching source and target records
Catch migration errors before go-live. Learn how to match source and target records, check for data loss, detect transformation errors, and build a migration validation report.
You spent six months planning the migration. The ETL scripts are written, the schema mapping is done, the cutover window is scheduled for Saturday night. On Monday morning, everyone logs into the new system and starts working.
Two weeks later, someone notices that 1,400 customer records have the wrong billing address. A month after that, finance discovers that transaction amounts in the new system don’t match the old one for records created before 2019. By quarter-end, the compliance team finds that 3% of account records are simply missing.
None of these errors triggered an alert during migration. The ETL jobs completed successfully. Row counts matched. The data looked fine. But nobody matched individual records between source and target to verify that the data actually transferred correctly.
This is what migration validation catches — and what count-based checks miss.
What can go wrong in a data migration (and what matching catches)
Migrations fail in ways that are invisible to basic checks. Understanding the failure modes tells you what to look for.
Truncation. A VARCHAR(100) field in the source maps to a VARCHAR(50) field in the target. Long values get silently clipped. Company names like International Business Machines Corporation become International Business Machines Corpo. No error is thrown. The data is technically valid. But it’s wrong.
Encoding corruption. Character encoding mismatches turn Muller into Müller or M?ller. Accented names, currency symbols, and CJK characters are particularly vulnerable. The record exists in the target, but the data is garbled.
Transformation errors. Date formats flip during migration. 03/04/2023 in MM/DD/YYYY format becomes March 4th in the target system, but gets interpreted as April 3rd if the target expects DD/MM/YYYY. The error is only detectable if you compare the actual values.
Dropped records. A WHERE clause in the ETL script inadvertently filters out records with NULL values in a non-key field. The record count drops by 2%, but since you’re looking at a system with hundreds of thousands of records, nobody notices that 6,000 rows vanished.
Duplicate creation. A retry after a timeout inserts the same batch twice. Or a lookup join produces unexpected one-to-many relationships, duplicating parent records. The target now has more records than the source.
Default value injection. Fields that are NULL in the source get populated with defaults in the target — empty strings, zeros, or placeholder dates like 1900-01-01. The record exists and has data, but it’s fabricated data that never existed in the source.
Every one of these errors is invisible to a simple row count check. Every one of them is detectable by matching individual records between source and target and comparing field values.
Exporting source and target snapshots for comparison
Before you can match records, you need comparable exports from both systems. This step is more nuanced than it sounds.
Timing matters. If the source system is still live during migration, records may be created, updated, or deleted between the time you export the source snapshot and the time the target is populated. Always export the source snapshot after the cutover freeze — the point at which the source system stops accepting writes.
Export the same scope. If your migration covers three years of transaction history, export three years from both sides. If it covers only active customers, apply the same filter to both exports. Mismatched scope produces phantom “missing” records that are actually just out of scope.
Include primary keys from both systems. If the target system assigns new IDs, export a cross-reference table that maps old IDs to new IDs. If no cross-reference exists (common in legacy migrations), you’ll need to match on natural keys instead.
Normalize format before exporting. Export dates in ISO format (YYYY-MM-DD). Export numbers without currency symbols or thousands separators. Export text in UTF-8. The less formatting variation between exports, the fewer false mismatches during validation.
A practical approach: export each system to CSV with consistent column naming. One file for the source, one for the target. Include every field you need to validate — not just the key fields, but the data fields where errors might hide.
Matching records: key fields, fuzzy tolerance, and handling transformations
With both snapshots in hand, the core task is matching each source record to its corresponding target record. This is record linkage — the same technique used for entity resolution, applied to a different problem.
When IDs survive migration, matching is straightforward. Join on the shared primary key. Every source record should have exactly one target record, and vice versa. Unmatched records on either side indicate dropped or duplicated rows.
When IDs change, you need natural keys. These are combinations of business fields that uniquely identify a record:
- Customer records: name + email + phone, or name + address + date of birth
- Order records: order number + date + customer ID
- Product records: SKU + name, or UPC + manufacturer
- Transaction records: account + date + amount + reference number
Natural keys often require fuzzy matching because the migration itself may have altered formatting. A name stored as SMITH, JOHN A. in the legacy system might become John A. Smith in the new system. Exact matching would miss this; fuzzy matching with a high threshold (0.85+) catches it.
| Record type | Primary natural key | Secondary natural key | Fuzzy tolerance | Expected match rate |
|---|---|---|---|---|
| Customer | Email + last name | Phone + ZIP + first name | 0.85 on names | 97-99% |
| Order | Order number + date | Customer + amount + date | Exact on numbers | 99%+ |
| Product | SKU or UPC | Name + category + price | 0.80 on names | 95-98% |
| Employee | Employee ID or SSN-last4 | Name + hire date + department | 0.90 on names | 98-99% |
| Transaction | Reference + date + amount | Account + date + amount | Exact on amounts | 99%+ |
Match rates assume clean migration. Lower rates indicate data quality issues that need investigation.
Set your threshold high for migration validation. Unlike entity resolution where you’re looking for possible matches across messy data, migration validation is comparing records that should be identical. If a record doesn’t match at a high threshold, something went wrong during migration — and that’s exactly what you want to find.
The four validation checks: count, completeness, accuracy, and orphans
A thorough migration validation runs four distinct checks, each catching a different class of error.
Check 1: Count validation. The simplest check. Does the number of records in the target match the source? Do it at the table level and also at the segment level — counts by record type, by date range, by status, by geographic region. Table-level counts can match even when segment-level counts are off, because drops and duplicates can cancel each other out.
Check 2: Completeness validation. For every record in the source, does a corresponding record exist in the target? This is where record matching happens. After matching, any unmatched source records are missing from the target. Any unmatched target records are orphans — records that exist in the target but have no source, indicating duplicates or fabricated data.
Check 3: Accuracy validation. For each matched pair, do the field values agree? This is the most revealing check. Compare every critical field: names, dates, amounts, statuses, addresses. Flag any pair where a field value changed unexpectedly. A customer’s name should be identical (or a known transformation of the original). A transaction amount should match to the penny. A date should not have shifted by a day or a month.
Check 4: Orphan analysis. Records that exist in the target but not the source need explanation. Some are expected — the migration process may generate system records, default configurations, or seed data. Others are duplicates created by retry logic, or records from a different migration batch that shouldn’t be in this dataset. Classify every orphan.
These four checks form a hierarchy. Count validation is the quickest sanity check. Completeness validation catches missing and extra records. Accuracy validation catches field-level corruption. Orphan analysis explains anomalies. Skip any layer and you leave a category of errors undetected.
Building a migration validation report from match results
Raw match results are data. A validation report is a decision-making tool. The difference is structure and context.
Start with the executive summary: how many records were validated, what percentage matched, and what’s the overall data quality assessment. Decision-makers need one number: is this migration clean enough to go live?
Then break it down by category:
Match summary. Total source records, total target records, matched pairs, unmatched source (missing), unmatched target (orphans). Include percentages. A 99.2% match rate sounds good until you realize that 0.8% of 500,000 records is 4,000 missing customers.
Field-level accuracy. For each validated field, report the percentage of matched pairs where the field value is identical, where it differs within acceptable tolerance (case changes, whitespace), and where it differs beyond tolerance (actual data corruption). Rank fields by error rate so the team knows where to focus remediation.
Error classification. Group discrepancies by type: truncation, encoding, format change, value change, NULL injection. Each type has a different root cause and a different fix. Truncation means a schema mismatch. Encoding errors mean a character set configuration issue. Format changes mean a transformation rule is wrong.
Sample discrepancies. Include 10-20 specific examples of each error type. Abstractions like “3% of name fields have encoding errors” become actionable when accompanied by concrete examples: Source: Muller | Target: Müller. Engineers can trace each example back to the specific ETL step that caused it.
Go/no-go recommendation. Based on the error rates and severity, does the migration meet the acceptance criteria defined in the migration plan? If not, which errors must be fixed before cutover and which can be remediated post-migration?
When to re-run: iterative validation during phased migrations
Large migrations rarely happen in a single cutover. Phased migrations move data in stages — by module, by business unit, by date range, or by entity type. Each phase needs its own validation, and the validation approach must evolve across phases.
Phase 1: Pilot validation. Migrate a small, representative subset (1,000-5,000 records). Validate every record, every field. This is where you catch systematic errors — transformation bugs, encoding issues, schema mismatches. Fix them before scaling.
Phase 2: Incremental validation. Each subsequent batch gets the same four checks. Compare error rates across batches. If batch 3 has a higher error rate than batches 1 and 2, something changed — maybe a new record type triggered an untested code path, or a source table was modified between extracts.
Phase 3: Cumulative validation. After the final batch, validate the entire target dataset against the complete source snapshot. This catches cross-batch issues: records that were modified in the source between batch extracts, or lookup references that point to records loaded in a different batch.
Phase 4: Post-cutover validation. Run validation again one week after go-live. Compare the target system’s current state against the validated snapshot. This catches issues introduced by early production usage — data that was modified by users before data stewards finished their remediation work.
Each phase generates its own validation report. Track error rates across phases to ensure they’re trending downward. If a new error type appears in phase 3 that didn’t exist in phase 1, investigate immediately — it may indicate a systemic issue that affects all previously migrated records.
The key insight is that migration validation is not a one-time event. It’s an iterative process that runs alongside the migration itself, catching errors early when they’re cheap to fix rather than late when they’ve already propagated into production workflows.
Match Data Studio handles the record matching at the heart of migration validation — upload your source and target exports as CSVs, match on natural keys with fuzzy tolerance, and get a field-by-field comparison that shows exactly where the data diverges. Start validating your migration data —>
Keep reading
- How to compare two datasets — the general technique behind migration validation
- Data cleaning before matching — prep work that reduces false mismatches in your validation
- Five matching mistakes that silently ruin your results — configuration errors that can make validation results misleading