Data cleaning before matching: the steps most people skip

You upload two CSV files, run a matching job, and get disappointing results. Half the matches are wrong. Obvious pairs were missed. Before tweaking thresholds or changing algorithms, look at the data itself.

Matching algorithms compare what you give them. If one file has JOHN SMITH and the other has john smith, that’s already a penalty before any real comparison begins. If one file records phone numbers as (555) 123-4567 and the other as 5551234567, an exact match on phone will find zero overlap.

Most matching failures aren’t algorithm failures. They’re data preparation failures.

Why raw data produces bad matches

Every system that generates CSV exports has its own formatting conventions. CRMs store names in title case. Accounting systems use all caps. Government databases truncate fields to fixed widths. Marketing platforms allow freeform entry with no validation.

When you bring two of these exports together, you’re not just comparing entities — you’re comparing the formatting decisions of two different software systems and every person who ever typed data into them.

The matching algorithm has to work through all of this noise to find the signal. Some algorithms handle noise better than others, but none of them are immune. Cleaning the data before matching removes the noise so the algorithm can focus on actual differences between records.

The essential cleaning steps

Here’s what actually moves the needle, in order of effort vs. impact.

Case normalization

Trivial to implement, frequently skipped. JOHN SMITH, John Smith, john smith, and John SMITH should all compare identically. Lowercasing everything before matching eliminates an entire class of false negatives.

Whitespace trimming and normalization

Leading spaces, trailing spaces, double spaces, tabs — all invisible to a human scanning a spreadsheet, all treated as different characters by a matching algorithm. " John Smith " and "John Smith" are different strings.

Trim leading and trailing whitespace. Collapse multiple internal spaces to single spaces. Remove non-breaking spaces and other Unicode whitespace characters.

Character encoding fixes

If you see CafÃ© instead of Cafe or Ã' instead of N, you have a UTF-8/Latin-1 encoding mismatch. This is more common than most people expect, especially with data exported from older systems or passed through Excel. Mojibake (garbled characters from encoding mismatches) will cause every affected record to fail matching.

Field parsing

A single column often contains multiple pieces of information. John Smith, CEO in a name field. 123 Main St, Apt 4 in an address field. New York, NY 10001 in a city field that also contains state and ZIP.

Splitting compound fields into separate components lets you match on each component independently. Name matching should compare names, not names-with-titles.

Address standardization

Addresses are a special case of normalization with their own entire post (see our address matching guide). The short version: expand abbreviations (St to Street, N to North, Apt to Apartment), parse into components, and compare components separately.

Phone number normalization

Strip everything that isn’t a digit. +1 (555) 123-4567, 555-123-4567, 5551234567, and 555.123.4567 should all become the same canonical form. Decide whether to keep country codes (relevant for international data) or strip them (for domestic-only matching).

Date format standardization

02/15/2026, 2026-02-15, Feb 15, 2026, 15/02/2026 — four representations of the same date. If dates are a matching field (birth date, registration date, transaction date), convert everything to ISO 8601 (YYYY-MM-DD) before comparing.

The ambiguous case: is 01/02/2026 January 2nd or February 1st? If your data mixes US and European date formats, you need to resolve this before matching or you’ll get systematic errors.

Null and empty handling

Missing data takes many forms: empty strings, N/A, n/a, NULL, null, UNKNOWN, Unknown, -, --, 0, None, not provided. These all mean “no value” but they’ll match against each other as if they were real data.

Normalize all missing-value indicators to a consistent representation (empty string or actual null) so they don’t generate false positive matches. Two records that both have N/A in the phone field are not a phone number match.

Before and after

Here’s what these cleaning steps look like applied to real-world sample records.

Sample records before and after cleaning

Field	Before (raw)	After (cleaned)
Name	JOHN R. SMITH Jr.	john r smith jr
Name	Smith, John Robert	john robert smith
Email	John.Smith@GMAIL.COM	john.smith@gmail.com
Phone	+1 (555) 123-4567	5551234567
Phone	555.123.4567	5551234567
Address	123 N. Main St., Ste 200	123 north main street suite 200
Address	123 North Main Street #200	123 north main street suite 200
Date	02/15/2026	2026-02-15
Date	Feb 15, 2026	2026-02-15
City	New York	new york
Missing	N/A
Missing	UNKNOWN

After cleaning, records that represent the same entity produce identical or near-identical field values.

Notice what happened. The two phone numbers are now identical. The two addresses are now identical. The two dates are now identical. The null indicators are both empty strings. Before cleaning, none of these would have matched exactly. After cleaning, all of them do.

The impact is measurable

Each cleaning step applied in isolation has a modest effect. Applied together, they compound dramatically.

Match rate improvement by cleaning step (cumulative)

Raw data (no cleaning) Baseline match rate

47%

+ Case normalization +9 percentage points

56%

+ Whitespace trimming +4 pp

60%

+ Phone normalization +8 pp

68%

+ Address standardization +8 pp

76%

+ Date formatting +3 pp

79%

+ Null handling +3 pp

82%

+ AI similarity matching +11 pp (on clean data)

93%

Illustrative figures for a typical two-dataset matching job with 5,000 records per side.

The baseline — matching raw, uncleaned data — finds 47% of true matches. That means more than half of real matches are invisible because of formatting differences.

Basic text normalization (case + whitespace) gets you to 60%. Phone and address standardization push it to 76%. By the time you’ve applied all cleaning steps, you’re at 82% — and that’s before any fuzzy matching or AI similarity.

The AI matching layer adds another 11 points on clean data. But notice: if you skip cleaning and go straight to AI matching on raw data, the AI is spending its capacity dealing with formatting noise instead of actual semantic comparison. Cleaning first makes the AI layer more effective, not less necessary.

The 80/20 rule of data matching

80% of matching improvement comes from basic cleaning. The first four steps — case normalization, whitespace trimming, phone normalization, and address standardization — are all deterministic string operations. No machine learning. No configuration. No cost per record. They take minutes to implement and they eliminate the most common category of false negatives.

The remaining 20% of improvement comes from fuzzy matching, embeddings, and LLM confirmation — the techniques that handle genuine variation (nicknames, abbreviations, semantically equivalent descriptions). These are valuable, but they work best on data that’s already been cleaned.

How Match Data Studio handles this

Match Data Studio’s Stage 1 pipeline handles transformations before any AI processing begins. When you configure a project, the AI assistant sets up normalization rules for each field:

Text fields get case normalization, whitespace trimming, and character encoding fixes
Phone fields get digit extraction
Address fields get abbreviation expansion and component parsing
Date fields get format standardization
Null indicators get normalized to empty values

These transformations run before the data enters the similarity computation stages, so the embedding and comparison steps work on clean, consistent inputs.

You can review and modify every transformation rule in the pipeline configuration view. If the AI assistant missed a field or picked the wrong normalization strategy, adjust it before running.

The point is that data cleaning doesn’t have to be a separate manual step. It’s built into the matching pipeline. But understanding what it does — and why it matters — helps you evaluate results and troubleshoot when matches aren’t what you expected.

Clean data matches better. It’s that simple. Start a project in Match Data Studio and see the difference on your own datasets.

Keep reading

Address matching and standardization — practical techniques for normalizing address fields
Five matching mistakes that silently ruin your results — common pitfalls and how to avoid them
Product data annotation — why your catalog may need more attributes before matching works