Getting started with CSV matching

Matching records across two CSV datasets is one of the most common data tasks — and one of the most frustrating when the data is messy. Names are misspelled, IDs differ between systems, addresses are abbreviated differently.

This guide walks through your first matching job in Match Data Studio from start to finish.

Step 1: Prepare your CSVs

You don’t need to do much preprocessing. Match Data Studio works directly with raw CSV files. A few things that help:

Make sure your files have headers in the first row
Remove any completely empty rows or columns
If you have a field that’s a likely unique identifier (like an email or product code), keep it in — it helps the AI configure matching rules

The files can be any size, but for your first run, a dataset of a few hundred rows will give you fast feedback.

Step 2: Create a project and upload

Sign in, click New project, give it a name, and upload your two files. The uploader shows progress and confirms row counts on completion.

After uploading, Match Data Studio analyzes the columns and sample rows from both files to understand your data.

Step 3: Configure with the AI assistant

This is where the magic happens. The AI assistant will ask you a few questions:

What are you trying to match? (e.g., “customer records from our CRM against a purchased contact list”)
Which fields should be compared?
What makes two records a definite match vs. a likely match?

Based on your answers, the assistant configures an eight-stage matching pipeline:

Data preparation — normalization and cleaning
Text completions — filling sparse fields (optional)
Vector embeddings — semantic similarity
Cosine similarity — pairwise comparison with thresholds
Numeric matching — for dates, IDs, prices
String matching — fuzzy algorithms for names and codes
LLM confirmation — for borderline pairs
Export — column selection for output

You can review and edit any stage before running.

Step 4: Run a sample first

Before processing your full dataset, run a sample (5 credits). This processes up to 100 rows from each file and shows you matched pairs with confidence scores.

Check the results:

Are the matches correct?
Is the threshold too tight (missing real matches) or too loose (including false positives)?

Adjust the configuration and run another sample if needed.

Step 5: Run the full dataset

Once the sample looks right, click Run full. Cost depends on the number of rows — you’ll see a credit estimate before confirming.

The pipeline runs in the background. You can close the tab and come back — progress is tracked in real time.

Step 6: Download results

When the run completes, download the results CSV. It contains:

All matched pairs
Fields from both datasets side by side
Similarity scores per field
An overall confidence score

From here, you can import into your CRM, database, or spreadsheet for review.

That’s a complete first run. The whole process — from upload to results — typically takes under 10 minutes for datasets of a few thousand rows.

Questions? Contact us or browse the blog for more guides.

Keep reading

Entity resolution explained — understand the theory behind linking records across datasets
Data cleaning before matching — the prep steps that make or break match quality
Fuzzy matching algorithms explained — Levenshtein, Jaro-Winkler, and when to use each