Every recruiter has lived this scenario: you find a promising candidate on LinkedIn, reach out, and they reply with “I already spoke with your colleague last week.” You check your ATS — no record of the conversation. Then you discover the candidate exists under three separate records: one from a job board application six months ago, one entered manually by a different recruiter after a networking event, and one from a referral that came in through email.

The candidate had one experience with your company. Your system has three fragmented views of it.

This is not an edge case. It is the default state of recruiting data.

The duplicate candidate problem in recruiting (and what it costs)

Recruiting teams source candidates from an expanding number of channels — LinkedIn, Indeed, Glassdoor, ZipRecruiter, company career pages, employee referrals, recruiting agency submissions, university career fairs, and internal mobility programs. Each channel feeds into your ATS through a different integration or manual entry workflow, and each one creates a new record.

The result is predictable: a single candidate accumulates multiple records across channels, and your ATS has no reliable way to detect them.

Duplicate candidate rates by sourcing channel count
1-2 channels Small teams, single job board
8%
3-4 channels Mid-size, multi-board + referrals
18%
5-6 channels Enterprise, diverse sourcing
27%
7+ channels Global teams, agencies + events
38%

Estimated duplicate rates based on typical ATS databases. More channels means more entry points for the same candidate.

The costs go beyond data clutter. When two recruiters unknowingly work the same candidate for the same role, the candidate receives duplicate outreach — a poor experience that damages your employer brand. When a candidate’s interview history is split across records, hiring managers lack the full picture during evaluation. When disposition tracking is fragmented, your compliance reporting becomes unreliable, which matters in regulated industries and government contracting.

There is also a hidden cost to recruiting metrics. If 25% of your candidate records are duplicates, your “candidates sourced” numbers are inflated by 25%. Your cost-per-candidate is artificially low. Your source-of-hire attribution is unreliable because the same candidate may have been “sourced” from three channels, but only one led to the actual hire. Every downstream metric that starts with candidate count inherits the distortion.

Why candidates appear differently across sources

The matching challenge in recruiting data is not just about typos. Candidates legitimately appear differently across platforms because of how people present themselves in different contexts.

One candidate across five sourcing channels
Source Name Email Title Company Location
LinkedIn Recruiter Katherine M. Rivera Senior Software Engineer Stripe, Inc. San Francisco, CA
Indeed application Kate Rivera krivera.dev@gmail.com Sr. SWE Stripe San Francisco
Employee referral Kat Rivera katherine@stripe.com Senior Engineer Stripe SF Bay Area
University alumni list Katherine Rivera kmrivera@alumni.berkeley.edu Software Engineer California
Agency submission K. Rivera Senior Software Eng. Stripe Inc SF, CA

Different name formats, emails, title abbreviations, company suffixes, and location representations — all the same person.

Name variations alone create significant matching difficulty. “Katherine,” “Kate,” “Kat,” and “K.” are all valid representations, and none of them match on an exact string comparison. An ATS that deduplicates only on exact email match will miss three of these five records entirely because only two have email addresses at all, and those two have different addresses.

Title variations compound the problem. “Senior Software Engineer,” “Sr. SWE,” “Senior Engineer,” and “Software Engineer” all describe the same role at different levels of abbreviation or precision. Company names fluctuate between “Stripe,” “Stripe, Inc.,” “Stripe Inc,” and just the domain context. Location representations range from full city-state to abbreviations to regional descriptions.

These are not data quality problems in the traditional sense. Each source is internally consistent and accurate. The mismatch happens when you try to reconcile records across sources that follow different formatting conventions.

The matching fields: name, email, phone, current company, location

Not all candidate fields carry equal weight for deduplication. The right matching strategy uses a hierarchy of identifiers, from high-confidence unique fields down to supporting contextual fields.

Email address is the strongest single identifier. When two records share an email, the probability of a true match is above 95%. The challenge is that candidates use different emails across platforms — a personal Gmail for job board applications, a work email in referral contexts, a university email on alumni lists. Matching on email alone catches some duplicates but misses many.

Phone number is the second strongest identifier. Like email, it is nearly unique to an individual, but candidates do not always provide it, and formatting varies (dashes, parentheses, country codes, extensions).

Full name is necessary but insufficient. It catches the majority of true matches when combined with other fields, but it has high ambiguity for common names. A name match should raise confidence, not confirm identity.

Current company is a strong supporting signal. Two records with similar names and the same current employer are very likely the same person. But company names vary across sources, and candidates change jobs — the company listed in a six-month-old application may not match the company on a current LinkedIn profile.

Location is a weak but useful tiebreaker. It helps distinguish between “Sarah Johnson in Dallas” and “Sarah Johnson in Portland” but adds little value when other fields already align. Location formats vary wildly across sources.

The practical approach is to configure matching with email and phone as primary identifiers (high weight, high threshold), name and company as secondary identifiers (medium weight, fuzzy matching), and location as a supporting field (low weight, broad geographic matching that treats “SF” and “San Francisco, CA” as equivalent).

Handling title and company variations across platforms

Candidate titles deserve special treatment because they carry recruiting-specific meaning that generic fuzzy matching misses.

Consider the difference between “VP of Engineering” and “VP of Sales.” These titles are 70% similar as strings — same prefix, same structure, one word different. But they represent completely different functions, and matching them would be a serious error. Meanwhile, “VP of Engineering” and “Vice President, Engineering” are only 55% similar as strings but represent the identical role.

AI-powered matching handles this distinction well because it operates on semantic meaning rather than character overlap. The embedding for “VP of Engineering” and “Vice President, Engineering” will be nearly identical because they encode the same concept. The embedding for “VP of Engineering” and “VP of Sales” will diverge because the underlying roles are different.

Company name matching has a similar pattern. You need to handle legitimate variations (abbreviations, suffixes, parent-vs-subsidiary) without conflating different companies.

Common company name variations in candidate records
Variation type Example A Example B Same company?
Legal suffix Microsoft Corporation Microsoft Corp. Yes
Abbreviation JPMorgan Chase JPMC Yes
Parent vs subsidiary Alphabet Inc. Google Depends on context
Common prefix General Electric General Motors No
Acquired company Tableau Software Salesforce (Tableau) Yes, post-acquisition
DBA name Meta Platforms Inc. Facebook Yes

AI embeddings handle most of these correctly because they encode company identity, not just string patterns.

For recruiting specifically, the parent-vs-subsidiary distinction matters. A candidate who lists “Google” and one who lists “Alphabet” may or may not be a match depending on your context. If you are deduplicating across all roles, they are the same person. If you are tracking candidates per business unit, you may want to distinguish them. Your matching configuration should reflect this decision.

Building a unified talent pool from multiple sourcing channels

The deduplication workflow for recruiting data follows a specific sequence that accounts for the realities of multi-channel sourcing.

Step 1: Export and normalize. Pull candidate records from each source as CSV files. Standardize column names — map “First Name” and “Last Name” from your ATS to a combined “Full Name” column, or vice versa. Ensure email and phone fields are in consistent formats. This normalization step prevents easy matches from being missed due to structural differences between exports.

Step 2: Choose your golden record source. Decide which system contains the most authoritative candidate data. Typically this is your primary ATS, because it has interview history, disposition codes, and recruiter notes. All other sources will be matched against this golden source.

Step 3: Match iteratively. Match your ATS export against each external source one at a time. Start with the source most likely to produce high-confidence matches — usually LinkedIn or your second ATS, if you have one. Then match against job board exports, referral lists, and event attendee lists.

Step 4: Review and merge. For each matching round, review the results in three tiers. High-confidence matches (email or phone match plus name similarity) can be auto-merged. Medium-confidence matches (name plus company, no contact info match) should be reviewed by a recruiter. Low-confidence matches should be flagged but not merged without manual verification.

Step 5: Enrich on merge. When merging records, fill gaps in your golden record with data from the matched source. If your ATS record lacks a phone number but the referral submission has one, add it. If the LinkedIn record has a more current title, update it. The merge should add information, not overwrite verified data.

Typical match confidence distribution — ATS vs. LinkedIn export
High confidence Email or phone + name match
42%
Medium confidence Name + company, no contact match
31%
Low confidence Name only or partial signals
15%
No match New candidates not in ATS
12%

Distribution from matching a 20K-record ATS against a 5K LinkedIn Recruiter export. The 12% unmatched records are genuinely new candidates to add.

The 12% of unmatched records in the external source are not failures — they represent genuinely new candidates that your ATS did not previously contain. Add these as new records. This is how deduplication and sourcing work together: the same matching job both cleans your existing data and identifies net-new candidates.

Keeping your pipeline clean: ongoing deduplication as candidates flow in

A one-time deduplication project is useful, but the problem recurs immediately. New applications arrive daily from multiple channels. Without ongoing deduplication, your database drifts back to its duplicated state within months.

The sustainable approach is to build deduplication into your sourcing workflow rather than treating it as a periodic cleanup project.

At ingestion. When a new candidate record enters your ATS — whether through a job board application, a recruiter adding a contact, or an agency submission — match it against existing records before creating a new entry. If a high-confidence match exists, merge the new data into the existing record and alert the relevant recruiter. If a medium-confidence match exists, flag it for review.

On a weekly cadence. Run a batch deduplication across recent entries to catch matches that ingestion-time matching missed — for example, when two applications from different channels arrive on the same day and neither existed in the database previously.

Before major sourcing pushes. When your team is about to source heavily for a new role — pulling lists from LinkedIn, running Boolean searches, requesting agency submissions — deduplicate your existing pool for that function and location first. This prevents the sourcing push from generating duplicate outreach to candidates already in your pipeline.

The operational benefit is significant. Recruiters stop wasting time on candidates already being worked by colleagues. Hiring managers see complete candidate histories. Source-of-hire attribution becomes accurate. And candidates receive a consistent, professional experience regardless of how many times they have crossed paths with your organization.

Candidate data is inherently messy because people present themselves differently across contexts, change jobs, update contact information, and use different names in different settings. The question is not whether your ATS has duplicates — it does. The question is whether you have a systematic process for finding and resolving them.


Match Data Studio matches candidate records across ATS exports, job board data, and referral lists using AI that handles name variants, title abbreviations, and company name inconsistencies. Start deduplicating your talent pool —>


Keep reading