Marketplace data deduplication: cleaning scraped listings at scale

You scrape Amazon, eBay, and Walmart weekly. Last week’s pull returned 48,000 product listings across three categories. Your analyst builds a report: “48,000 unique products in the competitive landscape.”

That number is wrong. It’s probably closer to 19,000 unique products listed an average of 2.5 times each — across sellers, across platforms, and sometimes multiple times within the same platform.

Every decision made on the inflated number is distorted. Market sizing is off by 2.5x. Average pricing is skewed by seller variation. Category composition looks broader than it actually is. Trend analysis mistakes seller churn for product churn.

Deduplication isn’t a nice-to-have. It’s a prerequisite for any marketplace analytics that means anything.

Why marketplace data is so duplicated

Marketplaces are not catalogs. They’re platforms where multiple sellers list the same product independently. Each seller writes their own title, sets their own price, and adds their own keywords. Amazon alone can have 15+ listings for a single popular product from different sellers.

Five Amazon listings — actually two unique products

Listing title	Seller	Price	Actual product
Sony WH-1000XM5 Wireless Noise Canceling Headphones Black	TechDirect	$298	Sony WH-1000XM5 (Black)
Sony WH1000XM5 Headphones Bluetooth NC Over Ear BLACK - NEW	AudioDeals247	$279	Sony WH-1000XM5 (Black)
SONY WH-1000XM5/B Wireless ANC Headphones — Factory Sealed	BestPriceElectronics	$289	Sony WH-1000XM5 (Black)
Anker Soundcore Life Q35 Wireless ANC Headphones Navy Blue	AnkerOfficial	$79	Anker Soundcore Q35 (Navy)
Soundcore by Anker Life Q35 Bluetooth Active NC Headphone NAVY	GadgetWorld	$74	Anker Soundcore Q35 (Navy)

Three Sony listings and two Anker listings represent just two unique products.

The Sony headphones appear three times with three different titles. “WH-1000XM5” vs “WH1000XM5” (missing hyphen). “Noise Canceling” vs “NC” vs “ANC”. “Wireless” appearing in different positions. One seller adds “NEW”, another adds “Factory Sealed”. The model number suffix “/B” appears in one listing but not the others.

None of these titles match exactly. But they’re all the same product, and any analytics pipeline that treats them as three separate products is producing misleading output.

The taxonomy of marketplace duplicates

Not all duplicates are created equal. Understanding the types helps calibrate your deduplication approach.

Same product, different sellers: The most common type. Identical physical product listed by multiple third-party sellers. Titles vary because each seller optimizes for different keywords.

Same product, different conditions: One listing says “New”, another says “Renewed”, another says “Open Box”. These are the same underlying product but may warrant separate treatment depending on your analysis — you might want to aggregate them under one product but keep condition as an attribute.

Same product, bundled differently: “Sony WH-1000XM5” vs “Sony WH-1000XM5 + Carrying Case Bundle” vs “Sony WH-1000XM5 with Extra Ear Pads”. The base product is the same, but the bundle adds complexity. Aggressive deduplication merges these; conservative deduplication keeps them separate.

Size and color variants: “Nike Air Max 90 White Size 10” vs “Nike Air Max 90 White Size 11”. These are the same product in a meaningful sense (same model) but different SKUs. Your deduplication strategy should define whether variants collapse into one record or remain separate.

Cross-platform duplicates: The same seller or product appearing on Amazon, eBay, and Walmart. Titles may differ significantly across platforms because each has different SEO conventions and character limits.

Why naive approaches break down

UPC/EAN matching works when it’s available — but scraped listings often don’t expose the barcode. Amazon’s ASIN is platform-specific and not available on eBay. eBay’s item ID is meaningless on Amazon. There’s no universal product identifier consistently present across scraped marketplace data.

Brand + model number extraction helps but is fragile. Model numbers appear in different formats (“WH-1000XM5” vs “WH1000XM5” vs “1000XM5”), and not all products have clear model numbers. Fashion, home goods, and grocery categories are particularly difficult because model numbers either don’t exist or aren’t consistently included in titles.

Title clustering with TF-IDF or n-grams gets you partway, but the noise in marketplace titles — seller-injected keywords, promotional text, variant attributes — pushes similar products apart and pulls different products together. “Sony WH-1000XM5 Case” (an accessory) might score closer to “Sony WH-1000XM5 Headphones” than two differently-titled listings for the same headphones.

Duplicate rate by marketplace category

Consumer electronics High seller overlap, clear model numbers

68%

Beauty & personal care Brand variations, kit/bundle combos

54%

Home & kitchen Generic descriptions, less standardized

45%

Clothing & apparel Size/color variants inflate counts

72%

Grocery & gourmet Pack sizes create real distinctions

38%

Percentage of scraped listings that are duplicates of another listing in the same dataset. Based on typical multi-marketplace scrapes.

The deduplication pipeline

Effective marketplace deduplication follows a layered approach.

1. Normalize

Strip the noise that makes matching harder without adding signal:

Remove promotional prefixes and suffixes: “SALE”, “FREE SHIPPING”, “BEST DEAL”, “NEW IN BOX”
Standardize brand names: map “Sony”, “SONY”, “Sony Electronics” to a canonical form
Normalize model numbers: remove hyphens, standardize spacing, uppercase
Extract and separate variant attributes: size, color, condition, quantity

This preprocessing can be simple regex and lookup-table work. It doesn’t need to be perfect — the AI handles residual noise — but it dramatically reduces the comparison space.

2. Block by category and brand

Before computing pairwise similarity, group listings into blocks. A Sony headphone should never be compared against an Anker headphone. A listing in “Electronics > Headphones” should never be compared against “Home > Kitchen Appliances.”

Blocking on brand alone typically reduces the comparison space by 95%+. Combined with category, it drops further. This is what makes deduplication feasible at marketplace scale.

3. Embed and score

Within each block, generate AI embeddings for each listing’s title (and description, if available). Compute cosine similarity between all pairs in the block. Pairs above a threshold become candidate duplicates.

The embedding model understands that “Wireless Noise Canceling Headphones” and “Bluetooth ANC Headphone” are semantically equivalent. It captures the product identity regardless of how the seller chose to describe it.

4. Cluster duplicates

Candidate pairs form a graph where nodes are listings and edges are similarity scores. Connected components in this graph represent clusters of duplicate listings for the same product.

A cluster might contain 8 Amazon listings, 3 eBay listings, and 2 Walmart listings — all for the Sony WH-1000XM5 in black. That cluster collapses to one canonical record.

5. Pick canonical records

For each cluster, select or construct a canonical record:

Product name: Use the most complete, least promotional title in the cluster
Price: Report min, max, mean, and median across the cluster
Seller count: Count of unique sellers across all listings in the cluster
Platform presence: Which marketplaces carry this product
Condition breakdown: How many listings are new vs renewed vs open box

The output is one row per unique product with aggregated marketplace intelligence attached.

What clean data enables

Accurate market sizing

“How many unique wireless headphones are sold across the top 3 marketplaces?” becomes answerable. Instead of reporting 12,000 listings, you report 4,800 unique products with an average of 2.5 listings each. That’s a fundamentally different market size estimate.

Real price analytics

Price analysis on raw listings is misleading. The “average price” of a product is inflated by outlier sellers and skewed by listing volume. When 8 of the 11 listings for a product come from one marketplace, that marketplace’s pricing dominates the average even if it’s not representative.

Deduplicated data lets you compute price per unique product, price range across sellers, and cross-platform price differences — metrics that actually inform purchasing and pricing decisions.

Catalog management

Retailers building or curating their own product catalog need to know what exists in the market. Deduplicated marketplace data serves as a reference catalog: here are the 4,800 unique products in this category, here’s who sells each one, here’s the price range, here’s which ones you carry and which you don’t.

Procurement intelligence

Bulk purchasers and resellers use marketplace data to identify sourcing opportunities. Deduplicated data reveals which products have the most sellers (indicating commodity status and price competition) vs which have few sellers (potential supply constraints or niche opportunities).

Running this at scale

A weekly marketplace scrape across 3 platforms in 5 categories might produce 200,000+ raw listings. The deduplication pipeline needs to handle this volume efficiently.

The key is blocking. With brand + category blocking, the 200,000 listings break into thousands of small blocks, each containing 10–200 listings. Embedding and similarity computation within these blocks is fast. The total wall-clock time is dominated by embedding generation, which runs in parallel.

For ongoing monitoring, incremental deduplication is more efficient than re-processing the full dataset each week. Match new listings against the existing canonical catalog. Only listings that don’t match any existing canonical record need full cluster analysis.

Getting started

Export your scraped marketplace data as CSV — one file per platform or one combined file with a source column
Upload to Match Data Studio
Configure the matching: “Deduplicate product listings by semantic similarity of title, blocking on brand and category”
Review a sample of the clusters to verify accuracy
Export the deduplicated canonical catalog with aggregated pricing

The difference between raw scraped data and deduplicated marketplace intelligence is the difference between noise and signal. The matching step is what separates the two.

Ready to clean your marketplace data? Start deduplicating with Match Data Studio →

Keep reading

Matching at scale — performance strategies for millions of records
Web scraping to price intelligence — matching products across competing sites
Competitive intelligence from scraped data — turning deduplicated data into market insights