From web scraping to price intelligence: matching products across competing sites

You scraped five competitor sites last night. You have 200,000 product records across five CSVs. You want to answer a simple question: where are we priced higher than our competitors?

The problem is that none of these records agree on what to call the same product.

The naming problem

Every retailer and marketplace has its own conventions for product titles. Sellers optimize titles for search, not for data consistency. Abbreviations vary. Attribute ordering varies. Promotional text gets injected into titles.

The same physical product might appear like this across three sources:

The same product across three scraped sources

Source	Product title	Price	SKU / ID
RetailerA.com	Apple iPhone 15 Pro Max 256GB Natural Titanium	$1,199	APL-IP15PM-256-NT
MarketplaceB	iPhone 15 Pro Max, 256 GB, Nat. Titanium — Unlocked	$1,149	B0CJ4R7MQ2
DiscountC.com	APPLE iPHONE 15PM 256G TITAN (Natural) NEW SEALED	$1,089	DC-98234
RetailerA.com	Samsung Galaxy S24 Ultra 512GB Titanium Gray	$1,299	SAM-GS24U-512-TG
MarketplaceB	Galaxy S24 Ultra 512 GB Titanium Grey SM-S928B	$1,279	B0CTDJ4KVP
DiscountC.com	SAMSUNG S24 ULTRA 512G GRAY TITAN UNLOCKED	$1,199	DC-44521

Each source uses different naming conventions, abbreviations, and attribute formatting.

No shared SKU. No shared product ID. The names are recognizably similar to a human reader but different enough that exact matching returns zero results.

Why exact and basic fuzzy matching fail

Exact matching on product title matches nothing. Not a single pair across any two sources will join, because no two retailers use the same string.

SKU matching fails because each platform generates its own internal IDs. Apple doesn’t enforce a universal product code that all retailers use in their scraped HTML.

Basic fuzzy matching (Levenshtein distance, n-gram overlap) gets you partway there. “iPhone 15 Pro Max” and “iPHONE 15PM” share some character sequences, but the distance is large enough that you’ll need a very permissive threshold — which then creates false positives between genuinely different products. A phone case named “iPhone 15 Pro Max Premium Case” would score highly against the phone itself.

The fundamental issue is that these are string-level algorithms applied to a semantic-level problem. The product titles don’t need to look similar — they need to mean the same thing.

Match rate by method — 10,000 product pairs across 3 sources

Exact title match Nearly useless

SKU / UPC join Only where UPCs exist

22%

Fuzzy string (trigram) High false-positive rate

48%

AI embeddings + LLM Semantic understanding

89%

Illustrative figures. Actual rates depend on category complexity and source formatting.

How AI embeddings solve this

AI embeddings convert the full product description into a high-dimensional vector that captures the meaning of the text, not just the characters. When the embedding model processes “Apple iPhone 15 Pro Max 256GB Natural Titanium” and “APPLE iPHONE 15PM 256G TITAN (Natural) NEW SEALED”, both vectors land in nearly the same region of embedding space — because both descriptions refer to the same product.

This is fundamentally different from string similarity. The embedding model has learned that “15PM” is an abbreviation for “15 Pro Max”, that “TITAN” and “Titanium” refer to the same finish, that “256G” and “256GB” are equivalent. It processes the product description holistically rather than comparing character sequences.

For cases where embeddings alone produce borderline scores — perhaps a 256GB and 512GB variant of the same phone — an LLM confirmation step reasons over the full record: “Same model, same color, different storage capacity. These are distinct SKUs, not duplicates.”

The practical workflow

1. Scrape and normalize

Run your scrapers against each competitor site. Export each source as a CSV with at minimum: product title, price, and category. Additional fields like brand, model number, or specifications improve matching accuracy.

2. Light preprocessing

Strip obvious noise: promotional prefixes (“SALE!”, “NEW ARRIVAL”), seller-specific tags, and HTML artifacts. Standardize price to a common currency. This doesn’t need to be perfect — the AI handles residual noise — but removing the most egregious junk improves throughput.

3. Upload and match

Upload any two source CSVs to Match Data Studio. The AI assistant will analyze your columns and suggest a matching configuration: which fields to embed, what string pre-filters to apply (brand blocking is often useful), and what similarity thresholds to use.

For a typical 10,000-product catalog comparison, the pipeline generates candidate pairs from brand-blocked combinations, runs embedding similarity, and confirms borderline matches with LLM reasoning.

4. Analyze price differences

The output is a CSV of matched product pairs with prices from both sources. From here, the analysis is straightforward:

Competitive pricing: Sort by price difference to find where you’re priced above or below each competitor.
MAP violation detection: Filter for products priced below the minimum advertised price from the matched manufacturer catalog.
Assortment gaps: Products in the competitor’s catalog with no match in yours represent potential assortment opportunities.

Business applications

Price monitoring at scale

Matching scraped competitor data against your own catalog creates a living competitive intelligence feed. Rather than manually checking competitor prices on key products, you have a systematic view across your entire catalog.

MAP enforcement

Brands scrape reseller sites to detect minimum advertised price violations. The challenge is matching their internal product catalog against the thousands of listing variations across resellers. AI matching handles the name variation problem that makes exact-match MAP monitoring unreliable.

Assortment gap analysis

The unmatched records are as valuable as the matched ones. Products that appear in competitor catalogs but not in yours represent potential additions. Products in your catalog with no competitor equivalent may be differentiation opportunities — or deadweight inventory.

Market entry research

Before entering a new product category, scrape the major players and match their catalogs against each other. The result shows you the real competitive landscape: how many unique products exist (vs duplicated listings), what the actual price range looks like per product, and where the pricing clusters are.

What makes this hard to do manually

A competitive intelligence analyst can match products across two sources manually. It takes about 5–10 seconds per pair for unambiguous matches and 30+ seconds for difficult ones. At 10,000 products across 5 sources, you’re looking at roughly 100,000 pairwise comparisons — hundreds of hours of analyst time, repeated every time you re-scrape.

The value of automated matching isn’t just speed. It’s consistency and repeatability. Run the same matching job weekly. Track how competitor pricing evolves over time. Detect new products appearing in competitor catalogs within days, not months.

Getting started

The workflow is simpler than it sounds:

Export your scraped competitor data as CSV files (one per source)
Upload any two to Match Data Studio
Describe the matching logic: “Match products by semantic similarity of title + brand + key specifications”
Review a sample of matches to calibrate thresholds
Run the full dataset and export the matched pairs with pricing

The output gives you a clean, deduplicated view of the competitive landscape with price comparisons you can actually trust.

Ready to turn scraped product data into competitive intelligence? Start matching with Match Data Studio →

Keep reading

Competitive intelligence from scraped data — turn matched products into market insights
Marketplace deduplication — clean duplicate listings from scraped marketplaces
Fuzzy matching algorithms explained — the string-distance methods behind product name comparison