From web scraping to price intelligence: matching products across competing sites
Scraped product data from competitor sites uses different naming conventions, SKU systems, and category structures. AI-powered matching connects equivalent products across sources to build real-time competitive pricing intelligence.
You scraped five competitor sites last night. You have 200,000 product records across five CSVs. You want to answer a simple question: where are we priced higher than our competitors?
The problem is that none of these records agree on what to call the same product.
The naming problem
Every retailer and marketplace has its own conventions for product titles. Sellers optimize titles for search, not for data consistency. Abbreviations vary. Attribute ordering varies. Promotional text gets injected into titles.
The same physical product might appear like this across three sources:
| Source | Product title | Price | SKU / ID |
|---|---|---|---|
| RetailerA.com | Apple iPhone 15 Pro Max 256GB Natural Titanium | $1,199 | APL-IP15PM-256-NT |
| MarketplaceB | iPhone 15 Pro Max, 256 GB, Nat. Titanium — Unlocked | $1,149 | B0CJ4R7MQ2 |
| DiscountC.com | APPLE iPHONE 15PM 256G TITAN (Natural) NEW SEALED | $1,089 | DC-98234 |
| RetailerA.com | Samsung Galaxy S24 Ultra 512GB Titanium Gray | $1,299 | SAM-GS24U-512-TG |
| MarketplaceB | Galaxy S24 Ultra 512 GB Titanium Grey SM-S928B | $1,279 | B0CTDJ4KVP |
| DiscountC.com | SAMSUNG S24 ULTRA 512G GRAY TITAN UNLOCKED | $1,199 | DC-44521 |
Each source uses different naming conventions, abbreviations, and attribute formatting.
No shared SKU. No shared product ID. The names are recognizably similar to a human reader but different enough that exact matching returns zero results.
Why exact and basic fuzzy matching fail
Exact matching on product title matches nothing. Not a single pair across any two sources will join, because no two retailers use the same string.
SKU matching fails because each platform generates its own internal IDs. Apple doesn’t enforce a universal product code that all retailers use in their scraped HTML.
Basic fuzzy matching (Levenshtein distance, n-gram overlap) gets you partway there. “iPhone 15 Pro Max” and “iPHONE 15PM” share some character sequences, but the distance is large enough that you’ll need a very permissive threshold — which then creates false positives between genuinely different products. A phone case named “iPhone 15 Pro Max Premium Case” would score highly against the phone itself.
The fundamental issue is that these are string-level algorithms applied to a semantic-level problem. The product titles don’t need to look similar — they need to mean the same thing.
How AI embeddings solve this
AI embeddings convert the full product description into a high-dimensional vector that captures the meaning of the text, not just the characters. When the embedding model processes “Apple iPhone 15 Pro Max 256GB Natural Titanium” and “APPLE iPHONE 15PM 256G TITAN (Natural) NEW SEALED”, both vectors land in nearly the same region of embedding space — because both descriptions refer to the same product.
This is fundamentally different from string similarity. The embedding model has learned that “15PM” is an abbreviation for “15 Pro Max”, that “TITAN” and “Titanium” refer to the same finish, that “256G” and “256GB” are equivalent. It processes the product description holistically rather than comparing character sequences.
For cases where embeddings alone produce borderline scores — perhaps a 256GB and 512GB variant of the same phone — an LLM confirmation step reasons over the full record: “Same model, same color, different storage capacity. These are distinct SKUs, not duplicates.”
The practical workflow
1. Scrape and normalize
Run your scrapers against each competitor site. Export each source as a CSV with at minimum: product title, price, and category. Additional fields like brand, model number, or specifications improve matching accuracy.
2. Light preprocessing
Strip obvious noise: promotional prefixes (“SALE!”, “NEW ARRIVAL”), seller-specific tags, and HTML artifacts. Standardize price to a common currency. This doesn’t need to be perfect — the AI handles residual noise — but removing the most egregious junk improves throughput.
3. Upload and match
Upload any two source CSVs to Match Data Studio. The AI assistant will analyze your columns and suggest a matching configuration: which fields to embed, what string pre-filters to apply (brand blocking is often useful), and what similarity thresholds to use.
For a typical 10,000-product catalog comparison, the pipeline generates candidate pairs from brand-blocked combinations, runs embedding similarity, and confirms borderline matches with LLM reasoning.
4. Analyze price differences
The output is a CSV of matched product pairs with prices from both sources. From here, the analysis is straightforward:
- Competitive pricing: Sort by price difference to find where you’re priced above or below each competitor.
- MAP violation detection: Filter for products priced below the minimum advertised price from the matched manufacturer catalog.
- Assortment gaps: Products in the competitor’s catalog with no match in yours represent potential assortment opportunities.
Business applications
Price monitoring at scale
Matching scraped competitor data against your own catalog creates a living competitive intelligence feed. Rather than manually checking competitor prices on key products, you have a systematic view across your entire catalog.
MAP enforcement
Brands scrape reseller sites to detect minimum advertised price violations. The challenge is matching their internal product catalog against the thousands of listing variations across resellers. AI matching handles the name variation problem that makes exact-match MAP monitoring unreliable.
Assortment gap analysis
The unmatched records are as valuable as the matched ones. Products that appear in competitor catalogs but not in yours represent potential additions. Products in your catalog with no competitor equivalent may be differentiation opportunities — or deadweight inventory.
Market entry research
Before entering a new product category, scrape the major players and match their catalogs against each other. The result shows you the real competitive landscape: how many unique products exist (vs duplicated listings), what the actual price range looks like per product, and where the pricing clusters are.
What makes this hard to do manually
A competitive intelligence analyst can match products across two sources manually. It takes about 5–10 seconds per pair for unambiguous matches and 30+ seconds for difficult ones. At 10,000 products across 5 sources, you’re looking at roughly 100,000 pairwise comparisons — hundreds of hours of analyst time, repeated every time you re-scrape.
The value of automated matching isn’t just speed. It’s consistency and repeatability. Run the same matching job weekly. Track how competitor pricing evolves over time. Detect new products appearing in competitor catalogs within days, not months.
Getting started
The workflow is simpler than it sounds:
- Export your scraped competitor data as CSV files (one per source)
- Upload any two to Match Data Studio
- Describe the matching logic: “Match products by semantic similarity of title + brand + key specifications”
- Review a sample of matches to calibrate thresholds
- Run the full dataset and export the matched pairs with pricing
The output gives you a clean, deduplicated view of the competitive landscape with price comparisons you can actually trust.
Ready to turn scraped product data into competitive intelligence? Start matching with Match Data Studio →
Keep reading
- Competitive intelligence from scraped data — turn matched products into market insights
- Marketplace deduplication — clean duplicate listings from scraped marketplaces
- Fuzzy matching algorithms explained — the string-distance methods behind product name comparison