Why SQL and pandas can't accurately match retail products — and what can

A retail operations team exports two product catalogs as CSVs. One from their internal inventory system, one from a supplier. They open a Jupyter notebook, load both into pandas DataFrames, and write a merge:

merged = pd.merge(inventory, supplier, on='product_name', how='inner')

Zero matches. Not a single row. The inventory calls it “Samsung Galaxy S24 Ultra 512GB Titanium Gray.” The supplier calls it “Galaxy S24 Ultra 512 GB Gris Titanio.” Same phone. Different strings.

So the team tries fuzzy matching. They install fuzzywuzzy, compute token set ratios, set a threshold at 80. Now they get matches — but also false positives, because “Galaxy S24 Ultra 512GB” scores 83 against “Galaxy S24 Ultra 256GB.” Those aren’t the same product. One has twice the storage.

This scenario plays out daily in retail. The tools aren’t broken — they’re being asked to solve a problem they weren’t designed for.

The five problems SQL and pandas can’t solve

1. Color variant naming

This is the problem that makes retail matching uniquely difficult. The same physical color gets described with completely different words depending on the brand, the retailer, the market, and sometimes the mood of whoever wrote the product listing.

The same color, described by different brands

Actual color	Brand A	Brand B	Brand C	Brand D
Black	Midnight	Phantom Black	Obsidian	Noir
Dark blue	Midnight Blue	Navy	Abyss Blue	Bleu Nuit
White	Starlight	Cream	Frost White	Blanc Glacier
Purple	Violet	Lavender	Deep Purple	Ultra Violet
Gray	Silver	Titanium Gray	Graphite	Gris Acier
Gold	Sunrise Gold	Champagne	Amber	Doré
Green	Alpine Green	Sage	Forest	Vert Sapin
Pink	Rose	Blush Pink	Flamingo	Rosa

Every row is one color. No two brands use the same name. SQL equality, LIKE patterns, and Levenshtein distance all fail here.

Consider a straightforward matching task: your inventory lists a phone as “Titanium Gray” and the supplier lists the same phone as “Silver.” Try writing a SQL query that knows those are the same color:

-- This returns nothing
SELECT * FROM inventory i
JOIN supplier s ON i.color = s.color
WHERE i.model = 'Galaxy S24 Ultra'

-- This also returns nothing
SELECT * FROM inventory i
JOIN supplier s ON i.color LIKE '%' || s.color || '%'
WHERE i.model = 'Galaxy S24 Ultra'

You could build a lookup table mapping every color variant to a canonical name. But the table would need hundreds of entries — and every new brand, every new season, every new product launch introduces terms you haven’t seen before. Apple invented “Starlight” in 2021. Samsung introduced “Phantom Black” in 2021. Google launched “Hazel” in 2023. The vocabulary is unbounded.

# The lookup table approach — always incomplete
color_map = {
    'midnight': 'black', 'phantom black': 'black', 'obsidian': 'black',
    'noir': 'black', 'jet black': 'black', 'onyx': 'black',
    'silver': 'gray', 'titanium gray': 'gray', 'graphite': 'gray',
    'violet': 'purple', 'lavender': 'purple', 'deep purple': 'purple',
    # ... hundreds more, and you're still missing next season's names
}

This doesn’t scale. And it breaks the moment a brand uses a word that has a different color meaning in context — “Midnight” is black for iPhones but dark blue for Samsung Galaxy watches.

2. Promotional and marketing naming

Retailers inject promotional language into product names. The same physical product gets different names depending on when and where it’s sold.

The same product with promotional naming variations

Source	Product name as listed
Manufacturer catalog	Sony WH-1000XM5 Wireless Noise Cancelling Headphones - Black
Retailer website (normal)	Sony WH-1000XM5 Over-Ear Headphones, Black
Retailer website (sale)	SAVE 30% Sony WH-1000XM5 Premium ANC Headphones Black - LIMITED
Marketplace listing	Sony WH1000XM5 Bluetooth NC Headphones - Midnight Black - SEALED NEW
Outlet store	Sony 1000XM5 Noise Cancel Wireless — Black [Refurb]
Bundle listing	Sony WH-1000XM5 Black + Carrying Case + Cable Bundle

Six representations of the same base product. Promotional text, condition markers, abbreviations, and bundling all change the product name.

A pandas fuzzy match between “Sony WH-1000XM5 Wireless Noise Cancelling Headphones - Black” and “SAVE 30% Sony WH-1000XM5 Premium ANC Headphones Black - LIMITED” produces a token set ratio of about 62. That’s below most useful thresholds — the promotional text dilutes the matching signal.

You can strip promotional keywords with regex:

import re
promo_pattern = r'\b(SAVE|LIMITED|NEW|SEALED|SALE|FREE SHIPPING|BEST DEAL|HOT)\b'
df['clean_name'] = df['product_name'].apply(lambda x: re.sub(promo_pattern, '', x, flags=re.I).strip())

But this is a whack-a-mole game. “LIMITED EDITION” might be promotional — or it might be a genuine product variant (like a limited edition colorway). “NEW” might be promotional noise or a condition indicator. “BUNDLE” changes what the product actually is. Rules can’t tell the difference without understanding what the words mean.

3. Language and translation differences

Retail is global. A product catalog from a European supplier often mixes languages — Spanish product names in one column, English brand names in another, French category labels in a third.

The same products across languages

English (US catalog)	Spanish (MX supplier)	French (CA supplier)	German (EU distributor)
Samsung 65" 4K Smart TV	Samsung Televisor 65" 4K Inteligente	Samsung Téléviseur 65" 4K Intelligent	Samsung 65 Zoll 4K Fernseher
Wireless Bluetooth Earbuds	Audífonos Inalámbricos Bluetooth	Écouteurs Sans Fil Bluetooth	Kabellose Bluetooth-Ohrhörer
Stainless Steel Water Bottle 32oz	Botella de Agua Acero Inoxidable 1L	Bouteille d'Eau Inox 1L	Edelstahl Trinkflasche 1L
Running Shoes Men Size 10	Tenis para Correr Hombre Talla 28	Chaussures de Course Homme Taille 44	Laufschuhe Herren Größe 44

Same products, four languages. Note the unit differences: 32oz vs 1L, Size 10 vs Talla 28 vs Taille 44.

SQL can’t join “Wireless Bluetooth Earbuds” to “Audífonos Inalámbricos Bluetooth.” Levenshtein distance between those strings is 30+ characters. Even if you somehow matched “Bluetooth” (the one shared token), you’d need to know that “Audífonos Inalámbricos” means “Wireless Earbuds” — that’s translation, not string comparison.

And the problem goes beyond words. Shoe size 10 (US) = Taille 44 (EU) = Talla 28 (MX). 32oz ≈ 1L. These are unit conversions that require domain knowledge, not character matching.

# This produces zero matches across languages
from fuzzywuzzy import fuzz
score = fuzz.token_set_ratio(
    "Wireless Bluetooth Earbuds",
    "Audífonos Inalámbricos Bluetooth"
)
# score ≈ 40 — well below any useful threshold

4. Specification formatting

The same technical specification gets formatted dozens of ways. Storage capacity, screen size, processor model, RAM — every system has its own conventions.

Specification formatting across retail systems

Attribute	Format A	Format B	Format C	Format D
Storage	256GB	256 GB	256G	0.25TB
Screen size	6.7"	6.7 inch	6.7-inch	170mm
RAM	8GB RAM	8 GB	8GB LPDDR5	8192MB
Battery	5000mAh	5000 mAh	5Ah	19.25Wh
Weight	240g	0.24kg	8.47oz	0.53lb
Resolution	3088x1440	3088 x 1440	1440p	WQHD+

All values in each row are equivalent. String matching treats them as completely different.

A SQL WHERE clause can handle one or two of these with REPLACE and CAST:

-- This handles "256GB" vs "256 GB" but not "256G" or "0.25TB"
WHERE REPLACE(i.storage, ' ', '') = REPLACE(s.storage, ' ', '')

To handle all variants, you’d need a parsing function for every spec field that extracts the numeric value and unit, converts to a common unit, and compares. That’s feasible for one field. For ten fields across a product catalog with inconsistent naming of the fields themselves — it’s an engineering project.

5. Implicit product knowledge

Some matching decisions require knowledge that isn’t in the data at all.

Matches that require world knowledge

Record A	Record B	Why they match	Why string matching fails
iPhone 15 Pro Max	A3106 (Apple)	A3106 is the model number for iPhone 15 Pro Max	No string overlap beyond 'Apple'
Dyson V15 Detect	Dyson SV47	SV47 is the internal model code for V15 Detect	Different naming system entirely
AirPods Pro 2nd Gen	AirPods Pro USB-C	2nd gen is the USB-C version	String match might work — but USB-C vs 2nd Gen aren't equivalent strings
Galaxy Buds2 Pro	SM-R510	SM-R510 is the Samsung model number	No string overlap
PS5 Slim Digital	CFI-2015	CFI-2015 is the SKU for PS5 Slim Digital Edition	Completely different strings

These matches are obvious to a product specialist but invisible to any string comparison algorithm.

No amount of pandas string manipulation will connect “Dyson V15 Detect” to “Dyson SV47.” That connection exists in Dyson’s product documentation, in retail databases, in product reviews — it’s world knowledge. String algorithms don’t have world knowledge.

How AI solves each problem

Embeddings understand color semantics

When an embedding model processes “Titanium Gray” and “Silver,” both vectors land in a similar region of the color semantic space. The model has learned from millions of product descriptions that these terms describe similar colors in the context of consumer electronics.

Embedding cosine similarity for color variant pairs

Color A	Color B	Cosine similarity	String similarity (Jaro-Winkler)
Midnight	Black	0.87	0.43
Violet	Purple	0.91	0.48
Titanium Gray	Silver	0.84	0.51
Starlight	White	0.82	0.38
Alpine Green	Green	0.89	0.68
Rose Gold	Doré	0.79	0.30
Phantom Black	Noir	0.85	0.29
Lavender	Light Purple	0.93	0.42

Embedding similarity captures semantic closeness. String similarity measures character overlap — which is near-random for creative color names. Scores are illustrative.

The Jaro-Winkler scores for these pairs are useless — most hover around 0.30-0.50, well below any matching threshold. The embedding scores correctly cluster color synonyms together, because the model understands that “Midnight” in the context of a phone listing means black.

Critically, embeddings also know when color names are not equivalent. “Midnight” and “Midnight Blue” produce embeddings that are close but distinguishable — the model captures that “Midnight Blue” refers to a dark blue, not pure black. This nuance is impossible with a lookup table approach, where you’d have to decide: does “Midnight” map to black or dark blue? The answer depends on context, which only the embedding captures.

LLMs strip promotional noise intelligently

An LLM can read “SAVE 30% Sony WH-1000XM5 Premium ANC Headphones Black - LIMITED” and understand that “SAVE 30%,” “Premium,” and “LIMITED” are promotional modifiers, while “Sony WH-1000XM5,” “ANC Headphones,” and “Black” are product identifiers. It doesn’t need a regex pattern for every promotional keyword — it understands the structure of a product listing.

More importantly, an LLM can distinguish promotional text from product information:

“LIMITED” in “LIMITED TIME OFFER” → promotional, ignore
“LIMITED” in “Limited Edition Colorway” → product variant, keep
“NEW” in “NEW ARRIVAL” → promotional, ignore
“NEW” in “New 2024 Model” → product generation, keep
“BUNDLE” in “GREAT BUNDLE DEAL” → promotional framing
“BUNDLE” in “Headphones + Case Bundle” → actual bundle, different product

This contextual reasoning is what makes LLMs fundamentally different from regex. The same word means different things in different positions, and the LLM processes meaning, not patterns.

Embeddings work across languages

Modern embedding models are multilingual by design. “Wireless Bluetooth Earbuds” and “Audífonos Inalámbricos Bluetooth” produce vectors in the same semantic neighborhood because the model was trained on text in dozens of languages and learned that these phrases refer to the same concept.

Cross-language embedding similarity for product names

English	Spanish	French	Embedding similarity
Wireless Earbuds	Audífonos Inalámbricos	Écouteurs Sans Fil	0.88–0.92
Stainless Steel Water Bottle	Botella de Acero Inoxidable	Bouteille en Inox	0.85–0.90
Running Shoes	Tenis para Correr	Chaussures de Course	0.87–0.91
Smart TV 65 inch	Televisor Inteligente 65"	Téléviseur Intelligent 65"	0.90–0.94
Noise Cancelling Headphones	Audífonos con Cancelación de Ruido	Casque à Réduction de Bruit	0.86–0.91

Embedding models trained on multilingual data natively understand cross-language equivalence. No translation step needed. Score ranges reflect variation across embedding models.

No translation API. No multilingual lookup dictionary. No language detection step. The embedding model handles it implicitly, because semantic similarity in vector space doesn’t depend on which language the text is written in.

LLMs reason about specifications and model numbers

When an LLM examines two records — one listing “iPhone 15 Pro Max” and the other listing “A3106 (Apple)” — it can connect them because it has been trained on product documentation, spec sheets, and retail databases. It knows that A3106 is a model number for a specific iPhone variant.

This is the capability that no amount of SQL or pandas code can replicate. String comparison functions compare characters. LLMs compare meanings, drawing on a vast base of product knowledge to make connections that would otherwise require a human product specialist.

The SQL ceiling and the AI floor

Here’s the practical problem: SQL and pandas are excellent at things AI is wasteful on, and AI is excellent at things SQL and pandas can’t do. Using only one approach means either wasting money on easy comparisons or missing hard matches entirely.

Match recall by method — 8,000 retail product pairs across multiple categories

SQL exact JOIN Only catches identical strings

11%

Pandas fuzzy (token set ratio > 75) Catches typos, misses semantics

47%

Pandas + color lookup + regex cleanup Better — but lookup tables are never complete

58%

AI embeddings only Catches semantics — but expensive on all pairs

82%

Hybrid: SQL pre-filter → Embeddings → LLM Best accuracy, lowest cost per match

94%

Estimated recall on a mixed retail dataset (electronics, apparel, home goods) with cross-language, cross-retailer naming variation. The hybrid pipeline finds 8x more matches than SQL alone.

The gap between 58% (the best pandas can do with extensive engineering) and 94% (the hybrid pipeline) represents thousands of product matches that traditional tools miss. Those aren’t edge cases — they’re common scenarios like color variant naming, cross-language catalogs, and model number differences.

The hybrid pipeline: next-generation product matching

The architecture that solves this combines both approaches in a cost-efficient funnel.

Layer 1: SQL-style string and numeric pre-filters (free, instant)

Use the operations SQL and pandas are good at to eliminate obvious non-matches.

Brand blocking. If both datasets have a brand column, require exact brand match (after normalization) before comparing anything else. This alone eliminates 90%+ of candidate pairs and costs nothing.

-- The kind of filter SQL excels at
WHERE LOWER(TRIM(a.brand)) = LOWER(TRIM(b.brand))

Numeric range filters. If a $50 product appears in one dataset, it’s not going to match a $500 product in the other. A price-proximity filter (within 30% of each other, adjustable) eliminates more pairs for free.

String prefix or contains. If the product names share no tokens at all — not even the model number — they’re almost certainly different products. A fast check for any shared significant token (longer than 3 characters) is cheap and eliminates noise.

These are the operations SQL was built for. They run on millions of pairs in seconds. Let them do what they’re good at.

Layer 2: AI attribute extraction (cost per record, runs once)

For the records that survived pre-filtering, an LLM reads each product listing and extracts structured attributes: brand, model, color (normalized), storage, screen size, category, condition.

This is where the color problem gets solved. The LLM reads “Phantom Black” and extracts color: black. It reads “Gris Titanio” and extracts color: gray. It reads “Starlight” and extracts color: white/silver. Now both records have a normalized color field that downstream comparisons can use — without a manually maintained lookup table.

The extraction runs once per unique record, not once per pair. For a 10,000-record dataset, that’s 10,000 LLM calls — not the millions of calls you’d need for pairwise LLM comparison.

Layer 3: Embedding similarity (cost per record, computed once)

Each record’s extracted attributes and original text are embedded into a semantic vector. Cosine similarity between candidate pairs produces a score from 0 to 1.

This is where cross-language matching, promotional noise tolerance, and specification format variation get handled. The embeddings capture meaning — “256GB” and “256 GB” and “256G” all produce essentially the same embedding contribution, because the model understands they mean the same thing.

Layer 4: LLM confirmation (cost per pair, only ambiguous pairs)

Pairs with embedding similarity in the ambiguous range — high enough to be plausible, low enough to be uncertain — get sent to an LLM for final confirmation. The LLM sees both full records and decides: same product, different product, or same product but different variant (size, color, storage).

This is the most expensive step, but it only runs on the pairs that genuinely need it. In a typical retail matching job, that’s 2-5% of the original candidate pairs.

What each pipeline layer solves

Problem	SQL/Pandas pre-filter	AI extraction	Embedding similarity	LLM confirmation
Exact duplicates	Solves completely	—	—	—
Formatting differences (256GB vs 256 GB)	Partially (with normalization)	Solves	Solves	—
Color variants (Midnight vs Black)	Can't solve	Solves (normalized extraction)	Solves (semantic similarity)	Solves
Promotional naming (SALE 30%...)	Partially (regex)	Solves (ignores promo text)	Solves (semantic focus)	Solves
Cross-language (English vs Spanish)	Can't solve	Partially	Solves (multilingual model)	Solves
Model number ↔ product name	Can't solve	Can't solve	Sometimes	Solves (world knowledge)
Variant distinction (256GB vs 512GB)	Solves (numeric filter)	Solves (extracts storage)	Sometimes	Solves

Each layer handles different failure modes. No single layer solves everything. The combination covers the full spectrum.

Cost structure

The funnel shape is the key economic insight. Each layer is more expensive than the last — but each layer processes dramatically fewer pairs.

Pairs processed at each pipeline layer (10K × 12K product catalog match)

Total candidate pairs Full cross-product (after brand blocking)

120000 pairs

After string pre-filter 85% eliminated — free

18000 pairs

After numeric pre-filter 50% more eliminated — free

9000 pairs

After embedding threshold 87% eliminated — ~$12 for embeddings

1200 pairs

LLM confirmed matches Final matches — ~$6 for LLM calls

340 pairs

Brand blocking reduced the initial 120M pairs to 120K before the pipeline even starts. Total AI cost: ~$18. Without pre-filtering, embedding all 120M pairs would cost ~$1,200+.

The total AI cost for this example is roughly $18. Without the SQL-style pre-filtering layers, embedding all pairs would cost over $1,200 and running LLM confirmation on all pairs would cost over $60,000. The pre-filters don’t just save money — they make the approach economically viable.

A concrete example: matching phone catalogs

A mobile phone retailer needs to match their internal catalog (2,400 products) against a distributor’s catalog (3,100 products) to reconcile inventory and pricing. Here’s how the matching plays out on real product naming patterns.

Phone matching: what each pipeline layer catches

Retailer listing	Distributor listing	Matched by
Samsung Galaxy S24 Ultra 256GB Black	Samsung Galaxy S24 Ultra 256GB Black	Layer 1: exact string match
iPhone 15 Pro 128GB Blue Titanium	Apple iPhone 15 Pro 128 GB Blue Titanium	Layer 1: string pre-filter (high token overlap)
Galaxy S24+ 256GB Violet	Samsung Galaxy S24 Plus 256GB Lilac Purple	Layer 3: embedding similarity (Violet ≈ Lilac Purple, S24+ = S24 Plus)
Pixel 8 Pro 256GB Porcelain	Google Pixel 8 Pro 256GB Crema Cerámica	Layer 3: embedding similarity (cross-language, Porcelain ≈ Crema Cerámica)
iPhone 15 128GB Pink	Apple iPhone 15 128Go Rose	Layer 3: embedding similarity (Pink = Rose, 128GB = 128Go)
OnePlus 12 256GB Silky Black	OnePlus 12 256GB Noir Soyeux	Layer 3: embedding similarity (Silky Black = Noir Soyeux)
Motorola Edge 40 Neo 256GB	XT2335-3 Motorola 256GB	Layer 4: LLM confirmation (knows XT2335-3 = Edge 40 Neo)
Galaxy Z Flip5 256GB Mint	SM-F731B Samsung 256GB Green	Layer 4: LLM confirmation (knows SM-F731B = Z Flip5, Mint ≈ Green)

Real-world phone catalog naming patterns. Each row shows a true match and which pipeline layer identified it.

Without the AI layers, the last six matches would all be missed. The retailer would think those products exist in one catalog but not the other — leading to incorrect inventory counts, missed pricing comparisons, and manual reconciliation work.

With the hybrid pipeline, all eight pairs are matched correctly. The easy ones are caught instantly by string filters. The color variants and cross-language names are caught by embeddings. The model-number-to-product-name connections are caught by LLM reasoning.

What this means for retail teams

You don’t need to build lookup tables

The traditional approach to color matching, abbreviation handling, and format normalization is lookup tables. Hundreds of lines mapping “Midnight” → “Black,” “Violet” → “Purple,” “Starlight” → “White.” These tables are expensive to build, expensive to maintain, and always incomplete.

AI extraction replaces lookup tables with a single instruction: “Extract the normalized color from this product description.” The model handles every color variant it’s ever seen in training data — including terms from next season that don’t exist yet in any lookup table.

You don’t need a translation step

If your supplier catalogs include Spanish, French, German, or Portuguese product names, you don’t need to run them through a translation API before matching. Multilingual embeddings handle cross-language matching natively. The pipeline matches “Audífonos Inalámbricos Bluetooth” to “Wireless Bluetooth Earbuds” without either record being translated.

Your SQL skills still matter

The pre-filtering layers of the pipeline are essentially SQL operations — exact match, LIKE patterns, numeric range comparisons. The skills your team already has are what make the AI layers affordable. Without good pre-filtering, the AI cost would be 50-100x higher.

The shift isn’t from SQL to AI. It’s from SQL alone to SQL plus AI — letting each technology handle what it’s best at.

The accuracy ceiling is real

If you’re matching retail products with traditional tools and getting 50-60% recall, that’s not a tuning problem. You’ve hit the ceiling of what string comparison can do on data with color variants, promotional naming, and cross-language descriptions. More regex rules, more lookup tables, and lower thresholds won’t close the gap — they’ll just add false positives.

Breaking through that ceiling requires understanding what the text means, not what characters it contains. That’s what embeddings and LLMs provide.

Match Data Studio runs this exact pipeline architecture. SQL-style string and numeric pre-filters eliminate the easy non-matches for free. AI attribute extraction normalizes color variants, promotional text, and specification formats. Multilingual embeddings handle cross-language matching. LLM confirmation resolves the genuinely ambiguous pairs. You configure the pipeline through a visual interface — no lookup tables, no regex libraries, no translation APIs.

Try it free →