Why SQL and pandas can't accurately match retail products — and what can
SQL JOINs and pandas merges fail on color variants, promotional naming, translated descriptions, and spec formatting differences. AI embeddings and LLMs understand that 'Midnight' means black and 'Violet' means purple. Here's why traditional tools hit a ceiling and how hybrid pipelines break through it.
A retail operations team exports two product catalogs as CSVs. One from their internal inventory system, one from a supplier. They open a Jupyter notebook, load both into pandas DataFrames, and write a merge:
merged = pd.merge(inventory, supplier, on='product_name', how='inner')
Zero matches. Not a single row. The inventory calls it “Samsung Galaxy S24 Ultra 512GB Titanium Gray.” The supplier calls it “Galaxy S24 Ultra 512 GB Gris Titanio.” Same phone. Different strings.
So the team tries fuzzy matching. They install fuzzywuzzy, compute token set ratios, set a threshold at 80. Now they get matches — but also false positives, because “Galaxy S24 Ultra 512GB” scores 83 against “Galaxy S24 Ultra 256GB.” Those aren’t the same product. One has twice the storage.
This scenario plays out daily in retail. The tools aren’t broken — they’re being asked to solve a problem they weren’t designed for.
The five problems SQL and pandas can’t solve
1. Color variant naming
This is the problem that makes retail matching uniquely difficult. The same physical color gets described with completely different words depending on the brand, the retailer, the market, and sometimes the mood of whoever wrote the product listing.
| Actual color | Brand A | Brand B | Brand C | Brand D |
|---|---|---|---|---|
| Black | Midnight | Phantom Black | Obsidian | Noir |
| Dark blue | Midnight Blue | Navy | Abyss Blue | Bleu Nuit |
| White | Starlight | Cream | Frost White | Blanc Glacier |
| Purple | Violet | Lavender | Deep Purple | Ultra Violet |
| Gray | Silver | Titanium Gray | Graphite | Gris Acier |
| Gold | Sunrise Gold | Champagne | Amber | Doré |
| Green | Alpine Green | Sage | Forest | Vert Sapin |
| Pink | Rose | Blush Pink | Flamingo | Rosa |
Every row is one color. No two brands use the same name. SQL equality, LIKE patterns, and Levenshtein distance all fail here.
Consider a straightforward matching task: your inventory lists a phone as “Titanium Gray” and the supplier lists the same phone as “Silver.” Try writing a SQL query that knows those are the same color:
-- This returns nothing
SELECT * FROM inventory i
JOIN supplier s ON i.color = s.color
WHERE i.model = 'Galaxy S24 Ultra'
-- This also returns nothing
SELECT * FROM inventory i
JOIN supplier s ON i.color LIKE '%' || s.color || '%'
WHERE i.model = 'Galaxy S24 Ultra'
You could build a lookup table mapping every color variant to a canonical name. But the table would need hundreds of entries — and every new brand, every new season, every new product launch introduces terms you haven’t seen before. Apple invented “Starlight” in 2021. Samsung introduced “Phantom Black” in 2021. Google launched “Hazel” in 2023. The vocabulary is unbounded.
# The lookup table approach — always incomplete
color_map = {
'midnight': 'black', 'phantom black': 'black', 'obsidian': 'black',
'noir': 'black', 'jet black': 'black', 'onyx': 'black',
'silver': 'gray', 'titanium gray': 'gray', 'graphite': 'gray',
'violet': 'purple', 'lavender': 'purple', 'deep purple': 'purple',
# ... hundreds more, and you're still missing next season's names
}
This doesn’t scale. And it breaks the moment a brand uses a word that has a different color meaning in context — “Midnight” is black for iPhones but dark blue for Samsung Galaxy watches.
2. Promotional and marketing naming
Retailers inject promotional language into product names. The same physical product gets different names depending on when and where it’s sold.
| Source | Product name as listed |
|---|---|
| Manufacturer catalog | Sony WH-1000XM5 Wireless Noise Cancelling Headphones - Black |
| Retailer website (normal) | Sony WH-1000XM5 Over-Ear Headphones, Black |
| Retailer website (sale) | SAVE 30% Sony WH-1000XM5 Premium ANC Headphones Black - LIMITED |
| Marketplace listing | Sony WH1000XM5 Bluetooth NC Headphones - Midnight Black - SEALED NEW |
| Outlet store | Sony 1000XM5 Noise Cancel Wireless — Black [Refurb] |
| Bundle listing | Sony WH-1000XM5 Black + Carrying Case + Cable Bundle |
Six representations of the same base product. Promotional text, condition markers, abbreviations, and bundling all change the product name.
A pandas fuzzy match between “Sony WH-1000XM5 Wireless Noise Cancelling Headphones - Black” and “SAVE 30% Sony WH-1000XM5 Premium ANC Headphones Black - LIMITED” produces a token set ratio of about 62. That’s below most useful thresholds — the promotional text dilutes the matching signal.
You can strip promotional keywords with regex:
import re
promo_pattern = r'\b(SAVE|LIMITED|NEW|SEALED|SALE|FREE SHIPPING|BEST DEAL|HOT)\b'
df['clean_name'] = df['product_name'].apply(lambda x: re.sub(promo_pattern, '', x, flags=re.I).strip())
But this is a whack-a-mole game. “LIMITED EDITION” might be promotional — or it might be a genuine product variant (like a limited edition colorway). “NEW” might be promotional noise or a condition indicator. “BUNDLE” changes what the product actually is. Rules can’t tell the difference without understanding what the words mean.
3. Language and translation differences
Retail is global. A product catalog from a European supplier often mixes languages — Spanish product names in one column, English brand names in another, French category labels in a third.
| English (US catalog) | Spanish (MX supplier) | French (CA supplier) | German (EU distributor) |
|---|---|---|---|
| Samsung 65" 4K Smart TV | Samsung Televisor 65" 4K Inteligente | Samsung Téléviseur 65" 4K Intelligent | Samsung 65 Zoll 4K Fernseher |
| Wireless Bluetooth Earbuds | Audífonos Inalámbricos Bluetooth | Écouteurs Sans Fil Bluetooth | Kabellose Bluetooth-Ohrhörer |
| Stainless Steel Water Bottle 32oz | Botella de Agua Acero Inoxidable 1L | Bouteille d'Eau Inox 1L | Edelstahl Trinkflasche 1L |
| Running Shoes Men Size 10 | Tenis para Correr Hombre Talla 28 | Chaussures de Course Homme Taille 44 | Laufschuhe Herren Größe 44 |
Same products, four languages. Note the unit differences: 32oz vs 1L, Size 10 vs Talla 28 vs Taille 44.
SQL can’t join “Wireless Bluetooth Earbuds” to “Audífonos Inalámbricos Bluetooth.” Levenshtein distance between those strings is 30+ characters. Even if you somehow matched “Bluetooth” (the one shared token), you’d need to know that “Audífonos Inalámbricos” means “Wireless Earbuds” — that’s translation, not string comparison.
And the problem goes beyond words. Shoe size 10 (US) = Taille 44 (EU) = Talla 28 (MX). 32oz ≈ 1L. These are unit conversions that require domain knowledge, not character matching.
# This produces zero matches across languages
from fuzzywuzzy import fuzz
score = fuzz.token_set_ratio(
"Wireless Bluetooth Earbuds",
"Audífonos Inalámbricos Bluetooth"
)
# score ≈ 40 — well below any useful threshold
4. Specification formatting
The same technical specification gets formatted dozens of ways. Storage capacity, screen size, processor model, RAM — every system has its own conventions.
| Attribute | Format A | Format B | Format C | Format D |
|---|---|---|---|---|
| Storage | 256GB | 256 GB | 256G | 0.25TB |
| Screen size | 6.7" | 6.7 inch | 6.7-inch | 170mm |
| RAM | 8GB RAM | 8 GB | 8GB LPDDR5 | 8192MB |
| Battery | 5000mAh | 5000 mAh | 5Ah | 19.25Wh |
| Weight | 240g | 0.24kg | 8.47oz | 0.53lb |
| Resolution | 3088x1440 | 3088 x 1440 | 1440p | WQHD+ |
All values in each row are equivalent. String matching treats them as completely different.
A SQL WHERE clause can handle one or two of these with REPLACE and CAST:
-- This handles "256GB" vs "256 GB" but not "256G" or "0.25TB"
WHERE REPLACE(i.storage, ' ', '') = REPLACE(s.storage, ' ', '')
To handle all variants, you’d need a parsing function for every spec field that extracts the numeric value and unit, converts to a common unit, and compares. That’s feasible for one field. For ten fields across a product catalog with inconsistent naming of the fields themselves — it’s an engineering project.
5. Implicit product knowledge
Some matching decisions require knowledge that isn’t in the data at all.
| Record A | Record B | Why they match | Why string matching fails |
|---|---|---|---|
| iPhone 15 Pro Max | A3106 (Apple) | A3106 is the model number for iPhone 15 Pro Max | No string overlap beyond 'Apple' |
| Dyson V15 Detect | Dyson SV47 | SV47 is the internal model code for V15 Detect | Different naming system entirely |
| AirPods Pro 2nd Gen | AirPods Pro USB-C | 2nd gen is the USB-C version | String match might work — but USB-C vs 2nd Gen aren't equivalent strings |
| Galaxy Buds2 Pro | SM-R510 | SM-R510 is the Samsung model number | No string overlap |
| PS5 Slim Digital | CFI-2015 | CFI-2015 is the SKU for PS5 Slim Digital Edition | Completely different strings |
These matches are obvious to a product specialist but invisible to any string comparison algorithm.
No amount of pandas string manipulation will connect “Dyson V15 Detect” to “Dyson SV47.” That connection exists in Dyson’s product documentation, in retail databases, in product reviews — it’s world knowledge. String algorithms don’t have world knowledge.
How AI solves each problem
Embeddings understand color semantics
When an embedding model processes “Titanium Gray” and “Silver,” both vectors land in a similar region of the color semantic space. The model has learned from millions of product descriptions that these terms describe similar colors in the context of consumer electronics.
| Color A | Color B | Cosine similarity | String similarity (Jaro-Winkler) |
|---|---|---|---|
| Midnight | Black | 0.87 | 0.43 |
| Violet | Purple | 0.91 | 0.48 |
| Titanium Gray | Silver | 0.84 | 0.51 |
| Starlight | White | 0.82 | 0.38 |
| Alpine Green | Green | 0.89 | 0.68 |
| Rose Gold | Doré | 0.79 | 0.30 |
| Phantom Black | Noir | 0.85 | 0.29 |
| Lavender | Light Purple | 0.93 | 0.42 |
Embedding similarity captures semantic closeness. String similarity measures character overlap — which is near-random for creative color names. Scores are illustrative.
The Jaro-Winkler scores for these pairs are useless — most hover around 0.30-0.50, well below any matching threshold. The embedding scores correctly cluster color synonyms together, because the model understands that “Midnight” in the context of a phone listing means black.
Critically, embeddings also know when color names are not equivalent. “Midnight” and “Midnight Blue” produce embeddings that are close but distinguishable — the model captures that “Midnight Blue” refers to a dark blue, not pure black. This nuance is impossible with a lookup table approach, where you’d have to decide: does “Midnight” map to black or dark blue? The answer depends on context, which only the embedding captures.
LLMs strip promotional noise intelligently
An LLM can read “SAVE 30% Sony WH-1000XM5 Premium ANC Headphones Black - LIMITED” and understand that “SAVE 30%,” “Premium,” and “LIMITED” are promotional modifiers, while “Sony WH-1000XM5,” “ANC Headphones,” and “Black” are product identifiers. It doesn’t need a regex pattern for every promotional keyword — it understands the structure of a product listing.
More importantly, an LLM can distinguish promotional text from product information:
- “LIMITED” in “LIMITED TIME OFFER” → promotional, ignore
- “LIMITED” in “Limited Edition Colorway” → product variant, keep
- “NEW” in “NEW ARRIVAL” → promotional, ignore
- “NEW” in “New 2024 Model” → product generation, keep
- “BUNDLE” in “GREAT BUNDLE DEAL” → promotional framing
- “BUNDLE” in “Headphones + Case Bundle” → actual bundle, different product
This contextual reasoning is what makes LLMs fundamentally different from regex. The same word means different things in different positions, and the LLM processes meaning, not patterns.
Embeddings work across languages
Modern embedding models are multilingual by design. “Wireless Bluetooth Earbuds” and “Audífonos Inalámbricos Bluetooth” produce vectors in the same semantic neighborhood because the model was trained on text in dozens of languages and learned that these phrases refer to the same concept.
| English | Spanish | French | Embedding similarity |
|---|---|---|---|
| Wireless Earbuds | Audífonos Inalámbricos | Écouteurs Sans Fil | 0.88–0.92 |
| Stainless Steel Water Bottle | Botella de Acero Inoxidable | Bouteille en Inox | 0.85–0.90 |
| Running Shoes | Tenis para Correr | Chaussures de Course | 0.87–0.91 |
| Smart TV 65 inch | Televisor Inteligente 65" | Téléviseur Intelligent 65" | 0.90–0.94 |
| Noise Cancelling Headphones | Audífonos con Cancelación de Ruido | Casque à Réduction de Bruit | 0.86–0.91 |
Embedding models trained on multilingual data natively understand cross-language equivalence. No translation step needed. Score ranges reflect variation across embedding models.
No translation API. No multilingual lookup dictionary. No language detection step. The embedding model handles it implicitly, because semantic similarity in vector space doesn’t depend on which language the text is written in.
LLMs reason about specifications and model numbers
When an LLM examines two records — one listing “iPhone 15 Pro Max” and the other listing “A3106 (Apple)” — it can connect them because it has been trained on product documentation, spec sheets, and retail databases. It knows that A3106 is a model number for a specific iPhone variant.
This is the capability that no amount of SQL or pandas code can replicate. String comparison functions compare characters. LLMs compare meanings, drawing on a vast base of product knowledge to make connections that would otherwise require a human product specialist.
The SQL ceiling and the AI floor
Here’s the practical problem: SQL and pandas are excellent at things AI is wasteful on, and AI is excellent at things SQL and pandas can’t do. Using only one approach means either wasting money on easy comparisons or missing hard matches entirely.
The gap between 58% (the best pandas can do with extensive engineering) and 94% (the hybrid pipeline) represents thousands of product matches that traditional tools miss. Those aren’t edge cases — they’re common scenarios like color variant naming, cross-language catalogs, and model number differences.
The hybrid pipeline: next-generation product matching
The architecture that solves this combines both approaches in a cost-efficient funnel.
Layer 1: SQL-style string and numeric pre-filters (free, instant)
Use the operations SQL and pandas are good at to eliminate obvious non-matches.
Brand blocking. If both datasets have a brand column, require exact brand match (after normalization) before comparing anything else. This alone eliminates 90%+ of candidate pairs and costs nothing.
-- The kind of filter SQL excels at
WHERE LOWER(TRIM(a.brand)) = LOWER(TRIM(b.brand))
Numeric range filters. If a $50 product appears in one dataset, it’s not going to match a $500 product in the other. A price-proximity filter (within 30% of each other, adjustable) eliminates more pairs for free.
String prefix or contains. If the product names share no tokens at all — not even the model number — they’re almost certainly different products. A fast check for any shared significant token (longer than 3 characters) is cheap and eliminates noise.
These are the operations SQL was built for. They run on millions of pairs in seconds. Let them do what they’re good at.
Layer 2: AI attribute extraction (cost per record, runs once)
For the records that survived pre-filtering, an LLM reads each product listing and extracts structured attributes: brand, model, color (normalized), storage, screen size, category, condition.
This is where the color problem gets solved. The LLM reads “Phantom Black” and extracts color: black. It reads “Gris Titanio” and extracts color: gray. It reads “Starlight” and extracts color: white/silver. Now both records have a normalized color field that downstream comparisons can use — without a manually maintained lookup table.
The extraction runs once per unique record, not once per pair. For a 10,000-record dataset, that’s 10,000 LLM calls — not the millions of calls you’d need for pairwise LLM comparison.
Layer 3: Embedding similarity (cost per record, computed once)
Each record’s extracted attributes and original text are embedded into a semantic vector. Cosine similarity between candidate pairs produces a score from 0 to 1.
This is where cross-language matching, promotional noise tolerance, and specification format variation get handled. The embeddings capture meaning — “256GB” and “256 GB” and “256G” all produce essentially the same embedding contribution, because the model understands they mean the same thing.
Layer 4: LLM confirmation (cost per pair, only ambiguous pairs)
Pairs with embedding similarity in the ambiguous range — high enough to be plausible, low enough to be uncertain — get sent to an LLM for final confirmation. The LLM sees both full records and decides: same product, different product, or same product but different variant (size, color, storage).
This is the most expensive step, but it only runs on the pairs that genuinely need it. In a typical retail matching job, that’s 2-5% of the original candidate pairs.
| Problem | SQL/Pandas pre-filter | AI extraction | Embedding similarity | LLM confirmation |
|---|---|---|---|---|
| Exact duplicates | Solves completely | — | — | — |
| Formatting differences (256GB vs 256 GB) | Partially (with normalization) | Solves | Solves | — |
| Color variants (Midnight vs Black) | Can't solve | Solves (normalized extraction) | Solves (semantic similarity) | Solves |
| Promotional naming (SALE 30%...) | Partially (regex) | Solves (ignores promo text) | Solves (semantic focus) | Solves |
| Cross-language (English vs Spanish) | Can't solve | Partially | Solves (multilingual model) | Solves |
| Model number ↔ product name | Can't solve | Can't solve | Sometimes | Solves (world knowledge) |
| Variant distinction (256GB vs 512GB) | Solves (numeric filter) | Solves (extracts storage) | Sometimes | Solves |
Each layer handles different failure modes. No single layer solves everything. The combination covers the full spectrum.
Cost structure
The funnel shape is the key economic insight. Each layer is more expensive than the last — but each layer processes dramatically fewer pairs.
The total AI cost for this example is roughly $18. Without the SQL-style pre-filtering layers, embedding all pairs would cost over $1,200 and running LLM confirmation on all pairs would cost over $60,000. The pre-filters don’t just save money — they make the approach economically viable.
A concrete example: matching phone catalogs
A mobile phone retailer needs to match their internal catalog (2,400 products) against a distributor’s catalog (3,100 products) to reconcile inventory and pricing. Here’s how the matching plays out on real product naming patterns.
| Retailer listing | Distributor listing | Matched by |
|---|---|---|
| Samsung Galaxy S24 Ultra 256GB Black | Samsung Galaxy S24 Ultra 256GB Black | Layer 1: exact string match |
| iPhone 15 Pro 128GB Blue Titanium | Apple iPhone 15 Pro 128 GB Blue Titanium | Layer 1: string pre-filter (high token overlap) |
| Galaxy S24+ 256GB Violet | Samsung Galaxy S24 Plus 256GB Lilac Purple | Layer 3: embedding similarity (Violet ≈ Lilac Purple, S24+ = S24 Plus) |
| Pixel 8 Pro 256GB Porcelain | Google Pixel 8 Pro 256GB Crema Cerámica | Layer 3: embedding similarity (cross-language, Porcelain ≈ Crema Cerámica) |
| iPhone 15 128GB Pink | Apple iPhone 15 128Go Rose | Layer 3: embedding similarity (Pink = Rose, 128GB = 128Go) |
| OnePlus 12 256GB Silky Black | OnePlus 12 256GB Noir Soyeux | Layer 3: embedding similarity (Silky Black = Noir Soyeux) |
| Motorola Edge 40 Neo 256GB | XT2335-3 Motorola 256GB | Layer 4: LLM confirmation (knows XT2335-3 = Edge 40 Neo) |
| Galaxy Z Flip5 256GB Mint | SM-F731B Samsung 256GB Green | Layer 4: LLM confirmation (knows SM-F731B = Z Flip5, Mint ≈ Green) |
Real-world phone catalog naming patterns. Each row shows a true match and which pipeline layer identified it.
Without the AI layers, the last six matches would all be missed. The retailer would think those products exist in one catalog but not the other — leading to incorrect inventory counts, missed pricing comparisons, and manual reconciliation work.
With the hybrid pipeline, all eight pairs are matched correctly. The easy ones are caught instantly by string filters. The color variants and cross-language names are caught by embeddings. The model-number-to-product-name connections are caught by LLM reasoning.
What this means for retail teams
You don’t need to build lookup tables
The traditional approach to color matching, abbreviation handling, and format normalization is lookup tables. Hundreds of lines mapping “Midnight” → “Black,” “Violet” → “Purple,” “Starlight” → “White.” These tables are expensive to build, expensive to maintain, and always incomplete.
AI extraction replaces lookup tables with a single instruction: “Extract the normalized color from this product description.” The model handles every color variant it’s ever seen in training data — including terms from next season that don’t exist yet in any lookup table.
You don’t need a translation step
If your supplier catalogs include Spanish, French, German, or Portuguese product names, you don’t need to run them through a translation API before matching. Multilingual embeddings handle cross-language matching natively. The pipeline matches “Audífonos Inalámbricos Bluetooth” to “Wireless Bluetooth Earbuds” without either record being translated.
Your SQL skills still matter
The pre-filtering layers of the pipeline are essentially SQL operations — exact match, LIKE patterns, numeric range comparisons. The skills your team already has are what make the AI layers affordable. Without good pre-filtering, the AI cost would be 50-100x higher.
The shift isn’t from SQL to AI. It’s from SQL alone to SQL plus AI — letting each technology handle what it’s best at.
The accuracy ceiling is real
If you’re matching retail products with traditional tools and getting 50-60% recall, that’s not a tuning problem. You’ve hit the ceiling of what string comparison can do on data with color variants, promotional naming, and cross-language descriptions. More regex rules, more lookup tables, and lower thresholds won’t close the gap — they’ll just add false positives.
Breaking through that ceiling requires understanding what the text means, not what characters it contains. That’s what embeddings and LLMs provide.
Match Data Studio runs this exact pipeline architecture. SQL-style string and numeric pre-filters eliminate the easy non-matches for free. AI attribute extraction normalizes color variants, promotional text, and specification formats. Multilingual embeddings handle cross-language matching. LLM confirmation resolves the genuinely ambiguous pairs. You configure the pipeline through a visual interface — no lookup tables, no regex libraries, no translation APIs.