Product data annotation: why your catalog needs more attributes before matching works
Product matching accuracy depends on attribute richness. Sparse product data produces weak matches. Here's how to annotate product catalogs — manually and with AI — to make matching reliable.
A retailer acquires a competitor and needs to merge the product catalogs. They export both as CSVs: 12,000 products from their system, 8,400 from the acquisition. They run a matching job on product name and price.
The results are a mess. “Blue T-Shirt Size M” from catalog A matches six different products in catalog B because half their apparel has equally generic names. “Wireless Headphones” matches eleven times. Meanwhile, products with genuinely different names but identical items — “Men’s Crew Neck Tee (Navy, M)” vs “Blue T-Shirt Size M” — don’t match at all.
The problem isn’t the matching algorithm. The problem is the data. When your product records have three fields — name, price, category — the algorithm has three signals to work with. That’s not enough to distinguish thousands of products in the same category and price range.
What product data annotation means
Product data annotation is the process of adding structured attributes to product records. Every attribute you add gives the matching algorithm another signal to compare on.
A minimal product record looks like this:
| Field | Value |
|---|---|
| Name | Blue T-Shirt Size M |
| Price | $24.99 |
| Category | Apparel |
A well-annotated product record looks like this:
| Field | Value |
|---|---|
| Name | Blue T-Shirt Size M |
| Brand | Hanes |
| Material | 100% cotton, jersey knit |
| Color | Navy blue |
| Size | M |
| Fit | Regular |
| Neckline | Crew neck |
| Sleeve | Short sleeve |
| UPC | 038257364118 |
| Price | $24.99 |
| Category | Men’s > Tops > T-Shirts |
The first record gives a matching algorithm 3 signals. The second gives it 11. When the algorithm can compare brand, material, color, size, fit, neckline, and sleeve length, it can confidently distinguish “Blue T-Shirt Size M” (Hanes navy cotton crew neck) from “Blue T-Shirt Size M” (Fruit of the Loom royal blue polyester V-neck).
The annotation quality spectrum
| Attribute | Minimal | Moderate | Rich |
|---|---|---|---|
| Name | Wireless Headphones | Sony WH-1000XM5 Wireless | Sony WH-1000XM5 Wireless NC Headphones |
| Brand | — | Sony | Sony |
| Model | — | — | WH-1000XM5 |
| Category | Electronics | Audio > Headphones | Audio > Headphones > Over-ear > Noise-canceling |
| Color | — | Black | Black (matte finish) |
| Connectivity | — | Bluetooth | Bluetooth 5.2, 3.5mm aux |
| Features | — | — | ANC, 30hr battery, multipoint, LDAC |
| Weight | — | — | 250g |
| Price | $298 | $298 | $298 |
| Matching confidence | 62% | 81% | 96% |
Matching confidence represents how reliably this record can be distinguished from similar products in the same category.
At the minimal level, “Wireless Headphones” at $298 could be any of a dozen products. At the moderate level, brand and category narrow it to a handful. At the rich level, the model number alone is nearly unique — and the supporting attributes confirm it.
The relationship between annotation depth and matching quality isn’t linear. The first few attributes (brand, model, category) provide the largest jump. Additional attributes (weight, connectivity, features) provide diminishing but still meaningful improvements, especially for distinguishing variants and similar models.
Manual annotation doesn’t scale
The obvious approach is to hire someone to fill in the missing attributes. And for small catalogs, it works.
At 50-100 products per annotator per day — a realistic rate for thorough annotation with quality checks — a catalog of 25,000 products takes three people three months. By the time they finish, hundreds of products have changed, been discontinued, or been added. The annotation is never truly “done.”
Manual annotation also introduces consistency problems. Annotator A writes “Navy blue.” Annotator B writes “Dark navy.” Annotator C writes “Blue (navy).” These are the same color described three different ways, and they won’t match unless you add another normalization step.
AI-powered annotation from text
Language models can read a product name and description and extract structured attributes automatically. Given the input “Sony WH-1000XM5 Wireless Noise Canceling Bluetooth Over-Ear Headphones, Black,” a model can reliably extract:
- Brand: Sony
- Model: WH-1000XM5
- Type: Over-ear headphones
- Features: Wireless, noise canceling, Bluetooth
- Color: Black
This works well when the product name or description is detailed. The AI is parsing structured information that’s already present in the text — just not in separate fields.
But here’s the limitation: AI can only extract what the text contains. If the product listing says “Blue T-Shirt Size M” and nothing else, the AI can extract color (blue), size (M), and type (t-shirt). It cannot extract brand, material, neckline, sleeve length, or fit — because that information simply isn’t in the text.
When the product image tells you what the text doesn’t
This is where product images change the equation.
A product photo contains information that text listings routinely omit:
- Brand — logos on the product, labels, packaging
- Material — visual texture reveals cotton vs. polyester, matte vs. glossy, wood vs. laminate
- Color accuracy — “blue” in text could be navy, royal, baby blue, or teal; the photo shows the exact shade
- Design details — button count, zipper style, stitching pattern, hardware finish
- Condition — new in packaging, used, refurbished, damaged
- Size and proportions — relative to other objects in the image
| Attribute | From text only | From text + image |
|---|---|---|
| Brand | Sometimes (if in name) | Almost always (logo visible) |
| Exact color | Approximate ('blue') | Precise (navy matte) |
| Material | Rarely | Usually (visual texture) |
| Condition | If explicitly stated | Visible (packaging, wear) |
| Design details | Almost never | Visible (buttons, stitching) |
| Neckline/fit | Sometimes | Always visible |
| Accessories included | If listed | Visible in photo |
| Packaging type | Rarely | Visible (box, bag, loose) |
Image-based extraction is most valuable for attributes that sellers don't bother typing into text fields.
A product listing that says “Blue T-Shirt Size M” and includes a product photo can be annotated by AI as: brand Hanes (logo on collar tag), material cotton jersey (visible knit texture), color navy (precise shade from image), neckline crew (visible), sleeve short (visible), fit regular (proportions), condition new with tags (tag visible).
That’s seven additional attributes extracted from a single image — attributes that the text listing didn’t provide and a text-only AI couldn’t infer.
Building an annotation pipeline
The practical approach combines text and image annotation in stages:
Stage 1: Extract what the text already contains. Use AI completions to parse brand, model, category, and basic attributes from product names and descriptions. This is fast and high-confidence.
Stage 2: Fill gaps from product images. For attributes the text doesn’t provide — material, exact color, condition, design details — use file-based AI extraction. Upload the product photos, mark the image column as a “file” type, and create extraction rules that reference the images.
Stage 3: Embed the enriched records. With 10-12 attributes per product instead of 3, embedding similarity becomes much more discriminating. “Navy cotton crew-neck short-sleeve t-shirt by Hanes” embeds very differently from “Royal blue polyester V-neck tank top by Fruit of the Loom” — even though both started as “Blue T-Shirt Size M” in the original data.
Stage 4: Match with confidence. The enriched, embedded records produce matches that are accurate enough to act on without manual review for the clear cases, and focused enough for efficient human review of the borderline cases.
In Match Data Studio, this pipeline runs in a single project. Upload your CSVs, upload your product images, configure the extraction rules with the AI assistant, and let the pipeline handle the rest. The images never leave your project — they’re processed by Gemini for attribute extraction and then the extracted text attributes flow through the standard matching pipeline.
For a deeper look at how image categorization works at scale, see our guide on image categorization for product matching. And for the full technical walkthrough of extracting specific attributes from product photos, see extracting matchable attributes from product images.
Your product data is richer than your spreadsheet suggests. Add images to the matching pipeline and let AI extract the attributes your text fields are missing.
Start annotating your catalog →
Keep reading
- Extracting matchable attributes from product images — let AI pull attributes your text fields are missing
- Image categorization at scale — from folders of photos to structured, matchable data
- Data cleaning before matching — the broader prep checklist for any matching project