Product data annotation: why your catalog needs more attributes before matching works

A retailer acquires a competitor and needs to merge the product catalogs. They export both as CSVs: 12,000 products from their system, 8,400 from the acquisition. They run a matching job on product name and price.

The results are a mess. “Blue T-Shirt Size M” from catalog A matches six different products in catalog B because half their apparel has equally generic names. “Wireless Headphones” matches eleven times. Meanwhile, products with genuinely different names but identical items — “Men’s Crew Neck Tee (Navy, M)” vs “Blue T-Shirt Size M” — don’t match at all.

The problem isn’t the matching algorithm. The problem is the data. When your product records have three fields — name, price, category — the algorithm has three signals to work with. That’s not enough to distinguish thousands of products in the same category and price range.

What product data annotation means

Product data annotation is the process of adding structured attributes to product records. Every attribute you add gives the matching algorithm another signal to compare on.

A minimal product record looks like this:

Field	Value
Name	Blue T-Shirt Size M
Price	$24.99
Category	Apparel

A well-annotated product record looks like this:

Field	Value
Name	Blue T-Shirt Size M
Brand	Hanes
Material	100% cotton, jersey knit
Color	Navy blue
Size	M
Fit	Regular
Neckline	Crew neck
Sleeve	Short sleeve
UPC	038257364118
Price	$24.99
Category	Men’s > Tops > T-Shirts

The first record gives a matching algorithm 3 signals. The second gives it 11. When the algorithm can compare brand, material, color, size, fit, neckline, and sleeve length, it can confidently distinguish “Blue T-Shirt Size M” (Hanes navy cotton crew neck) from “Blue T-Shirt Size M” (Fruit of the Loom royal blue polyester V-neck).

The annotation quality spectrum

Same product at three annotation levels

Attribute	Minimal	Moderate	Rich
Name	Wireless Headphones	Sony WH-1000XM5 Wireless	Sony WH-1000XM5 Wireless NC Headphones
Brand	—	Sony	Sony
Model	—	—	WH-1000XM5
Category	Electronics	Audio > Headphones	Audio > Headphones > Over-ear > Noise-canceling
Color	—	Black	Black (matte finish)
Connectivity	—	Bluetooth	Bluetooth 5.2, 3.5mm aux
Features	—	—	ANC, 30hr battery, multipoint, LDAC
Weight	—	—	250g
Price	$298	$298	$298
Matching confidence	62%	81%	96%

Matching confidence represents how reliably this record can be distinguished from similar products in the same category.

At the minimal level, “Wireless Headphones” at $298 could be any of a dozen products. At the moderate level, brand and category narrow it to a handful. At the rich level, the model number alone is nearly unique — and the supporting attributes confirm it.

The relationship between annotation depth and matching quality isn’t linear. The first few attributes (brand, model, category) provide the largest jump. Additional attributes (weight, connectivity, features) provide diminishing but still meaningful improvements, especially for distinguishing variants and similar models.

Manual annotation doesn’t scale

The obvious approach is to hire someone to fill in the missing attributes. And for small catalogs, it works.

Manual annotation effort by catalog size

500 products ~1 week, 1 person

8 days

5,000 products ~5 weeks, 2 people

42 days

25,000 products ~3 months, 3 people

85 days

100,000 products ~10 months, 5+ people

100 days

Estimated annotation time assuming 50-100 products per person per day with quality checks.

At 50-100 products per annotator per day — a realistic rate for thorough annotation with quality checks — a catalog of 25,000 products takes three people three months. By the time they finish, hundreds of products have changed, been discontinued, or been added. The annotation is never truly “done.”

Manual annotation also introduces consistency problems. Annotator A writes “Navy blue.” Annotator B writes “Dark navy.” Annotator C writes “Blue (navy).” These are the same color described three different ways, and they won’t match unless you add another normalization step.

AI-powered annotation from text

Language models can read a product name and description and extract structured attributes automatically. Given the input “Sony WH-1000XM5 Wireless Noise Canceling Bluetooth Over-Ear Headphones, Black,” a model can reliably extract:

Brand: Sony
Model: WH-1000XM5
Type: Over-ear headphones
Features: Wireless, noise canceling, Bluetooth
Color: Black

This works well when the product name or description is detailed. The AI is parsing structured information that’s already present in the text — just not in separate fields.

But here’s the limitation: AI can only extract what the text contains. If the product listing says “Blue T-Shirt Size M” and nothing else, the AI can extract color (blue), size (M), and type (t-shirt). It cannot extract brand, material, neckline, sleeve length, or fit — because that information simply isn’t in the text.

When the product image tells you what the text doesn’t

This is where product images change the equation.

A product photo contains information that text listings routinely omit:

Brand — logos on the product, labels, packaging
Material — visual texture reveals cotton vs. polyester, matte vs. glossy, wood vs. laminate
Color accuracy — “blue” in text could be navy, royal, baby blue, or teal; the photo shows the exact shade
Design details — button count, zipper style, stitching pattern, hardware finish
Condition — new in packaging, used, refurbished, damaged
Size and proportions — relative to other objects in the image

Attributes available from text vs. text + image

Attribute	From text only	From text + image
Brand	Sometimes (if in name)	Almost always (logo visible)
Exact color	Approximate ('blue')	Precise (navy matte)
Material	Rarely	Usually (visual texture)
Condition	If explicitly stated	Visible (packaging, wear)
Design details	Almost never	Visible (buttons, stitching)
Neckline/fit	Sometimes	Always visible
Accessories included	If listed	Visible in photo
Packaging type	Rarely	Visible (box, bag, loose)

Image-based extraction is most valuable for attributes that sellers don't bother typing into text fields.

A product listing that says “Blue T-Shirt Size M” and includes a product photo can be annotated by AI as: brand Hanes (logo on collar tag), material cotton jersey (visible knit texture), color navy (precise shade from image), neckline crew (visible), sleeve short (visible), fit regular (proportions), condition new with tags (tag visible).

That’s seven additional attributes extracted from a single image — attributes that the text listing didn’t provide and a text-only AI couldn’t infer.

Building an annotation pipeline

The practical approach combines text and image annotation in stages:

Stage 1: Extract what the text already contains. Use AI completions to parse brand, model, category, and basic attributes from product names and descriptions. This is fast and high-confidence.

Stage 2: Fill gaps from product images. For attributes the text doesn’t provide — material, exact color, condition, design details — use file-based AI extraction. Upload the product photos, mark the image column as a “file” type, and create extraction rules that reference the images.

Stage 3: Embed the enriched records. With 10-12 attributes per product instead of 3, embedding similarity becomes much more discriminating. “Navy cotton crew-neck short-sleeve t-shirt by Hanes” embeds very differently from “Royal blue polyester V-neck tank top by Fruit of the Loom” — even though both started as “Blue T-Shirt Size M” in the original data.

Stage 4: Match with confidence. The enriched, embedded records produce matches that are accurate enough to act on without manual review for the clear cases, and focused enough for efficient human review of the borderline cases.

In Match Data Studio, this pipeline runs in a single project. Upload your CSVs, upload your product images, configure the extraction rules with the AI assistant, and let the pipeline handle the rest. The images never leave your project — they’re processed by Gemini for attribute extraction and then the extracted text attributes flow through the standard matching pipeline.

For a deeper look at how image categorization works at scale, see our guide on image categorization for product matching. And for the full technical walkthrough of extracting specific attributes from product photos, see extracting matchable attributes from product images.

Your product data is richer than your spreadsheet suggests. Add images to the matching pipeline and let AI extract the attributes your text fields are missing.

Start annotating your catalog →

Keep reading

Extracting matchable attributes from product images — let AI pull attributes your text fields are missing
Image categorization at scale — from folders of photos to structured, matchable data
Data cleaning before matching — the broader prep checklist for any matching project