Image categorization at scale: from folders of photos to structured data

An e-commerce operations team inherits 12,000 product images from a supplier migration. The files are named IMG_4021.jpg through IMG_16033.jpg. There’s a CSV mapping each filename to a product name and price, but no category, no brand field, no material, no color — just the product name the supplier typed in, which ranges from helpful (“Bosch 18V Impact Driver Kit”) to useless (“Tool Set”).

The team needs to match these products against their existing catalog of 45,000 items. The text fields alone aren’t enough — too many generic names, too little structure. But every one of those 12,000 images shows exactly what the product is. A human flipping through them could categorize each one in seconds.

The question is how to do that at scale.

What image categorization actually means

Image categorization is the process of assigning structured labels to images: category, subcategory, attributes, and a natural-language description. It turns a visual asset into a data row.

This is distinct from image recognition (identifying objects in a scene — “there is a dog and a bicycle”) and image search (finding visually similar images in a database). Categorization produces structured, queryable, matchable output from visual input.

The output of categorizing a product image might look like:

Category:     Power Tools
Subcategory:  Impact Drivers
Brand:        Bosch
Color:        Teal/black
Power source: Battery (18V)
Kit contents: Driver, 2 batteries, charger, case
Condition:    New, in original packaging

That structured output is functionally identical to a CSV row. Once an image has been categorized, it can be embedded, compared, filtered, and matched just like any text-based record.

The manual bottleneck

Human categorization is accurate. It’s also prohibitively slow for any dataset beyond a few hundred images.

Image categorization throughput

Manual (human annotator) ~80 images/hour with quality checks

Semi-automated (human + templates) ~300 images/hour with pre-filled forms

AI multimodal (Gemini/GPT-4V) ~2,400 images/hour, batch processing

Throughput in hundreds of images per hour. AI throughput varies by complexity and API rate limits.

At 80 images per hour, a human annotator needs 150 hours — nearly a month of full-time work — to categorize 12,000 images. And that assumes consistent quality from hour one to hour 150, which is unrealistic. Fatigue introduces errors and inconsistencies.

The cost compounds when you consider that product catalogs aren’t static. New products arrive weekly. Suppliers change. Seasonal inventory rotates. Manual categorization is perpetually behind.

How AI categorization works

Modern multimodal AI models — Gemini, GPT-4V, Claude — can look at an image and produce structured output. They don’t just recognize objects; they understand context, read text in images, interpret visual attributes, and produce human-quality descriptions.

Given a product image of a blue ceramic mug, the model produces:

Category:     Kitchen & Dining
Subcategory:  Mugs & Cups
Material:     Ceramic (glazed)
Color:        Cobalt blue
Capacity:     ~12 oz (estimated from proportions)
Style:        Modern minimalist
Handle:       Single loop handle
Condition:    New
Description:  A cobalt blue glazed ceramic mug with a modern
              minimalist design, featuring a single loop handle
              and smooth matte exterior finish.

The key insight: this output is structured data extracted from visual input. The image has been transformed from an opaque binary file into a row of queryable attributes. Once you have this data, the image-as-a-file ceases to matter — the extracted attributes and descriptions are what flow into the matching pipeline.

Categorization at scale

When you’re categorizing thousands of images, the approach changes from one-off analysis to batch processing. The system needs to:

Process images in parallel — send batches of images to the AI, don’t wait for each one sequentially
Apply consistent schemas — every image gets the same set of attributes extracted, ensuring comparability
Handle edge cases gracefully — blurry images, images with multiple products, stock photos vs. actual product photos

AI-categorized product images

Image file	Category	Subcategory	Brand	Color	Material
IMG_4021.jpg	Power Tools	Impact Drivers	Bosch	Teal/Black	Metal/Plastic
IMG_4022.jpg	Kitchen	Cookware Sets	Le Creuset	Flame Orange	Enameled cast iron
IMG_4023.jpg	Furniture	Office Chairs	Herman Miller	Graphite	Mesh/Aluminum
IMG_4024.jpg	Electronics	Wireless Earbuds	Apple	White	Plastic/Silicone
IMG_4025.jpg	Apparel	Running Shoes	Nike	Volt/Black	Mesh/Rubber

Each row was generated entirely from the image — no text input was provided beyond the filename.

Each row in that table was derived entirely from the visual content of the image. The AI identified the brand from logos and design language, determined the material from visual texture, and assigned categories based on what the product actually is — not what a supplier chose to call it.

From categories to embeddings

Categorization produces structured attributes. But for matching, you also need a way to measure how similar two products are. This is where embeddings come in.

Once an image has been categorized and described in natural language, that description can be embedded into a vector — a numerical representation that captures semantic meaning. Two products with similar descriptions produce similar vectors, even if they use different words.

Consider two supplier photos of the same product:

Supplier A’s photo → AI description: “A teal and black Bosch 18V cordless impact driver with two batteries, charger, and hard carrying case”
Supplier B’s photo → AI description: “Bosch GDR 18V-210 C cordless impact wrench kit, turquoise body, includes battery pack and charger in molded case”

These descriptions use different words — “impact driver” vs “impact wrench,” “teal” vs “turquoise,” “hard carrying case” vs “molded case.” But when embedded, their vectors will be highly similar because they describe the same semantic content.

This is the bridge between visual data and the matching pipeline. The image becomes a description, the description becomes a vector, and the vector can be compared against every other vector in the opposing dataset using cosine similarity.

Practical applications

Image categorization isn’t limited to product matching. Any domain where images contain structured information that needs to be extracted and compared benefits from this approach.

Product catalog enrichment. A retailer receives supplier product photos with minimal metadata. AI categorization fills in brand, category, material, and color — turning a folder of photos into a structured catalog ready for matching against the retailer’s existing inventory.

Real estate. Listing photos are categorized by room type (kitchen, bedroom, bathroom), condition (renovated, original, needs work), and features (hardwood floors, granite counters, stainless appliances). These visual attributes help match the same property across different listing platforms where the text descriptions differ.

Insurance. Claim photos are categorized by damage type (water, fire, impact, wear), severity (minor, moderate, severe), and affected components (roof, siding, foundation). Structured damage data enables matching against policy records and historical claims.

Inventory and asset management. Warehouse photos are categorized by product type, condition, and storage requirements. Visual categorization catches misplacements — a product on the wrong shelf — that barcode-only systems miss.

How this works in Match Data Studio

The workflow integrates image categorization directly into the matching pipeline:

Upload your CSV with a column containing image filenames (e.g., product_image).
Upload the images to your project through the Project Files panel — drag and drop, or upload a ZIP.
Mark the image column as “file” type in the Type Definitions stage. This tells the system that IMG_4021.jpg is a reference to an actual file, not literal text.
Create AI enrichment rules that reference the file column. Example: “Analyze this product image and extract category, subcategory, brand, color, material, and condition.”
Add the file column to embeddings. The system automatically generates a detailed text description of each image and embeds it. You just set the similarity threshold.
Run the pipeline. The extracted attributes become regular text columns that feed into string matching, embedding comparison, and LLM confirmation.

The images themselves never leave your project. They’re uploaded to Gemini temporarily for AI processing, then the extracted text attributes and embedding vectors are what flow through the rest of the pipeline.

For a deeper dive into extracting specific product attributes from images — beyond broad categorization — see our guide on attribute extraction from product images.

Your images are data, not just decoration. Turn them into structured, matchable records.

Start categorizing →

Keep reading

Extracting matchable attributes from product images — go beyond categories to structured attributes
Matching with images and attributes — the complete file-based matching workflow
PDF categorization and data extraction — similar techniques applied to document files