Image categorization at scale: from folders of photos to structured data
Thousands of images sitting in folders with meaningless filenames. AI image categorization extracts structured labels, categories, and descriptions — turning visual assets into matchable data.
An e-commerce operations team inherits 12,000 product images from a supplier migration. The files are named IMG_4021.jpg through IMG_16033.jpg. There’s a CSV mapping each filename to a product name and price, but no category, no brand field, no material, no color — just the product name the supplier typed in, which ranges from helpful (“Bosch 18V Impact Driver Kit”) to useless (“Tool Set”).
The team needs to match these products against their existing catalog of 45,000 items. The text fields alone aren’t enough — too many generic names, too little structure. But every one of those 12,000 images shows exactly what the product is. A human flipping through them could categorize each one in seconds.
The question is how to do that at scale.
What image categorization actually means
Image categorization is the process of assigning structured labels to images: category, subcategory, attributes, and a natural-language description. It turns a visual asset into a data row.
This is distinct from image recognition (identifying objects in a scene — “there is a dog and a bicycle”) and image search (finding visually similar images in a database). Categorization produces structured, queryable, matchable output from visual input.
The output of categorizing a product image might look like:
Category: Power Tools
Subcategory: Impact Drivers
Brand: Bosch
Color: Teal/black
Power source: Battery (18V)
Kit contents: Driver, 2 batteries, charger, case
Condition: New, in original packaging
That structured output is functionally identical to a CSV row. Once an image has been categorized, it can be embedded, compared, filtered, and matched just like any text-based record.
The manual bottleneck
Human categorization is accurate. It’s also prohibitively slow for any dataset beyond a few hundred images.
At 80 images per hour, a human annotator needs 150 hours — nearly a month of full-time work — to categorize 12,000 images. And that assumes consistent quality from hour one to hour 150, which is unrealistic. Fatigue introduces errors and inconsistencies.
The cost compounds when you consider that product catalogs aren’t static. New products arrive weekly. Suppliers change. Seasonal inventory rotates. Manual categorization is perpetually behind.
How AI categorization works
Modern multimodal AI models — Gemini, GPT-4V, Claude — can look at an image and produce structured output. They don’t just recognize objects; they understand context, read text in images, interpret visual attributes, and produce human-quality descriptions.
Given a product image of a blue ceramic mug, the model produces:
Category: Kitchen & Dining
Subcategory: Mugs & Cups
Material: Ceramic (glazed)
Color: Cobalt blue
Capacity: ~12 oz (estimated from proportions)
Style: Modern minimalist
Handle: Single loop handle
Condition: New
Description: A cobalt blue glazed ceramic mug with a modern
minimalist design, featuring a single loop handle
and smooth matte exterior finish.
The key insight: this output is structured data extracted from visual input. The image has been transformed from an opaque binary file into a row of queryable attributes. Once you have this data, the image-as-a-file ceases to matter — the extracted attributes and descriptions are what flow into the matching pipeline.
Categorization at scale
When you’re categorizing thousands of images, the approach changes from one-off analysis to batch processing. The system needs to:
- Process images in parallel — send batches of images to the AI, don’t wait for each one sequentially
- Apply consistent schemas — every image gets the same set of attributes extracted, ensuring comparability
- Handle edge cases gracefully — blurry images, images with multiple products, stock photos vs. actual product photos
| Image file | Category | Subcategory | Brand | Color | Material |
|---|---|---|---|---|---|
| IMG_4021.jpg | Power Tools | Impact Drivers | Bosch | Teal/Black | Metal/Plastic |
| IMG_4022.jpg | Kitchen | Cookware Sets | Le Creuset | Flame Orange | Enameled cast iron |
| IMG_4023.jpg | Furniture | Office Chairs | Herman Miller | Graphite | Mesh/Aluminum |
| IMG_4024.jpg | Electronics | Wireless Earbuds | Apple | White | Plastic/Silicone |
| IMG_4025.jpg | Apparel | Running Shoes | Nike | Volt/Black | Mesh/Rubber |
Each row was generated entirely from the image — no text input was provided beyond the filename.
Each row in that table was derived entirely from the visual content of the image. The AI identified the brand from logos and design language, determined the material from visual texture, and assigned categories based on what the product actually is — not what a supplier chose to call it.
From categories to embeddings
Categorization produces structured attributes. But for matching, you also need a way to measure how similar two products are. This is where embeddings come in.
Once an image has been categorized and described in natural language, that description can be embedded into a vector — a numerical representation that captures semantic meaning. Two products with similar descriptions produce similar vectors, even if they use different words.
Consider two supplier photos of the same product:
- Supplier A’s photo → AI description: “A teal and black Bosch 18V cordless impact driver with two batteries, charger, and hard carrying case”
- Supplier B’s photo → AI description: “Bosch GDR 18V-210 C cordless impact wrench kit, turquoise body, includes battery pack and charger in molded case”
These descriptions use different words — “impact driver” vs “impact wrench,” “teal” vs “turquoise,” “hard carrying case” vs “molded case.” But when embedded, their vectors will be highly similar because they describe the same semantic content.
This is the bridge between visual data and the matching pipeline. The image becomes a description, the description becomes a vector, and the vector can be compared against every other vector in the opposing dataset using cosine similarity.
Practical applications
Image categorization isn’t limited to product matching. Any domain where images contain structured information that needs to be extracted and compared benefits from this approach.
Product catalog enrichment. A retailer receives supplier product photos with minimal metadata. AI categorization fills in brand, category, material, and color — turning a folder of photos into a structured catalog ready for matching against the retailer’s existing inventory.
Real estate. Listing photos are categorized by room type (kitchen, bedroom, bathroom), condition (renovated, original, needs work), and features (hardwood floors, granite counters, stainless appliances). These visual attributes help match the same property across different listing platforms where the text descriptions differ.
Insurance. Claim photos are categorized by damage type (water, fire, impact, wear), severity (minor, moderate, severe), and affected components (roof, siding, foundation). Structured damage data enables matching against policy records and historical claims.
Inventory and asset management. Warehouse photos are categorized by product type, condition, and storage requirements. Visual categorization catches misplacements — a product on the wrong shelf — that barcode-only systems miss.
How this works in Match Data Studio
The workflow integrates image categorization directly into the matching pipeline:
- Upload your CSV with a column containing image filenames (e.g.,
product_image). - Upload the images to your project through the Project Files panel — drag and drop, or upload a ZIP.
- Mark the image column as “file” type in the Type Definitions stage. This tells the system that
IMG_4021.jpgis a reference to an actual file, not literal text. - Create AI enrichment rules that reference the file column. Example: “Analyze this product image and extract category, subcategory, brand, color, material, and condition.”
- Add the file column to embeddings. The system automatically generates a detailed text description of each image and embeds it. You just set the similarity threshold.
- Run the pipeline. The extracted attributes become regular text columns that feed into string matching, embedding comparison, and LLM confirmation.
The images themselves never leave your project. They’re uploaded to Gemini temporarily for AI processing, then the extracted text attributes and embedding vectors are what flow through the rest of the pipeline.
For a deeper dive into extracting specific product attributes from images — beyond broad categorization — see our guide on attribute extraction from product images.
Your images are data, not just decoration. Turn them into structured, matchable records.
Keep reading
- Extracting matchable attributes from product images — go beyond categories to structured attributes
- Matching with images and attributes — the complete file-based matching workflow
- PDF categorization and data extraction — similar techniques applied to document files