AI extraction vs AI enrichment: how structured data gets pulled from files
Extraction produces one column from a file. Enrichment produces many. Understanding the difference — and when to use each — determines whether your matching pipeline gets the right signals.
You upload a CSV with 10,000 rows. One column contains product image URLs. Another contains PDF spec sheet filenames. The text columns have product names and prices — useful but incomplete. The real information is locked inside those files.
The matching pipeline needs structured data to compare records. It can’t compare two JPEGs pixel by pixel and tell you they’re the same product. It can’t diff two PDFs and decide they describe the same component. What it can do is extract structured attributes from those files and then match on those attributes.
Two pipeline stages handle this: AI extraction and AI enrichment. They sound similar. They both use AI to pull information from files. But they serve different purposes, and choosing the right one for each task determines whether your matching pipeline gets useful signals or noisy ones.
AI extraction: one question, one answer
AI extraction asks a single, focused question about a file and gets back a single value. One input column, one output column.
Examples:
- “What brand is shown in this product image?” →
"Nike" - “What is the document type of this PDF?” →
"Invoice" - “What color is the primary item in this photo?” →
"Navy blue" - “Is this property photo an interior or exterior shot?” →
"Interior — kitchen"
Each extraction rule creates exactly one new column in your dataset. The AI sees the file, answers the specific question, and moves on.
| Image | → brand | → primary_color | → condition |
|---|---|---|---|
| shoe_001.jpg | Nike | White/Black | New with tags |
| shoe_002.jpg | Adidas | Grey/White | New without box |
| shoe_003.jpg | New Balance | Navy/Red | Used — light wear |
| shoe_004.jpg | Nike | Black/Volt | New with box |
| shoe_005.jpg | Puma | White/Gold | New with tags |
Three separate extraction rules, each producing one column. Each rule = one AI call per row.
The advantage of extraction is precision. When you ask “What brand?” the AI focuses entirely on brand identification. It looks for logos, brand text, design language — nothing else. The answer is short, consistent, and directly usable as a matching field.
The disadvantage is cost at scale. If you need five attributes from each image, you need five extraction rules, which means five AI calls per row. At 10,000 rows, that’s 50,000 API calls for five attributes.
AI enrichment: one call, many answers
AI enrichment asks a complex question and gets back multiple structured fields in a single AI call. One input file, multiple output columns.
Example enrichment prompt: “Analyze this product image. Extract: brand, product type, primary color, material, and condition.”
That single call returns:
brand: "Nike"
product_type: "Running shoe — low top"
primary_color: "White with black swoosh and accents"
material: "Mesh upper, rubber sole"
condition: "New with original box and tags"
Five fields from one API call instead of five.
| Image | → brand | → product_type | → primary_color | → material | → condition |
|---|---|---|---|---|---|
| shoe_001.jpg | Nike | Running shoe | White/Black | Mesh/Rubber | New with tags |
| shoe_002.jpg | Adidas | Basketball shoe | Grey/White | Leather/Rubber | New without box |
| shoe_003.jpg | New Balance | Trail runner | Navy/Red | Synthetic/Rubber | Used — light wear |
One enrichment rule, one AI call per row, five output columns. 5x more efficient than five separate extractions.
Enrichment is efficient. But it trades some precision for breadth. When the AI is asked to extract five attributes simultaneously, it distributes its attention across all of them. For most use cases this works fine — multimodal models are good at multi-task extraction. But for a particularly nuanced field (e.g., identifying a specific sub-model of a product from subtle visual differences), a dedicated extraction rule with a focused prompt will outperform the enrichment approach.
When to use which
The decision is straightforward:
| Scenario | Use extraction | Use enrichment |
|---|---|---|
| Need 1-2 attributes from a file | ✓ | |
| Need 3+ attributes from a file | ✓ | |
| Need a field that requires focused analysis | ✓ | |
| Need a quick categorization + description | ✓ | |
| Working with a very large dataset (cost matters) | ✓ | |
| Need maximum precision on a single field | ✓ | |
| Building a matching pipeline from scratch | ✓ (start here, then add extractions for weak fields) |
The practical pattern is: start with enrichment to get broad coverage, then add targeted extraction rules for fields where the enrichment output isn’t precise enough.
For example, you run enrichment to extract brand, color, material, and product type from product images. The brand extraction is 95% accurate — good enough. But the material extraction is only 78% accurate because the enrichment prompt doesn’t give the AI enough room to analyze textures carefully. So you add a dedicated extraction rule: “Examine this product image closely. What is the primary material? Look at surface texture, sheen, visible grain patterns, and construction details. Be specific — ‘full-grain leather’ not just ‘leather.’” That focused prompt brings material accuracy to 90%.
How the prompts differ
The prompt is everything. Same AI model, same image — but a well-crafted prompt extracts precise, matchable data while a vague prompt returns useless descriptions.
Extraction prompt anatomy
An extraction prompt should be:
- Focused on one attribute. Don’t ask about color in a brand-extraction prompt.
- Specific about format. “Return the brand name only, no other text” prevents the AI from adding qualifiers.
- Clear about edge cases. “If no brand is visible, return ‘Unknown’” prevents hallucinated brand names.
Good extraction prompt:
“Identify the brand of this product from any visible logos, labels, brand text, or distinctive brand design elements. Return only the brand name. If the brand is not identifiable, return ‘Unknown’.”
Bad extraction prompt:
“What brand is this?”
The bad prompt might return “This appears to be a Nike product based on the swoosh logo visible on the side” — a sentence, not a matchable value. The good prompt returns “Nike” — clean, consistent, directly comparable across rows.
Enrichment prompt anatomy
An enrichment prompt should be:
- Structured with clear field names. Tell the AI exactly what fields you want and what to call them.
- Specific about each field’s requirements. A one-line description per field prevents ambiguity.
- Explicit about the output format. The system parses the response into columns, so consistency matters.
Good enrichment prompt:
“Analyze this product image and extract the following fields:
- brand: The brand name from logos or labels. ‘Unknown’ if not identifiable.
- product_type: Specific product category (e.g., ‘running shoe’ not just ‘shoe’).
- primary_color: The dominant color(s), be specific (e.g., ‘navy blue’ not ‘blue’).
- material: Primary material of the main body (e.g., ‘mesh upper’ or ‘full-grain leather’).
- condition: New/Used/Refurbished with visible evidence.”
Bad enrichment prompt:
“Describe this product.”
The bad prompt returns a narrative paragraph. The good prompt returns five discrete, matchable fields.
File types and what they yield
Different file types contain different kinds of extractable information. The prompt strategy should match the file type.
Images (JPG, PNG, WebP)
Images are the most information-dense files for visual attributes. A single product photo can yield 8–12 structured fields.
Strong extractions: Brand, color, product type, design features, condition, packaging. Weak extractions: Exact dimensions (no reference object), weight (not visual), technical specifications (not in the image).
Prompt strategy: Focus on visually observable attributes. Don’t ask images for information that requires text data (model numbers, specifications, pricing).
PDFs
PDFs contain structured text, tables, images, and layout — the most varied file type. What you can extract depends on the document type.
Strong extractions: Entity names, dates, financial figures, technical specifications, compliance data, tabular content. Weak extractions: Implicit relationships (“this clause modifies that clause”), sentiment, intent.
Prompt strategy: Be specific about which sections to extract from. “Extract the manufacturer and part number from the header block” outperforms “What manufacturer made this?” because it directs the AI’s attention.
URLs (public web content)
When a column contains URLs instead of filenames, the AI fetches the page content server-side. This works for public product pages, listing URLs, and documentation.
Strong extractions: Structured data embedded in pages (prices, specifications, descriptions), visible content in images on the page. Weak extractions: Dynamic content (JavaScript-rendered data), content behind authentication, Cloudflare-protected pages.
Prompt strategy: Specify what you expect the URL to contain. “This URL points to a product listing page. Extract the product name, price, and specifications” gives the AI context about what kind of page it’s reading.
The pipeline flow
Extraction and enrichment run during Stage 2 of the matching pipeline. Here’s how they fit into the full flow:
| Stage | What happens | Uses files? |
|---|---|---|
| Stage 1: Prepare | Normalize columns, generate candidate pairs, apply string/numeric pre-filters | No — text and numbers only |
| Stage 2: Enrich | AI extraction → AI enrichment → Embeddings | Yes — files processed here |
| Stage 3: Match | Cosine similarity, thresholds, LLM confirmation | Yes — LLM can view files for confirmation |
| Stage 4: Output | Generate results CSV | No — uses extracted text from Stage 2 |
Files are processed in Stage 2. The extracted text flows through Stages 3 and 4 as regular columns.
The critical insight: files are read once in Stage 2, then the extracted text replaces them for all downstream operations. The embedding model doesn’t see your images — it sees the text descriptions generated from your images. The cosine similarity calculation doesn’t compare PDFs — it compares the vectors of the text extracted from those PDFs.
This design means extraction and enrichment quality directly determines matching quality. A sloppy enrichment prompt that returns vague descriptions produces vague embeddings that match imprecisely. A sharp enrichment prompt that returns specific, structured attributes produces specific embeddings that match accurately.
Cost and performance tradeoffs
At scale, the choice between extraction and enrichment has real cost implications.
The hybrid approach — one enrichment rule for broad coverage plus one or two targeted extraction rules for fields that need extra precision — typically delivers the best cost/quality balance. You get 80% of the value from the enrichment call and use targeted extraction to push the remaining fields to the accuracy you need.
Extraction for categorization
A common pattern is using extraction purely for categorization — sorting files into types before applying different enrichment strategies.
For example, a dataset with a mixed document column (some rows have product photos, some have PDF spec sheets, some have certificates):
- First pass: extraction to categorize. “What type of file is this? Return one of: product_photo, spec_sheet, certificate, invoice, other.”
- Second pass: enrichment tuned by category. Different enrichment prompts for each file type. Product photos get “Extract brand, color, material, condition.” Spec sheets get “Extract manufacturer, part number, material grade, pressure rating.” Certificates get “Extract issuing authority, certification type, expiry date.”
This two-pass approach is more accurate than a single generic enrichment prompt because the AI knows what kind of document it’s analyzing before trying to extract specific fields.
Common mistakes
Asking for too many fields in one enrichment call. Enrichment works well for 3–8 fields. Beyond that, extraction quality degrades — the AI has too many objectives and doesn’t focus enough on any single one. If you need 15 attributes, split into two enrichment rules of 7-8 fields each.
Not specifying output format. Without format guidance, the AI might return “The brand appears to be Nike, based on the swoosh logo” for one row and just “Adidas” for another. The inconsistency breaks downstream matching. Specify: “Return only the brand name, nothing else.”
Asking images for non-visual information. “What is the price of this product?” from a product photo is almost always wrong — the price isn’t in the image. Extract price from text columns, extract visual attributes from images. Use each data source for what it’s good at.
Ignoring extraction quality before running a full match. Run a sample (5 rows from each dataset) and check the extracted values. If the brand column says “Unknown” for 40% of rows, your enrichment prompt needs work before you scale to 10,000 rows. Debugging at 10 rows costs nothing. Debugging at 10,000 rows costs time and credits.
Extraction gives you precision on one field. Enrichment gives you breadth across many. The best pipelines use both — enrichment for the bulk of the work, extraction for the fields that need extra care.
Keep reading
- Matching with images and attributes — how extracted attributes feed into the full matching pipeline
- Extracting matchable attributes from product images — deep dive on product image extraction specifically
- PDF categorization and data extraction — extraction techniques for document files
- Matching real estate listings with photos — extraction applied to property listing photos