A sporting goods retailer is doing competitive price monitoring. They have their internal product catalog — 6,800 items with product names, prices, categories, and product photos. They’ve scraped a competitor’s website and collected 9,200 listings, also with product names, prices, and product images.

They run a text-based match on product names and prices. The result: 4,350 matched pairs, a 64% match rate against their catalog. Not bad. But they know there should be more overlap — this competitor carries most of the same brands.

A spot check reveals the problem. Their catalog says “Nike Air Max 90 Men’s Running Shoe — White/Black.” The competitor lists it as “Men’s Nike AM90 White Black Running Sneaker.” The text overlap is partial. The string similarity score is 0.61 — below the threshold. No match.

But both listings include a product photo. And both photos show the same shoe from a slightly different angle. To a human comparing the images, it’s an obvious match.

The remaining 36% of unmatched products aren’t actually unmatched — the matching tool just can’t see them because it’s only looking at text.

Why files change the matching equation

Traditional data matching works on structured text: names, codes, descriptions, categories. Every comparison is text-to-text or number-to-number. This works well when the text is descriptive and consistent.

But text is often neither.

Product names are written by different people with different conventions. Category taxonomies diverge across organizations. Descriptions range from meticulous to absent. Key attributes are scattered unevenly — one dataset has brand in a separate column, the other buries it in the product name.

Files — images, PDFs, documents — contain information that text fields don’t capture. And often, the file is the most reliable identifier. A product photo shows exactly what the product is, regardless of what anyone chose to call it. A PDF spec sheet contains technical attributes that no one bothered to type into the spreadsheet. A certificate proves a compliance claim that would otherwise require manual verification.

File-based matching treats these files as data sources, not just attachments. An image is not just a reference — it’s input to AI that extracts attributes, generates descriptions, and enables visual comparison. A PDF is not just a document to store — it’s structured data waiting to be extracted and matched.

The file-based matching architecture

Files participate in four stages of the matching pipeline, each providing a different type of signal:

How file columns work in the matching pipeline
Pipeline stage What it does with files Output
AI Completions Extract a single attribute from the file (e.g., brand from a product photo) One new text column per completion rule
AI Enrichment Extract multiple attributes from the file in one call (e.g., brand, color, material, condition) Multiple new text columns per enrichment rule
Embeddings Generate a detailed text description of the file, then embed that description into a vector Embedding vector for similarity comparison
LLM Confirmation Show both files side by side and ask the AI whether they represent the same entity Boolean match/no-match with reasoning

Each stage builds on the previous. Completions and enrichments produce structured attributes. Embeddings capture semantic similarity. LLM confirmation resolves ambiguous pairs.

These stages layer on top of each other:

Completions and enrichments run first, during Stage 2 of the pipeline. They extract structured attributes from files — brand from a product photo, manufacturer from a spec sheet, party names from a contract. The extracted values become regular text columns that feed into all downstream matching.

Embeddings also run in Stage 2. For file columns, the system automatically generates a detailed natural-language description of each file’s content using AI, then embeds that description into a vector. You don’t need to configure the description step — just add the file column to embeddings and set a similarity threshold.

LLM confirmation runs in Stage 3, after candidate matches have been identified through embedding similarity and pre-filters. For borderline pairs — high enough similarity to be interesting but not high enough to be certain — the AI examines both files side by side and makes a determination: same entity or different?

Use case: product matching with photos

The sporting goods retailer from the opening configures their pipeline to use product images:

Step 1: Extract attributes from images. An AI enrichment rule extracts brand, product line, color, and product type from each product photo in both datasets. Now both datasets have consistent, AI-extracted brand and color fields — even though the original text was inconsistent.

Step 2: Embed image descriptions. Each product photo gets an auto-generated description (“Nike Air Max 90 men’s running shoe in white with black swoosh and accents, low-top silhouette, visible Air Max cushioning unit in heel”) that’s embedded alongside the text fields.

Step 3: Pre-filter on extracted brand. The AI extracted “Nike” from both images. String pre-filtering on the extracted brand field eliminates cross-brand comparisons, dramatically reducing the comparison space.

Step 4: Compare embeddings. The image-description embeddings for “Nike Air Max 90 Men’s Running Shoe — White/Black” and “Men’s Nike AM90 White Black Running Sneaker” are highly similar — because the AI described the same shoe from both photos, producing semantically similar descriptions regardless of the original text.

Step 5: LLM visual confirmation. For pairs where embedding similarity is 0.75-0.85 (promising but not conclusive), the pipeline shows both product images to the AI and asks: “Are these photos of the same product? Consider brand, model, color, and design details.” The AI sees both shoes and confirms: same product, different photos.

Match rate by matching approach
Text-only matching Product name + price + category
64%
Text + image attributes Adding AI-extracted brand, color, type
78%
Text + image embeddings Adding visual similarity comparison
85%
Full pipeline with LLM visual Adding side-by-side image confirmation
91%

Match rate against known product overlap between the two catalogs. Full pipeline finds 91% of true matches while maintaining >95% precision.

The jump from 64% to 91% is significant. Those additional 27 percentage points represent 1,836 product matches that text-only matching missed entirely — products the retailer was blind to in their competitive analysis.

Use case: company matching with annual reports

A private equity firm screens acquisition targets. They have two datasets: a CRM list of 2,400 companies (name, industry, estimated revenue, location) and a collection of 800 PDF annual reports gathered from various sources.

The text matching problem: company names in the CRM don’t match the legal entity names in the annual reports. “Acme Corp” in the CRM is “Acme Corporation International, LLC” on the annual report cover. “TechStart” in the CRM is “TechStart Holdings, Inc. d/b/a TechStart Solutions” in the report.

The solution: AI enrichment extracts structured fields from each annual report — legal entity name, revenue, headcount, primary industry, headquarters location, key executives. These extracted fields match against the CRM data with much higher accuracy than the raw company names alone.

The revenue field is particularly powerful. “Acme Corp” at “$450M estimated” in the CRM matches “Acme Corporation International” reporting “$447.2M” in the annual report. The combination of fuzzy name matching and numeric revenue proximity makes this a high-confidence match that name matching alone would miss.

Use case: real estate with listing photos

Two real estate data providers each have property listings. Addresses are formatted differently. MLS numbers are platform-specific. Square footage sometimes disagrees. But both datasets include listing photos.

The matching pipeline extracts visual attributes from listing photos: property style (colonial, ranch, modern), exterior material (brick, siding, stone), distinctive features (bay window, wraparound porch, red front door). These visual attributes, combined with approximate address matching and price proximity, identify the same property across platforms — even when the text data has errors or inconsistencies.

LLM confirmation is particularly powerful here. Showing the AI two kitchen photos from different listing platforms: “Same kitchen? Compare cabinet style, countertop material, backsplash, and appliances.” The AI confirms: “Same kitchen — white shaker cabinets, quartz countertops, subway tile backsplash, stainless steel KitchenAid appliances visible in both photos.”

Use case: document deduplication

A legal department has accumulated 15,000 contracts across SharePoint, a document management system, and email attachments. Many are duplicates — the same contract saved in multiple locations, sometimes with different filenames or minor formatting differences.

The pipeline:

  • AI enrichment extracts parties, effective date, term, value, and key clause summaries from each PDF
  • Embeddings capture the semantic content of each contract
  • Matching finds pairs with similar parties, dates, and content
  • LLM confirmation compares the two PDFs and identifies whether they’re the same contract, different versions, or related but distinct agreements

The output: clusters of duplicate contracts, tagged by version, with the most recent version identified. The legal team reviews the clusters instead of opening 15,000 PDFs.

Supported file types

TypeExtensionsWhat AI extracts
Images.jpg, .jpeg, .png, .webp, .gifVisual attributes, brand, color, condition, descriptions
Documents.pdfText content, tables, structured data, narrative summaries
Text files.txt, .md, .csvDirect text content for embedding and comparison

Best practices for images:

  • Clear, well-lit product photos work best
  • One product per image (not lifestyle shots with multiple items)
  • Show the product from an angle that reveals brand and key features
  • Minimum 640px on the shortest side for reliable extraction

Best practices for PDFs:

  • Native (searchable) PDFs extract better than scanned documents
  • Documents under 20 pages process most reliably
  • Standard business document formats (invoices, spec sheets, certificates) have the highest extraction accuracy

Setting up a file-based matching project

The workflow is nearly identical to text-only matching — file support adds one column type, not a separate process.

  1. Create a project and upload your two CSVs. One or both should have a column with filenames.

  2. Upload files through the Project Files panel. Drag and drop individual files, or upload a ZIP archive that gets extracted automatically. The files are stored in your project and available for both datasets to reference.

  3. Mark file columns in the Type Definitions stage. Set the image or document column’s type to “file” instead of “text.” This tells the pipeline that product_photo.jpg is a reference to an actual file, not literal text.

  4. Configure extraction rules. In the AI Completions or AI Enrichment stages, create rules that reference the file column. The AI assistant can help you write extraction prompts — describe what you want to extract and it suggests the configuration.

  5. Add file columns to embeddings. Select the file column in the Embeddings stage and set a similarity threshold. The system handles the description-then-embed process automatically.

  6. Add LLM confirmation with file columns. Include the file columns in your LLM check configuration so the AI can compare files side by side for borderline matches.

  7. Run and review. The sample run (5 credits) processes a subset so you can verify the extraction quality and match accuracy before running the full dataset.

The key insight is that files are additive. Every existing matching capability — string matching, numeric comparison, embedding similarity, LLM checks — continues to work on your text and numeric columns. Files add a new layer of signals on top. You can start with text-only matching, verify it works, then add file columns to see if they improve the results.

For background on the underlying techniques, see our guides on image categorization, PDF extraction, and attribute extraction from product images.


Text matching shows you part of the picture. File-based matching shows you the whole thing. Add images, PDFs, and documents to your matching pipeline and find the matches your text data is missing.

Try file-based matching →


Keep reading