You have a folder of 2,000 PDFs. Some are invoices. Some are contracts. Some are product spec sheets. Some are tax forms. They’re named things like scan_0847.pdf and document_final_v3_REAL.pdf. The filenames tell you nothing.

You need to classify every document, extract key fields, and route them to the right system. The traditional approach — writing regex rules against OCR’d text — handles maybe 60% of them. The rest have scanned handwriting, embedded images, complex tables, or charts that carry the critical information.

So you open ChatGPT and upload a PDF. “What type of document is this?” It reads the text, looks at the layout, and tells you it’s a commercial invoice from Acme Corp dated March 2026, with three line items totaling $14,200. It even reads the table correctly.

Can you do this for all 2,000 documents?

Yes, LLMs can read PDFs

Let’s start with the direct answer to the questions people are actually asking.

Can Gemini read PDFs? Yes. Gemini 2.5 Pro and Gemini 2.5 Flash accept PDF files directly as input. You can upload a PDF to the Gemini API (or Google AI Studio) and ask questions about it. Gemini processes each page as an image, understands the text, reads tables, interprets charts, and reasons about the layout. Gemini’s 1M token context window means it can handle documents up to several hundred pages in a single call.

Can ChatGPT understand documents? Yes. GPT-4o is natively multimodal — it processes images, including PDF pages rendered as images. In the ChatGPT interface, you can upload PDFs directly. Through the API, you convert pages to images and send them as part of the message. GPT-4o reads text, understands tables, and interprets visual elements.

Can Claude read PDFs? Yes. Claude (Opus, Sonnet, Haiku) accepts PDF files directly through the API. It processes both the text layer and visual elements of each page. Like Gemini, it handles tables, charts, and mixed-format documents.

Can open-source models do this? Increasingly, yes. Models like Llama 4, Qwen2.5-VL, and InternVL can process document images. They require self-hosting and more setup, but the capability exists.

The multimodal revolution means that every major LLM now understands documents the way a human does — by looking at them.

What “reading a PDF” actually means

When we say an LLM “reads” a PDF, it’s doing several things simultaneously that traditional OCR and rule-based systems handle separately (or not at all).

Text extraction. The obvious one. The LLM reads all the text on the page — headers, body text, footers, watermarks, marginalia. Unlike traditional OCR, it understands context. It knows that “Net 30” in an invoice footer means payment terms, not a product name.

Table comprehension. This is where LLMs pull ahead dramatically. A table with merged cells, spanning headers, and inconsistent formatting is a nightmare for rule-based parsers. An LLM looks at the table the same way you do — it understands that the bold row is a subtotal, that the indented rows are line items, and that the column labeled “Qty” contains quantities even when some cells use “x2” and others use “2 units.”

Image understanding. Logos, signatures, stamps, photos embedded in the document — an LLM can describe them, extract text from them (even stylized or handwritten text), and use them for classification. A document with the IRS eagle logo is probably a tax form. A document with a company’s product photo is probably a spec sheet.

Chart and graph interpretation. Bar charts, line graphs, pie charts — an LLM can read the values, understand the trends, and extract the data. A financial report with a revenue chart doesn’t need the data extracted separately; the LLM reads the chart directly.

Layout understanding. The spatial arrangement of elements on the page carries meaning. A two-column layout with a header and footer is a letter. A grid of cells with monetary values is an invoice. A dense block of text with numbered sections is a contract. LLMs understand these visual patterns.

LLM PDF understanding capabilities
Capability Traditional OCR Rule-based parser Multimodal LLM
Plain text extraction Good Good Excellent
Handwritten text Poor None Good
Simple tables Fair Good Excellent
Complex/merged tables Poor Poor Good
Embedded images None None Excellent
Charts and graphs None None Good
Layout classification None Manual rules Excellent
Cross-page reasoning None Limited Excellent

Multimodal LLMs handle the full spectrum of PDF content that traditional tools process piecemeal or miss entirely.

Granular document classification with LLMs

Basic classification — “is this an invoice or a contract?” — is straightforward. Any LLM gets that right 95%+ of the time from the first page alone.

But real-world classification needs are much more granular. You don’t just need to know it’s an invoice. You need to know:

  • Is it a proforma invoice or a commercial invoice?
  • Is it domestic or international (customs declarations, HS codes)?
  • Is it associated with a purchase order or standalone?
  • Does it contain hazardous materials declarations?
  • Is it in USD, EUR, or another currency?
  • Is the vendor a preferred supplier or a new vendor?

This level of classification requires understanding text, tables, and visual cues together. A customs invoice has specific layout patterns and mandatory fields. A hazmat declaration has warning symbols. A purchase-order-linked invoice references a PO number in a specific location.

LLMs handle this because they process the document holistically — text, images, layout, context — and reason about all of it at once.

How to prompt for granular classification

The key is specificity. Vague prompts produce vague classifications. Structured prompts produce structured results.

Bad prompt:

What kind of document is this?

Good prompt:

Classify this document. Return a JSON object with:
- document_type: one of [commercial_invoice, proforma_invoice,
  purchase_order, contract, amendment, spec_sheet, certificate,
  tax_form, financial_report, correspondence, other]
- confidence: 0.0 to 1.0
- language: ISO 639-1 code
- currency: ISO 4217 code if financial, null otherwise
- contains_tables: boolean
- contains_images: boolean
- contains_signatures: boolean
- page_count: number
- key_entities: list of company/person names found
- summary: one-sentence description

With a structured prompt, the LLM doesn’t just classify — it extracts a rich metadata record from every document. This metadata becomes the basis for routing, matching, and downstream processing.

The model-by-model breakdown for PDF tasks

Each model has different strengths when it comes to document understanding.

LLM capabilities for PDF document processing
Model PDF input Max pages Table accuracy Best for
Gemini 2.5 Pro Native PDF ~300 pages Excellent Long documents, complex tables
Gemini 2.5 Flash Native PDF ~300 pages Good High-volume classification
GPT-4o Images (pages as PNG) ~50 pages* Excellent Detailed extraction, reasoning
GPT-4o mini Images (pages as PNG) ~50 pages* Good Budget classification
Claude Opus 4 Native PDF ~100 pages Excellent Complex reasoning, long contracts
Claude Sonnet 4 Native PDF ~100 pages Good Balanced speed and accuracy
Claude Haiku 4.5 Native PDF ~100 pages Fair Fast, cheap pre-screening

*GPT-4o requires converting PDF pages to images before sending. Max pages are practical limits based on context window and cost, not hard caps. Native PDF means the API accepts .pdf files directly.

Gemini has the largest context window and native PDF support, making it ideal for long documents — legal contracts, annual reports, multi-section filings. It handles 300+ pages without chunking.

GPT-4o requires converting PDF pages to images, which adds a preprocessing step but produces excellent results. Its reasoning capabilities are strong on complex documents with ambiguous classification.

Claude offers native PDF support with strong reasoning on nuanced documents. Particularly good at understanding contractual language and regulatory filings.

For high-volume classification where cost matters, the smaller models (Gemini Flash, GPT-4o mini, Claude Haiku) provide 80-90% of the accuracy at 10-20% of the cost.

Where single-document LLM classification breaks down

Just like with fuzzy matching, LLM document classification works brilliantly on one document. The problems emerge at scale.

1. Cost per page

Multimodal input is expensive. A single PDF page costs roughly 250-1,500 tokens depending on complexity — images use significantly more tokens than text.

Estimated cost to classify PDF documents (GPT-4o)
100 documents (5 pg avg) $0.03 per doc — trivial
3$
1,000 documents $0.03 per doc — manageable
30$
10,000 documents $0.03 per doc — adds up
300$
100,000 documents $0.03 per doc — significant budget
3000$

Costs assume 5 pages per document, processed as images with structured extraction prompt. Gemini Flash and Claude Haiku reduce costs by 5-10x.

For classification alone, the cost per document is low — $0.02-0.05 each. But when you add extraction (pulling every field from every page), the cost per document can reach $0.10-0.50, and at 100,000 documents that’s $10,000-50,000.

2. Speed and throughput

Processing a 5-page PDF through an LLM takes 3-10 seconds. With parallelism and rate limits, you can process maybe 20-50 documents per minute.

1,000 documents: 20-50 minutes. Fine. 10,000 documents: 3-8 hours. Tolerable. 100,000 documents: 1-3 days. You need a pipeline.

3. Consistency across batches

Ask an LLM to classify the same ambiguous document three times and you’ll get three slightly different answers. One run says “amendment,” another says “contract addendum,” a third says “supplemental agreement.” They’re all reasonable — but they break your downstream routing.

Traditional classifiers, once trained, produce the same label every time. LLMs require post-processing to normalize outputs, validate against allowed categories, and handle edge cases consistently.

4. No memory across documents

Each LLM call is independent. The model doesn’t know that the previous 50 documents were all invoices from the same vendor, or that document #847 is the amendment to document #312. Cross-document reasoning — understanding a document in the context of its batch — requires pipeline infrastructure the LLM doesn’t provide.

The right architecture for document classification at scale

The pattern is the same one that works for record matching: use LLMs where they add unique value, and use faster, cheaper tools everywhere else.

Tier 1: Metadata pre-classification (no LLM needed). File size, page count, text-to-image ratio, presence of form fields, PDF metadata tags. A 1-page PDF under 100KB with no images is almost certainly a letter or memo. A 50-page PDF with financial metadata tags is probably a report. These heuristics classify 20-30% of documents instantly, at zero cost.

Tier 2: Text-layer classification (cheap LLM or traditional NLP). Extract the text layer from the PDF (no OCR needed for native PDFs). Run it through a lightweight classifier — a fine-tuned model, keyword rules, or a cheap LLM like Gemini Flash Lite. This handles another 40-50% of documents where the text alone is sufficient for classification.

Tier 3: Full multimodal classification (frontier LLM). The remaining 20-30% — scanned documents, image-heavy PDFs, complex layouts where text extraction loses structure — get sent to a frontier multimodal model for full visual understanding. This is where Gemini Pro, GPT-4o, or Claude Opus earn their cost.

Cost comparison: every-doc LLM vs. tiered pipeline
Every doc → frontier LLM Full multimodal processing on all documents
100%
Tiered pipeline Metadata + text-layer + LLM on hard cases only
18%

Relative cost for 10,000 documents. Tiered pipeline achieves the same accuracy by reserving expensive multimodal processing for documents that need it.

PDF classification for matching and deduplication

Here’s where document classification connects to the matching problem.

You have two datasets. One is a list of companies with their filings. The other is a folder of PDF documents — contracts, invoices, certificates — that need to be matched to the right company. Or you have two sets of documents that need to be deduplicated: find the pairs that are versions of the same underlying agreement.

This is file-based matching, and it’s a problem that pure text matching can’t solve.

Scenario 1: Matching documents to records. You have a CSV of vendors (name, address, tax ID) and a folder of invoices (PDFs). You need to match each invoice to the correct vendor. The LLM reads each invoice, extracts the vendor name, address, and tax ID, and the matching pipeline compares these extracted fields against your vendor list using the same string, embedding, and LLM confirmation pipeline used for CSV matching.

Scenario 2: Document deduplication. You have 5,000 contracts accumulated over 10 years. Many are duplicates — same agreement scanned at different times, or slight revisions of the same template. The LLM extracts key attributes (parties, dates, subject matter, dollar amounts), embeds the document summaries, and the matching pipeline finds pairs with high similarity. A human reviews the flagged duplicates.

Scenario 3: Cross-format matching. You have product data in a CSV (name, SKU, description) and product spec sheets as PDFs. The LLM reads each spec sheet, extracts the product name, specifications, and model numbers, and the matching pipeline connects them to CSV records. The PDF might contain a product photo that helps disambiguate — “this spec sheet shows a red widget, and only one SKU in the CSV is a red widget.”

Document matching scenarios
Scenario Source A Source B What the LLM does What the pipeline does
Invoice → Vendor Vendor CSV Invoice PDFs Extract vendor name, address, amounts Fuzzy match extracted fields to CSV
Contract dedup Contract PDFs Contract PDFs Extract parties, dates, terms Embed + cosine similarity on extractions
Spec sheet → Product Product CSV Spec sheet PDFs Extract product name, specs, images Match attributes + visual descriptions
Certificate → Entity Company CSV Cert PDFs Read cert holder, issuer, dates Match cert holder to company records

In each case, the LLM handles the unstructured-to-structured conversion. The matching pipeline handles the structured comparison at scale.

The LLM’s job is the part that only an LLM can do: turn unstructured visual documents into structured data. The matching pipeline’s job is the part it does better than an LLM: compare thousands of structured records quickly, cheaply, and consistently.

Practical tips for PDF classification projects

Start with a sample. Classify 50 documents manually to understand your taxonomy. How many categories do you actually need? What are the ambiguous cases? What visual features distinguish similar document types?

Design your schema before prompting. Define your classification categories, extraction fields, and output format before writing a single prompt. A well-defined schema produces consistent outputs. A vague prompt produces creative but inconsistent results.

Use page 1 for classification, all pages for extraction. Most documents reveal their type on the first page. Send only page 1 for classification (cheaper, faster), then send the full document for field extraction only after you know what you’re looking for.

Validate LLM output programmatically. Parse the JSON response. Check that document_type is in your allowed list. Verify that dates are valid. Flag documents where confidence is below 0.8 for human review. Never trust LLM output without validation.

Build feedback loops. When a human corrects a classification, log the correction. After 100 corrections, you’ll know which document types the LLM struggles with — and you can improve your prompt or add examples for those categories.

Consider fine-tuning for high volume. If you’re classifying the same types of documents repeatedly (e.g., processing invoices daily), a fine-tuned model or a trained traditional classifier will be cheaper and more consistent than prompting a frontier model every time.

So, can Gemini read your PDFs?

Yes. And so can ChatGPT, Claude, and increasingly capable open-source models. Multimodal LLMs understand documents the way you do — they see the text, the tables, the images, the layout, and the context. They can classify, extract, summarize, and reason about PDFs with remarkable accuracy.

But “can it read a PDF” and “should you process 10,000 PDFs through it” are different questions.

For a handful of documents, upload them to your favorite chatbot. It’ll classify them accurately, extract the fields you need, and even read the charts.

For hundreds or thousands of documents, you need a pipeline — one that uses metadata heuristics and text-layer analysis for the easy cases, and reserves multimodal LLM processing for the documents that genuinely need visual understanding. The same tiered architecture that makes LLM-powered record matching practical makes LLM-powered document classification practical.


Match Data Studio supports file-based matching — upload PDFs and images alongside your CSVs, and the pipeline extracts attributes from documents using AI, then matches them against your structured data. The same hybrid architecture that handles text matching at scale handles document matching too.

Try it free →