vlm.md
← All Recipes · Document Understanding · Beginner

Extract Structured Data from Invoice Images

Have your agent automatically extract vendor, amount, date, and other fields from scanned invoices and output JSON. Includes real production gotchas.

4/29/2026 · vlm.md · Recommended models: GPT-4oClaude 3.5 SonnetGemini 1.5 Pro

Scenario

Your agent needs to process user-uploaded invoice images (PDF scans or phone photos) and automatically extract:

  • Vendor name
  • Invoice number
  • Invoice date
  • Subtotal, tax amount, total
  • Payment due date (if present)

Output structured JSON for downstream systems (ERP, accounting software).

ModelWhen to use
GPT-4oBest overall; strong on diverse invoice formats
Claude 3.5 SonnetMost consistent structured output; lowest JSON error rate
Gemini 1.5 ProBest value for multi-page or long documents

For non-English invoices, prefer GPT-4o or Claude. For English invoices, all three perform similarly.

Prompt Template

You are an invoice data extraction expert. Extract the following fields from the image and return ONLY valid JSON — no explanation, no markdown.

Required fields:
- vendor_name: seller/vendor name
- invoice_number: invoice ID or number
- invoice_date: date issued (YYYY-MM-DD)
- subtotal: pre-tax amount (number, no currency symbol)
- tax_amount: tax amount (number, no currency symbol)
- total_amount: total including tax (number, no currency symbol)
- currency: currency code (e.g. USD, EUR, CNY)
- due_date: payment due date (YYYY-MM-DD, or null if not present)

Return null for any field not found in the image.

Code

import base64
import json
from pathlib import Path
from openai import OpenAI

client = OpenAI()

PROMPT = """You are an invoice data extraction expert. Extract the following fields from the image and return ONLY valid JSON — no explanation, no markdown.

Required fields:
- vendor_name: seller/vendor name
- invoice_number: invoice ID or number
- invoice_date: date issued (YYYY-MM-DD)
- subtotal: pre-tax amount (number, no currency symbol)
- tax_amount: tax amount (number, no currency symbol)
- total_amount: total including tax (number, no currency symbol)
- currency: currency code (e.g. USD, EUR, CNY)
- due_date: payment due date (YYYY-MM-DD, or null if not present)

Return null for any field not found in the image."""

def extract_invoice(image_path: str) -> dict:
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
    suffix = Path(image_path).suffix.lower().lstrip(".")
    mime_type = {"jpg": "image/jpeg", "jpeg": "image/jpeg", "png": "image/png"}.get(suffix, "image/jpeg")

    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": "You are an invoice extraction assistant. Output JSON only."},
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": f"data:{mime_type};base64,{image_data}"}},
                    {"type": "text", "text": PROMPT},
                ],
            },
        ],
        max_tokens=512,
    )

    return json.loads(response.choices[0].message.content)


if __name__ == "__main__":
    result = extract_invoice("invoice.jpg")
    print(json.dumps(result, indent=2))

Run:

pip install openai
python extract_invoice.py

Expected output:

{
  "vendor_name": "Acme Corp",
  "invoice_number": "INV-2024-0042",
  "invoice_date": "2024-03-15",
  "subtotal": 1000.00,
  "tax_amount": 100.00,
  "total_amount": 1100.00,
  "currency": "USD",
  "due_date": "2024-04-15"
}

Gotchas

Gotcha 1: Model wraps JSON in a markdown code block

Without response_format={"type": "json_object"}, GPT-4o sometimes wraps the output in ```json ... ```. The code above uses json_object mode to prevent this. Note: this mode requires mentioning “JSON” somewhere in the system prompt.

Gotcha 2: Low-quality or skewed phone photos

VLMs handle light skew (<15°) well, but image size matters. Pre-resize large images before sending:

from PIL import Image

def preprocess(path: str) -> str:
    img = Image.open(path)
    if max(img.size) > 2048:
        img.thumbnail((2048, 2048))
        out = path.replace(".", "_resized.")
        img.save(out)
        return out
    return path

Gotcha 3: Multi-page PDFs

The OpenAI API doesn’t accept PDF files directly. Convert to images first:

from pdf2image import convert_from_path

def pdf_to_images(pdf_path: str) -> list[str]:
    pages = convert_from_path(pdf_path, dpi=200)
    paths = []
    for i, page in enumerate(pages):
        p = f"/tmp/invoice_page_{i}.jpg"
        page.save(p, "JPEG")
        paths.append(p)
    return paths

Extract each page separately, then merge results. For multi-page invoices, the total is usually on the last page.