Convert Image Tables to Structured JSON

Scenario

Your agent encounters tables embedded in screenshots, scanned documents, or presentation slides — not as HTML or Excel data, but as rasterized images. You need to extract them as structured data (list of dicts or CSV) for downstream processing.

Common use cases:

Extract fee rate tables from scanned contracts
Convert spec comparison tables in competitor slide decks to database rows
Process report data from legacy system screenshots

Recommended Models

Model	When to use
GPT-4o	Best overall; most accurate on complex multi-level headers
Claude 3.5 Sonnet	Most consistent JSON formatting; best handling of empty cells
Gemini 1.5 Pro	Best value for large tables or multi-page scanned documents

For simple tables all three are comparable. When headers are complex (merged cells, multi-row headers), prefer GPT-4o or Claude.

Prompt Template

You are a table data extraction expert. Extract the table from the image and return ONLY the JSON below — no explanation, no markdown.

Rules:
1. headers: Extract all column headers. For merged/spanning headers use "ParentHeader/ChildHeader" format (e.g. "Sales/Q1").
2. rows: Each row is a dict with column headers as keys.
3. Empty cells: Use null for genuinely empty cells. Use "__UNREADABLE__" if there is content you cannot read.
4. Number formatting: Preserve the original format (e.g. "1,234.56") — do not convert.

Return format:
{
  "headers": ["col1", "col2", ...],
  "rows": [
    {"col1": "value", "col2": "value", ...},
    ...
  ],
  "notes": "Any notes about table structure such as merged cells"
}

Code

import base64
import json
import csv
import io
from pathlib import Path
from openai import OpenAI

client = OpenAI()

PROMPT = """You are a table data extraction expert. Extract the table from the image and return ONLY the JSON below — no explanation, no markdown.

Rules:
1. headers: Extract all column headers. For merged/spanning headers use "ParentHeader/ChildHeader" format (e.g. "Sales/Q1").
2. rows: Each row is a dict with column headers as keys.
3. Empty cells: Use null for genuinely empty cells. Use "__UNREADABLE__" if there is content you cannot read.
4. Number formatting: Preserve the original format (e.g. "1,234.56") — do not convert.

Return format:
{
  "headers": ["col1", "col2", ...],
  "rows": [
    {"col1": "value", "col2": "value", ...},
    ...
  ],
  "notes": "Any notes about table structure such as merged cells"
}"""


def extract_table(image_path: str) -> dict:
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
    suffix = Path(image_path).suffix.lower().lstrip(".")
    mime_type = {"jpg": "image/jpeg", "jpeg": "image/jpeg", "png": "image/png"}.get(
        suffix, "image/jpeg"
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": "You are a table extraction assistant. Output valid JSON only.",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:{mime_type};base64,{image_data}"},
                    },
                    {"type": "text", "text": PROMPT},
                ],
            },
        ],
        max_tokens=2048,
    )

    return json.loads(response.choices[0].message.content)


def table_to_csv(table_data: dict) -> str:
    """Convert extracted table to a CSV string."""
    headers = table_data.get("headers", [])
    rows = table_data.get("rows", [])

    buf = io.StringIO()
    writer = csv.DictWriter(buf, fieldnames=headers, extrasaction="ignore")
    writer.writeheader()
    for row in rows:
        # Replace null with empty string for CSV convention
        writer.writerow({k: ("" if v is None else v) for k, v in row.items()})
    return buf.getvalue()


def parse_number(value: str, locale: str = "en_US") -> float | None:
    """Parse a locale-formatted number string to float."""
    if value is None or value == "__UNREADABLE__":
        return None
    if locale == "en_US":
        # "1,234.56" -> 1234.56
        return float(value.replace(",", "").strip())
    elif locale == "de_DE":
        # "1.234,56" -> 1234.56
        return float(value.replace(".", "").replace(",", ".").strip())
    return None


if __name__ == "__main__":
    result = extract_table("table_screenshot.png")
    print("=== JSON result ===")
    print(json.dumps(result, indent=2))

    if result.get("notes"):
        print(f"\nNotes: {result['notes']}")

    print("\n=== CSV format ===")
    print(table_to_csv(result))

Run:

pip install openai
python extract_table.py

Expected output:

{
  "headers": ["Product", "Q1 Revenue", "Q2 Revenue", "YoY Growth"],
  "rows": [
    {"Product": "Product A", "Q1 Revenue": "1,234,567", "Q2 Revenue": "1,456,789", "YoY Growth": "18.0%"},
    {"Product": "Product B", "Q1 Revenue": "890,000", "Q2 Revenue": null, "YoY Growth": null},
    {"Product": "Product C", "Q1 Revenue": "__UNREADABLE__", "Q2 Revenue": "2,100,000", "YoY Growth": "5.3%"}
  ],
  "notes": "Original table has a merged header cell spanning Q1-Q3 columns; Product B Q2 cell is blank"
}

Gotchas

Gotcha 1: Merged header cells

When a table has two header rows (e.g., row 1 is “Sales Data” spanning three columns, row 2 is “Q1”, “Q2”, “Q3”), models typically extract only the bottom row and drop the grouping context. Using the Parent/Child header format preserves the hierarchy so downstream code can group columns semantically.

Gotcha 2: Empty cells vs missing cells are different

null (genuinely blank) and "__UNREADABLE__" (content present but illegible) are semantically different. Conflating them causes data quality issues: blank cells can be filled with defaults; unreadable cells need human review. The prompt above explicitly distinguishes them — make sure your post-processing does too.

Gotcha 3: Number format locale differences

1,234.56 (US/UK) and 1.234,56 (Germany/Brazil) use opposite separators. Calling float("1,234.56") raises a ValueError. The safest approach:

Ask the model to preserve the original format unchanged
Parse with locale-aware logic in code:

def parse_number(value: str, locale: str = "en_US") -> float | None:
    if not value:
        return None
    if locale == "en_US":
        return float(value.replace(",", ""))
    elif locale == "de_DE":
        return float(value.replace(".", "").replace(",", "."))
    return None

If the source locale is unknown, log the raw string and defer conversion until locale is confirmed rather than guessing.