Extracting structured tables from PDFs is harder than it looks.
PDF files do not store tables as structured data. Instead, they position text at specific coordinates on the page.
Table extraction tools must reconstruct the structure by determining which values belong in which rows and columns.
The problem becomes even harder when tables include multi-level headers, merged cells, or complex layouts.
To explore this problem, I experimented with three tools designed for PDF table extraction: LlamaParse, Marker, and Docling. Each tool takes a different approach.
Performance overview:
• Docling: Fastest local option, but struggles with complex tables
• Marker: Handles complex layouts well and runs locally, but is much slower
• LlamaParse: Most accurate on complex tables and fastest overall, but requires a cloud API
In this article, I share the code, examples, and results from testing each tool.
Here's the link: bit.ly/4cBL2fG