Why AI still struggles with PDF files

Why AI still struggles with PDF files Newspoint | Feb 24 2026 09:38:41 IST

Why AI still struggles with PDF files
24 Feb 2026

Artificial Intelligence (AI) has made significant strides in recent years, but it still struggles with one of the most common file formats: PDF.

The problem was highlighted when tech entrepreneur Luke Igel and his friends tried to navigate through 20,000 pages of documents related to Jeffrey Epstein's estate.

They found the PDF viewer difficult to use, and searching for specific information was challenging due to poor OCR quality.

'Unsexy failures' of AI
Parsing challenges

Despite the rapid progress in AI's ability to solve complex software and physics problems, PDF remains a major challenge.

Edwin Chen, CEO of data company Surge, has called it one of AI's "unsexy failures" that limit its real-world applicability.

He discovered that even advanced models tasked with extracting information from a PDF often summarize it instead or confuse footnotes with body text.

Why PDFs are so difficult for AI
Extraction difficulties

Extracting information from PDFs is more complicated than it seems.

This is because PDFs were never designed to be machine-readable. They were developed by Adobe in the early '90s as a way to share documents while preserving their exact visual appearance, first on paper and later on screens.

Unlike formats like HTML that represent text in logical order, PDF consists of character codes, coordinates, and other instructions for rendering an image of a page.

OCR can convert images to text but struggles with layouts
OCR challenges

Optical character recognition (OCR) can convert these images of words back into machine-readable text, but it struggles with multi-column layouts often found in academic papers. This leads to jumbled outputs.

When given a PDF, an AI assistant like ChatGPT tries various tools for extraction but the results are often uneven due to the inherent difficulties of the format.

Models rarely trained on PDFs
Training issues

Another problem is that models are rarely trained on PDFs.

This is changing as AI developers look for high-quality data, and PDFs provide a lot of it.

Government reports, textbooks, academic papers—all in PDF format.

Researchers at the Allen Institute for AI last year suggested that "PDF documents have the potential to provide trillions of novel high-quality tokens for training language models."

Contact to : xlf550402@gmail.com

Privacy Agreement