All posts
Engineering

How Modern OCR Works: From Pixels to Structured Text

A technical deep-dive into how large vision models read documents, why classical OCR struggles with real-world documents, and what makes AI-powered extraction different.

David ReevesNovember 12, 20258 min read

Optical Character Recognition has been a solved problem since the 1990s — or so the conventional wisdom goes. Ask anyone who has tried to extract data from a scanned mortgage document, a crumpled receipt, or a passport photographed in suboptimal lighting, and they'll tell you a different story.

Classical OCR engines like Tesseract work by segmenting an image into lines, then words, then characters, and matching each character against a trained font model. This works reasonably well for clean, typeset documents in well-supported languages. It falls apart spectacularly anywhere else.

Why classical OCR struggles

The failure modes of classical OCR aren't random — they follow predictable patterns:

Skew and perspective distortion. A document photographed at a 15-degree angle will produce significantly degraded output from most classical engines. Pre-processing (deskewing, dewarping) helps but adds latency and its own failure modes.

Noise and degradation. Coffee stains, wrinkles, low-quality scans, and JPEG compression artifacts all introduce pixel-level noise that confuses character segmentation.

Non-standard fonts and handwriting. Classical models are trained on finite font libraries. Anything unusual — a decorative header, a custom brand typeface, handwritten annotations — degrades accuracy substantially.

Layout complexity. Multi-column layouts, tables with merged cells, and documents where text flows around images require explicit layout analysis before character recognition can begin.

What AI-powered OCR does differently

Modern vision-language models approach document reading differently. Rather than a pipeline of discrete stages (segment → recognize → assemble), they treat the document as a holistic visual input and generate text as a sequence prediction problem.

The implications are significant:

  1. Context awareness. A vision model reading a receipt knows that after "Total:" there's likely a dollar amount. It can disambiguate ambiguous characters (is that a 0 or an O?) using semantic context, not just visual features.

  2. Layout agnostic. The model doesn't need to segment the document before reading it. It can attend to multiple regions simultaneously and understand spatial relationships implicitly.

  3. Few-shot generalization. Vision models generalize to new document types, languages, and layouts without retraining, because they've learned rich visual and linguistic representations from diverse pre-training data.

How Quantilence OCR works

Our OCR endpoint sends your document to Claude with a precisely engineered prompt that elicits faithful text extraction. Claude's vision capabilities were trained on an enormous diversity of document types, giving it strong priors for reading commercial and government documents.

const result = await client.ocr.extract({
  file: documentBuffer,
  options: {
    preserveLayout: true,
    language: "auto",
  },
});

The response includes the extracted text with layout markers preserved, an overall confidence score, and detected language metadata.

When to use OCR vs Document AI

OCR is the right tool when you need raw text — feeding into a search index, processing articles, or handling documents without a well-defined schema.

Document AI (our structured extraction product) is the right choice when you need specific, typed fields from documents with known structure: passport details, invoice line items, contract clauses.

The two products are complementary: many pipelines run OCR for full-text search and Document AI for structured field extraction in parallel.

Accuracy benchmarks

We measure OCR accuracy across four document classes:

| Document type | Accuracy | |--------------|---------| | Printed, clean | 99.4% | | Printed, degraded | 97.8% | | Handwritten | 89.2% | | Mixed (printed + handwritten) | 94.6% |

Degraded documents include JPEG artifacts, light water damage, and moderate skew (up to 20 degrees). Accuracy is measured as character-level edit distance against ground truth.

Conclusion

The gap between classical OCR and vision-model-based extraction is widest precisely where it matters most: real-world documents under non-ideal conditions. If you're building a document processing pipeline and accuracy on edge cases matters, the architecture choice is worth revisiting.

The Quantilence OCR API is available with 500 free requests per month. Try it with your own documents →