OCR Explained: How Scanners Make Documents Searchable

OCR is one of the key features to look for in a scan documents to SharePoint workflow. This guide explains what OCR is, how it works, and why it's the difference between a useful document archive and a digital filing cabinet you can't search.

What OCR Is (in Plain English)

When a scanner creates an image of a page, it produces a picture — a grid of coloured pixels that looks like text to a human, but contains no actual text data that a computer can read or search.

OCR (Optical Character Recognition) analyses that pixel image and identifies the characters in it — letters, numbers, punctuation — and converts them into actual text data. The output is a file that contains both the original scanned image (so it looks identical to the paper original) and an invisible text layer underneath it.

Without OCR: searching for "Invoice Steelworks Direct March 2025" finds nothing — the scanner created a picture, not a document. With OCR: the same search finds the document instantly, even if it was filed under a different name, because the text content is searchable.

Why OCR Matters for Document Management

Metadata (the fields you fill in when scanning — supplier name, document type, date) makes documents findable by the information you enter at scan time. OCR makes documents findable by their content — any word anywhere in the document.

In practice: you scan 500 invoices over a year. You configured the Supplier Name metadata field, so you can find all invoices from a specific supplier. But what if you remember a specific line item or product description, not the supplier? With OCR, searching for that phrase finds the right invoice. Without OCR, you're scrolling through all 500.

For compliance purposes: OCR makes it possible to respond to a Subject Access Request by searching for an individual's name across your entire document archive — something that would take days manually but takes seconds with full-text search on an OCR-processed archive.

How OCR Works

Modern OCR has four stages:

Pre-processing: The scanned image is cleaned up — skew correction (straightening pages that went through the ADF at a slight angle), noise reduction, contrast enhancement. This significantly improves recognition accuracy.
Layout analysis: The OCR engine identifies text regions, distinguishing text from images, tables, and graphics on the page.
Character recognition: Each character in the text regions is analysed and matched against font pattern libraries. Modern OCR engines use neural networks trained on millions of document samples.
Output generation: The recognised text is embedded in the document as a hidden text layer, creating a searchable PDF while preserving the original image appearance.

OCR Quality — What Affects Accuracy

Scan resolution

300dpi (dots per inch) is the minimum for reliable OCR. 400dpi improves accuracy for small or condensed fonts. Scanning at 200dpi produces images that look fine to the eye but have noticeably lower OCR accuracy. All modern document scanners default to 300dpi — confirm your scanner settings aren't set lower for file size reasons.

Document condition

Faded ink, creased paper, and low-contrast originals all reduce OCR accuracy. A heavily worn job card from a noisy production environment will OCR less accurately than a clean printed invoice. For important documents with poor condition, check OCR output quality and consider manual verification of key fields.

Font type

Standard printed fonts (Arial, Times New Roman, Helvetica) achieve near-100% OCR accuracy. Decorative or script fonts reduce accuracy. Handwriting is a separate category — modern AI-based handwriting recognition (ICR) achieves 70–85% accuracy on clear handwriting, lower for difficult scripts.

Searchable PDF vs Plain Text Output

Searchable PDF (also called PDF/A or image+text PDF): The scanned image is preserved exactly as it appears, with a hidden text layer added for search. This is the standard output for business document scanning — the document looks identical to the original, but is fully searchable. File size is larger than image-only PDF but manageable (a typical A4 page at 300dpi is 50–150KB as searchable PDF).

Plain text or Word output: OCR extracts only the text, discarding the original image. Useful for data extraction workflows where you want to process the text content programmatically — but loses all formatting, tables, signatures, and stamps. Not appropriate for archival scanning where the appearance of the original matters.

When OCR Won't Help

OCR is powerful but not universal. Situations where it provides limited benefit:

Handwritten documents: Handwriting recognition accuracy is lower than printed text, particularly for difficult handwriting. For hand-completed forms, barcode-driven metadata is more reliable than relying on OCR to read handwritten field values.
Complex form layouts: OCR reads text sequentially and may struggle with multi-column forms, overlapping elements, or rotated text. Specialist forms-processing software handles these better.
Very poor quality originals: Documents that are heavily degraded, water damaged, or printed on low-quality paper may produce OCR output that's worse than no OCR at all.

For these cases, robust metadata entry at the point of scan — using barcode recognition or prompted metadata fields on the scanner touchscreen — provides more reliable findability than OCR search.