iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🔬

Python: Converting Scanned PDFs to Searchable PDFs using OCR

に公開

Python OCR -> Convert to searchable PDF by embedding transparent text!

Hello, I'm Harumi from the "Association of People Who Want to Do Their Best on Local Machines".

Previously, I introduced a local OCR library, and it received a great response. This time, I'm writing an article as a practical application guide for that library.

https://zenn.dev/harumikun/articles/fb9c435acf4070

Scanned PDFs are Inconvenient

To give a brief explanation of what a scanned PDF is: it's a PDF file with absolutely no text data, created by scanning documents with a printer or similar device.

What makes this troublesome is:

  • Text cannot be searched or copied

  • It is difficult to utilize with AI

I'm sure everyone in the working world feels this way without needing an explanation...

While you can easily process them into text by uploading to Google Drive or similar services, as is often the case, I can't use them due to internal company restrictions. So I wondered if there was a better way...🤔

Right, Let's Embed Text for Free Using Python

Setting the introduction aside, let's build a process to embed transparent text onto an existing PDF completely locally and convert it into a Searchable PDF.

Towards the end of the article, I also introduce a TUI tool that allows you to easily run the process explained here. If you just want to use it, feel free to skip to that section!

Installing Necessary Packages

We will use uv as the package manager.

bash
uv init
uv add numpy opencv-python pypdf pypdfium2 reportlab onnxocr
pip
pip install numpy opencv-python pypdf pypdfium2 reportlab onnxocr
Library Purpose
numpy Preprocessing for OCR
OpenCV Preprocessing for OCR
pypdf Merging overlay PDF with the original PDF
pypdfium2 Rendering PDF pages as images
reportlab Creating a PDF with a drawn text layer
OnnxOCR High-speed OCR processing using CPU inference

Overall Flow

The conceptual diagram of the process is as follows:

I will break down each step and explain them with simple sample code.

1. Extracting PDF Pages as Images (PIL Image)

To perform OCR, the PDF must be converted into images.
We use pypdfium2 for this purpose.

main.py
import pypdfium2 as pdfium

def render_pdf_to_image(pdf_path, dpi=300):
    pdf = pdfium.PdfDocument(pdf_path)
    page = pdf[0]  # Example: Only the first page
    scale = dpi / 72
    pil_img = page.render(scale=scale).to_pil()
    return pil_img

Key Points

  • PDFs are 72dpi-based -> Scale up to around 300dpi for better OCR accuracy.

  • to_pil() converts it into a Pillow image, making it easy to handle with OpenCV or numpy.

2. Running OCR

We will perform OCR on the PDF images using OnnxOCR, which I introduced in a separate article.

main.py
import numpy as np
import cv2
from onnxocr.onnx_paddleocr import ONNXPaddleOcr

ocr = ONNXPaddleOcr(use_gpu=False, lang="japan")

def run_ocr(pil_img):
    rgb = np.array(pil_img)
    bgr = cv2.cvtColor(rgb, cv2.COLOR_RGB2BGR)
    results = ocr.ocr(bgr)
    return results

The execution results of OnnxOCR contain data structured as follows:

bash
[
  [
    [[x1,y1], [x2,y2], [x3,y3], [x4,y4]], ["文字列", 信頼度]
  ],
  ...
]

3. Normalizing OCR Results into "Rectangle + Text"

While this step isn't strictly necessary—as OCR results are typically quadrilaterals (four points)—it is easier to handle simple rectangles (x1, y1, x2, y2) when embedding text into a PDF. We'll format the data here to make it more manageable for subsequent processing.

main.py
def normalize_ocr_results(ocr_results):
    items = []
    for line in ocr_results[0]:
        quad = line[0]
        text = line[1][0]
        xs = [p[0] for p in quad]
        ys = [p[1] for p in quad]
        items.append({
            "text": text,
            "bbox": (min(xs), min(ys), max(xs), max(ys)),
        })
    return items

4. Creating a "Transparent Text Layer PDF" with ReportLab

This is the most crucial part of the process.

By placing Invisible Text at the positions identified by the OCR and merging it with the original PDF, we create a "Searchable PDF."

main.py
import io
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.cidfonts import UnicodeCIDFont

pdfmetrics.registerFont(UnicodeCIDFont("HeiseiKakuGo-W5"))  # Japanese font

def create_overlay_pdf(page_w, page_h, ocr_items):
    buf = io.BytesIO()
    c = canvas.Canvas(buf, pagesize=(page_w, page_h))
    c.setFillAlpha(0.0)  # Fully transparent

    for item in ocr_items:
        x1, y1, x2, y2 = item["bbox"]
        text = item["text"]

        fontsize = max(6, (y2 - y1) * 0.9)
        c.setFont("HeiseiKakuGo-W5", fontsize)

        # PDF coordinates have the origin at the bottom-left, so flip vertically
        baseline_y = page_h - y2
        c.drawString(x1, baseline_y, text)

    c.save()
    return buf.getvalue()

Key Points

  • Use setFillAlpha(0.0) to achieve transparency.

  • OCR coordinates use image coordinates (top is 0), whereas PDFs use bottom-up coordinates (bottom is 0). Therefore, vertical inversion is necessary.

  • The y-coordinate is calculated as page_h - y2.

5. Merging the Original PDF and Overlay PDF with PyPDF

This is the final step. By merging the transparent text PDF overlay we created with the original PDF, you complete the generation of a text-embedded PDF.

main.py
from pypdf import PdfReader, PdfWriter

def merge_overlay(original_pdf, overlay_bytes, output_pdf):
    reader = PdfReader(original_pdf)
    overlay_reader = PdfReader(io.BytesIO(overlay_bytes))

    writer = PdfWriter()

    page = reader.pages[0]
    overlay_page = overlay_reader.pages[0]

    page.merge_page(overlay_page)
    writer.add_page(page)

    with open(output_pdf, "wb") as f:
        writer.write(f)

Trying Out the Program

I couldn't find a good sample scanned PDF to post in this article, so I converted some document-like data into an image and then into a PDF to process it as a scanned PDF.


The text has become selectable thanks to the program!

Since the position and font size are determined based on the bbox returned during OCR, they are slightly misaligned with the actual text, but the transparent text is embedded quite well.

By the way, here is the result of copying and pasting the text. Although there are some minor recognition errors, I believe the accuracy is sufficient depending on the use case.

令和7年12月1日
関係者各位
口一力ルOCR実行委員会
委員長ハルミ
スキ一PDF変換大会の開催中止につて(お知らせ
拝啓時下ますます二清祥のこととお喜び申し上げます。平素は当会の運営につき
まして格別のご高配を賜り、厚く御礼申し上げます。
さて、開催を予定しておりました標記のスキPDF変換大会は、委員長の体調不
良のため、慎重な協議の結果、中止が決定いたしましたので、(ここに)お知せ
いたします。
実行委員一同、開催に向けて準備をしてまいりましたが、苦渋の決断をせさるを
得ない状況なりました。開を心待ちにしてくださっていた皆ま、関係者の皆
さまには大変ご迷惑かけすることとな「大変ご迷惑かけいたしますこ
と、深くおび申し上げす。なにとご理解(ご了承いただきますようご
理解(ご了承)のほよろしくお願い申し上げます。
敬具

1.イベト[催し物・行事名
2.開催日令和7年12月2日(日)
3.お問↓合わせ先:000-000-0000
以上

I won't go into too much detail here, as this part depends heavily on the performance of the OCR engine.

Released as a TUI Tool

Although the process is simple, I have created a TUI application to easily perform these operations and published it on PyPI. If you're interested, please give it a try.

The TUI is built using textual, allowing you to perform OCR on PDFs and generate searchable PDFs directly from the terminal. For more details, please see the README.


TUI tool screen (running in the terminal)

https://github.com/harumiWeb/pdfembed

bash
pip install pdfembed
pdfembed

Conclusion

This was a brief explanatory article, but I wanted to share it because it was actually quite practical.
Most OCR libraries and services allow you to obtain bounding boxes along with character recognition, making it easy to leverage them effectively.

If image recognition models can be run practically on local machines, we should be able to create even more useful tools in the future. I'm looking forward to further advancements in AI! 🔥

See you again! 👋

  • Library for creating TUI tools

https://textual.textualize.io/

Discussion