🙆
Google Cloud Vision APIを用いて、単一ページから構成される透明テキスト付きPDFを作成する

2024/11/02に公開
Python
PDF
OCR
tech
 概要PDFを対象に、Google Cloud Vision APIを使って、透明テキスト付きPDFを作成する機会がありましたので、備忘録です。
以下、simpleで検索した例です。

 背景今回は単一ページから構成されるPDFを対象とします。

 手順
 画像の作成OCRの対象とする画像を作成します。
デフォルトの設定だとボヤけた画像ができてしまったので、解像度を2倍に設定し、また後述するプロセスで、解像度を考慮した位置合わせを実施しています。
以下をインストールします。
requirements.txt
PyMuPDF
Pillow
import fitz  # PyMuPDF
from PIL import Image
import json
from tqdm import tqdm
import io

# 入力PDFファイルと出力PDFファイル
input_pdf_path = "./input.pdf"  # 単一ページのPDFファイル
output_pdf_path = "./output.pdf"

# 入力PDFファイルを開き、単一ページを読み込み
pdf_document = fitz.open(input_pdf_path)
page = pdf_document[0]  # 最初のページを選択

# ページを画像としてレンダリングし、OCRでテキストを抽出
# pix = page.get_pixmap()  # 解像度300 DPIでレンダリング

zoom = 2.0

# 解像度を上げるためにズーム設定
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)

img = Image.open(io.BytesIO(pix.tobytes("png")))
img.save("./image.png")

 Google Cloud Vision API出力された画像を対象に、Google Cloud Vision APIを適用します。
{
    "textAnnotations": [
        {
            "boundingPoly": {
                "vertices": [
                    {
                        "x": 141,
                        "y": 152
                    },
                    {
                        "x": 1082,
                        "y": 152
                    },
                    {
                        "x": 1082,
                        "y": 1410
                    },
                    {
                        "x": 141,
                        "y": 1410
                    }
                ]
            },
            "description": "Sample PDF...",
            "locale": "la"
        },
        {
            "boundingPoly": {
                "vertices": [
                    {
                        "x": 141,
                        "y": 159
                    },
                    {
                        "x": 363,
                        "y": 156
                    },
                    {
                        "x": 364,
                        "y": 216
                    },
                    {
                        "x": 142,
                        "y": 219
                    }
                ]
            },
            "description": "Sample"
        },
        {
            "boundingPoly": {
                "vertices": [
                    {
                        "x": 382,
                        "y": 156
                    },
                    {
                        "x": 506,
                        "y": 154
                    },
                    {
                        "x": 507,
                        "y": 213
                    },
                    {
                        "x": 383,
                        "y": 215
                    }
                ]
            },
            "description": "PDF"
        },
...
出力結果として得られるJSONファイルを./google_ocr.jsonといった名前で保存します。
そして、以下のようにOCR結果を取得します。
json_path = "./google_ocr.json"

# JSONファイルからOCRテキストデータを読み込む
with open(json_path, "r") as f:
    response = json.load(f)

texts = response["textAnnotations"]

 透明テキストの作成以下のスクリプトにより、PDFに反映します。ポイントとして、「フォントサイズを調整して収まるか確認」する必要がありました。
# ページサイズを取得
rect = page.rect

# OCRテキストを透明テキストとして追加
if texts:
    for text in tqdm(texts[1:]):  # texts[0]はページ全体のテキストなのでスキップ
        vertices = text["boundingPoly"]["vertices"]

        x_min = min([v["x"] for v in vertices if v])
        y_min = min([v["y"] for v in vertices if v])
        x_max = max([v["x"] for v in vertices if v])
        y_max = max([v["y"] for v in vertices if v])

        x_min = x_min / zoom
        y_min = y_min / zoom
        x_max = x_max / zoom
        y_max = y_max / zoom

        # バウンディングボックスを定義
        bbox_rect = fitz.Rect(x_min, y_min, x_max, y_max)
        content = text["description"]

        # 初期フォントサイズ
        fontsize = 10
        fits = False

        # フォントサイズを調整して収まるか確認
        while fontsize > 0:
            res = page.insert_textbox(bbox_rect, content, fontsize=fontsize, color=(0, 0, 0, 0), render_mode=3, align=1)
            if res >= 0:
                fits = True
                break
            fontsize -= 1  # フォントサイズを縮小

        if not fits:
            print(f"'{content}' could not fit in the rectangle.")

# 変更したPDFを保存
pdf_document.save(output_pdf_path)
pdf_document.close()

 結果以下で公開されているPDFを対象にします。
https://pdfobject.com/pdf/sample.pdf
結果、以下のように透明テキスト付きPDFを作成することができました。

 まとめ特定のページのみOCRが必要な際などに、本記事が参考になれば幸いです。
概要

背景

手順

画像の作成

Google Cloud Vision API

透明テキストの作成

結果

まとめ

Discussion