📄

画像PDFをOCR→テキスト抽出からの透過テキスト埋め込み（無料で頑張るGoogle Colab✕Google Vision API編）

2024/05/15に公開

Python

作成経緯

基本的にはスキャナ付属ソフトを使用し、書類スキャン→OCR処理→透過テキスト付きPDFファイル作成している。
しかし、スキャナ付属ソフトでは、個別に用意した画像PDFを変換することができない。
無料のWebサービスもあるけど、よくわからないWebサービスにファイルをアップロードするのはちょっと心配。
もちろんAdobe Acrobatのような有料のソフトもあるけど、お金もかけたくない💸
ということで、頑張って（ChatGPTに相談しながら）作ることにした。

仕様

PDFファイルはいろいろな手段で準備するので、個別にアップロードできるようにする。
透過テキストを埋め込んだPDFファイルは、最終的に共有するためGoogle Driveに出力する。
（出力後、手動で移動する。）
透過テキストの細かな位置や、フォントサイズまでは対応しない。（現状、縦書き文書はレイアウトが崩れる。）
様々なページサイズに対応する。（B5サイズとか・・・）

事前準備

Google Driveの「マイドライブ」に以下のフォルダを作成し、Colabファイルを保存する。
Colab Notebooks/PdfOcr/

「IPAexフォント」をダウンロードし、「ipaexg.ttf」を以下のフォルダに保存する
Colab Notebooks/PdfOcr/fonts

Google Vision APIを有効化し、プロジェクトを作成後、認証情報を「key.json」として以下のフォルダに保存する。

Colab Notebooks/PdfOcr/key

出力ファイルを保存先として、以下のフォルダを作成する。
Colab Notebooks/PdfOcr/output

コード

ライブラリインポート

!pip install pdf2image
!pip install pillow
!pip install pdfplumber
!pip install reportlab
!pip install PyPDF2
!pip install google-cloud-vision

コード

from google.cloud import vision
from google.oauth2 import service_account
import PyPDF2
from reportlab.pdfgen import canvas
import pdfplumber
import io
import os
from PIL import Image
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from google.colab import files

# Google Cloud認証
def get_vision_client():
    credentials = service_account.Credentials.from_service_account_file(
        '/content/drive/MyDrive/Colab Notebooks/PdfOcr/key/key.json'
    )
    return vision.ImageAnnotatorClient(credentials=credentials)

# 画像からテキストを検出
def detect_text(path):
    client = get_vision_client()
    with io.open(path, 'rb') as image_file:
        content = image_file.read()
    image = vision.Image(content=content)
    response = client.document_text_detection(image=image)
    #response = client.text_detection(image=image)
    texts = response.text_annotations
    if response.error.message:
        raise Exception(f'{response.error.message}')
    return texts[1:] if texts else []  # 最初の要素は全体のテキストなので除外

# PDFページを画像に変換
def pdf_to_images(pdf_path):
    images = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            image = page.to_image(resolution=300)
            images.append(image.original)
    return images

def add_text_to_pdf(images, input_pdf_path, output_pdf_path):
    font_path = '/content/drive/MyDrive/Colab Notebooks/PdfOcr/fonts/ipaexg.ttf'
    pdfmetrics.registerFont(TTFont('IPAexGothic', font_path))

    existing_pdf = PyPDF2.PdfReader(input_pdf_path)
    output = PyPDF2.PdfWriter()

    for i, image in enumerate(images):
        img_path = f'/tmp/temp_image_{i}.jpg'
        image.save(img_path, 'JPEG')
        texts = detect_text(img_path)
        os.remove(img_path)

        page = existing_pdf.pages[i]

        rotation = page.get('/Rotate', 0) % 360  # 回転角度を360で正規化

        # mediaboxからページの物理的な寸法を取得
        can_width = pdf_width = float(page.mediabox.upper_right[0])
        can_height = pdf_height = float(page.mediabox.upper_right[1])

        # 透過テキスト用キャンバスの設定
        packet = io.BytesIO()
        can = canvas.Canvas(packet, pagesize=(pdf_width, pdf_height))
        #can.setFillColorRGB(1, 0, 0)  # デバッグ用に赤文字で出力する場合
        can.setFillColorRGB(0, 0, 0, 0)  # 透過テキスト

        # 回転角度に応じてテキストを回転
        can.saveState()
        if rotation in [90, 270]:
            can.translate(pdf_width/2, pdf_height/2)
            can.rotate(rotation)
            can.translate(-pdf_height/2, -pdf_width/2)

        # 90度または270度回転している場合、幅と高さを入れ替え
        if rotation == 90 or rotation == 270:
            pdf_width, pdf_height = pdf_height, pdf_width

        for text in texts:
            if text.description.strip():
                vertices = text.bounding_poly.vertices
                x = vertices[0].x * pdf_width / image.width
                y = pdf_height - (vertices[0].y * pdf_height / image.height)  # Y座標の計算
                # テキストの高さを計算して調整
                text_height = (vertices[2].y - vertices[0].y) * pdf_height / image.height
                y -= text_height  # テキストの高さを考慮してY座標を調整


                can.setFont('IPAexGothic', 12)  # フォントサイズを12に固定
                can.drawString(x, y, text.description)

        can.restoreState()
        can.save()
        packet.seek(0)
        new_pdf = PyPDF2.PdfReader(packet)
        new_page = page
        new_page.merge_page(new_pdf.pages[0])
        output.add_page(new_page)

    with open(output_pdf_path, 'wb') as f:
        output.write(f)

実行

def main():
    uploaded_files = files.upload()
    for filename, data in uploaded_files.items():
        input_pdf_path = f'/tmp/{filename}'
        with open(input_pdf_path, 'wb') as f:
            f.write(data)
        output_pdf_path = f'/content/drive/MyDrive/Colab Notebooks/PdfOcr/output/{filename}'
        images = pdf_to_images(input_pdf_path)
        add_text_to_pdf(images, input_pdf_path, output_pdf_path)
        os.remove(filename)

main()

最後に

Google Colabo（Python含む）、Google Vision APIのどちらも未経験ではあったがとりあえず目的は達成できた。
未経験ゆえに、お作法がわからずコードがゴチャゴチャしているため、綺麗にしたいところだが、どう手を付けて良いかさっぱり🤷‍♂️

作成経緯

仕様

事前準備

コード

ライブラリインポート

コード

実行

最後に

Discussion