🐷

【Azure Document intelligence/PyPDF2/python】OCRをするドキュメントのページ数を取得する方法

takekawa tomoki

2024/09/28に公開

Document Intelligence

tech

 執筆日2024/09/28

 やることAzure Document intelligenceを使ってドキュメントのOCRを行うアプリケーションを開発していました。その際に、ドキュメントのページ数を取得する機能を実装する必要がありました。

あれ、どうやってやるんだ？と思ったので記事にします。

 前提python 3.9.6
以下参考にAzure Document intelligenceをつかるようにする
https://zenn.dev/headwaters/articles/8e23a752096c1e

 コード以下のライブラリーをinstallする
pip install pypdf2
以下のコードを実行する
main.py
import time
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from datetime import datetime
import PyPDF2

# Azure Document IntelligenceのエンドポイントとAPIキーを設定
endpoint = "エンドポイント"
api_key = "キー"

# クライアントの初期化
document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(api_key)
)

# ファイルを読み込んで
file_path = "pdfファイルのパス"

# PyPDF2を使ってすべてのページ数を取得
with open(file_path, "rb") as f:
    reader = PyPDF2.PdfReader(f)
    num_pages = len(reader.pages)
    print(f"ページ数: {num_pages}")

    # ファイルストリームをリセット
    f.seek(0)

    # ドキュメントを解析
    poller = document_analysis_client.begin_analyze_document(
        model_id="prebuilt-document", document=f
    )

# 結果を取得
result = poller.result()

# 結果を表示
for page in result.pages:
    # 各ページごとに出力する
    print(f"Page number: {page.page_number}")
    for line in page.lines:
        print(line.content)

ヘッドウォータース

株式会社ヘッドウォータースのテックブログです。生成AI、LLM、Azureのサービスや資格、IoT、XR系などData&AIとApp modernizeに関して幅広く投稿します！

執筆日

やること

前提

コード

Discussion