Azure AI Document IntelligenceでのPDF処理

 Azure AI Document Intelligence とは?https://learn.microsoft.com/ja-jp/azure/ai-services/document-intelligence/overview
Azure AI Document Intelligence を使ってユースケースに依らずに PDF などの文書データからテキストを抽出する際に指定できるモデルは大きく以下2つ。
Readモデル

文書データからテキストを抽出するモデル。画像データなどは取れない。
Layoutモデル

文書データからテキストに加え、レイアウト情報も抽出するモデル。画像データや表データの取得にも対応。

Yuki Yamada

 利用するリソース情報今回は以下のような構成で用意した Azure AI Document Intelligence リソースを利用する。

項目
値

リージョン
東日本

価格レベル（SKU）
Standard

 利用するクライアントクライアントにはPython SDKのv1系を利用する。

APIバージョンは2024-11-30でDocument Intelligence v4.0系を呼び出す想定。
https://learn.microsoft.com/en-us/python/api/overview/azure/ai-documentintelligence-readme?view=azure-python
https://pypi.org/project/azure-ai-documentintelligence/

項目	値
リージョン	`東日本`
価格レベル（SKU）	`Standard`

Yuki Yamada

クライアントの生成

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient

# リソース情報の設定
endpoint = "<リソースのエンドポイント>"
credential= AzureKeyCredential(key="<APIキー>")

document_intelligence_client = DocumentIntelligenceClient(
            endpoint=endpoint, credential=credential
        )

Yuki Yamada

 PDFのマークダウン化PDFをマークダウンに変換してみる。
利用するPDFは文科省が公開している「高等学校情報科「情報Ⅰ」教員研修用教材（本編）」の「第1章　情報社会の問題解決」を利用する。
https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/1416756.htm
https://www.mext.go.jp/content/20200722-mxt_jogai02-100013300_003.pdf
このPDFファイルのファイルサイズは9.6MBで43ページある。
import os
from pathlib import Path

from azure.ai.documentintelligence.models import (
    AnalyzeDocumentRequest,
    AnalyzeResult,
    AnalyzeOutputOption
)

current_path =os.getcwd()
target_file_path = os.path.join(Path(current_path).parent, "data/20200722-mxt_jogai02-100013300_003.pdf")

with open(target_file_path, "rb") as f:
    bytes_data = f.read()
    analyze_request = AnalyzeDocumentRequest(bytes_source=bytes_data)

# レイアウトモデルでの抽出処理の開始
poller = document_intelligence_client.begin_analyze_document(
    model_id="prebuilt-layout",
    body=analyze_request,
    output_content_format="markdown", # 出力をMarkDown形式に
    output=[AnalyzeOutputOption.FIGURES] # 抽出画像を別エンドポイントで取得できるようにするオプション
)

result: AnalyzeResult = poller.result()

Yuki Yamada

 レイアウトモデルで抽出された画像の取得抽出処理開始時のoutputパラメータにAnalyzeOutputOption.FIGURESを指定しておくことで、文書中の画像データを別API呼び出しで取得できます。

※ パラメータを指定しない場合、画像取得時に404エラーが発生します。
https://learn.microsoft.com/en-us/rest/api/aiservices/document-models/get-analyze-result-figure?view=rest-aiservices-v4.0 (2024-11-30)&viewFallbackFrom=rest-aiservices-v4.0 (2024-07-31-preview)&tabs=HTTP

 Get Analyze Result Figure APIの呼び出しこのAPIの呼び出しにはresultIdというパラメータで抽出処理開始の結果のIDパラメータを渡す必要があります。

このパラメータはpollerに含まれています。
# 結果IDの取得
result_id = poller.details['operation_id']

# results/imgディレクトリに抽出画像を保存
img_dir_path = os.path.join(Path(current_path).parent, "results/img")
if result.figures:
    for figure in result.figures:
        assert figure.id
        raw_img = document_intelligence_client.get_analyze_result_figure(
            model_id=result.model_id,
            result_id=result_id,
            figure_id=figure.id,
        )
        saved_img_path = os.path.join(img_dir_path, f"img_{figure.id}.png")
        with open(saved_img_path, "wb") as f:
            f.writelines(raw_img)

ログインするとコメントできます