😺

【Azure Document intelligence】- markdownで出力をする方法

2024/11/25に公開

執筆日

2024/11/25

やること

Azure Document intelligenceのlayout モデルを使って、Markdown形式の出力を実装しようかなと。

参考資料

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=sample-code

前提

  • Azure Document intelligenceのStandardをデプロイ済み
  • Layout モデルを使う
  • 以下のデータを使う

ライブラリー

pip install azure-ai-documentintelligence

コード

main.py
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
    AnalyzeDocumentRequest,
    ContentFormat,
    AnalyzeResult,
)

# 定数の定義
ENDPOINT = <"endpoint">
KEY = <"key">
URL = "https://raw.githubusercontent.com/Azure/azure-sdk-for-python/main/sdk/documentintelligence/azure-ai-documentintelligence/samples/sample_forms/forms/Invoice_1.pdf"
MODEL = "prebuilt-layout"

def analyze_document(endpoint, key, url, model):
    client = DocumentIntelligenceClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )
    poller = client.begin_analyze_document(
        model,
        AnalyzeDocumentRequest(url_source=url),
        output_content_format=ContentFormat.MARKDOWN,
    )

    result: AnalyzeResult = poller.result()

    print(f"Here's the full content in format {result.content_format}:\n")
    print(result.content)


if __name__ == "__main__":
    analyze_document(ENDPOINT, KEY, URL, MODEL)
出力結果

Here's the full content in format ContentFormat.MARKDOWN:

Contoso

Address:

1 Redmond way Suite

6000 Redmond, WA

99243

Invoice For: Microsoft

1020 Enterprise Way

Sunnayvale, CA 87659

<table>
<tr>
<th>Invoice Number</th>
<th>Invoice Date</th>
<th>Invoice Due Date</th>
<th>Charges</th>
<th>VAT ID</th>
</tr>
<tr>
<td>34278587</td>
<td>6/18/2017</td>
<td>6/24/2017</td>
<td>$56,651.49</td>
<td>PT</td>
</tr>
</table>

mdファイルで出力してみる

main.py
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
    AnalyzeDocumentRequest,
    ContentFormat,
    AnalyzeResult,
)

# 定数の定義
ENDPOINT = <"endpoint">
KEY = <"key">
URL = "https://raw.githubusercontent.com/Azure/azure-sdk-for-python/main/sdk/documentintelligence/azure-ai-documentintelligence/samples/sample_forms/forms/Invoice_1.pdf"
MODEL = "prebuilt-layout"


def analyze_document(endpoint, key, url, model):
    client = DocumentIntelligenceClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )
    poller = client.begin_analyze_document(
        model,
        AnalyzeDocumentRequest(url_source=url),
        output_content_format=ContentFormat.MARKDOWN,
    )

    result: AnalyzeResult = poller.result()

    # Markdown形式の内容をファイルに保存
    output_filename = "output.md"
    with open(output_filename, "w", encoding="utf-8") as md_file:
        md_file.write(f"Here's the full content in format {result.content_format}:\n\n")
        md_file.write(result.content)

    print(f"Markdown content has been written to {output_filename}")


if __name__ == "__main__":
    analyze_document(ENDPOINT, KEY, URL, MODEL)

いい感じ。

まとめ

Markdownで出力できることを知らなかったので良い勉強になった。
前処理に組み込みうかなと。(ちょっと高いけど..)

ヘッドウォータース

Discussion