😺
【Azure Document intelligence】- markdownで出力をする方法
執筆日
2024/11/25
やること
Azure Document intelligenceのlayout モデルを使って、Markdown形式の出力を実装しようかなと。
参考資料
前提
- Azure Document intelligenceのStandardをデプロイ済み
- Layout モデルを使う
- 以下のデータを使う
ライブラリー
pip install azure-ai-documentintelligence
コード
main.py
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
AnalyzeDocumentRequest,
ContentFormat,
AnalyzeResult,
)
# 定数の定義
ENDPOINT = <"endpoint">
KEY = <"key">
URL = "https://raw.githubusercontent.com/Azure/azure-sdk-for-python/main/sdk/documentintelligence/azure-ai-documentintelligence/samples/sample_forms/forms/Invoice_1.pdf"
MODEL = "prebuilt-layout"
def analyze_document(endpoint, key, url, model):
client = DocumentIntelligenceClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
poller = client.begin_analyze_document(
model,
AnalyzeDocumentRequest(url_source=url),
output_content_format=ContentFormat.MARKDOWN,
)
result: AnalyzeResult = poller.result()
print(f"Here's the full content in format {result.content_format}:\n")
print(result.content)
if __name__ == "__main__":
analyze_document(ENDPOINT, KEY, URL, MODEL)
出力結果
Here's the full content in format ContentFormat.MARKDOWN:
Contoso
Address:
1 Redmond way Suite
6000 Redmond, WA
99243
Invoice For: Microsoft
1020 Enterprise Way
Sunnayvale, CA 87659
<table>
<tr>
<th>Invoice Number</th>
<th>Invoice Date</th>
<th>Invoice Due Date</th>
<th>Charges</th>
<th>VAT ID</th>
</tr>
<tr>
<td>34278587</td>
<td>6/18/2017</td>
<td>6/24/2017</td>
<td>$56,651.49</td>
<td>PT</td>
</tr>
</table>
mdファイルで出力してみる
main.py
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
AnalyzeDocumentRequest,
ContentFormat,
AnalyzeResult,
)
# 定数の定義
ENDPOINT = <"endpoint">
KEY = <"key">
URL = "https://raw.githubusercontent.com/Azure/azure-sdk-for-python/main/sdk/documentintelligence/azure-ai-documentintelligence/samples/sample_forms/forms/Invoice_1.pdf"
MODEL = "prebuilt-layout"
def analyze_document(endpoint, key, url, model):
client = DocumentIntelligenceClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
poller = client.begin_analyze_document(
model,
AnalyzeDocumentRequest(url_source=url),
output_content_format=ContentFormat.MARKDOWN,
)
result: AnalyzeResult = poller.result()
# Markdown形式の内容をファイルに保存
output_filename = "output.md"
with open(output_filename, "w", encoding="utf-8") as md_file:
md_file.write(f"Here's the full content in format {result.content_format}:\n\n")
md_file.write(result.content)
print(f"Markdown content has been written to {output_filename}")
if __name__ == "__main__":
analyze_document(ENDPOINT, KEY, URL, MODEL)
いい感じ。
まとめ
Markdownで出力できることを知らなかったので良い勉強になった。
前処理に組み込みうかなと。(ちょっと高いけど..)
Discussion