🐥

【Azure Document intelligence】- Layoutモデルについて

2024/11/25に公開

執筆日

2024/11/25

やりたいこと

Azure Document IntelligenceのReadモデルを使用してOCRを行っていましたが、チェックボックスの検知ができず、どう対処しようか悩んでました。(カスタムモデルはちょっと高いし..)
Portal画面を見ていると、Layoutモデルを発見しました。このモデルを使用すると、チェックボックスの検知が可能であるという記載がありました。そこで、実際に検証してみることにしました。

前提条件

  • Azure Document IntelligenceのS0を構築済みであること

サンプルコードを実行する

以下のサンプルコードを使用して、「Layout」モデルを試してみます。

sample code

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"

# sample document
formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-layout", formUrl)
result = poller.result()

for idx, style in enumerate(result.styles):
    print(
        "Document contains {} content".format(
         "handwritten" if style.is_handwritten else "no handwritten"
        )
    )

for page in result.pages:
    for line_idx, line in enumerate(page.lines):
        print(
         "...Line # {} has text content '{}'".format(
        line_idx,
        line.content.encode("utf-8")
        )
    )

    for selection_mark in page.selection_marks:
        print(
         "...Selection mark is '{}' and has a confidence of {}".format(
         selection_mark.state,
         selection_mark.confidence
         )
    )

for table_idx, table in enumerate(result.tables):
    print(
        "Table # {} has {} rows and {} columns".format(
        table_idx, table.row_count, table.column_count
        )
    )
        
    for cell in table.cells:
        print(
            "...Cell[{}][{}] has content '{}'".format(
            cell.row_index,
            cell.column_index,
            cell.content.encode("utf-8"),
            )
        )

print("----------------------------------------")

サンプルコードは、以下の画像を解析するために使用されます。

https://github.com/Azure-Samples/cognitive-services-REST-api-samples/blob/master/curl/form-recognizer/sample-layout.pdf

出力結果

Document contains handwritten content
...Line # 0 has text content 'b'UNITED STATES''
...Line # 1 has text content 'b'SECURITIES AND EXCHANGE COMMISSION''
...Line # 2 has text content 'b'Washington, D.C. 20549''
...Line # 3 has text content 'b'FORM 10-Q''
...Line # 4 has text content 'b'\xe2\x98\x92''
...Line # 5 has text content 'b'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF''
...Line # 6 has text content 'b'1934''
...Line # 7 has text content 'b'For the Quarterly Period Ended March 31, 2020''
...Line # 8 has text content 'b'OR''
...Line # 9 has text content 'b'\xe2\x98\x90''
...Line # 10 has text content 'b'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF''
...Line # 11 has text content 'b'1934''
...Line # 12 has text content 'b'For the Transition Period From''
...Line # 13 has text content 'b'to''
...Line # 14 has text content 'b'Commission File Number 001-37845''
...Line # 15 has text content 'b'MICROSOFT CORPORATION''
...Line # 16 has text content 'b'WASHINGTON''
...Line # 17 has text content 'b'(STATE OF INCORPORATION)''
...Line # 18 has text content 'b'91-1144442''
...Line # 19 has text content 'b'(I.R.S. ID)''
...Line # 20 has text content 'b'ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399''
...Line # 21 has text content 'b'(425) 882-8080''
...Line # 22 has text content 'b'www.microsoft.com/investor''
...Line # 23 has text content 'b'Securities registered pursuant to Section 12(b) of the Act:''
...Line # 24 has text content 'b'Title of each class''
...Line # 25 has text content 'b'Trading Symbol''
...Line # 26 has text content 'b'Name of exchange on which registered''
...Line # 27 has text content 'b'Common stock, $0.00000625 par value per share''
...Line # 28 has text content 'b'MSFT''
...Line # 29 has text content 'b'NASDAQ''
...Line # 30 has text content 'b'2.125% Notes due 2021''
...Line # 31 has text content 'b'MSFT''
...Line # 32 has text content 'b'NASDAQ''
...Line # 33 has text content 'b'3.125% Notes due 2028''
...Line # 34 has text content 'b'MSFT''
...Line # 35 has text content 'b'NASDAQ''
...Line # 36 has text content 'b'2.625% Notes due 2033''
...Line # 37 has text content 'b'MSFT''
...Line # 38 has text content 'b'NASDAQ''
...Line # 39 has text content 'b'Securities registered pursuant to Section 12(g) of the Act:''
...Line # 40 has text content 'b'NONE''
...Line # 41 has text content 'b'Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange''
...Line # 42 has text content 'b'Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has''
...Line # 43 has text content 'b'been subject to such filing requirements for the past 90 days. Yes''
...Line # 44 has text content 'b'No''
...Line # 45 has text content 'b'Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule''
...Line # 46 has text content 'b'405 of Regulation S-T (\xc2\xa7232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to''
...Line # 47 has text content 'b'submit such files). Yes''
...Line # 48 has text content 'b'No''
...Line # 49 has text content 'b'Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting''
...Line # 50 has text content 'b'company, or an emerging growth company. See the definitions of "large accelerated filer," "accelerated filer," "smaller reporting company,"''
...Line # 51 has text content 'b'and "emerging growth company" in Rule 12b-2 of the Exchange Act.''
...Line # 52 has text content 'b'Large accelerated filer''
...Line # 53 has text content 'b'Accelerated filer''
...Line # 54 has text content 'b'Non-accelerated filer''
...Line # 55 has text content 'b'Smaller reporting company''
...Line # 56 has text content 'b'Emerging growth company''
...Line # 57 has text content 'b'If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying''
...Line # 58 has text content 'b'with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act.''
...Line # 59 has text content 'b'Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act).''
...Line # 60 has text content 'b'Yes''
...Line # 61 has text content 'b'No''
...Line # 62 has text content 'b"Indicate the number of shares outstanding of each of the issuer's classes of common stock, as of the latest practicable date."'
...Line # 63 has text content 'b'Class''
...Line # 64 has text content 'b'Outstanding as of April 24, 2020''
...Line # 65 has text content 'b'Common Stock, $0.00000625 par value per share''
...Line # 66 has text content 'b'7,583,440,247 shares''
...Selection mark is 'selected' and has a confidence of 0.98
...Selection mark is 'unselected' and has a confidence of 0.98
...Selection mark is 'selected' and has a confidence of 0.99
...Selection mark is 'unselected' and has a confidence of 0.995
...Selection mark is 'selected' and has a confidence of 0.99
...Selection mark is 'unselected' and has a confidence of 0.987
...Selection mark is 'selected' and has a confidence of 0.983
...Selection mark is 'unselected' and has a confidence of 0.982
...Selection mark is 'unselected' and has a confidence of 0.99
...Selection mark is 'unselected' and has a confidence of 0.995
...Selection mark is 'unselected' and has a confidence of 0.987
...Selection mark is 'unselected' and has a confidence of 0.99
...Selection mark is 'unselected' and has a confidence of 0.997
...Selection mark is 'selected' and has a confidence of 0.99
Table # 0 has 5 rows and 3 columns
...Cell[0][0] has content 'b'Title of each class''
...Cell[0][1] has content 'b'Trading Symbol''
...Cell[0][2] has content 'b'Name of exchange on which registered''
...Cell[1][0] has content 'b'Common stock, $0.00000625 par value per share''
...Cell[1][1] has content 'b'MSFT''
...Cell[1][2] has content 'b'NASDAQ''
...Cell[2][0] has content 'b'2.125% Notes due 2021''
...Cell[2][1] has content 'b'MSFT''
...Cell[2][2] has content 'b'NASDAQ''
...Cell[3][0] has content 'b'3.125% Notes due 2028''
...Cell[3][1] has content 'b'MSFT''
...Cell[3][2] has content 'b'NASDAQ''
...Cell[4][0] has content 'b'2.625% Notes due 2033''
...Cell[4][1] has content 'b'MSFT''
...Cell[4][2] has content 'b'NASDAQ''
Table # 1 has 2 rows and 2 columns
...Cell[0][0] has content 'b'Class''
...Cell[0][1] has content 'b'Outstanding as of April 24, 2020''
...Cell[1][0] has content 'b'Common Stock, $0.00000625 par value per share''
...Cell[1][1] has content 'b'7,583,440,247 shares''

解説
  1. テキストの行
    文書から抽出されたテキストの行が一覧として表示されます。

  2. 選択マーク
    チェックボックスの状態(例:「選択済み」「未選択」)が表示されます。

  3. 表の内容
    文書内には2つの表が含まれており、それぞれ以下の情報を示しています:

    • 表1:証券の種類、取引記号、登録されている取引所のリスト。
    • 表2:株式のクラスと2020年4月24日時点での発行済み株式数。

LocalのPDFを実行する

local.py

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(api_key)
)

# ローカルのPDFファイルのパス
document_path = <"PDFのパス">

with open(document_path, "rb") as document_file:
    poller = document_analysis_client.begin_analyze_document(
        "prebuilt-layout", document=document_file
    )
    result = poller.result()

for idx, style in enumerate(result.styles):
    print(
        "Document contains {} content".format(
            "handwritten" if style.is_handwritten else "no handwritten"
        )
    )

for page in result.pages:
    for line_idx, line in enumerate(page.lines):
        print(
            "...Line # {} has text content '{}'".format(
                line_idx, line.content.encode("utf-8")
            )
        )
    for selection_mark in page.selection_marks:
        print(
            "...Selection mark is '{}' and has a confidence of {}".format(
                selection_mark.state, selection_mark.confidence
            )
        )

for table_idx, table in enumerate(result.tables):
    print(
        "Table # {} has {} rows and {} columns".format(
            table_idx, table.row_count, table.column_count
        )
    )
    for cell in table.cells:
        print(
            "...Cell[{}][{}] has content '{}'".format(
                cell.row_index,
                cell.column_index,
                cell.content.encode("utf-8"),
            )
        )

print("----------------------------------------")

まとめ

出力結果を見たところ、Checkboxの検出は可能ですが、どの項目(Key)のCheckboxであるかを特定することが難しい点がネックです。ただし、Tableの検出ができる点は非常にいいなと。カスタムモデルを使うしかないかー。

ヘッドウォータース

Discussion