文書内に画像が含まれるPDFをLayout-Parserを使用してGPT-4oで解析する

D.S.

以下のレイアウト抽出のサンプル試す。
Deep Layout Parsing Example: With the help of Deep Learning, layoutparser supports the analysis very complex documents and processing of the hierarchical structure in the layouts.

D.S.

インストールは以下を参照

pip install layoutparser
pip install "layoutparser[effdet]"
pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"
pip install "layoutparser[paddledetection]"
pip install "layoutparser[ocr]"

D.S.

modelの定義時に以下のエラーが発生

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File /Users/daikisoga/Development/GitHub/DevAI/layout-parser_sample.py:2
      1 #%%
----> 2 model1 = lp.Detectron2LayoutModel('lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config',
      3                                  extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
      4                                  label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})
      6 model2 = lp.Detectron2LayoutModel('lp://PubLayNet/mask_rcnn_R_50_FPN_3x/config',
      7                                  extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
      8                                  label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})
     10 model3 = lp.Detectron2LayoutModel('lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
     11                                  extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
     12                                  label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})

File ~/Development/GitHub/DevAI/layout-parser/lib/python3.10/site-packages/layoutparser/models/layoutmodel.py:52, in BaseLayoutModel.__new__(cls, *args, **kwargs)
     50 def __new__(cls, *args, **kwargs):
---> 52     cls._import_module()
     53     return super().__new__(cls)

File ~/Development/GitHub/DevAI/layout-parser/lib/python3.10/site-packages/layoutparser/models/layoutmodel.py:40, in BaseLayoutModel._import_module(cls)
     37 for m in cls.MODULES:
     38     if importlib.util.find_spec(m["module_path"]):
     39         setattr(
---> 40             cls, m["import_name"], importlib.import_module(m["module_path"])
     41         )
...
     52             fill: Fill color used when src_rect extends outside image
     53         """
     54         super().__init__()

AttributeError: module 'PIL.Image' has no attribute 'LINEAR'

Pillow 10移行のバージョンでは PIL.Image.LINEARが削除されたため、ダウングレードが必要
※参照：https://github.com/Layout-Parser/layout-parser/issues/187

pip install pillow==9.5.0

D.S.

テキストと画像を抽出する。
抜き出した画像はファイルとして保存し、画像を保存したファイルパスに置換してテキストとともに保存する。

#%%
import layoutparser as lp
import fitz  # PyMuPDF
from PIL import Image
import io
import os

# PDFファイルのパス
pdf_path = "Attention_is_All_You_Need.pdf"
document = fitz.open(pdf_path)

# 画像を保存するディレクトリの作成
output_dir = os.path.splitext(pdf_path)[0]
os.makedirs(output_dir, exist_ok=True)

# テキストファイルのパス
output_text_file = os.path.join(output_dir, "extracted_text.txt")

# レイアウトモデルのロード
model = lp.Detectron2LayoutModel(
    config_path='lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
    extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
    label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)

full_text = []
# 各ページを処理
for page_num in range(len(document)):
    page = document.load_page(page_num)
    zoom_x = 300 / 72
    zoom_y = 300 / 72
    mat = fitz.Matrix(zoom_x, zoom_y)
    pix = page.get_pixmap(matrix=mat)
    img = Image.open(io.BytesIO(pix.tobytes("png")))

    # ページのレイアウトを検出
    layout = model.detect(img)

    for block in layout:
        print(block)
        # テキストブロックを出力
        if block.type == 'Text':
            x1, y1, x2, y2 = map(int, block.coordinates)
            text = page.get_textbox((x1, y1, x2, y2))
            full_text.append(text)
        elif block.type == 'Figure':
            x1, y1, x2, y2 = map(int, block.coordinates)
            cropped_img = img.crop((x1, y1, x2, y2))
            # ページ内の画像番号を管理するためのカウンタ
            if not hasattr(page, 'image_counter'):
                page.image_counter = 1
            else:
                page.image_counter += 1
            image_filename = f"page_{page_num + 1}_image_{page.image_counter:02d}.png"
            image_path = os.path.join(output_dir, image_filename)
            cropped_img.save(image_path)
            full_text.append(f"Figure : {image_path}")

# テキストをファイルに保存
with open(output_text_file, "w") as f:
    f.write("\n".join(full_text))

結果としては以下のようにテキストの抽出精度が悪い。
画像の抽出、保存は精度良くできているように見える。







cture [5, 2, 35].
 to a sequence
rates an output
auto-regressive
ng the next.





. In these models,
t positions grows
eNet. This makes
ansformer this is
ve resolution due
ead Attention as
ifferent positions
tention has been
e summarization,
8, 22].
ead of sequence-
on answering and
n model relying
t using sequence-
sformer, motivate






Figure : Attention_is_All_You_Need/page_3_image_01.png






Figure : Attention_is_All_You_Need/page_4_image_01.png



Figure : Attention_is_All_You_Need/page_4_image_02.png
Figure : Attention_is_All_You_Need/page_4_image_03.png



ecoder to attend to
o prevent leftward
We implement this
values in the input






itional encoding
o 10000 · 2π. We
earn to attend by
near function of







5)
(3)
s training steps,
mber. We used



bout 4.5 million
 a shared source-
ntly larger WMT
2000 word-piece
gth. Each training
okens and 25000
ase models using
 0.4 seconds. We
(described on the
or 300,000 steps
ried the learning





del, we use a rate of
ls = 0.1 [36]. This
and BLEU score.








Figure : Attention_is_All_You_Need/page_8_image_01.png


4
53
2
90
6
5
3
7
7
4
213
ction, but no
e dimensions,
 single-head
 heads.




Figure : Attention_is_All_You_Need/page_9_image_01.png



odel performs sur-
e exception of the
orms the Berkeley-

Figure : Attention_is_All_You_Need/page_10_image_01.png








ang, Bowen
Xiv preprint
r. Multi-task








els with
.
ccurate
(Volume

rds
on,
rts
tdi-






Figure : Attention_is_All_You_Need/page_13_image_01.png

Figure : Attention_is_All_You_Need/page_14_image_01.png
Figure : Attention_is_All_You_Need/page_14_image_02.png

Figure : Attention_is_All_You_Need/page_15_image_01.png

D.S.

OCRの精度を上げるためpyocrを利用するように変更

インストール

pip install pyocr

修正コード

import pyocr
import pyocr.builders

# pyocrのツールを取得
tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    exit(1)
tool = tools[0]

    # テキストブロックを抽出し、元の位置に挿入
    for block in text_blocks:
        x1, y1, x2, y2 = map(int, block.coordinates)
        cropped_img = img.crop((x1, y1, x2, y2))
        text = tool.image_to_string(
            cropped_img,
            lang="eng",
            builder=pyocr.builders.TextBuilder()
        )
        full_text.append(text)

結果

OCRの精度は向上


efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in

the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms
are used in conjunction with a recurrent network.
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-
aligned recurrence and have been shown to perform well on simple-language question answering and
language modeling tasks [34].
[Image: page_3_image_1.png]
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position 7 can depend only on the known outputs at positions less than 7.
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(z + Sublayer(z)), where Sublayer(z:) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding