🚀

Pythonで順番バラバラなPDF内のテキストをうまく読み込む

2023/04/28に公開

今回利用するライブラリ

$ pip install pymupdf

https://pymupdf.readthedocs.io/en/latest/

sorted関数を利用して、座標順に並べ替えればOK

import fitz

def sort_by_bbox(chunks: list):
    chunks = sorted(chunks, key=lambda x: x["bbox"][0])
    chunks = sorted(chunks, key=lambda x: x["bbox"][1])
    return chunks

doc = fitz.open("読み取りたいPDF.pdf")
for page in doc:
    text = page.get_text("dict")
    if not "blocks" in text:
        continue
    blocks = sort_by_bbox(text["blocks"])
    for block in blocks:
        lines = sort_by_bbox(block["lines"])
        block_text = ""
        for line in lines:
            for span in sort_by_bbox(line["spans"]):
                block_text += span["text"]
        print(block_text)

参考

https://note.nkmk.me/python-dict-list-sort/

Discussion