🚀
Pythonで順番バラバラなPDF内のテキストをうまく読み込む
今回利用するライブラリ
$ pip install pymupdf
sorted関数を利用して、座標順に並べ替えればOK
import fitz
def sort_by_bbox(chunks: list):
chunks = sorted(chunks, key=lambda x: x["bbox"][0])
chunks = sorted(chunks, key=lambda x: x["bbox"][1])
return chunks
doc = fitz.open("読み取りたいPDF.pdf")
for page in doc:
text = page.get_text("dict")
if not "blocks" in text:
continue
blocks = sort_by_bbox(text["blocks"])
for block in blocks:
lines = sort_by_bbox(block["lines"])
block_text = ""
for line in lines:
for span in sort_by_bbox(line["spans"]):
block_text += span["text"]
print(block_text)
参考
Discussion