🐍

PDFを章ごとに分けてより効果的なEmbeddingを

2023/05/24に公開

はじめに

現在、POCである商品の取り扱い説明書のPDFをEmbeddingしChatGPTのAPIを経由して、ユーザーからの質問を回答するLINE Botを作成しています。

PDFをEmbeddingする際に、Langchainのドキュメントを読むとPDFからテキストを特定の文字ごとに分割してEmbeddingを行うやり方が記載されています。

しかし、それでは章が分割され章の後半に記載されていたユーザーからの質問に正しく答えられないケースが発生していました。

# 実装は以下で記載しているのですが、一部抜粋します。
# このコードでは、1000文字ずつにPDFを分割しています。
loader = PyPDFLoader("")
documents = loader.load_and_split()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

そこで、章ごとにテキストを取得して分割できる方法を探す事にしました。

OCRでの実装も考えたのですが、章ごとに分割する方法が思いつかず何か知見のある方がおられたら教えて頂けると嬉しいです。

手順

PDFの目次を取得する
PDFのフォントサイズを取得する
特定のサイズ以上のフォントサイズで、目次とテキストが一致する文字を見つける
(以後この文字をタイトルと呼びます)
次のタイトルが出るまで、その間の文字を一つの文章にする

手順1. PDFの目次を取得する

多くのPDFのライブラリを利用すれば、目次を簡単に取得できます。
今回は、PyMuPDFを利用します。

import fitz

doc = fitz.open(self.pdf_path)
outline = doc.get_toc()

https://github.com/pymupdf/PyMuPDF

手順2.PDFのフォントサイズを取得する

こちらは、pdfplumberというライブラリを利用して取得します。
ここの処理は少し複雑なため、いくつかの手順ごとに説明していきます。

pdfを開く

openで、pdf_pathを渡すとpdfの情報を受け取れるのでpdfという変数名で受け取ります。

import pdfplumber

pdf_path = ""
with pdfplumber.open(pdf_path) as pdf:
  return self._extract_text_from_pages(pdf.pages)

pageごとにページ番号とページ内の文字とフォントサイズを抽出する

先ほど取得したpdfの情報から、ページごとにページ番号と、ページにある文字とその文字のフォントサイズを取得します。

また、PDFには*やページ番号など文字として省きたい小さな文字が存在していたためここで、ignore_minimum_font_sizeとして無視するサイズのフォントサイズを引数に受け取ります。

def _extract_text_from_pages(pages, ignore_minimum_font_size):
    extracted_pages = []
    for page_num, page in enumerate(pages, start=1):
        page_content = [
            {"size": round(char["size"]), "text": char["text"]}
            for char in page.chars
            if round(char["size"]) >= ignore_minimum_font_size
        ]
        extracted_pages.append({"page_num": page_num, "content": page_content})
    return extracted_pages

この関数のアウトプットとしては、このような配列が返ってきます。

[
  {
    page_num: 0,
    content: [
      {
        size: 9,
        text: "a"
      },
      {
        size: 9,
        text: "b"
      }
    ]
  }
]

同じサイズのテキストを結合する

上の関数で呼び出したアウトプットを見ると、テキストは1文字ずつに分割されているため、文字サイズが同じの場合は1文とみなし結合していく処理を書いています。

def _combine_same_size_text(pages: List[Dict[str, any]]) -> List[Dict[str, any]]:
  for page in pages:
    page["content"] = PDFExtractor._combine_text_of_same_size(page["content"])
  return pages

def _combine_text_of_same_size(content):
  combined_content: List[Dict[str, any]] = []
  current_size: int = 0
  current_text: str = ""

  for item in content:
    if current_text and current_size != item["size"]:
      combined_content.append({"size": current_size, "text": current_text})
      current_text = ""
      current_size = item["size"]
      current_text = current_text + item["text"]
    if current_text:
      combined_content.append({"size": current_size, "text": current_text})
    return combined_content

手順3, 4.特定のサイズ以上のフォントサイズで、目次とテキストが一致する文字を見つける

現在と次の目次のタイトルとページ番号を取得する

def _extract_section_from_content(
  self,
  contents: List[Dict[str, any]],
  outlines: List[fitz.Outline],
  start_idx: int,
  end_idx: int,
  font_size: int,
) -> Tuple[str, str]:
    pagenum, next_pagenum = outlines[start_idx][2], outlines[end_idx][2]
    title, next_title = outlines[start_idx][1], outlines[end_idx][1]
    return self._find_section_in_contents(contents, pagenum, next_pagenum, title, next_title, font_size)

一致するタイトルを見つけて、次のタイトルが出るまでそれ以外を一つの文章にする

def _find_section_in_contents(
    contents: List[Dict[str, any]],
    start_page: int,
    end_page: int,
    title: str,
    next_title: str,
    font_size: int,
) -> Tuple[str, str]:
    current_title = ""
    current_section = ""
  
    isFinished = False

    for content in contents[start_page - 1 : end_page]:
        if isFinished:
            break
        for item in content["content"]:
            if item["size"] >= font_size and title in item["text"]:
                current_title = item["text"]
            elif (
                current_title
                and item["size"] >= font_size
                and next_title in item["text"]
            ):
                isFinished = True
                break
            elif current_title:
                current_section += " " + item["text"]
    return current_title, current_section.strip()

最後に

今回の実装は、多くのPDFで章ごとにテキストを分割する処理を行う事が達成できました。
しかし、全てのPDFで再現できるわけでは無く最後には目視での確認が必要でした。

再現できない例としては、PDFによってはそもそもテキストの取得で文字化けがあり、正しく文字を取得できないケースや、目次にタイトルと次の章のタイトルが記載されているため上手く目次部分の抽出ができないなどがありました。

今回の案件では、目次部分に関するユーザーからの質問は少ないという判断のもと目次部分を取得することは行いませんでした。

これらを合わせたコードは、こちらにありますので一度使ってみてください！
https://github.com/unochanel/PDFLoader