🐷

LangChainを用いて大量ファイルをロードするVectorDBを作ってみた(7)

2024/05/25に公開

Python

はじめに

前回、３つのVectorDB（chroma、Qdrant、FAISS）を用いて、生成AIに質問を投げてみたのですが、芳しくない結果となってしまいました。

そこで今回は、ファイルの情報をそのままVectorDBに登録するのではなく、ある程度選別してVectorDBへ格納したらどなるんだろうということで、検証していきたいと思います。

XMLファイルの書式について

今回もインプットデータのサンプルとして特許庁のファイルを採用します。画像ファイルやCSVファイルなどもあるのですが、過去の記事同様に請求文章が含まれているXML形式のファイルだけを対象にしてVectorDBを作っていきます。
XML形式ファイルの中の必要な部位だけを抽出してVectorDBに格納するため、特許庁のXML形式ファイルの仕様を理解する必要があります。

それではXML形式ファイルを見ていくことにします。

名前空間について

今回採用するXML形式ファイルのサンプルを以下に示します。

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../../../../../XSL/JPRegisteredPatentPublication.xsl"?>
<jppat:RegisteredPatentPublication xmlns:jpcom="http://www.jpo.go.jp/standards/XMLSchema/ST96/JPCommon"
                                   xmlns:jppat="http://www.jpo.go.jp/standards/XMLSchema/ST96/JPPatent"
                                   xmlns:com="http://www.wipo.int/standards/XMLSchema/ST96/Common"
                                   xmlns:pat="http://www.wipo.int/standards/XMLSchema/ST96/Patent"
                                   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                                   xsi:schemaLocation="http://www.jpo.go.jp/standards/XMLSchema/ST96/JPPatent ../../../../../XSD/JPRegisteredPatentPublication_V1_0.xsd"
                                   com:languageCode="ja"
                                   com:st96Version="V3_1"
                                   com:ipoVersion="JP_V1_0">
   <com:IPOfficeCode>JP</com:IPOfficeCode>
   <jppat:RegisteredPatentPublicationBibliographicData com:languageCode="ja">
      <com:IPOfficeCode>JP</com:IPOfficeCode>
      <jppat:PatentPublicationIdentification>
         <com:IPOfficeCode>JP</com:IPOfficeCode>
         <pat:PublicationNumber>7354391</pat:PublicationNumber>
         <com:PublicationDate>2023-10-02</com:PublicationDate>
      </jppat:PatentPublicationIdentification>
      <pat:PlainLanguageDesignationText>特許公報(B2)</pat:PlainLanguageDesignationText>
      <com:RegistrationDate>2023-09-22</com:RegistrationDate>
      <jppat:ApplicationIdentification>
         <com:ApplicationNumber>
            <com:ApplicationNumberText>2022166159</com:ApplicationNumberText>
         </com:ApplicationNumber>
         <pat:FilingDate>2022-10-17</pat:FilingDate>
      </jppat:ApplicationIdentification>
      <pat:InventionTitle>半導体装置</pat:InventionTitle>
      <jppat:RegisteredPatentPublicationPartyBag>
         <jppat:ApplicantsRegisteredPractitionersBag>
            <jppat:ApplicantRegisteredPractitionerBag com:sequenceNumber="1">
               <jppat:Applicant com:sequenceNumber="1">
                  <com:PartyIdentifier>000153878</com:PartyIdentifier>
                  <jpcom:Contact>
                     <com:Name>
                        <com:EntityName>株式会社半導体エネルギー研究所</com:EntityName>
・・・・・

XMLファイルの4行目に<com:IPOfficeCode>という部分があります。
左側のcomは「名前空間」で、上部<jppat:RegisteredPatentPublication>にxmlns:com="http://www.wipo.int/standards/XMLSchema/ST96/Common"という内容があることから、ソースコード内ではcomをhttp://www.wipo.int/standards/XMLSchema/ST96/Commonで置き換えることになります。

主要なタグ選定

タグにはいろいろなものがありますが、以下のような仕様だろうと推測しました。

com:EntityName：エントリーをした人・組織の名前
com:P：文章
pat:PublicationNumber：出版番号
pat:PublicationDate：出版日付
pat:RegistrationDate：登録日付
pat:RegistrationDate：請求文章
・・・

これらを主要なタグとして、該当するタグ情報をVectorDBへ格納するようにします。
エレメントから取得したtag情報が主要タグに該当するか否かを走査するため、リストを作成しました。ソースコードの中にあるのは、ちょっとブサイクですが。

# 取り出したい名前空間-タグ名
name_spaces_tag_names = [
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PublicationNumber",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PublicationDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}RegistrationDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}ApplicationNumberText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PartyIdentifier",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PostalAddressText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PersonFullName",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}P",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}FigureReference",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}PlainLanguageDesignationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}FilingDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}MainClassification",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}FurtherClassification",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}PatentClassificationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}SearchFieldText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}ClaimText",
]

XMLのエレメント取得と入れ子構造

次に、XMLパースを実施した結果をどうやってVectorDBに格納するかを考えることにしました。
XMLはタグがネストをしており、親子関係が存在しますが、これを単純にデータベースのテーブルで表現することは難しく、結果的にネスト構造をフラットな配置にすることとしました。


def set_element(level, trees, el):
    trees.append({"tag" : el.tag, "attrib" : el.attrib, "content_page" :el.text})

def set_child(level, trees, el):
    set_element(level, trees, el)
    for child in el:
        set_child(level+1, trees, child)

def parse_and_get_element(input_file):
    tmp_elements = []
    new_elements = []
    tree = ET.parse(input_file)
    root = tree.getroot()
    set_child(1, tmp_elements, root)
    for name_space_tag_name in name_spaces_tag_names:
        for tmp_element in tmp_elements:
            if tmp_element["tag"] == name_space_tag_name:
                new_elements.append(tmp_element)
    return new_elements

上記プログラムは、set_childを再帰的に呼び出し、樹形構造のエレメントをtmp_elementsにappendしています。最終的にname_spaces_tag_namesリストと合致するtagのみをnew_elementsにappendして、リストを返しています（return new_elements）。

メタデータをVectorDBに格納

今回は３つのVectorDBの中から「Chroma」を採用しました。

上記の記事でも３つのVectorDBのテーブル構造について少し比較をしているのですが、テーブルの内容を解析しやすいのがChromaであるためです。

VectorDBを作成するためのプログラムソースコードを記載します。

chroma_retriever.py

import glob
import os
import xml.etree.ElementTree as ET
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

load_dotenv()

docs = []

# 取り出したい名前空間-タグ名
name_spaces_tag_names = [
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PublicationNumber",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PublicationDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}RegistrationDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}ApplicationNumberText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PartyIdentifier",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PostalAddressText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PersonFullName",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}P",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}FigureReference",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}PlainLanguageDesignationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}FilingDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}MainClassification",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}FurtherClassification",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}PatentClassificationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}SearchFieldText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}ClaimText",
]

def set_element(level, trees, el):
    trees.append({"tag" : el.tag, "attrib" : el.attrib, "content_page" :el.text})

def set_child(level, trees, el):
    set_element(level, trees, el)
    for child in el:
        set_child(level+1, trees, child)


def parse_and_get_element(input_file):
    tmp_elements = []
    new_elements = []
    tree = ET.parse(input_file)
    root = tree.getroot()
    set_child(1, tmp_elements, root)
    for name_space_tag_name in name_spaces_tag_names:
        for tmp_element in tmp_elements:
            if tmp_element["tag"] == name_space_tag_name:
                new_elements.append(tmp_element)
    return new_elements

title = ""
entryName = ""
patentCitationText = ""

files = glob.glob(os.path.join("C:\\Users\\ogiki\\JPB_2023185", "**/*.*"), recursive=True)
for file in files:
    base, ext = os.path.splitext(file)
    if ext == '.xml':
        # --- topic名称 ---
        topic_name = os.path.splitext(os.path.basename(file))[0]
        # --- file名称 ---
        print(file)

        text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
        new_elements = parse_and_get_element(file)
        for new_element in new_elements:
            text = new_element["content_page"]
            tag = new_element["tag"]
            title = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle" else ""
            entryName = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName" else ""
            patentCitationText = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText" else ""

            documents = text_splitter.create_documents(texts=[text], metadatas=[{
                "name": topic_name, 
                "source": file, 
                "tag": tag, 
                "title": title,
                "entry_name": entryName, 
                "patent_citation_text" : patentCitationText}]
            )
            docs.extend(documents)


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma(persist_directory="C:\\Users\\ogiki\\vectorDB\\local_chroma", embedding_function=embeddings)

# トークン数制限のため、500 documentずつ処理をする
intv = 500
ln = len(docs)
max_loop = int(ln / intv) + 1
for i in range(max_loop):
    splitted_documents = text_splitter.split_documents(docs[intv * i : intv * (i+1)])
    db.add_documents(splitted_documents)

↓「一部抜粋」の部分についてですが、
・tag=="{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle"の時　→　metadataのtitleにセット
・tag=={http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityNameの時　、→　metadataのentryNameにセット
・tag=={http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationTextの時　、→　metadataのpatentCitationTextにセット
を実施しています。
生成AIへ質問する際に、metadataの属性(titleやentryNameなど)を予め絞り込むことで、より精度の高い回答結果がアウトプットされると考え、metadataの属性へ追加することにしました。
ちなみに、以前のEmbeddingモデルとしてtext-embedding-ada-002を利用していましたが、めちゃくちゃ高額だったためtext-embedding-3-smallに変更しています。

一部抜粋

       for new_element in new_elements:
            text = new_element["content_page"]
            tag = new_element["tag"]
            title = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle" else ""
            entryName = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName" else ""
            patentCitationText = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText" else ""

            documents = text_splitter.create_documents(texts=[text], metadatas=[{
                "name": topic_name, 
                "source": file, 
                "tag": tag, 
                "title": title,
                "entry_name": entryName, 
                "patent_citation_text" : patentCitationText}]
            )
            docs.extend(documents)

実行後データベースを確認

実際にコマンドを実行し、データベースを「DB Browser for SQLite」で確認をしました。

key列の中に、nameやsourceの他にtitle'や'tag'などが追加されていることがわかるかと思います。検索側でfilter`をかけてから生成AIを呼び出すことで、回答結果の精度が上がるのではないかと考えております。

まとめ

VectorDB(SQLite)のmetadata設定が出来ました(tag, titleなどを追加)。
次回はchainlitを適用して、VectorDBのデータを絞り、生成AIから精度の高い回答がなされるよう、プログラム改修をしていこうと思います。