LangChainのGetting StartedをGoogle Colaboratoryでやってみる ⑧Indexes

Indexes

まず埋め込みとは？ChatGPTに聞いてみた。

言語認識において「埋め込み (embedding)」とは、テキストデータの各単語を固定長の数値ベクトルに変換する処理のことです。これにより、テキストデータを機械学習アルゴリズムで処理しやすくなります。

具体的には、埋め込みは単語をベクトル空間にマッピングすることで、単語同士の関係性を捉えられます。例えば、「王」と「女王」は意味的に関連しているため、埋め込みされたベクトルでもその関係性が反映されます。

https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12

で、このベクトル化されたデータを保存しておくのがベクトル型ストア。

埋め込みをインデックス化したり検索したりするためのベクトル型ストアとして、LangChainのデフォルトではChromaが使用される。Chromaを使用するにはchomadbをインストールしておく必要がある。

!pip install chromadb

ベクトル型ストアのよくある例としてはテキストを元にしたQAデータベースがある。でこのQAデータベースを作成するために必要な要素は以下となる。

Text Splitter（テキスト分割）
embeddings（埋め込み）
vectorstors（ベクトル型ストア）

テキストを元にこのQAデータベースを作成するプロセスは以下となる。

インデックス作成
QAチェーン作成
質問！

それぞれのステップで更に細かくステップが必要になるが、この章では1がメインとなる。

必要なライブラリとテキストの読み込み。日本語でやろうと思ったけど、ちょっと上手く行かなかった（トークン数がオーバーしたりする）のでまずはドキュメントどおりに、大統領の所信演説でやってみる。

以下のファイルをアップロードして読み込む。

from langchain.chains import VectorDBQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader

loader = TextLoader('japan_prime_minister_policy_statement_210.txt')

これでテキストデータが読み込まれたTextLoaderオブジェクトが作成される。

インデックスの作成はVectorstoreIndexCreatorを使えば1行でできる。

from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])

DEBUG:Chroma:Logger created
Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:Index saved to .chroma/index/index.bin

こんな感じでインデックスが作成される。

# find .chroma -ls
  6424132      4 drwxr-xr-x   3 root     root         4096 Mar 22 09:50 .chroma
  6424133      4 drwxr-xr-x   2 root     root         4096 Mar 22 09:50 .chroma/index

ではこのインデックスに対してクエリを行ってみる。

query = "What did the president say about Ketanji Brown Jackson?"
index.query(query)

結果

DEBUG:Chroma:time to pre process our knn query: 1.9073486328125e-06
DEBUG:Chroma:time to run knn query: 0.0002830028533935547
 The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, from a family of public school educators and police officers, a consensus builder, and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

以下あたりが読み出されているのがわかる。

ソースとなるデータをあわせたオブジェクトとして取得することもできる。

query = "What did the president say about Ketanji Brown Jackson"
index.query_with_sources(query)

query = "What did the president say about Ketanji Brown Jackson"
index.query_with_sources(query)
DEBUG:Chroma:time to pre process our knn query: 2.86102294921875e-06
DEBUG:Chroma:time to run knn query: 0.00025963783264160156
{'question': 'What did the president say about Ketanji Brown Jackson',
 'answer': " The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, one of the nation's top legal minds, to continue Justice Breyer's legacy of excellence, and that she has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\n",
 'sources': 'state_of_the_union.txt'}

VectorstoreIndexCreatorはVectorStoreIndexWrapperオブジェクトを返して、これにqueryメソッドとquery_with_sourcesメソッドが生えているらしい。

直接vector storeにアクセスするには以下のようにすれば良いらしい

index.vectorstore

<langchain.vectorstores.chroma.Chroma at 0x7fcf90186a30>

全然関係ないけど、ちょっと気になったので中身を追いかけてみる（Python初心者なので）

type(index.vectorstore)

langchain.vectorstores.chroma.Chroma

type(index)

langchain.indexes.vectorstore.VectorStoreIndexWrapper

dir(index.vectorstore)

['_LANGCHAIN_DEFAULT_COLLECTION_NAME',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_client',
 '_client_settings',
 '_collection',
 '_embedding_function',
 '_persist_directory',
 'add_documents',
 'add_texts',
 'delete_collection',
 'from_documents',
 'from_texts',
 'max_marginal_relevance_search',
 'max_marginal_relevance_search_by_vector',
 'persist',
 'similarity_search',
 'similarity_search_by_vector',
 'similarity_search_with_score']

dir(index)

['Config',
 '__abstractmethods__',
 '__annotations__',
 '__class__',
 '__class_vars__',
 '__config__',
 '__custom_root_type__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__exclude_fields__',
 '__fields__',
 '__fields_set__',
 '__format__',
 '__ge__',
 '__get_validators__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__include_fields__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__json_encoder__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__post_root_validators__',
 '__pre_root_validators__',
 '__pretty__',
 '__private_attributes__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__repr_args__',
 '__repr_name__',
 '__repr_str__',
 '__rich_repr__',
 '__schema_cache__',
 '__setattr__',
 '__setstate__',
 '__signature__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__try_update_forward_refs__',
 '__validators__',
 '_abc_impl',
 '_calculate_keys',
 '_copy_and_set_values',
 '_decompose_class',
 '_enforce_dict_if_root',
 '_get_value',
 '_init_private_attributes',
 '_iter',
 'construct',
 'copy',
 'dict',
 'from_orm',
 'json',
 'parse_file',
 'parse_obj',
 'parse_raw',
 'query',
 'query_with_sources',
 'schema',
 'schema_json',
 'update_forward_refs',
 'validate',
 'vectorstore']

という風にVectorstoreIndexCreatorを使うとかんたんにインデックスが作れるが、中で何をしているか？を噛み砕いていく。

VectorstoreIndexCreatorは読み出されたドキュメントに対して以下を行っている

ドキュメントをチャンクに分割
それぞれのドキュメントに対して埋め込みを作成
ドキュメントと埋め込みをベクトル型ストアに保存

コードで書くとこうなる。

ドキュメントのロード

documents = loader.load()

チャンクに分割

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

埋め込みの作成。OpenAIのEmbeddingsを使っている。

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

ベクトル型ストアを作成

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)

インデックスを作成。QAとして使うためのchainを作成する。

qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=db)

検索

query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

VectorstoreIndexCreatorはこれらの単純なラッパーとなっていて、どのtext splitter・embeddings・vector storeを使うかを設定することができる。

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Chroma, 
    embedding=OpenAIEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
)

日本語だと上手く行かなかったのはOpenAIEmbeddings()のところだよね、多分。ということでchunk_size=500にしてみた。日本語の場合は途中で切れちゃうかもしれないので正しくはないけど。

from langchain.chains import VectorDBQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

loader = TextLoader('japan_prime_minister_policy_statement_210.txt')
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

db = Chroma.from_documents(texts, embeddings)

qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=db)

query = "マイナンバーカードについてはどうですか？"
qa.run(query)

結果

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:time to pre process our knn query: 3.0994415283203125e-06
DEBUG:Chroma:time to run knn query: 0.0002505779266357422
 政府は、マイナンバーカードについて、健康保険証との一体化など、利便性の向上を飛躍的に進め、概ね全ての国民への普及のための取組を加速するとともに、地域でのデジタル技術の社会実装を重点的に支援していきます。

できたー！

kun432

この辺が参考になった

https://note.com/npaka/n/nb9b70619939a

kun432

なんか修正されて、Retrieverというのが追加されてた。進化が速い！

この辺が参考になる

https://note.com/npaka/n/nf2849b26a524

まだチョットよくわかってないけど、今まではLLMとVectorDBを密につなげたVectorChainだったのが。Retrievalという抽象レイヤーが一枚追加されたイメージ。

ってことでここの記述が少し変わった様子。ただ元々書いてあったことが大きく変わったわけではないので、かんたんに見直す。

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader('japan_prime_minister_policy_statement_210.txt')
index = VectorstoreIndexCreator(text_splitter=CharacterTextSplitter(chunk_size=250)).from_loaders([loader])
query = "マイナンバーカードについてはどうですか？日本語で答えて。"
index.query(query)

日本語テキストにあわせてchunk_sizeをいじる必要があったので、VectorstoreIndexCreatorのパラメータでtext_splitterを指定している。

結果

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:time to pre process our knn query: 3.5762786865234375e-06
DEBUG:Chroma:time to run knn query: 0.0003001689910888672
 マイナンバーカードについては、健康保険証との一体化など、利便性の向上を飛躍的に進め、概ね全ての国民への普及のための取組を加速するとともに、地域でのデジタル技術の社会実装を重点的に支援していきます。

ソース付きで回答を返す。

index.query_with_sources(query)

DEBUG:Chroma:time to pre process our knn query: 3.0994415283203125e-06
DEBUG:Chroma:time to run knn query: 0.0002799034118652344
{'question': 'マイナンバーカードについてはどうですか？日本語で答えて。',
 'answer': ' マイナンバーカードについては、健康保険証との一体化など、利便性の向上を飛躍的に進め、概ね全ての国民への普及のための取組を加速するとともに、地域でのデジタル技術の社会実装を重点的に支援するという取組を行うとしています。\n',
 'sources': 'japan_prime_minister_policy_statement_210.txt'}

直接indexにアクセス

index.vectorstore

<langchain.vectorstores.chroma.Chroma at 0x7f1528b0ec70>

VectorstoreRetrieverにアクセスするには以下。

index.vectorstore.as_retriever()

VectorStoreRetriever(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f1528b0ec70>, search_type='similarity', search_kwargs={})

VectorstoreIndexCreatorを噛み砕くとこうなる。

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# ドキュメント読み込み
loader = TextLoader('japan_prime_minister_policy_statement_210.txt')
documents = loader.load()

# チャンクに分割
text_splitter = CharacterTextSplitter(chunk_size=250, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# VectorDB作成
embeddings = OpenAIEmbeddings()

# チャンク化されたドキュメントをVectorDBに埋め込み
db = Chroma.from_documents(texts, embeddings)

# インデックス作成
retriever = db.as_retriever()

# chain作成
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)

# クエリ
query = "マイナンバーカードについてはどうですか？日本語で答えて。"
qa.run(query)

結果

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:time to pre process our knn query: 3.0994415283203125e-06
DEBUG:Chroma:time to run knn query: 0.00033020973205566406
 マイナンバーカードについては、健康保険証との一体化など、利便性の向上を飛躍的に進め、概ね全ての国民への普及のための取組を加速するとともに、地域でのデジタル技術の社会実装を重点的に支援していくとしています。

いきなりChain作るのではなくて、retrieverインタフェース経由でインデックスを作ってChainを作るって感じ。

kun432

Indexes' How-To

ということで順に見ていく。

Utils
Vectorstores
Retrievers
Chains

kun432

Utils

Documents/Indexes を扱う上で必要なUtils

Text Splitters
VectorStores
Embeddings
HyDE

Text Splitters

文字通りテキストをチャンクに分割する。単純に分割する、ということではなく、意味的に関連するまとまりに分割するということが重要。

以下のようなプロセスになる。

テキストをセマンティックに意味のある小さなチャンクに分割する。
上記の小さなチャンクを、一定サイズでより大きなチャンクに結合する。
2のサイズに達したら、新しいチャンクを作成する。このときチャンク間の文脈を保つためにある程度重な利をもたせる。

ポイントとなるのは、

どのようにテキストを分割するか？
チャンクサイズをどう決定するか？

上記の違いでいくつかのText Splitterが用意されている。日本語の場合はちょっと考え方が変わると思うので、英語のテキストでやっていく。

Generic Recursive Text Splitting

["\n", "\n", " ", ""]で分割、つまり単語単位で分割して、文字数（len関数）で1チャンクとする。段落、つまり一般的に最も意味的に関連するテキストの塊として、最大の文字列でまとめる感じ。

from langchain.text_splitter import RecursiveCharacterTextSplitter

with open('../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)

texts = text_splitter.create_documents([state_of_the_union])
print(len(texts[0].page_content), ": ", texts[0])
print(len(texts[1].page_content), ": ", texts[1])

結果

97 :  page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' lookup_str='' metadata={} lookup_index=0

80 :  page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0

英語には適しているが日本語にはあまり適さなさそう。

Markdown Text Splitter

その名の通り、markdownの見出しやブロックなどで分割する。

from langchain.text_splitter import MarkdownTextSplitter

markdown_text = """
# 🦜️🔗 LangChain

⚡ Building applications with LLMs through composability ⚡

## Quick Install

```bash
# Hopefully this code block isn't split
pip install langchain
```

As an open source project in a rapidly developing field, we are extremely open to contributions.
"""
markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)

docs = markdown_splitter.create_documents([markdown_text])

docs

結果

[
  Document(page_content='# 🦜️🔗 LangChain\n\n⚡ Building applications with LLMs through composability ⚡', lookup_str='', metadata={}, lookup_index=0),
  Document(page_content="Quick Install\n\n```bash\n# Hopefully this code block isn't split\npip install langchain", lookup_str='', metadata={}, lookup_index=0),
  Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.', lookup_str='', metadata={}, lookup_index=0)
]

Latex Text Splitter

Latexの見出し等で分割する。

from langchain.text_splitter import LatexTextSplitter

latex_text = """
\documentclass{article}

\begin{document}

\maketitle

\section{Introduction}
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.

\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.

\subsection{Applications of LLMs}
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.

\end{document}
"""
latex_splitter = LatexTextSplitter(chunk_size=400, chunk_overlap=0)

docs = latex_splitter.create_documents([latex_text])

docs

結果

[
  Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle', lookup_str='', metadata={}, lookup_index=0),
  Document(page_content='Introduction}\nLarge language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.', lookup_str='', metadata={}, lookup_index=0),
  Document(page_content='History of LLMs}\nThe earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.', lookup_str='', metadata={}, lookup_index=0),
  Document(page_content='Applications of LLMs}\nLLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\n\n\\end{document}', lookup_str='', metadata={}, lookup_index=0)
]

Python Code Text Splitter

Pythonのクラスやメソッドで分割する。

from langchain.text_splitter import PythonCodeTextSplitter

python_text = """
class Foo:

    def bar():
    
    
def foo():

def testing_func():

def bar():
"""
python_splitter = PythonCodeTextSplitter(chunk_size=30, chunk_overlap=0)

docs = python_splitter.create_documents([python_text])

docs

結果

[
  Document(page_content='Foo:\n\n    def bar():', lookup_str='', metadata={}, lookup_index=0),
  Document(page_content='foo():\n\ndef testing_func():', lookup_str='', metadata={}, lookup_index=0),
  Document(page_content='bar():', lookup_str='', metadata={}, lookup_index=0)
]

Character Text Splitting

1文字単位で分割して文字数で結合する。

from langchain.text_splitter import CharacterTextSplitter

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 500,
    chunk_overlap  = 0,
    length_function = len,
)

texts = text_splitter.create_documents([state_of_the_union])
print(len(texts[0].page_content), ": ", texts[0])

結果

490 :  page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.' lookup_str='' metadata={} lookup_index=0

metadata等も一緒に分割される

from langchain.text_splitter import CharacterTextSplitter

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 500,
    chunk_overlap  = 0,
    length_function = len,
)

metadatas = [{"document": 1}, {"document": 2}]

texts = text_splitter.create_documents([state_of_the_union, state_of_the_union],metadatas=metadatas)
print(len(texts[0].page_content), ": ", texts[0])

結果

490 :  page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.' lookup_str='' metadata={'document': 1} lookup_index=0

HuggingFace Length Function

HuggingFaceのトークナイザーを使う。LLMに渡せるトークンには上限があり、このHuggingFaceトークナイザーは正確に長さを測ることができる。

from transformers import GPT2TokenizerFast

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)

print(len(texts[0]), ": ", texts[0])

結果

410 :  Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.

tiktoken (OpenAI) Length Function

OpenAIのオープンソースなトークナイザー、tiktokenを使うこともできる。

tiktokenパッケージの追加が必要

!pip install tiktoken

from transformers import GPT2TokenizerFast
import tiktoken

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)

print(len(texts[0]), ": ", texts[0])

結果

410 :  Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.

NLTK Text Splitter

Pythonの自然言語処理ライブラリであるNLTK（Natural Language Toolkit）を使う。

nltkパッケージの追加が必要

!pip install nltk

punktというのはモデルらしくて、NLTKではPunktSentenceTokenizerというのを内部で使っているみたい。そのために必要になるらしい。

import nltk
nltk.download('punkt')

from langchain.text_splitter import NLTKTextSplitter

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()
    
text_splitter = NLTKTextSplitter(chunk_size=500)

texts = text_splitter.split_text(state_of_the_union)
print(len(texts[0]), ": ", texts[0])

結果

```text
490 :  Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.

Last year COVID-19 kept us apart.

This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

And with an unwavering resolve that freedom will always triumph over tyranny.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Spacy Text Splitter

NLTKとおなじようなものでspaCyというものがある。

from langchain.text_splitter import SpacyTextSplitter

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()
    
text_splitter = SpacyTextSplitter(chunk_size=500)

texts = text_splitter.split_text(state_of_the_union)
print(len(texts[0]), ": ", texts[0])

結果

421 :  Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.  



Last year COVID-19 kept us apart.

This year we are finally together again. 



Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans. 



With a duty to one another to the American people to the Constitution.

Token Text Splitter

んー、これtiktokenのやつと違いがちょっとよくわからない。中身的にはtiktokenを使ってsplitしてtiktokenでチャンクサイズを決めてる様子。

from langchain.text_splitter import TokenTextSplitter

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()

text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

結果

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an

日本語でやる場合

公式ドキュメントの話とは違うけど、日本語の場合、より正確にやるには分かち書きとかが必要になるのかな―と思う。

VectorStores

text splitterで分割したチャンクを、embeddingsを使ってベクトル化して保存しておく箱がVectorStore。
vectorstoreもいろいろあるので概要だけ。

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)

embeddings = OpenAIEmbeddings()

docsearch = Chroma.from_texts(texts, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

docs

similarity_search()。

[
    Document(page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', lookup_str='', metadata={}, lookup_index=0),
    Document(page_content='As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n\nWhile it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.', lookup_str='', metadata={}, lookup_index=0),
    Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.', lookup_str='', metadata={}, lookup_index=0),
    Document(page_content='This is personal to me and Jill, to Kamala, and to so many of you. \n\nCancer is the #2 cause of death in America–second only to heart disease. \n\nLast month, I announced our plan to supercharge  \nthe Cancer Moonshot that President Obama asked me to lead six years ago. \n\nOur goal is to cut the cancer death rate by at least 50% over the next 25 years, turn more cancers from death sentences into treatable diseases.  \n\nMore support for patients and families.', lookup_str='', metadata={}, lookup_index=0)]

add_texts()でテキストをvectorstoreに追加することもできる。

docsearch.add_texts(["Ankush went to Princeton"])

DEBUG:Chroma:Index saved to .chroma/index/index.bin
['49d6f6d6-cc01-11ed-b1d1-0242ac1c000c']

検索してみる。

query = "Where did Ankush go to college?"
docs = docsearch.similarity_search(query)
docs

[
    Document(page_content='Ankush went to Princeton', lookup_str='', metadata={}, lookup_index=0),
    Document(page_content='The American Rescue Plan gave schools money to hire teachers and help students make up for lost learning.  \n\nI urge every parent to make sure your school does just that. And we can all play a part—sign up to be a tutor or a mentor. \n\nChildren were also struggling before the pandemic. Bullying, violence, trauma, and the harms of social media.', lookup_str='', metadata={}, lookup_index=0),
    Document(page_content='Heath’s widow Danielle is here with us tonight. They loved going to Ohio State football games. He loved building Legos with their daughter. \n\nBut cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body. \n\nDanielle says Heath was a fighter to the very end. \n\nHe didn’t know how to stop fighting, and neither did she. \n\nThrough her pain she found purpose to demand we do better. \n\nTonight, Danielle—we are.', lookup_str='', metadata={}, lookup_index=0),
    Document(page_content='Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges. \n\nAnd let’s pass the PRO Act when a majority of workers want to form a union—they shouldn’t be stopped.  \n\nWhen we invest in our workers, when we build the economy from the bottom up and the middle out together, we can do something we haven’t done in a long time: build a better America.', lookup_str='', metadata={}, lookup_index=0)]

from_documents()でドキュメントから直接vectorstoreを初期化できる。ドキュメントのメタデータがすでにあってそれを付与できるってことだと思う。

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

with open('state_of_the_union.txt') as f:
    state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)

embeddings = OpenAIEmbeddings()

documents = text_splitter.create_documents([state_of_the_union], metadatas=[{"source": "State of the Union"}])

docsearch = Chroma.from_texts(texts, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

docs

kun432

めっちゃドキュメント書き換わってた、章立てから変わってる

ということでIndexesの項目は

Document Loaders
Text Splitters
Vectorstores
Retrievers

になってる。Utilsとかちょっとコンセプト的に？だった気もするし、まあスッキリしたのかも。

Text Splittersは終わってるので、Document Loadersやる。

kun432

Document Loaders

データを「ドキュメント」にして、言語モデルと組み合わせることで、単なるLLMだけの問い合わせよりもパワフルになる。Document Loadersはそのためのモジュール。

Document Loaderのほとんどで"Unstructured"というPythonパッケージが使用されているらしい。"Unstructured"はあらゆる種類のファイル（テキスト、パワーポイント、画像、HTML、PDFなど）をテキストデータ化するためのパッケージ。

以下参考

https://note.com/npaka/n/n7c9847e262a2

Document Loadersは以下が紹介されている。

CoNLL-U
Airbyte JSON
AZLyrics
Blackboard
College Confidential
Copy Paste
CSV Loader
Specify a column to be used identify the document source
Directory Loader
Email
EverNote
Facebook Chat
Figma
GCS Directory
GCS File Storage
GitBook
Google Drive
Gutenberg
Hacker News
HTML
iFixit
Images
IMSDb
Markdown
Notebook
Notion
Obsidian
PDF
PowerPoint
ReadTheDocs Documentation
Roam
s3 Directory
s3 File
Subtitle Files
Telegram
Unstructured File Loader
URL
Web Base
Word Documents
YouTube

詳細は各DocumentLoaderのページを参照。ここでは少しだけ気になったやつをピックアップ。

Copy Paste

単純に文字列をドキュメントにする。DocumentLoaderは必要ないけど、ドキュメントにしたいようなケース。

from langchain.docstore.document import Document

text = "..... put the text you copy pasted here......"

doc = Document(page_content=text)
doc

Document(page_content='..... put the text you copy pasted here......', lookup_str='', metadata={}, lookup_index=0)

メタデータを付与することもできる。

from langchain.docstore.document import Document

text = "..... put the text you copy pasted here......"
metadata = {"source": "internet", "date": "Friday"}

doc = Document(page_content=text, metadata=metadata)
doc

Document(page_content='..... put the text you copy pasted here......', lookup_str='', metadata={'source': 'internet', 'date': 'Friday'}, lookup_index=0)

Unstructured File Loader

過去のドキュメントではテキストファイルの読み込みにTextLoaderというのが出てきていた。

from langchain.document_loaders import TextLoader

loader = TextLoader('state_of_the_union.txt')
doc = loader.load()
doc

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n (...snip...) May God bless you all. May God protect our troops.', lookup_str='', metadata={'source': 'state_of_the_union.txt'}, lookup_index=0)]

テキストファイルなのでTextLoaderだったのだけど、Unstructuredを使えばこの辺を意識しなくて済む。

Unstructuredをインストール

!pip install unstructured

以下を実行。どうやら内部的にNLTKを使っているようで初回はpunktがダウンロードされてた。

from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("state_of_the_union.txt")
doc = loader.load()
doc

結果は先程と同じ。読み込みたいファイルの種類を意識しなくて良い。

ただし、読み込むファイルによって多少準備がいる。例えば、PDFの場合。

!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf

from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper.pdf")
doc = loader.load()
doc

実行

ImportError: Following dependencies are missing: pdfminer. Please install them using `pip install unstructured[local-inference]`.

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

ドキュメントにあるように、パッケージやライブラリの追加が必要になる。特段問題ないようであればまるっと入れといてもいいかもしれない。記載のあるものをColaboratoryで入れるなら多分こんな感じ（過不足あるかもしれない。）

!pip install "unstructured[local-inference]"
!pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
!pip install layoutparser[layoutmodels,tesseract]
!apt install -y libmagic-dev
!apt install -y poppler-utils libpoppler-dev 
!apt install -y tesseract-ocr libtesseract-dev tesseract-ocr-jpn tesseract-ocr-jpn-vert tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert
!apt install -y libxml2-dev libxslt1-dev

※ランタイムの再起動を求められるので注意。

必要なライブラリ、パッケージ等インストール後に再度実行するとこうなる

Downloading model_final.pth: 100%
330M/330M [00:01<00:00, 276MB/s]
Downloading (…)50_FPN_3x/config.yml: 100%
5.37k/5.37k [00:00<00:00, 154kB/s]
100%|██████████| 7/7 [00:00<00:00, 13.05it/s]
100%|██████████| 6/6 [00:00<00:00, 66.60it/s]
100%|██████████| 8/8 [00:00<00:00, 120.98it/s]
100%|██████████| 6/6 [00:00<00:00, 121.00it/s]
100%|██████████| 9/9 [00:00<00:00, 114.39it/s]
100%|██████████| 5/5 [00:00<00:00, 11.70it/s]
100%|██████████| 9/9 [00:00<00:00, 112.82it/s]
100%|██████████| 8/8 [00:00<00:00, 128.23it/s]
100%|██████████| 8/8 [00:01<00:00,  7.60it/s]
100%|██████████| 7/7 [00:00<00:00, 10.86it/s]
100%|██████████| 8/8 [00:00<00:00,  8.64it/s]
100%|██████████| 5/5 [00:00<00:00, 59.70it/s]
100%|██████████| 5/5 [00:00<00:00, 13.26it/s]
100%|██████████| 6/6 [00:00<00:00, 31.68it/s]
100%|██████████| 2/2 [00:00<00:00, 94.04it/s]
100%|██████████| 1/1 [00:00<00:00, 49.08it/s]
[
  Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis\n\nZejiang Shen 1 ( (=) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5\n\nAllen Institute for AI (...snip...)

またデフォルトだと読み込んだドキュメントの要素は一つ、つまりまるっと読み込まれる。

len(doc)     # => 1

読み込み時にmode="elements"を指定すると、要素ごとに分割される。

from langchain.document_loaders import UnstructuredFileLoader
import pprint

loader = UnstructuredFileLoader("state_of_the_union.txt", mode="elements")
doc = loader.load()
print(len(doc))
pprint.pprint(doc)

365
[
  Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': 'state_of_the_union.txt', 'filename': 'state_of_the_union.txt', 'category': 'NarrativeText'}, lookup_index=0),
  Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': 'state_of_the_union.txt', 'filename': 'state_of_the_union.txt', 'category': 'NarrativeText'}, lookup_index=0),
  Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': 'state_of_the_union.txt', 'filename': 'state_of_the_union.txt', 'category': 'NarrativeText'}, lookup_index=0),
(...snip...)

from langchain.document_loaders import UnstructuredFileLoader
import pprint

loader = UnstructuredFileLoader("layout-parser-paper.pdf", mode="elements")
doc = loader.load()
print(len(doc))
pprint.pprint(doc)

315
[
  Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0),
  Document(page_content='Zejiang Shen 1 ( (=) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'NarrativeText'}, lookup_index=0),
(...snip...)

この分割方法をstrategyで設定できる。hi_res（デフォルト）とfastで、hi_resは正確だが時間がかかる、fastはその逆となる。またこの戦略は全ての文書タイプで用意されているわけではなく、その場合strategyの指定は無視されるし、fastにfallbackするようなケースもある。

hi_resの場合

%%time

from langchain.document_loaders import UnstructuredFileLoader
import pprint

loader = UnstructuredFileLoader("layout-parser-paper.pdf", mode="elements", strategy="hi_res")
doc = loader.load()
doc[:5]

CPU times: user 2min 6s, sys: 13.7 s, total: 2min 19s
Wall time: 2min 25s

[
  Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0),
  Document(page_content='Zejiang Shen 1 ( (=) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'NarrativeText'}, lookup_index=0),
(...snip...)

fastの場合

%%time

from langchain.document_loaders import UnstructuredFileLoader
import pprint

loader = UnstructuredFileLoader("layout-parser-paper.pdf", mode="elements", strategy="fast")
doc = loader.load()
doc[:5]

CPU times: user 3.32 s, sys: 28 ms, total: 3.35 s
Wall time: 3.41 s
[
  Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
  Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
  Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
(...snip...)

kun432

Vectorstores

kun432

Retrievers

kun432

色々変わったのと、最近はLangChain使わずに素で書いてるので、一旦クローズ。

このスクラップは2023/10/13にクローズされました