LangChainのGetting StartedをGoogle Colaboratoryでやってみる ⑧Indexes
Indexes
まず埋め込みとは?ChatGPTに聞いてみた。
言語認識において「埋め込み (embedding)」とは、テキストデータの各単語を固定長の数値ベクトルに変換する処理のことです。これにより、テキストデータを機械学習アルゴリズムで処理しやすくなります。
具体的には、埋め込みは単語をベクトル空間にマッピングすることで、単語同士の関係性を捉えられます。例えば、「王」と「女王」は意味的に関連しているため、埋め込みされたベクトルでもその関係性が反映されます。
https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12
で、このベクトル化されたデータを保存しておくのがベクトル型ストア。
埋め込みをインデックス化したり検索したりするためのベクトル型ストアとして、LangChainのデフォルトではChromaが使用される。Chromaを使用するにはchomadbをインストールしておく必要がある。
!pip install chromadb
ベクトル型ストアのよくある例としてはテキストを元にしたQAデータベースがある。でこのQAデータベースを作成するために必要な要素は以下となる。
- Text Splitter(テキスト分割)
- embeddings(埋め込み)
- vectorstors(ベクトル型ストア)
テキストを元にこのQAデータベースを作成するプロセスは以下となる。
- インデックス作成
- QAチェーン作成
- 質問!
それぞれのステップで更に細かくステップが必要になるが、この章では1がメインとなる。
必要なライブラリとテキストの読み込み。日本語でやろうと思ったけど、ちょっと上手く行かなかった(トークン数がオーバーしたりする)のでまずはドキュメントどおりに、大統領の所信演説でやってみる。
以下のファイルをアップロードして読み込む。
from langchain.chains import VectorDBQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
loader = TextLoader('japan_prime_minister_policy_statement_210.txt')
これでテキストデータが読み込まれたTextLoaderオブジェクトが作成される。
インデックスの作成はVectorstoreIndexCreator
を使えば1行でできる。
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])
DEBUG:Chroma:Logger created
Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:Index saved to .chroma/index/index.bin
こんな感じでインデックスが作成される。
# find .chroma -ls
6424132 4 drwxr-xr-x 3 root root 4096 Mar 22 09:50 .chroma
6424133 4 drwxr-xr-x 2 root root 4096 Mar 22 09:50 .chroma/index
ではこのインデックスに対してクエリを行ってみる。
query = "What did the president say about Ketanji Brown Jackson?"
index.query(query)
結果
DEBUG:Chroma:time to pre process our knn query: 1.9073486328125e-06
DEBUG:Chroma:time to run knn query: 0.0002830028533935547
The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, from a family of public school educators and police officers, a consensus builder, and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.
以下あたりが読み出されているのがわかる。
ソースとなるデータをあわせたオブジェクトとして取得することもできる。
query = "What did the president say about Ketanji Brown Jackson"
index.query_with_sources(query)
query = "What did the president say about Ketanji Brown Jackson"
index.query_with_sources(query)
DEBUG:Chroma:time to pre process our knn query: 2.86102294921875e-06
DEBUG:Chroma:time to run knn query: 0.00025963783264160156
{'question': 'What did the president say about Ketanji Brown Jackson',
'answer': " The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, one of the nation's top legal minds, to continue Justice Breyer's legacy of excellence, and that she has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\n",
'sources': 'state_of_the_union.txt'}
VectorstoreIndexCreator
はVectorStoreIndexWrapper
オブジェクトを返して、これにquery
メソッドとquery_with_sources
メソッドが生えているらしい。
直接vector storeにアクセスするには以下のようにすれば良いらしい
index.vectorstore
<langchain.vectorstores.chroma.Chroma at 0x7fcf90186a30>
全然関係ないけど、ちょっと気になったので中身を追いかけてみる(Python初心者なので)
type(index.vectorstore)
langchain.vectorstores.chroma.Chroma
type(index)
langchain.indexes.vectorstore.VectorStoreIndexWrapper
dir(index.vectorstore)
['_LANGCHAIN_DEFAULT_COLLECTION_NAME',
'__abstractmethods__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__slots__',
'__str__',
'__subclasshook__',
'__weakref__',
'_abc_impl',
'_client',
'_client_settings',
'_collection',
'_embedding_function',
'_persist_directory',
'add_documents',
'add_texts',
'delete_collection',
'from_documents',
'from_texts',
'max_marginal_relevance_search',
'max_marginal_relevance_search_by_vector',
'persist',
'similarity_search',
'similarity_search_by_vector',
'similarity_search_with_score']
dir(index)
['Config',
'__abstractmethods__',
'__annotations__',
'__class__',
'__class_vars__',
'__config__',
'__custom_root_type__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__exclude_fields__',
'__fields__',
'__fields_set__',
'__format__',
'__ge__',
'__get_validators__',
'__getattribute__',
'__getstate__',
'__gt__',
'__hash__',
'__include_fields__',
'__init__',
'__init_subclass__',
'__iter__',
'__json_encoder__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__post_root_validators__',
'__pre_root_validators__',
'__pretty__',
'__private_attributes__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__repr_args__',
'__repr_name__',
'__repr_str__',
'__rich_repr__',
'__schema_cache__',
'__setattr__',
'__setstate__',
'__signature__',
'__sizeof__',
'__slots__',
'__str__',
'__subclasshook__',
'__try_update_forward_refs__',
'__validators__',
'_abc_impl',
'_calculate_keys',
'_copy_and_set_values',
'_decompose_class',
'_enforce_dict_if_root',
'_get_value',
'_init_private_attributes',
'_iter',
'construct',
'copy',
'dict',
'from_orm',
'json',
'parse_file',
'parse_obj',
'parse_raw',
'query',
'query_with_sources',
'schema',
'schema_json',
'update_forward_refs',
'validate',
'vectorstore']
という風にVectorstoreIndexCreator
を使うとかんたんにインデックスが作れるが、中で何をしているか?を噛み砕いていく。
VectorstoreIndexCreator
は読み出されたドキュメントに対して以下を行っている
- ドキュメントをチャンクに分割
- それぞれのドキュメントに対して埋め込みを作成
- ドキュメントと埋め込みをベクトル型ストアに保存
コードで書くとこうなる。
ドキュメントのロード
documents = loader.load()
チャンクに分割
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
埋め込みの作成。OpenAIのEmbeddingsを使っている。
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
ベクトル型ストアを作成
from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)
インデックスを作成。QAとして使うためのchainを作成する。
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=db)
検索
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)
VectorstoreIndexCreator
はこれらの単純なラッパーとなっていて、どのtext splitter・embeddings・vector storeを使うかを設定することができる。
index_creator = VectorstoreIndexCreator(
vectorstore_cls=Chroma,
embedding=OpenAIEmbeddings(),
text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
)
日本語だと上手く行かなかったのはOpenAIEmbeddings()のところだよね、多分。ということでchunk_size=500にしてみた。日本語の場合は途中で切れちゃうかもしれないので正しくはないけど。
from langchain.chains import VectorDBQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
loader = TextLoader('japan_prime_minister_policy_statement_210.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(texts, embeddings)
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=db)
query = "マイナンバーカードについてはどうですか?"
qa.run(query)
結果
Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:time to pre process our knn query: 3.0994415283203125e-06
DEBUG:Chroma:time to run knn query: 0.0002505779266357422
政府は、マイナンバーカードについて、健康保険証との一体化など、利便性の向上を飛躍的に進め、概ね全ての国民への普及のための取組を加速するとともに、地域でのデジタル技術の社会実装を重点的に支援していきます。
できたー!
この辺が参考になった
なんか修正されて、Retriever
というのが追加されてた。進化が速い!
この辺が参考になる
まだチョットよくわかってないけど、今まではLLMとVectorDBを密につなげたVectorChainだったのが。Retrievalという抽象レイヤーが一枚追加されたイメージ。
ってことでここの記述が少し変わった様子。ただ元々書いてあったことが大きく変わったわけではないので、かんたんに見直す。
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
loader = TextLoader('japan_prime_minister_policy_statement_210.txt')
index = VectorstoreIndexCreator(text_splitter=CharacterTextSplitter(chunk_size=250)).from_loaders([loader])
query = "マイナンバーカードについてはどうですか?日本語で答えて。"
index.query(query)
日本語テキストにあわせてchunk_sizeをいじる必要があったので、VectorstoreIndexCreator
のパラメータでtext_splitter
を指定している。
結果
Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:time to pre process our knn query: 3.5762786865234375e-06
DEBUG:Chroma:time to run knn query: 0.0003001689910888672
マイナンバーカードについては、健康保険証との一体化など、利便性の向上を飛躍的に進め、概ね全ての国民への普及のための取組を加速するとともに、地域でのデジタル技術の社会実装を重点的に支援していきます。
ソース付きで回答を返す。
index.query_with_sources(query)
DEBUG:Chroma:time to pre process our knn query: 3.0994415283203125e-06
DEBUG:Chroma:time to run knn query: 0.0002799034118652344
{'question': 'マイナンバーカードについてはどうですか?日本語で答えて。',
'answer': ' マイナンバーカードについては、健康保険証との一体化など、利便性の向上を飛躍的に進め、概ね全ての国民への普及のための取組を加速するとともに、地域でのデジタル技術の社会実装を重点的に支援するという取組を行うとしています。\n',
'sources': 'japan_prime_minister_policy_statement_210.txt'}
直接indexにアクセス
index.vectorstore
<langchain.vectorstores.chroma.Chroma at 0x7f1528b0ec70>
VectorstoreRetrieverにアクセスするには以下。
index.vectorstore.as_retriever()
VectorStoreRetriever(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f1528b0ec70>, search_type='similarity', search_kwargs={})
VectorstoreIndexCreator
を噛み砕くとこうなる。
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# ドキュメント読み込み
loader = TextLoader('japan_prime_minister_policy_statement_210.txt')
documents = loader.load()
# チャンクに分割
text_splitter = CharacterTextSplitter(chunk_size=250, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# VectorDB作成
embeddings = OpenAIEmbeddings()
# チャンク化されたドキュメントをVectorDBに埋め込み
db = Chroma.from_documents(texts, embeddings)
# インデックス作成
retriever = db.as_retriever()
# chain作成
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
# クエリ
query = "マイナンバーカードについてはどうですか?日本語で答えて。"
qa.run(query)
結果
Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:Index saved to .chroma/index/index.bin
DEBUG:Chroma:time to pre process our knn query: 3.0994415283203125e-06
DEBUG:Chroma:time to run knn query: 0.00033020973205566406
マイナンバーカードについては、健康保険証との一体化など、利便性の向上を飛躍的に進め、概ね全ての国民への普及のための取組を加速するとともに、地域でのデジタル技術の社会実装を重点的に支援していくとしています。
いきなりChain作るのではなくて、retrieverインタフェース経由でインデックスを作ってChainを作るって感じ。
Indexes' How-To
ということで順に見ていく。
- Utils
- Vectorstores
- Retrievers
- Chains
Utils
Documents/Indexes を扱う上で必要なUtils
- Text Splitters
- VectorStores
- Embeddings
- HyDE
Text Splitters
文字通りテキストをチャンクに分割する。単純に分割する、ということではなく、意味的に関連するまとまりに分割するということが重要。
以下のようなプロセスになる。
- テキストをセマンティックに意味のある小さなチャンクに分割する。
- 上記の小さなチャンクを、一定サイズでより大きなチャンクに結合する。
- 2のサイズに達したら、新しいチャンクを作成する。このときチャンク間の文脈を保つためにある程度重な利をもたせる。
ポイントとなるのは、
- どのようにテキストを分割するか?
- チャンクサイズをどう決定するか?
上記の違いでいくつかのText Splitterが用意されている。日本語の場合はちょっと考え方が変わると思うので、英語のテキストでやっていく。
Generic Recursive Text Splitting
["\n", "\n", " ", ""]
で分割、つまり単語単位で分割して、文字数(len関数)で1チャンクとする。段落、つまり一般的に最も意味的に関連するテキストの塊として、最大の文字列でまとめる感じ。
from langchain.text_splitter import RecursiveCharacterTextSplitter
with open('../../state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 100,
chunk_overlap = 20,
length_function = len,
)
texts = text_splitter.create_documents([state_of_the_union])
print(len(texts[0].page_content), ": ", texts[0])
print(len(texts[1].page_content), ": ", texts[1])
結果
97 : page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' lookup_str='' metadata={} lookup_index=0
80 : page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0
英語には適しているが日本語にはあまり適さなさそう。
Markdown Text Splitter
その名の通り、markdownの見出しやブロックなどで分割する。
from langchain.text_splitter import MarkdownTextSplitter
markdown_text = """
# 🦜️🔗 LangChain
⚡ Building applications with LLMs through composability ⚡
## Quick Install
```bash
# Hopefully this code block isn't split
pip install langchain
```
As an open source project in a rapidly developing field, we are extremely open to contributions.
"""
markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])
docs
結果
[
Document(page_content='# 🦜️🔗 LangChain\n\n⚡ Building applications with LLMs through composability ⚡', lookup_str='', metadata={}, lookup_index=0),
Document(page_content="Quick Install\n\n```bash\n# Hopefully this code block isn't split\npip install langchain", lookup_str='', metadata={}, lookup_index=0),
Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.', lookup_str='', metadata={}, lookup_index=0)
]
Latex Text Splitter
Latexの見出し等で分割する。
from langchain.text_splitter import LatexTextSplitter
latex_text = """
\documentclass{article}
\begin{document}
\maketitle
\section{Introduction}
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.
\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.
\subsection{Applications of LLMs}
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.
\end{document}
"""
latex_splitter = LatexTextSplitter(chunk_size=400, chunk_overlap=0)
docs = latex_splitter.create_documents([latex_text])
docs
結果
[
Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='Introduction}\nLarge language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='History of LLMs}\nThe earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='Applications of LLMs}\nLLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\n\n\\end{document}', lookup_str='', metadata={}, lookup_index=0)
]
Python Code Text Splitter
Pythonのクラスやメソッドで分割する。
from langchain.text_splitter import PythonCodeTextSplitter
python_text = """
class Foo:
def bar():
def foo():
def testing_func():
def bar():
"""
python_splitter = PythonCodeTextSplitter(chunk_size=30, chunk_overlap=0)
docs = python_splitter.create_documents([python_text])
docs
結果
[
Document(page_content='Foo:\n\n def bar():', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='foo():\n\ndef testing_func():', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='bar():', lookup_str='', metadata={}, lookup_index=0)
]
Character Text Splitting
1文字単位で分割して文字数で結合する。
from langchain.text_splitter import CharacterTextSplitter
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 500,
chunk_overlap = 0,
length_function = len,
)
texts = text_splitter.create_documents([state_of_the_union])
print(len(texts[0].page_content), ": ", texts[0])
結果
490 : page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.' lookup_str='' metadata={} lookup_index=0
metadata等も一緒に分割される
from langchain.text_splitter import CharacterTextSplitter
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 500,
chunk_overlap = 0,
length_function = len,
)
metadatas = [{"document": 1}, {"document": 2}]
texts = text_splitter.create_documents([state_of_the_union, state_of_the_union],metadatas=metadatas)
print(len(texts[0].page_content), ": ", texts[0])
結果
490 : page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.' lookup_str='' metadata={'document': 1} lookup_index=0
HuggingFace Length Function
HuggingFaceのトークナイザーを使う。LLMに渡せるトークンには上限があり、このHuggingFaceトークナイザーは正確に長さを測ることができる。
from transformers import GPT2TokenizerFast
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(len(texts[0]), ": ", texts[0])
結果
410 : Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
tiktoken (OpenAI) Length Function
OpenAIのオープンソースなトークナイザー、tiktokenを使うこともできる。
tiktokenパッケージの追加が必要
!pip install tiktoken
from transformers import GPT2TokenizerFast
import tiktoken
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(len(texts[0]), ": ", texts[0])
結果
410 : Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
NLTK Text Splitter
Pythonの自然言語処理ライブラリであるNLTK(Natural Language Toolkit)を使う。
nltkパッケージの追加が必要
!pip install nltk
punktというのはモデルらしくて、NLTKではPunktSentenceTokenizerというのを内部で使っているみたい。そのために必要になるらしい。
import nltk
nltk.download('punkt')
from langchain.text_splitter import NLTKTextSplitter
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = NLTKTextSplitter(chunk_size=500)
texts = text_splitter.split_text(state_of_the_union)
print(len(texts[0]), ": ", texts[0])
結果
```text
490 : Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Spacy Text Splitter
NLTKとおなじようなものでspaCyというものがある。
from langchain.text_splitter import SpacyTextSplitter
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = SpacyTextSplitter(chunk_size=500)
texts = text_splitter.split_text(state_of_the_union)
print(len(texts[0]), ": ", texts[0])
結果
421 : Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
Token Text Splitter
んー、これtiktokenのやつと違いがちょっとよくわからない。中身的にはtiktokenを使ってsplitしてtiktokenでチャンクサイズを決めてる様子。
from langchain.text_splitter import TokenTextSplitter
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
結果
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an
日本語でやる場合
公式ドキュメントの話とは違うけど、日本語の場合、より正確にやるには分かち書きとかが必要になるのかな―と思う。
VectorStores
text splitterで分割したチャンクを、embeddingsを使ってベクトル化して保存しておく箱がVectorStore。
vectorstoreもいろいろあるので概要だけ。
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_texts(texts, embeddings)
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
docs
similarity_search()。
[
Document(page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n\nWhile it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='This is personal to me and Jill, to Kamala, and to so many of you. \n\nCancer is the #2 cause of death in America–second only to heart disease. \n\nLast month, I announced our plan to supercharge \nthe Cancer Moonshot that President Obama asked me to lead six years ago. \n\nOur goal is to cut the cancer death rate by at least 50% over the next 25 years, turn more cancers from death sentences into treatable diseases. \n\nMore support for patients and families.', lookup_str='', metadata={}, lookup_index=0)]
add_texts()
でテキストをvectorstoreに追加することもできる。
docsearch.add_texts(["Ankush went to Princeton"])
DEBUG:Chroma:Index saved to .chroma/index/index.bin
['49d6f6d6-cc01-11ed-b1d1-0242ac1c000c']
検索してみる。
query = "Where did Ankush go to college?"
docs = docsearch.similarity_search(query)
docs
[
Document(page_content='Ankush went to Princeton', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='The American Rescue Plan gave schools money to hire teachers and help students make up for lost learning. \n\nI urge every parent to make sure your school does just that. And we can all play a part—sign up to be a tutor or a mentor. \n\nChildren were also struggling before the pandemic. Bullying, violence, trauma, and the harms of social media.', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='Heath’s widow Danielle is here with us tonight. They loved going to Ohio State football games. He loved building Legos with their daughter. \n\nBut cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body. \n\nDanielle says Heath was a fighter to the very end. \n\nHe didn’t know how to stop fighting, and neither did she. \n\nThrough her pain she found purpose to demand we do better. \n\nTonight, Danielle—we are.', lookup_str='', metadata={}, lookup_index=0),
Document(page_content='Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges. \n\nAnd let’s pass the PRO Act when a majority of workers want to form a union—they shouldn’t be stopped. \n\nWhen we invest in our workers, when we build the economy from the bottom up and the middle out together, we can do something we haven’t done in a long time: build a better America.', lookup_str='', metadata={}, lookup_index=0)]
from_documents()
でドキュメントから直接vectorstoreを初期化できる。ドキュメントのメタデータがすでにあってそれを付与できるってことだと思う。
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
with open('state_of_the_union.txt') as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
embeddings = OpenAIEmbeddings()
documents = text_splitter.create_documents([state_of_the_union], metadatas=[{"source": "State of the Union"}])
docsearch = Chroma.from_texts(texts, embeddings)
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
docs
めっちゃドキュメント書き換わってた、章立てから変わってる
ということでIndexesの項目は
- Document Loaders
- Text Splitters
- Vectorstores
- Retrievers
になってる。Utilsとかちょっとコンセプト的に?だった気もするし、まあスッキリしたのかも。
Text Splittersは終わってるので、Document Loadersやる。
Document Loaders
データを「ドキュメント」にして、言語モデルと組み合わせることで、単なるLLMだけの問い合わせよりもパワフルになる。Document Loadersはそのためのモジュール。
Document Loaderのほとんどで"Unstructured"というPythonパッケージが使用されているらしい。"Unstructured"はあらゆる種類のファイル(テキスト、パワーポイント、画像、HTML、PDFなど)をテキストデータ化するためのパッケージ。
以下参考
Document Loadersは以下が紹介されている。
- CoNLL-U
- Airbyte JSON
- AZLyrics
- Blackboard
- College Confidential
- Copy Paste
- CSV Loader
- Specify a column to be used identify the document source
- Directory Loader
- EverNote
- Facebook Chat
- Figma
- GCS Directory
- GCS File Storage
- GitBook
- Google Drive
- Gutenberg
- Hacker News
- HTML
- iFixit
- Images
- IMSDb
- Markdown
- Notebook
- Notion
- Obsidian
- PowerPoint
- ReadTheDocs Documentation
- Roam
- s3 Directory
- s3 File
- Subtitle Files
- Telegram
- Unstructured File Loader
- URL
- Web Base
- Word Documents
- YouTube
詳細は各DocumentLoaderのページを参照。ここでは少しだけ気になったやつをピックアップ。
Copy Paste
単純に文字列をドキュメントにする。DocumentLoaderは必要ないけど、ドキュメントにしたいようなケース。
from langchain.docstore.document import Document
text = "..... put the text you copy pasted here......"
doc = Document(page_content=text)
doc
Document(page_content='..... put the text you copy pasted here......', lookup_str='', metadata={}, lookup_index=0)
メタデータを付与することもできる。
from langchain.docstore.document import Document
text = "..... put the text you copy pasted here......"
metadata = {"source": "internet", "date": "Friday"}
doc = Document(page_content=text, metadata=metadata)
doc
Document(page_content='..... put the text you copy pasted here......', lookup_str='', metadata={'source': 'internet', 'date': 'Friday'}, lookup_index=0)
Unstructured File Loader
過去のドキュメントではテキストファイルの読み込みにTextLoaderというのが出てきていた。
from langchain.document_loaders import TextLoader
loader = TextLoader('state_of_the_union.txt')
doc = loader.load()
doc
[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n (...snip...) May God bless you all. May God protect our troops.', lookup_str='', metadata={'source': 'state_of_the_union.txt'}, lookup_index=0)]
テキストファイルなのでTextLoaderだったのだけど、Unstructuredを使えばこの辺を意識しなくて済む。
Unstructuredをインストール
!pip install unstructured
以下を実行。どうやら内部的にNLTKを使っているようで初回はpunktがダウンロードされてた。
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("state_of_the_union.txt")
doc = loader.load()
doc
結果は先程と同じ。読み込みたいファイルの種類を意識しなくて良い。
ただし、読み込むファイルによって多少準備がいる。例えば、PDFの場合。
!wget https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("layout-parser-paper.pdf")
doc = loader.load()
doc
実行
ImportError: Following dependencies are missing: pdfminer. Please install them using `pip install unstructured[local-inference]`.
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
ドキュメントにあるように、パッケージやライブラリの追加が必要になる。特段問題ないようであればまるっと入れといてもいいかもしれない。記載のあるものをColaboratoryで入れるなら多分こんな感じ(過不足あるかもしれない。)
!pip install "unstructured[local-inference]"
!pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
!pip install layoutparser[layoutmodels,tesseract]
!apt install -y libmagic-dev
!apt install -y poppler-utils libpoppler-dev
!apt install -y tesseract-ocr libtesseract-dev tesseract-ocr-jpn tesseract-ocr-jpn-vert tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert
!apt install -y libxml2-dev libxslt1-dev
※ランタイムの再起動を求められるので注意。
必要なライブラリ、パッケージ等インストール後に再度実行するとこうなる
Downloading model_final.pth: 100%
330M/330M [00:01<00:00, 276MB/s]
Downloading (…)50_FPN_3x/config.yml: 100%
5.37k/5.37k [00:00<00:00, 154kB/s]
100%|██████████| 7/7 [00:00<00:00, 13.05it/s]
100%|██████████| 6/6 [00:00<00:00, 66.60it/s]
100%|██████████| 8/8 [00:00<00:00, 120.98it/s]
100%|██████████| 6/6 [00:00<00:00, 121.00it/s]
100%|██████████| 9/9 [00:00<00:00, 114.39it/s]
100%|██████████| 5/5 [00:00<00:00, 11.70it/s]
100%|██████████| 9/9 [00:00<00:00, 112.82it/s]
100%|██████████| 8/8 [00:00<00:00, 128.23it/s]
100%|██████████| 8/8 [00:01<00:00, 7.60it/s]
100%|██████████| 7/7 [00:00<00:00, 10.86it/s]
100%|██████████| 8/8 [00:00<00:00, 8.64it/s]
100%|██████████| 5/5 [00:00<00:00, 59.70it/s]
100%|██████████| 5/5 [00:00<00:00, 13.26it/s]
100%|██████████| 6/6 [00:00<00:00, 31.68it/s]
100%|██████████| 2/2 [00:00<00:00, 94.04it/s]
100%|██████████| 1/1 [00:00<00:00, 49.08it/s]
[
Document(page_content='LayoutParser : A Unified Toolkit for Deep Learning Based Document Image Analysis\n\nZejiang Shen 1 ( (=) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5\n\nAllen Institute for AI (...snip...)
またデフォルトだと読み込んだドキュメントの要素は一つ、つまりまるっと読み込まれる。
len(doc) # => 1
読み込み時にmode="elements"
を指定すると、要素ごとに分割される。
from langchain.document_loaders import UnstructuredFileLoader
import pprint
loader = UnstructuredFileLoader("state_of_the_union.txt", mode="elements")
doc = loader.load()
print(len(doc))
pprint.pprint(doc)
365
[
Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': 'state_of_the_union.txt', 'filename': 'state_of_the_union.txt', 'category': 'NarrativeText'}, lookup_index=0),
Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': 'state_of_the_union.txt', 'filename': 'state_of_the_union.txt', 'category': 'NarrativeText'}, lookup_index=0),
Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': 'state_of_the_union.txt', 'filename': 'state_of_the_union.txt', 'category': 'NarrativeText'}, lookup_index=0),
(...snip...)
from langchain.document_loaders import UnstructuredFileLoader
import pprint
loader = UnstructuredFileLoader("layout-parser-paper.pdf", mode="elements")
doc = loader.load()
print(len(doc))
pprint.pprint(doc)
315
[
Document(page_content='LayoutParser : A Unified Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0),
Document(page_content='Zejiang Shen 1 ( (=) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'NarrativeText'}, lookup_index=0),
(...snip...)
この分割方法をstrategy
で設定できる。hi_res
(デフォルト)とfast
で、hi_res
は正確だが時間がかかる、fast
はその逆となる。またこの戦略は全ての文書タイプで用意されているわけではなく、その場合strategy
の指定は無視されるし、fast
にfallbackするようなケースもある。
hi_res
の場合
%%time
from langchain.document_loaders import UnstructuredFileLoader
import pprint
loader = UnstructuredFileLoader("layout-parser-paper.pdf", mode="elements", strategy="hi_res")
doc = loader.load()
doc[:5]
CPU times: user 2min 6s, sys: 13.7 s, total: 2min 19s
Wall time: 2min 25s
[
Document(page_content='LayoutParser : A Unified Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0),
Document(page_content='Zejiang Shen 1 ( (=) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'NarrativeText'}, lookup_index=0),
(...snip...)
fast
の場合
%%time
from langchain.document_loaders import UnstructuredFileLoader
import pprint
loader = UnstructuredFileLoader("layout-parser-paper.pdf", mode="elements", strategy="fast")
doc = loader.load()
doc[:5]
CPU times: user 3.32 s, sys: 28 ms, total: 3.35 s
Wall time: 3.41 s
[
Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper.pdf', 'filename': 'layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
(...snip...)
Vectorstores
Retrievers
色々変わったのと、最近はLangChain使わずに素で書いてるので、一旦クローズ。