Closed2
新しくなったLlamaIndexのGetting Startedをやってみる
※時間が経ってしまったので改めて以下でやり直している。
ちょっと故あってLlamaIndexを再度見直してるところだけど、LlamaIndex v6.0.0でデータ構造が色々変わったらしいので、再度Getting Startedをやってみる。
Jupyter Labでやる。
$ pip install jupyterlab ipywidgets
$ jupyter-lab --ip='0.0.0.0'
インストール
!pip install llama-index
チュートリアル
公式レポジトリにサンプルが含まれているのでcloneしてくる。おなじみPaul Grahamのエッセイ。
!git clone https://github.com/jerryjliu/llama_index.git
%cd llama_index/examples/paul_graham_essay
中身はこう。
!ls
DavinciComparison.ipynb TestEssay.ipynb index_tree_insert.json
GPT4Comparison.ipynb data index_with_query.json
InsertDemo.ipynb index.json splitting_1.txt
KeywordTableComparison.ipynb index_list.json splitting_2.txt
SentenceSplittingDemo.ipynb index_table.json
dataディレクトリの下にテキストファイルがある。これを使う。
!ls data
paul_graham_essay.txt
まずOpenAIのAPIキーを環境変数に設定しておく。
import os
os.environ["OPENAI_API_KEY"] = "XXXXXXXXXX"
dataディレクトリ以下のファイルを読み込んでVectorインデックスを作成する。
from llama_index import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
ではインデックスを検索してみる
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
結果
The author grew up writing essays, learning Italian, exploring Florence, painting people, working with computers, attending RISD, living in a rent-stabilized apartment, building an online store builder, editing Lisp expressions, publishing essays online, writing essays, painting still life, working on spam filters, cooking for groups, and buying a building in Cambridge.
ロギングを有効化する
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
これで再度インデックスを検索するコードを実行すると大量にログが表示されて、以下が行われていることがわかる。
- ユーザの入力値をembeddings APIに渡してvector化
- Paul Grahamのvectorインデックス化データから類似したものを検索
- それをコンテキストとしてプロンプトに含めてcompletion APIに渡して、回答を生成する。
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/embeddings
message='Request to OpenAI API' method=post path=https://api.openai.com/v1/embeddings
DEBUG:openai:api_version=None data='{"input": ["What did the author do growing up?"], "model": "text-embedding-ada-002", "encoding_format": "base64"}' message='Post details'
api_version=None data='{"input": ["What did the author do growing up?"], "model": "text-embedding-ada-002", "encoding_format": "base64"}' message='Post details'
(snip)
DEBUG:llama_index.indices.utils:> Top 2 nodes:
> [Node ef2bceb9-2f7d-4881-919e-a67cb843f194] [Similarity score: 0.81405] I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essay...
> [Node 5eb083c4-4845-4f7e-b838-6a26e4921113] [Similarity score: 0.811131] page views. What on earth had happened? The referring urls showed that someone had posted it on S...
> Top 2 nodes:
> [Node ef2bceb9-2f7d-4881-919e-a67cb843f194] [Similarity score: 0.81405] I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essay...
> [Node 5eb083c4-4845-4f7e-b838-6a26e4921113] [Similarity score: 0.811131] page views. What on earth had happened? The referring urls showed that someone had posted it on S...
(snip)
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
DEBUG:openai:api_version=None data='{"prompt": ["Context information is below. \\n---------------------\\nI could write essays again, I wrote a bunch about topics I\'d had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy
(snip)
Jessica was in charge of marketing at a Boston investment
---------------------
Given the context information and not prior knowledge, answer the question: What did the author do growing up?
DEBUG:llama_index.indices.response.base_builder:> Initial response:
The author grew up writing essays, learning Italian, exploring Florence, painting people, working with computers, attending RISD, living in a rent-stabilized apartment, building an online store builder, editing Lisp expressions, publishing essays online, writing essays, painting still life, working on spam filters, cooking for groups, and buying a building in Cambridge.
> Initial response:
The author grew up writing essays, learning Italian, exploring Florence, painting people, working with computers, attending RISD, living in a rent-stabilized apartment, building an online store builder, editing Lisp expressions, publishing essays online, writing essays, painting still life, working on spam filters, cooking for groups, and buying a building in Cambridge.
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1880 tokens
> [get_response] Total LLM token usage: 1880 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
The author grew up writing essays, learning Italian, exploring Florence, painting people, working with computers, attending RISD, living in a rent-stabilized apartment, building an online store builder, editing Lisp expressions, publishing essays online, writing essays, painting still life, working on spam filters, cooking for groups, and buying a building in Cambridge.
デフォルトだとインデックスはオンメモリなので、ファイルに出力する。
index.storage_context.persist()
storageディレクトリが作成され中にファイルが作成される。
!ls -l
!ls -l storage
vectorインデックス化されたデータはJSONになっている。
合計 952
-rw-rw-r-- 1 kun432 kun432 7773 6月 7 18:14 DavinciComparison.ipynb
-rw-rw-r-- 1 kun432 kun432 24692 6月 7 18:14 GPT4Comparison.ipynb
-rw-rw-r-- 1 kun432 kun432 8402 6月 7 18:14 InsertDemo.ipynb
-rw-rw-r-- 1 kun432 kun432 19039 6月 7 18:14 KeywordTableComparison.ipynb
-rw-rw-r-- 1 kun432 kun432 6987 6月 7 18:14 SentenceSplittingDemo.ipynb
-rw-rw-r-- 1 kun432 kun432 24866 6月 7 18:14 TestEssay.ipynb
drwxrwxr-x 2 kun432 kun432 4096 6月 7 18:59 data
-rw-rw-r-- 1 kun432 kun432 172219 6月 7 18:14 index.json
-rw-rw-r-- 1 kun432 kun432 156103 6月 7 18:14 index_list.json
-rw-rw-r-- 1 kun432 kun432 159574 6月 7 18:14 index_table.json
-rw-rw-r-- 1 kun432 kun432 36104 6月 7 18:14 index_tree_insert.json
-rw-rw-r-- 1 kun432 kun432 166103 6月 7 18:14 index_with_query.json
-rw-rw-r-- 1 kun432 kun432 78252 6月 7 18:14 splitting_1.txt
-rw-rw-r-- 1 kun432 kun432 75176 6月 7 18:14 splitting_2.txt
drwxrwxr-x 2 kun432 kun432 4096 6月 7 18:58 storage
合計 772
-rw-rw-r-- 1 kun432 kun432 90833 6月 7 18:58 docstore.json
-rw-rw-r-- 1 kun432 kun432 1927 6月 7 18:58 index_store.json
-rw-rw-r-- 1 kun432 kun432 691051 6月 7 18:58 vector_store.json
次回以降はstorageディレクトリ以下を読み込めばいい
from llama_index import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
このスクラップは2023/10/19にクローズされました