GPT index

Github の説明を読むと「LLM は大量なデータを事前学習したモデルで、それを独自のデータで強化するためのもの」らしい
- 知識の拡張のために必要なタスクが「データの取り込み」と「データの索引付け」だそう
- それらを解決するのが GPT index となる
GPT index はデータソースや形式へのコネクタを提供してくれる（pdf や html とかを読むことができる）
しかも Python で数行でできちゃうのが一番の特徴
公式？ページはこちら
- https://gpt-index.readthedocs.io/en/latest/index.html
使ってみて自分の持っているドキュメントの文脈を GPT に理解させた上で会話することができるイメージを持った

とりあえず動かす

URL （そのレスポンスである html）を学習データとして渡すことができる
自分の note の記事を学習データに使ってみる
- サッカーの記事です
- https://note.com/hiracky16/n/ned941eb19e78
- https://note.com/hiracky16/n/na42f2f5e8809

import os, logging
from gpt_index import GPTSimpleVectorIndex, LLMPredictor, SimpleWebPageReader
from langchain.chat_models import ChatOpenAI

urls = [
    'https://note.com/hiracky16/n/ned941eb19e78',
    'https://note.com/hiracky16/n/na42f2f5e8809'
]

def get_index():
    """data/ 配下のテキストを学習した gpt index オブジェクトを得る
    Returns:
        GPTSimpleVectorIndex: index
    """
    llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))
    documents = SimpleWebPageReader(html_to_text=True).load_data(urls)
    index = GPTSimpleVectorIndex(
        documents=documents,
        llm_predictor=llm_predictor
    )
    return index

def save_index(index: GPTSimpleVectorIndex):
    """index を json で保存、その後に GCS にアップする
    Args:
        index (GPTSimpleVectorIndex): gpt index オブジェクト
    """
    save_index_path = "index/index.json"
    index.save_to_disk(save_index_path)
    return save_index_path


index = get_index()
save_index_path = save_index(index)

# テスト
index = GPTSimpleVectorIndex.load_from_disk(save_index_path)
result = index.query("EL の見どころは？")
logging.info(result)

> export export OPENAI_API_KEY=xxxxx
> python create_index.py
WARNING:gpt_index.llm_predictor.base:Unknown max input size for gpt-3.5-turbo, using defaults.
Token indices sequence length is longer than the specified maximum sequence length for this model (3646 > 1024). Running this sequence through the model will result in indexing errors
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
INFO:gpt_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 9578 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 2589 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 14 tokens
INFO:root:
EL の見どころは、今シーズンの豪華な試合（ソシエダ vs アーセナルなど）や、日本人選手同士の試合に期待できることです。また、CLやECLに影響されて出場資格や大会の方式などが変わっていることも面白いところです。

hiracky16

Web サーバーで使いやすくする

index は json で保存することができる（前述の save_to_disk ）
これを読み込んで query する REST API を Flask で実装する

from gpt_index import GPTSimpleVectorIndex, LLMPredictor
from langchain.chat_models import ChatOpenAI
import os, json
from flask import Flask, request, jsonify

INDEX_LOCAL_PATH = 'index/index.json'

app = Flask(__name__)
app.config['JSON_AS_ASCII'] = False

@app.route('/query', methods=["POST"])
def query():
    data = request.get_data()
    query = json.loads(data).get('query')
    result = index.query(query)
    return jsonify({ 'response': result.response }), 200

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))
index = GPTSimpleVectorIndex.load_from_disk(INDEX_LOCAL_PATH, llm_predictor=llm_predictor)

app.run(port=8000, debug=True)

app.config['JSON_AS_ASCII'] = False がないと日本語をレスポンスとして返すことができないらしい…（結構ハマった）
query の結果がこちら

> curl -X POST 'http://localhost:8000/query' -d '{"query": "EL の見どころは？"}'
{
  "response": "EL の見どころは、22/23シーズンの注目ポイントや日本人選手同士の試合に期待できることなどです。また、ELの歴史を調べるとCLやECLに影響されて出場資格や大会の方式などが変わっていることに気づき、今後もややこしいが面白くするための施策が行われていることも知れます。"
}

hiracky16

ChatGPT との比較

以下の ChatGPT は素の GPT-3.5 を使っているため当然記事の文脈を知らない
なので同じようなクエリを送ってみても関係のない回答をしてきた
ChatGPT は便利ですが、文脈を理解する（最近流行りのプロンプトエンジニアリング？）必要がある質問に弱いのでこの GPT index で強化すると便利になる
社内ドキュメントの index 作るとオンボーディングなどで役立つかも

このスクラップは2023/03/25にクローズされました