Closed2023/10/18にクローズ4

ChatGPTへの追加知識を外付けする

ChatGPT

LangChain

shin_t_o_

LangChain自体、既に先駆者の方々によってまとめられているので新しい情報はない
ただ Indexes(LangChain doc) を切り出しておくと、付け替え可能なインスタント知識的で良いのでは？という話。

shin_t_o_

任意の Document Loaders(LangChain doc)
任意の Text Splitters(LangChain doc)

により生成される Document[] オブジェクトは単なるJSONファイルのため、ここまで作成して格納しておけば実行時それを参照するだけで済む

…というのを React + NextJs webアプリケーションでLangChain拡張を試験的に構築しつつ、 in-memoryでHNSWLibを動かしていて重たいなぁと感じた辺りで気付いた。

shin_t_o_

Documentオブジェクトは下記で示されるシンプルな形。

export declare class Document<Metadata extends Record<string, any> = Record<string, any>> implements DocumentInput {
    pageContent: string;
    metadata: Metadata;
    constructor(fields: DocumentInput<Metadata>);
}

type Record<K extends keyof any, T> = {
    [P in K]: T;
};

なので（ (new XXXLoader).load() をせずとも）自前でデータを読み込んだうえでこの形に整形しても機能に違いはない。

shin_t_o_

何かしらのドキュメント + 何かしらのsplitter + 適切な { pageContent, metadata } プロパティの定義を施した Document[] オブジェクトをローカルに保存しておけば「外付け知識」的に必要に応じて読み込める。

// docs: 別途用意した知識のもとオブジェクト
// splitter: 別途用意したtext splitter
const documents = (
  await Promise.all(
    docs.map((doc: DocType) => {
      return splitter.createDocuments(
        [doc.content], // pageContent
        [
          {
            title: doc.title,
            created_at: doc.created_at,
            path: doc.path,
          },
        ] // metadata
      )
    })
  )
).flat()

// 例えばHNSWLibから呼び出し
const vectorStore = await HNSWLib.fromDocuments(
  documents,
  new OpenAIEmbeddings({ openAIApiKey: apiKey })
)

このスクラップは2023/10/18にクローズされました