Cloudflare D1で全文検索

以下の投稿を参考にしながら、自分もやってみる

https://leaves.chiba.dev/posts/246

SQLIte の fts のドキュメントを読む
https://www.sqlite.org/fts5.html

Overview of FTS5

It is an error to add types, constraints or PRIMARY KEY declarations to a CREATE VIRTUAL TABLE statement used to create an FTS5 table. Once created, an FTS5 table may be populated using INSERT, UPDATE or DELETE statements like any other table. Like any other table with no PRIMARY KEY declaration, an FTS5 table has an implicit INTEGER PRIMARY KEY field named rowid.

仮想テーブルを作るので通常のテーブル定義のように primary key や constraint などは定義できない。primary key には intger の rowid が使われる。

-- Query for all rows that contain at least once instance of the term
-- "fts5" (in any column). Return results in order from best to worst
-- match.
SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;

order by rank を使うとマッチの評価順で取得できる

FTS5 Table Creation and Initialization

There are currently the following configuration options:
The "tokenize" option, used to configure a custom tokenizer.
The "prefix" option, used to add prefix indexes to an FTS5 table.
The "content" option, used to make the FTS5 table an external content or contentless table.
The "content_rowid" option, used to set the rowid field of an external content table.
The "columnsize" option, used to configure whether or not the size in tokens of each value in the FTS5 table is stored separately within the database.
The "detail" option. This option may be used to reduce the size of the FTS index on disk by omitting some information from it.

content や content_rowid は、検索したい文書を他のテーブルに登録している時に使える。Tokenizer は次の４つが使える。

The unicode61 tokenizer, based on the Unicode 6.1 standard. This is the default.
The ascii tokenizer, which assumes all characters outside of the ASCII codepoint range (0-127) are to be treated as token characters.
The porter tokenizer, which implements the porter stemming algorithm.
The trigram tokenizer, which treats each contiguous sequence of three characters as a token, allowing FTS5 to support more general substring matching.

unicode61 tokenizer: space や punctuation を seperator として扱い、それ以外は token として扱う
ascii tokenizer: unicode61 tokenizer を拡張したもの。差分はドキュメント参考のこと。
porter tokenizer: 他の tokenizer のラッパーで、出力された token を Porter stemmer という正規手法で扱う方法
- 正規化 => running, ran, run を全部 run と扱うみたいなことをする
trigram tokenizer: 3文字を一つの token として扱う方法
- 参考: https://www.space-i.com/post-blog/sqlite-fts-trigram-tokenizerでunigram＆bigram検索までサポート-日本語全文検索/

nissy-dev

drizzle 使えるか？

drizzle は sqlite の fts をサポートしてないので無理そう

postgresql はサポートしているらしい

tokenizer を拡張できるか？

SQLite で全文検索をやったブログを見ていると、デフォルトの tokenizer の日本語対応が厳しいので独自で作った extension を読み込んだりしている。これはさすがに D1 がホストしているサーバーへ拡張ライブラリを送る手段がないので無理そう。

以下のドキュメントも見ていて、kagome と呼ばれる形態素解析のツールのサンプルコードが今回のようなケースで参考になりそうだった。アプリケーションで tokenize する方法がありそう。

これはまず２つのテーブルの検索を実現している。１つ目のテーブルは検索対象の文を格納するテーブル。

CREATE TABLE IF NOT EXISTS contents_fts(docid INTEGER PRIMARY KEY AUTOINCREMENT, content TEXT);

もう一つが、分かち書き後の単語をスペース区切りで join したものを格納する fts テーブル。これに対して検索クエリを投げて docid を取得し、そこからもう前に作ったテーブルをクエリを投げる仕組みになっている。

CREATE VIRTUAL TABLE IF NOT EXISTS fts USING fts4(words);

ベクトル検索という選択肢もありそう？

Cloudflare は vector を扱う専用のサービスを提供している。これを使った日本語検索を試している人はいた。

ただ、ベクトル化のところは OpenAI の有料 API を使っていて、LLMを使って無料でやるのは大変そうなので選択肢から外した。モデルもまだ今後もどんどん発展していきそうなので、もうちょっと値段と精度が落ち着いたら手を出してみたい。

Workers AI が提供する embbeding モデルも現時点では日本語に対応しているモデルは提供されていない。

nissy-dev

JS で形態素解析

lindera か sudachi がいいなあと思いつつ、メンテが止まっていそう

Intl.Segmentor でもいいのでは？という話はありそう

ICU の BreakIterator 並びに ICU4X の Segmenter は IPA 辞書を利用したある程度簡易的な単語の分割を行なっている。