⌨️

自動補完、IME で Deno KV を使う

2024/01/06に公開

はじめに

最近 Open Beta になった Deno KV (SQLite ベースのキーバリューストア) を使ってみました。
実際に使って感じた感想をコードベースの TIPS のような形で纏めます。
網羅的な説明をするものではありません。

Deno KV はまだ安定化していないので、この記事の内容を参考にするときは更新日時に注意してください。
利用するには --unstable-kv フラグが必要です。

https://github.com/vim-skk/skkeleton

マニュアル

https://docs.deno.com/kv/manual

目的

ddc-source-dictionary は単語が羅列された辞書ファイルを読み込み、自動補完の候補として提供します。
skkeleton は SKK 辞書を読み込み、漢字への変換で用います。

どちらも起動時のファイル I/O がボトルネックとなるので、改善できないだろうかと以前より考えていました。
そこでこの Deno KV を使って一度辞書の中身をデータベース化すれば、それ以降の起動が高速化されるだろうと思い実装してみました。

しかし、入力に一致する候補の収集は非常に頻繁に行われる操作であり、この速度によっては実用的とは言えません。
特に ddc-source-dictionary の元々の実装ではメモリ上で Trie 木を構築しており、候補の絞り込みは非常に高速です。
その比較結果もまとめて報告します。

cmp-dictionary で SQLite を使ったときに、候補の収集が遅くあまり推奨できる出来にならなかった苦い思い出。。

使い方

今回の目的から、揮発するインメモリな方法は意味がありません。
SQLite ベースのデータベースを用います。
Deno.openKv() の引数にデータベースを保存するパスを指定しましょう。

const database = await Deno.openKv("/path/to/database")

Deno KV はキーバリューストアです。最も基本的な操作は set と get でしょう。

対応する Key と Value をセットする (kv.set())
Key を使って対応する Value を取得する (kv.get())

await database.set(["ab", "c"], "def")
console.log((await database.get(["ab", "c"])).value) // -> "def"

今回のような、ある条件に一致する候補を全て取得する操作を行うためには kv.list() が便利です。
ただし、そのためにはキーのデータ構造に少し工夫が要ります。

キーのデータ構造

今回は string[] として扱います。

このキーですが、絞り込みにはキーしか使えないのでこちらにある程度情報を盛り込む必要があります。
具体的には

1つのデータベースで複数の辞書に対応するため、辞書のパスを入れる
- SQL と違いテーブルはありません
- そのためキーを階層化することでデータを分割します
辞書の更新を検知するために mtime を記録する
- [path, "mtime"] をキーにすることで、データベースが最新の辞書に対応したものかどうか判断できるようにしました
絞り込みに使う文字列を分割する
- 実際の単語のキーは [path, "word", ...word] のような構造にします
- 例えば ddc-source-dictionary で Vim という単語を登録するには、await database.set(["/path/to/dict", "word", "V", "i", "m"], "Vim") とします。
- これは kv.list() の制約で、prefix は配列としての前方一致しか見てくれないためです（{ prefix: ["word", "a"] } では ["word", "ab"] が見つからない）。

あたりでしょうか。

データの登録

当然ですが、データベースに値をセットするのもファイル I/O です。
愚直にやると初期化に莫大な時間がかかったので、Deno.AtomicOperation を使って工夫しました。
が、このあたりを効率的にやるにはまだ手法が整備されていないなという感想です（ベータ版なので仕方ない）。

今回 AtomicOperation を使う目的は I/O を減らすことです。
つまり commit() の頻度を最小化したいわけです。

しかし AtomicOperation で set() をまとめるには2つの制限があります。
1つ目は mutations の回数です。

コードとエラー

database.ts

const database = await Deno.openKv();
let atm = database.atomic();
for (let i = 0; i < 1001; i++) {
  atm = atm.set(["a"], "a");
}
await atm.commit();

$ deno run -A --unstable-kv ./database.ts
error: Uncaught (in promise) TypeError: too many mutations (max 1000)
await atm.commit();
          ^
    at doAtomicWriteInPlace (ext:deno_kv/01_db.ts:626:16)
    at AtomicOperation.commit (ext:deno_kv/01_db.ts:401:32)
    at file:///home/uga/dotfiles/nvim/database.ts:6:11
    at eventLoopTick (ext:core/01_core.js:182:7)

これは単純で、操作が 1000 回を越えると引っ掛かります。

もう1つが厄介で、Total key size の制限です。

コードとエラー

database.ts

const database = await Deno.openKv();
let atm = database.atomic();
for (let i = 0; i < 500; i++) {
  atm = atm.set(Array(100).fill("a"), "a");
}
await atm.commit();

$ deno run -A --unstable-kv ./database.ts
error: Uncaught (in promise) TypeError: total key size too large (max 81920 bytes)
await atm.commit();
          ^
    at doAtomicWriteInPlace (ext:deno_kv/01_db.ts:626:16)
    at AtomicOperation.commit (ext:deno_kv/01_db.ts:401:32)
    at file:///home/uga/dotfiles/nvim/database.ts:6:11
    at eventLoopTick (ext:core/01_core.js:182:7)

今度は500回しか操作していないのにエラーになりました。
total key size が 81920 bytes を越えたからとありますね。
これは set() で使われているキーのサイズの合計値から生じる制限なのですが、この「キーのサイズ」が曲者です。

key は TypeScript のレイヤーでは Deno.KvKey ですが、Rust のレイヤーで u8 のバイト列に変換されます。
この際、要素ごとに元の型を特定するための識別子や、配列の要素を区切るためのバイトが追加されます。

今回は string しか使わないのでこれに絞りますが、

配列の各要素に対して文字列であることの識別子 0x02 から開始
各要素の文字列をバイト列に変換し、0x00 を 0xFF でエスケープ
末尾に 0x00 を追加

となるので、['a', 'bc'] は 0x02 0x61 0x00 0x02 0x62 0x63 0x00 の7バイトになります。

ですので、キーのサイズを計算するには次のような関数が必要になります。

const Encoder = new TextEncoder();
function calcKeySize(key: string[]): number {
  let size = 0;
  for (const k of key) {
    const encoded = Encoder.encode(k);
    size += encoded.reduce((acc, cur) => acc + (cur === 0x00 ? 2 : 1), 2);
  }
  return size;
}

以上より、ファイル I/O を最小化するためには、操作が 1000 回を越えるか total key size が 81920 bytes を越えるかのどちらかを満たしたら commit() する、という判定を仕込む必要がありました。

ちなみに、上記の kv.list() で prefix が部分一致を見つけてくれない理由は恐らくこれだと思います。
["ab"] は 0x02 0x61 0x62 0x00 であり ["a"] は 0x02 0x61 0x00 ですから、確かに前方一致してませんね。

候補の検索

現在有効化されている辞書ごとに、prefix に前方一致する候補を収集します。
ここも素直にやると上手くいかない点がありました。

export class Dictionary {
  #db: Deno.Kv;
  #activePath: Map<string, boolean>;

  // ...

  async search(
    prefix: string,
  ): Promise<Item[]> {
    const items: Item[] = [];
    for (const [path, active] of this.#activePath) {
      if (!active) {
        continue;
      }
      for await (
        const entry of this.#db.list<string>({
          prefix: [path, "word", ...prefix],
          start: [path, "word", ...prefix],
        })
      ) {
        items.push({ word: entry.value });
      }
    }
    return items;
  }
}

prefix と start に同じものを指定していますね。
これは prefix だけだと完全一致する候補が取れないのに、start を入れると取れるようになるからです。
結構怪しい挙動なので issue として報告済みですが、反応がないのでバグか仕様かは分かりません。

テストコード

test.ts

const data = [
  { key: ["data"], value: 0 },
  { key: ["data", 1], value: 1 },
];

const db = await Deno.openKv("./test.db");
for (const d of data) {
  await db.set(d.key, d.value);
}

console.log("only prefix");
for await (const entry of db.list({ prefix: ["data"] })) {
  console.log(entry.value);
}

console.log("prefix and start");
for await (const entry of db.list({ prefix: ["data"], start: ["data"] })) {
  console.log(entry.value);
}

db.close();

result

$ deno run -A --unstable-kv ./test.ts
only prefix
1
prefix and start
0
1

速度比較

結果は以下のようになりました。
10万件ほどの候補から探索し、10 ms スケールで収集できるのは十分な速度ではないでしょうか？

テストコード

import { TextLineStream } from "https://deno.land/std@0.208.0/streams/mod.ts";

const path = "/usr/share/dict/words";
const database = await Deno.openKv("./database");
let atm = database.atomic();

const Encoder = new TextEncoder();
function calcKeySize(key: string[]): number {
  let size = 0;
  for (const k of key) {
    const encoded = Encoder.encode(k);
    size += encoded.reduce((acc, cur) => acc + (cur === 0x00 ? 2 : 1), 2);
  }
  return size;
}

let mutationCount = 0;
let totalKeySize = 0;
async function setAtom(key: string[], value: string) {
  const keySize = calcKeySize(key);
  if (mutationCount >= 1000 || totalKeySize + keySize > 81920) {
    await atm.commit();
    atm = database.atomic();
    mutationCount = 0;
    totalKeySize = 0;
  }
  atm = atm.set(key, value);
  mutationCount++;
  totalKeySize += keySize;
}

// Initialize database
let time = performance.now();

const stat = Deno.statSync(path);
const mtime = stat.mtime?.getTime();
if (!mtime || (await database.get([path, "mtime"])).value !== mtime) {
  const lineStream = Deno.openSync(path).readable
    .pipeThrough(new TextDecoderStream())
    .pipeThrough(new TextLineStream());
  for await (const line of lineStream) {
    for (const word of line.split(/\s+/)) {
      if (word !== "") {
        await setAtom([path, "word", ...word], word);
      }
    }
  }
  await atm.commit();
  await database.set([path, "mtime"], mtime);
}

console.log(`Initialize: ${performance.now() - time} ms`);

// Search words by prefix
const prefixes = "abcdefghijklmnopqrstuvwxyz".split("");

for (const prefix of prefixes) {
  time = performance.now();

  const items = [];
  for await (
    const entry of database.list<string>({
      prefix: [path, "word", prefix],
      start: [path, "word", prefix],
    })
  ) {
    items.push({ word: entry.value });
  }

  console.log(
    `Search by prefix '${prefix}': Number of words is ${items.length}: Time is ${
      performance.now() - time
    } ms`,
  );
}

Initialize: 1841.1240440000001 ms
Search by prefix 'a': Number of words is 4705: Time is 25.88997500000005 ms
Search by prefix 'b': Number of words is 4913: Time is 27.485771999999997 ms
Search by prefix 'c': Number of words is 8260: Time is 45.36883399999988 ms
Search by prefix 'd': Number of words is 5176: Time is 27.3865330000001 ms
Search by prefix 'e': Number of words is 3307: Time is 18.01066199999991 ms
Search by prefix 'f': Number of words is 3745: Time is 18.53727099999992 ms
Search by prefix 'g': Number of words is 2799: Time is 14.104373999999552 ms
Search by prefix 'h': Number of words is 3122: Time is 15.26544100000001 ms
Search by prefix 'i': Number of words is 3385: Time is 17.978665999999976 ms
Search by prefix 'j': Number of words is 777: Time is 3.8198680000000422 ms
Search by prefix 'k': Number of words is 621: Time is 3.063626000000113 ms
Search by prefix 'l': Number of words is 2644: Time is 12.752908999999818 ms
Search by prefix 'm': Number of words is 4496: Time is 22.978982000000087 ms
Search by prefix 'n': Number of words is 1560: Time is 7.876393999999891 ms
Search by prefix 'o': Number of words is 1967: Time is 10.591051999999763 ms
Search by prefix 'p': Number of words is 6822: Time is 34.23828300000014 ms
Search by prefix 'q': Number of words is 417: Time is 2.143696000000091 ms
Search by prefix 'r': Number of words is 4721: Time is 23.46596099999988 ms
Search by prefix 's': Number of words is 10070: Time is 52.34290299999975 ms
Search by prefix 't': Number of words is 4354: Time is 22.268744999999853 ms
Search by prefix 'u': Number of words is 1826: Time is 9.000050999999985 ms
Search by prefix 'v': Number of words is 1280: Time is 6.1433320000001 ms
Search by prefix 'w': Number of words is 2362: Time is 12.512147999999797 ms
Search by prefix 'x': Number of words is 57: Time is 0.5083490000001802 ms
Search by prefix 'y': Number of words is 285: Time is 1.6422410000000127 ms
Search by prefix 'z': Number of words is 151: Time is 0.8030710000002728 ms

あらかじめメモリに載せている Trie 木と比較しても 1 桁程度の差で済んでいます。
十分実用的な速度ですね。

テストコード

import { TextLineStream } from "https://deno.land/std@0.208.0/streams/mod.ts";

const path = "/usr/share/dict/words";

class TrieNode {
  readonly children: Record<string, TrieNode>;
  endOfWord: boolean;

  constructor() {
    this.children = {};
    this.endOfWord = false;
  }
}

class Trie {
  readonly root: TrieNode;

  constructor() {
    this.root = new TrieNode();
  }

  insert(word: string): void {
    let current = this.root;
    for (let i = 0; i < word.length; i++) {
      const ch = word.charAt(i);
      const node = current.children[ch] ?? new TrieNode();
      current.children[ch] = node;
      current = node;
    }
    current.endOfWord = true;
  }

  private searchPrefix(
    node: TrieNode,
    prefix: string,
    wordList: string[],
  ): void {
    if (node.endOfWord) {
      wordList.push(prefix);
    }
    for (const ch in node.children) {
      this.searchPrefix(node.children[ch], prefix + ch, wordList);
    }
  }

  search(prefix: string): string[] {
    let node = this.root;
    for (let i = 0; i < prefix.length; i++) {
      node = node.children[prefix.charAt(i)];
      if (node === undefined) {
        return [];
      }
    }
    const wordList: string[] = [];
    this.searchPrefix(node, prefix, wordList);
    return wordList;
  }
}

// Initialize database
let time = performance.now();

const trie = new Trie();
const lineStream = Deno.openSync(path).readable
  .pipeThrough(new TextDecoderStream())
  .pipeThrough(new TextLineStream());
for await (const line of lineStream) {
  for (const word of line.split(/\s+/)) {
    if (word !== "") {
      trie.insert(word);
    }
  }
}

console.log(`Initialize: ${performance.now() - time} ms`);

// Search words by prefix
const prefixes = "abcdefghijklmnopqrstuvwxyz".split("");

for (const prefix of prefixes) {
  time = performance.now();

  const items = trie.search(prefix).map((word) => ({ word }));

  console.log(
    `Search by prefix '${prefix}': Number of words is ${items.length}: Time is ${
      performance.now() - time
    } ms`,
  );
}

Initialize: 185.635855 ms
Search by prefix 'a': Number of words is 4705: Time is 1.5532199999999818 ms
Search by prefix 'b': Number of words is 4913: Time is 1.2792749999999842 ms
Search by prefix 'c': Number of words is 8260: Time is 1.5127829999999847 ms
Search by prefix 'd': Number of words is 5176: Time is 0.7586410000000114 ms
Search by prefix 'e': Number of words is 3307: Time is 0.48531500000001415 ms
Search by prefix 'f': Number of words is 3745: Time is 0.5526940000000025 ms
Search by prefix 'g': Number of words is 2799: Time is 0.4029250000000104 ms
Search by prefix 'h': Number of words is 3122: Time is 0.48266499999999724 ms
Search by prefix 'i': Number of words is 3385: Time is 0.5046949999999981 ms
Search by prefix 'j': Number of words is 777: Time is 0.11220900000000711 ms
Search by prefix 'k': Number of words is 621: Time is 0.09358899999998016 ms
Search by prefix 'l': Number of words is 2644: Time is 0.36822599999999284 ms
Search by prefix 'm': Number of words is 4496: Time is 0.7106530000000078 ms
Search by prefix 'n': Number of words is 1560: Time is 0.33206599999999753 ms
Search by prefix 'o': Number of words is 1967: Time is 0.32374599999999987 ms
Search by prefix 'p': Number of words is 6822: Time is 0.9599990000000105 ms
Search by prefix 'q': Number of words is 417: Time is 0.07502900000000068 ms
Search by prefix 'r': Number of words is 4721: Time is 0.7690519999999879 ms
Search by prefix 's': Number of words is 10070: Time is 1.6252819999999986 ms
Search by prefix 't': Number of words is 4354: Time is 0.850560999999999 ms
Search by prefix 'u': Number of words is 1826: Time is 0.48629499999998416 ms
Search by prefix 'v': Number of words is 1280: Time is 0.24391699999998195 ms
Search by prefix 'w': Number of words is 2362: Time is 0.40450500000000034 ms
Search by prefix 'x': Number of words is 57: Time is 0.016899999999992588 ms
Search by prefix 'y': Number of words is 285: Time is 0.059868999999991956 ms
Search by prefix 'z': Number of words is 151: Time is 0.03650000000001796 ms

終わりに

まだベータ版なので微妙なところもありますが、かなり良さそうな雰囲気は感じています。
今後の開発を応援しています。

GitHubで編集を提案

はじめに

目的

使い方

キーのデータ構造

データの登録

候補の検索

速度比較

終わりに

Discussion