🔍

OpenSearchで日本語の検索ができるようにする

2022/05/26に公開

Elasticsearch

OpenSearch

tech

ローカルでOpenSearchに日本語周りのプラグインをインストールして、Analyzerの設定をするのに少し手間取ったのでメモ

Dockerfileとdocker-compose.yamlの準備

Dockerfileは以下のような感じ
本当はDockerfile用意せずに環境変数で指定できるとすごい楽なのだけど。。。

Dockerfile

FROM opensearchproject/opensearch:1.3.2
RUN /usr/share/opensearch/bin/opensearch-plugin install analysis-kuromoji
RUN /usr/share/opensearch/bin/opensearch-plugin install analysis-icu

docker-compose.yml は以下
特に特別な指定はなし

docker-compose.yml
version: "3.3"
services:
  search-engine:
    build:
      context: .
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - plugins.security.disabled=true

indexの作成とAnalyzerの設定

今回は books というインデックスを作成する

まずはAnalyzerの設定なしで作成

$ docker compose exec search-engine /bin/bash
$ curl -X PUT -H "Content-Type: application/json"  "localhost:9200/books/?pretty"

以下のコマンドでAnalyzerの確認が出来る

$ curl -H "Content-Type: application/json" "localhost:9200/books/_analyze?pretty" -d '{"text": "良いコードで学ぶ設計入門"}'

{
  "tokens" : [
    {
      "token" : "良",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
        {
      "token" : "い",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<HIRAGANA>",
      "position" : 1
    },
    {
      "token" : "コード",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<KATAKANA>",
      "position" : 2
    },
    ...

日本語の形態要素を無視して1文字ずつをtokenとされてしまっている
この状態で検索を行うと例えば　い　が含まれるドキュメントを全て持ってきてしまう。
(本当は　良い で検索してほしい)

日本語の形態要素を解釈するにはAnalyzerの設定のTokenizerにkuromojiを指定してあげればok

$ curl -X DELETE "localhost:9200/books?pretty"
$ curl -X PUT -H "Content-Type: application/json"  "localhost:9200/books?pretty" -d'
{
    "settings":{
        "analysis":{
            "analyzer": {
                "default": {
                    "type": "custom",
                    "tokenizer": "kuromoji_tokenizer"
                }
            }
        }
    }
}'

settings.analysis.analyzer.default でデフォルトのAnalyzerの設定を指定できる
この他にmappingsでプロパティ単位でAnalyzerの設定も可能

$ curl  -H "Content-Type: application/json" "localhost:9200/books/_analyze?pretty" -d '{"text": "良いコードで学ぶ設計入門"}'
{
  "tokens" : [
    {
      "token" : "良い",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "コード",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
　　　　　　　{
      "token" : "で",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
...
  ]
}

日本語の形態要素通りにTokenizeが出来ている！
これで日本語の単語単位での検索が可能となる。

だが上記の例だと品詞の で も検索の単語となってしまう。
これを除去するためには Token Filter に kuromoji_part_of_speech を指定してあげる必要がある

$ curl -X DELETE "localhost:9200/books?pretty"
$ curl -X PUT -H "Content-Type: application/json"  "localhost:9200/books?pretty" -d'
{
    "settings":{
        "analysis":{
            "analyzer": {
                "default": {
                    "type": "custom",
                    "tokenizer": "kuromoji_tokenizer",
		    "filter": ["kuromoji_part_of_speech"]
                }
            }
        }
    }
}'
$ curl  -H "Content-Type: application/json" "localhost:9200/books/_analyze?pretty" -d '{"text": "良いコードで学ぶ設計入門"}'
{
  "tokens" : [
    {
      "token" : "良い",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "コード",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "学ぶ",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "設計",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "入門",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 5
    }
  ]
}

Tokenから で が消え　良い　コード 設計 入門　が検索トークンとなった　👍

参考

Discussion

ログインするとコメントできます