Zenn
😎

AWS OpenSeach Serviceで全文検索を試す

2025/03/11に公開

OpenSeachで全文検索を試す

はじめに

以前、AWS OpenSeach Serviceを用いてサイトの検索機能を実装したことがあり、その時のことを思い出しつつ知識の整理のために使い方、当時使った機能についてまとめてみました。

試した環境

  • AWS OpenSeach Service
    • Amazon OpenSearch Serverless があるため環境構築が容易
  • 以下の内容は、OpenSearch ダッシュボードで実施

検索の前準備

  • コネクションを作成する
  • コネクションに対してインデックスを作成する
  • コネクションに対して検索に用いるデータを投入しインデックス化する

コネクションを作成する

  • 1つまたは複数のインデックスを論理的にグループ化したもの

インデックスの作成

名称 説明
インデックス 文書やデータベース内のコンテンツを効率的に検索できるようにするために作成されたデータ構造

英語ドキュメントのインデックス構造例

PUT /en_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding"],
          "char_filter": ["quote_remover"]
        }
      },
      "normalizer": {
        "custom_normalizer": {
          "type": "custom",
          "filter": ["lowercase", "asciifolding"],
        }
      },
      "char_filter": {
        "quote_remover": {
          "type": "mapping",
          "mappings": [
            "\" => ",
            "' => "
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword",
            "normalizer": "custom_normalizer"
          }
        }
      },
      "content": {
        "type": "text",
        "analyzer": "custom_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword",
            "normalizer": "custom_normalizer"
          }
        }
      }
    }
  }
}

インデックス のフィールドの型

https://opensearch.org/docs/latest/field-types/supported-field-types/index/

名称 説明
text型 ・テキストを検索用にトークン(単語)に分割した形式で保存する。
・スペルミスやタイプミスを許容するあいまい検索にも対応しています。
・正確なスペルに関わらず関連する情報を見つけやすくなります。
keyword型 ・正確な一致検索に適しており、フィルタリングや集計に使用されます。
・テキストはトークナイズされず、全体として一つのトークンとして扱われます。
・これは、正確な値やタグ、カテゴリ、IDなどを検索する場合に適しています。
・また、keywordフィールドは、データの集計(例:カウント、グルーピング)やソートにも使用されます。
・これにより、検索エンジンは精密な検索とデータ操作を行うことができます。

インデックス化前の処理

Analyzer の場合

名称 説明
トークナイズ 指定したtokenizerでテキストをトークンに分割する
トークンフィルター 指定したfilterでトークンを変換する。例えば小文字化など
文字フィルター 指定したchar_filterでテキストを変換する、例えばHTMLタグの除去

Normalizer の場合

名称 説明
トークンフィルター 指定したfilterでトークンを変換する。例えば小文字化など

※ Normalizerでは、文字フィルターが使えない

データのインデックス化

https://www.aozora.gr.jp/cards/000148/files/789_14547.html

PUT /en_index/_doc/1
{
  "title": "I am a cat",
  "content": "I am a cat. I don't have a name yet. I have no idea where I was born. All I remember is that he used to meow and cry in a dim, dank place. It was here that I first saw a human being. I was told later that they were the most ferocious and vicious of all human beings, called shosei. It is said that these 'shosei' would sometimes catch us and boil us for food. But at the time, I had no idea what to expect, so I didn't think of it as anything to be afraid of. It was just that when I was placed in his palm and lifted up, I felt a fluffy sensation. I guess that was the first time I saw the face of the calligrapher after I had settled down a little on his palm. I still remember how strange I thought it was. First of all, his face, which should have been decorated with hair, was so smooth that it looked like a medicine can. I have met many cats since then, but I have never encountered a cat with such a one-ringed face. Not only that, but the center of its face was so protruding. The smoke sometimes blew out from the hole. It seemed to be choking, which made me feel really weak. It was around this time that I finally learned that this was the kind of tobacco people drank."
}

https://www.aozora.gr.jp/cards/001095/files/57478_62862.html

PUT /en_index/_doc/2
{
  "title": "Ieyasu",
  "content": "Tokugawa Ieyasu is known to be a raccoon dog. He was a tricks to destroy the Toyotomi family from Sekigahara to the battle of Osaka, and even his mother-in-law, who tormented his wife, was not an insolent raccoon, or she would not have made such a blatant accusation. This is because he is a relative of Raccoon Grandmother, and the people's intuition is well founded. However, it cannot be said that Ieyasu was a native Mikawa raccoon. Ieyasu in his later years was by all accounts a raccoon, and it is said that Ieyasu had been a raccoon until then, but if he had been a raccoon for more than fifty years, it would seem that his final performance would have been a little more clear-cut. Ieyasu had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during that time, Ieyasu was already in control of the country, with the lords and princes leaning almost entirely toward him. He had been in power for more than ten years, and yet his methods were so blatant and clumsy that they were those of a first-time offender, not those of a habitual or natural-born criminal.Yoritomo was exiled to Izu in his thirteenth year and remained a mere exile for thirty years until his middle age, with no other hope than to write a love letter to the daughter of a powerful provincial family. Not only was he not directly responsible for the failures of his soldiers, but he also used his own orders to correct, prohibit, or even take advantage of the failures. In this way, he skillfully manipulated the flow of power from Kyoto to Kamakura without ever going to Kyoto."
}

あいまい検索を行う

検索ワード1語によるANDあいまい検索

POST /en_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "IEASU",
        "analyzer": "custom_analyzer",
        "fuzziness": "AUTO"
      }
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

結果

{
  "took": 30,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.42411238,
    "hits": [
      {
        "_index": "en_index",
        "_id": "2",
        "_score": 0.42411238,
        "_source": {
          "title": "Ieyasu",
          "content": "Tokugawa Ieyasu is known to be a raccoon dog. He was a tricks to destroy the Toyotomi family from Sekigahara to the battle of Osaka, and even his mother-in-law, who tormented his wife, was not an insolent raccoon, or she would not have made such a blatant accusation. This is because he is a relative of Raccoon Grandmother, and the people's intuition is well founded. However, it cannot be said that Ieyasu was a native Mikawa raccoon. Ieyasu in his later years was by all accounts a raccoon, and it is said that Ieyasu had been a raccoon until then, but if he had been a raccoon for more than fifty years, it would seem that his final performance would have been a little more clear-cut. Ieyasu had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during that time, Ieyasu was already in control of the country, with the lords and princes leaning almost entirely toward him. He had been in power for more than ten years, and yet his methods were so blatant and clumsy that they were those of a first-time offender, not those of a habitual or natural-born criminal.Yoritomo was exiled to Izu in his thirteenth year and remained a mere exile for thirty years until his middle age, with no other hope than to write a love letter to the daughter of a powerful provincial family. Not only was he not directly responsible for the failures of his soldiers, but he also used his own orders to correct, prohibit, or even take advantage of the failures. In this way, he skillfully manipulated the flow of power from Kyoto to Kamakura without ever going to Kyoto."
        },
        "highlight": {
          "content": [
            "Tokugawa <em>Ieyasu</em> is known to be a raccoon dog.",
            "However, it cannot be said that <em>Ieyasu</em> was a native Mikawa raccoon.",
            "<em>Ieyasu</em> in his later years was by all accounts a raccoon, and it is said that <em>Ieyasu</em> had been a raccoon",
            "<em>Ieyasu</em> had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during",
            "that time, <em>Ieyasu</em> was already in control of the country, with the lords and princes leaning almost entirely"
          ]
        }
      }
    ]
  }
}

検索ワード2語によるANDあいまい検索

検索ワード : ieasu hoge のクエリー

POST /en_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "ieasu hoge",
        "operator": "AND",
        "analyzer": "custom_analyzer",
        "fuzziness": "AUTO"
      }
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

結果

{
  "took": 36,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.6435634,
    "hits": [
      {
        "_index": "en_index",
        "_id": "2",
        "_score": 0.6435634,
        "_source": {
          "title": "Ieyasu",
          "content": "Tokugawa Ieyasu is known to be a raccoon dog. He was a tricks to destroy the Toyotomi family from Sekigahara to the battle of Osaka, and even his mother-in-law, who tormented his wife, was not an insolent raccoon, or she would not have made such a blatant accusation. This is because he is a relative of Raccoon Grandmother, and the people's intuition is well founded. However, it cannot be said that Ieyasu was a native Mikawa raccoon. Ieyasu in his later years was by all accounts a raccoon, and it is said that Ieyasu had been a raccoon until then, but if he had been a raccoon for more than fifty years, it would seem that his final performance would have been a little more clear-cut. Ieyasu had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during that time, Ieyasu was already in control of the country, with the lords and princes leaning almost entirely toward him. He had been in power for more than ten years, and yet his methods were so blatant and clumsy that they were those of a first-time offender, not those of a habitual or natural-born criminal.Yoritomo was exiled to Izu in his thirteenth year and remained a mere exile for thirty years until his middle age, with no other hope than to write a love letter to the daughter of a powerful provincial family. Not only was he not directly responsible for the failures of his soldiers, but he also used his own orders to correct, prohibit, or even take advantage of the failures. In this way, he skillfully manipulated the flow of power from Kyoto to Kamakura without ever going to Kyoto."
        },
        "highlight": {
          "content": [
            "Tokugawa <em>Ieyasu</em> is known to be a raccoon dog.",
            "However, it cannot be said that <em>Ieyasu</em> was a native Mikawa raccoon.",
            "<em>Ieyasu</em> in his later years was by all accounts a raccoon, and it is said that <em>Ieyasu</em> had been a raccoon",
            "<em>Ieyasu</em> had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during",
            "his thirteenth year and remained a mere exile for thirty years until his middle age, with no other <em>hope</em>"
          ]
        }
      }
    ]
  }
}

検索ワード : ieasu hogefoo のクエリー

POST /en_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "ieasu hogefoo",
        "operator": "AND",
        "analyzer": "custom_analyzer",
        "fuzziness": "AUTO"
      }
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

結果

{
  "took": 24,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

検索ワード2語によるORあいまい検索

検索ワード : ieasu hogefoo のクエリー

POST /en_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "ieasu hogefoo",
        "operator": "AND",
        "analyzer": "custom_analyzer",
        "fuzziness": "AUTO"
      }
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

結果

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.42411238,
    "hits": [
      {
        "_index": "en_index",
        "_id": "2",
        "_score": 0.42411238,
        "_source": {
          "title": "Ieyasu",
          "content": "Tokugawa Ieyasu is known to be a raccoon dog. He was a tricks to destroy the Toyotomi family from Sekigahara to the battle of Osaka, and even his mother-in-law, who tormented his wife, was not an insolent raccoon, or she would not have made such a blatant accusation. This is because he is a relative of Raccoon Grandmother, and the people's intuition is well founded. However, it cannot be said that Ieyasu was a native Mikawa raccoon. Ieyasu in his later years was by all accounts a raccoon, and it is said that Ieyasu had been a raccoon until then, but if he had been a raccoon for more than fifty years, it would seem that his final performance would have been a little more clear-cut. Ieyasu had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during that time, Ieyasu was already in control of the country, with the lords and princes leaning almost entirely toward him. He had been in power for more than ten years, and yet his methods were so blatant and clumsy that they were those of a first-time offender, not those of a habitual or natural-born criminal.Yoritomo was exiled to Izu in his thirteenth year and remained a mere exile for thirty years until his middle age, with no other hope than to write a love letter to the daughter of a powerful provincial family. Not only was he not directly responsible for the failures of his soldiers, but he also used his own orders to correct, prohibit, or even take advantage of the failures. In this way, he skillfully manipulated the flow of power from Kyoto to Kamakura without ever going to Kyoto."
        },
        "highlight": {
          "content": [
            "Tokugawa <em>Ieyasu</em> is known to be a raccoon dog.",
            "However, it cannot be said that <em>Ieyasu</em> was a native Mikawa raccoon.",
            "<em>Ieyasu</em> in his later years was by all accounts a raccoon, and it is said that <em>Ieyasu</em> had been a raccoon",
            "<em>Ieyasu</em> had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during",
            "that time, <em>Ieyasu</em> was already in control of the country, with the lords and princes leaning almost entirely"
          ]
        }
      }
    ]
  }
}

一致検索を行う

  • Keyword型フィールドに対して検索を行う

完全一致

実行クエリー

GET /en_index/_search
{
  "query": {
    "wildcard": {
      "title.keyword":"i AM a cat"
    }
  },
  "highlight": {
    "fields": {
      "title.keyword": {}
    }
  }
}

結果

{
  "took": 39,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "en_index",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "I am a cat",
          "content": "I am a cat. I don't have a name yet. I have no idea where I was born. All I remember is that he used to meow and cry in a dim, dank place. It was here that I first saw a human being. I was told later that they were the most ferocious and vicious of all human beings, called shosei. It is said that these 'shosei' would sometimes catch us and boil us for food. But at the time, I had no idea what to expect, so I didn't think of it as anything to be afraid of. It was just that when I was placed in his palm and lifted up, I felt a fluffy sensation. I guess that was the first time I saw the face of the calligrapher after I had settled down a little on his palm. I still remember how strange I thought it was. First of all, his face, which should have been decorated with hair, was so smooth that it looked like a medicine can. I have met many cats since then, but I have never encountered a cat with such a one-ringed face. Not only that, but the center of its face was so protruding. The smoke sometimes blew out from the hole. It seemed to be choking, which made me feel really weak. It was around this time that I finally learned that this was the kind of tobacco people drank."
        },
        "highlight": {
          "title.keyword": [
            "<em>i am a cat</em>"
          ]
        }
      }
    ]
  }
}

前方一致

実行クエリ

GET /en_index/_search
{
  "query": {
    "wildcard": {
      "title.keyword":"ieya*"
    }
  },
  "highlight": {
    "fields": {
      "title.keyword": {}
    }
  }
}

結果

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "en_index",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "Ieyasu",
          "content": "Tokugawa Ieyasu is known to be a raccoon dog. He was a tricks to destroy the Toyotomi family from Sekigahara to the battle of Osaka, and even his mother-in-law, who tormented his wife, was not an insolent raccoon, or she would not have made such a blatant accusation. This is because he is a relative of Raccoon Grandmother, and the people's intuition is well founded. However, it cannot be said that Ieyasu was a native Mikawa raccoon. Ieyasu in his later years was by all accounts a raccoon, and it is said that Ieyasu had been a raccoon until then, but if he had been a raccoon for more than fifty years, it would seem that his final performance would have been a little more clear-cut. Ieyasu had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during that time, Ieyasu was already in control of the country, with the lords and princes leaning almost entirely toward him. He had been in power for more than ten years, and yet his methods were so blatant and clumsy that they were those of a first-time offender, not those of a habitual or natural-born criminal.Yoritomo was exiled to Izu in his thirteenth year and remained a mere exile for thirty years until his middle age, with no other hope than to write a love letter to the daughter of a powerful provincial family. Not only was he not directly responsible for the failures of his soldiers, but he also used his own orders to correct, prohibit, or even take advantage of the failures. In this way, he skillfully manipulated the flow of power from Kyoto to Kamakura without ever going to Kyoto."
        },
        "highlight": {
          "title.keyword": [
            "<em>ieyasu</em>"
          ]
        }
      }
    ]
  }
}

部分一致

実行クエリ

GET /en_index/_search
{
  "query": {
    "wildcard": {
      "title.keyword":"*YA*"
    }
  },
  "highlight": {
    "fields": {
      "title.keyword": {}
    }
  }
}

結果

{
  "took": 39,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "en_index",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "Ieyasu",
          "content": "Tokugawa Ieyasu is known to be a raccoon dog. He was a tricks to destroy the Toyotomi family from Sekigahara to the battle of Osaka, and even his mother-in-law, who tormented his wife, was not an insolent raccoon, or she would not have made such a blatant accusation. This is because he is a relative of Raccoon Grandmother, and the people's intuition is well founded. However, it cannot be said that Ieyasu was a native Mikawa raccoon. Ieyasu in his later years was by all accounts a raccoon, and it is said that Ieyasu had been a raccoon until then, but if he had been a raccoon for more than fifty years, it would seem that his final performance would have been a little more clear-cut. Ieyasu had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during that time, Ieyasu was already in control of the country, with the lords and princes leaning almost entirely toward him. He had been in power for more than ten years, and yet his methods were so blatant and clumsy that they were those of a first-time offender, not those of a habitual or natural-born criminal.Yoritomo was exiled to Izu in his thirteenth year and remained a mere exile for thirty years until his middle age, with no other hope than to write a love letter to the daughter of a powerful provincial family. Not only was he not directly responsible for the failures of his soldiers, but he also used his own orders to correct, prohibit, or even take advantage of the failures. In this way, he skillfully manipulated the flow of power from Kyoto to Kamakura without ever going to Kyoto."
        },
        "highlight": {
          "title.keyword": [
            "<em>ieyasu</em>"
          ]
        }
      }
    ]
  }
}

後方一致

実行クエリー

GET /en_index/_search
{
  "query": {
    "wildcard": {
      "title.keyword":"*YASU"
    }
  },
  "highlight": {
    "fields": {
      "title.keyword": {}
    }
  }
}

結果

{
  "took": 50,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "en_index",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "Ieyasu",
          "content": "Tokugawa Ieyasu is known to be a raccoon dog. He was a tricks to destroy the Toyotomi family from Sekigahara to the battle of Osaka, and even his mother-in-law, who tormented his wife, was not an insolent raccoon, or she would not have made such a blatant accusation. This is because he is a relative of Raccoon Grandmother, and the people's intuition is well founded. However, it cannot be said that Ieyasu was a native Mikawa raccoon. Ieyasu in his later years was by all accounts a raccoon, and it is said that Ieyasu had been a raccoon until then, but if he had been a raccoon for more than fifty years, it would seem that his final performance would have been a little more clear-cut. Ieyasu had been in power for more than ten years between Sekigahara and the Battle of Osaka, and during that time, Ieyasu was already in control of the country, with the lords and princes leaning almost entirely toward him. He had been in power for more than ten years, and yet his methods were so blatant and clumsy that they were those of a first-time offender, not those of a habitual or natural-born criminal.Yoritomo was exiled to Izu in his thirteenth year and remained a mere exile for thirty years until his middle age, with no other hope than to write a love letter to the daughter of a powerful provincial family. Not only was he not directly responsible for the failures of his soldiers, but he also used his own orders to correct, prohibit, or even take advantage of the failures. In this way, he skillfully manipulated the flow of power from Kyoto to Kamakura without ever going to Kyoto."
        },
        "highlight": {
          "title.keyword": [
            "<em>ieyasu</em>"
          ]
        }
      }
    ]
  }
}

よくある要件

多言語対応

対応言語

著者が対応したことがある言語

  • 英語
  • フランス語
  • スペイン語
  • ロシア語
  • 中国語(繁体字/簡体字)

言語ごとに適したトークナイザを指定する必要ある

文章を単語(トークン)に分解するために言語に適したトークナイザに指定する必要あります。
英語やフランス語などの単語間をスペースで区切る言語は単語(トークン)への分解が簡単ですが、日本語・中国語・韓国語(CJK言語)のような単語の間にスペースがない言語に対しては分かち書き(意味のある単語ごとに区切る処理)が必要です。

OpenSearch ビルドイン トークナイザ
AWS OpenSearch Serverlessでサポートされているトークナイザ

言語 名称 説明
英語、フランス語etc standard スペースや句読点を基準に単語を分割を行う
日本語 kuromoji 形態素解析を行い、意味のある単位で分割
中国語(繁体/簡体) icu 形態素解析を行い、意味のある単位で分割
中国語(簡体) smartcn 形態素解析を行い、意味のある単位で分割
韓国語 icu ハングルの分かち書きを行う

ICU(International Components for Unicode)トークナイザの特徴

  • ユニコードに基づく多言語対応(特にCJK言語:日本語、中国語、韓国語、その他多数の言語に対応)
  • 高度なトークナイズ(単語の正しい分割が可能/ただし言語専用のトークナイザがあればそちらを利用)
  • Unicode 正規化(特殊文字の変換)
  • 音韻正規化(Folding)(アクセント除去、カタカナ→ひらがな変換など)
  • ピンイン変換(中国語) や スペル訂正の前処理 にも利用可能
英語 / フランス語 / スペイン語 / ロシア語 の場合

tokenizer : standard

英語の場合

POST _analyze
{
  "tokenizer": "standard",
  "text": "I am a cat. I don't have a name yet."
}

{
  "tokens": [
    { "token": "i", "start_offset": 0, "end_offset": 1, "type": "<ALPHANUM>", "position": 0 },
    { "token": "am", "start_offset": 2, "end_offset": 4, "type": "<ALPHANUM>", "position": 1 },
    { "token": "a", "start_offset": 5, "end_offset": 6, "type": "<ALPHANUM>", "position": 2 },
    { "token": "cat", "start_offset": 7, "end_offset": 10, "type": "<ALPHANUM>", "position": 3 },
    { "token": "i", "start_offset": 12, "end_offset": 13, "type": "<ALPHANUM>", "position": 4 },
    { "token": "don't", "start_offset": 14, "end_offset": 19, "type": "<ALPHANUM>", "position": 5 },
    { "token": "have", "start_offset": 20, "end_offset": 24, "type": "<ALPHANUM>", "position": 6 },
    { "token": "a", "start_offset": 25, "end_offset": 26, "type": "<ALPHANUM>", "position": 7 },
    { "token": "name", "start_offset": 27, "end_offset": 31, "type": "<ALPHANUM>", "position": 8 },
    { "token": "yet", "start_offset": 32, "end_offset": 35, "type": "<ALPHANUM>", "position": 9 }
  ]
}

フランスの場合

POST _analyze
{
  "tokenizer": "standard",
  "text": "Je suis un chat. Je n'ai pas encore de nom."
}

{
  "tokens": [
    { "token": "je", "start_offset": 0, "end_offset": 2, "type": "<ALPHANUM>", "position": 0 },
    { "token": "suis", "start_offset": 3, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 },
    { "token": "un", "start_offset": 8, "end_offset": 10, "type": "<ALPHANUM>", "position": 2 },
    { "token": "chat", "start_offset": 11, "end_offset": 15, "type": "<ALPHANUM>", "position": 3 },
    { "token": "je", "start_offset": 17, "end_offset": 19, "type": "<ALPHANUM>", "position": 4 },
    { "token": "n'ai", "start_offset": 20, "end_offset": 24, "type": "<ALPHANUM>", "position": 5 },
    { "token": "pas", "start_offset": 25, "end_offset": 28, "type": "<ALPHANUM>", "position": 6 },
    { "token": "encore", "start_offset": 29, "end_offset": 35, "type": "<ALPHANUM>", "position": 7 },
    { "token": "de", "start_offset": 36, "end_offset": 38, "type": "<ALPHANUM>", "position": 8 },
    { "token": "nom", "start_offset": 39, "end_offset": 42, "type": "<ALPHANUM>", "position": 9 }
  ]
}

スペイン語

POST _analyze
{
  "tokenizer": "standard",
  "text": "Soy un gato. Todavía no tengo nombre."
}

{
  "tokens": [
    { "token": "soy", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 },
    { "token": "un", "start_offset": 4, "end_offset": 6, "type": "<ALPHANUM>", "position": 1 },
    { "token": "gato", "start_offset": 7, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 },
    { "token": "todavía", "start_offset": 13, "end_offset": 20, "type": "<ALPHANUM>", "position": 3 },
    { "token": "no", "start_offset": 21, "end_offset": 23, "type": "<ALPHANUM>", "position": 4 },
    { "token": "tengo", "start_offset": 24, "end_offset": 29, "type": "<ALPHANUM>", "position": 5 },
    { "token": "nombre", "start_offset": 30, "end_offset": 36, "type": "<ALPHANUM>", "position": 6 }
  ]
}

ロシア語

POST _analyze
{
  "tokenizer": "standard",
  "text": "Я - кошка. У меня пока нет имени."
}

{
  "tokens": [
    { "token": "я", "start_offset": 0, "end_offset": 1, "type": "<ALPHANUM>", "position": 0 },
    { "token": "кошка", "start_offset": 4, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 },
    { "token": "у", "start_offset": 11, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 },
    { "token": "меня", "start_offset": 13, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 },
    { "token": "пока", "start_offset": 18, "end_offset": 22, "type": "<ALPHANUM>", "position": 4 },
    { "token": "нет", "start_offset": 23, "end_offset": 26, "type": "<ALPHANUM>", "position": 5 },
    { "token": "имени", "start_offset": 27, "end_offset": 32, "type": "<ALPHANUM>", "position": 6 }
  ]
}
中国語(繁体字/簡体字) の場合

tokenizer : icu_tokenizer

中国語:繁体字

POST _analyze
{
  "tokenizer": "icu_tokenizer",
  "text": "我是一隻貓。我還沒有名字。"
}

{
  "tokens": [
    { "token": "我是", "start_offset": 0, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 0 },
    { "token": "一隻", "start_offset": 2, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 1 },
    { "token": "貓", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 2 },
    { "token": "我", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 3 },
    { "token": "還沒有", "start_offset": 7, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 4 },
    { "token": "名字", "start_offset": 10, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 5 }
  ]
}

中国語:簡体字

POST _analyze
{
  "tokenizer": "icu_tokenizer",
  "text": "我是一只猫。我还没有名字。"
}

{
  "tokens": [
    { "token": "我是", "start_offset": 0, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 0 },
    { "token": "一只", "start_offset": 2, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 1 },
    { "token": "猫", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 2 },
    { "token": "我还", "start_offset": 6, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 3 },
    { "token": "没有", "start_offset": 8, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 4 },
    { "token": "名字", "start_offset": 10, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 5 }
  ]
}

日本語の場合

tokenizer : icu_tokenizer

POST _analyze
{
  "tokenizer": "icu_tokenizer",
  "text": "吾輩は猫である。名前はまだない。"
}

{
  "tokens": [
    { "token": "吾輩", "start_offset": 0, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 0 },
    { "token": "は", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 1 },
    { "token": "猫", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 2 },
    { "token": "で", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 3 },
    { "token": "ある", "start_offset": 5, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 4 },
    { "token": "名前", "start_offset": 8, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 5 },
    { "token": "は", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 6 },
    { "token": "まだ", "start_offset": 11, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 7 },
    { "token": "ない", "start_offset": 13, "end_offset": 15, "type": "<IDEOGRAPHIC>", "position": 8 }
  ]
}

tokenizer : kuromoji_tokenizer

POST _analyze
{
  "tokenizer": "kuromoji_tokenizer",
  "text": "吾輩は猫である。名前はまだない。"
}

{
  "tokens": [
    {"token": "吾輩", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0},
    {"token": "は", "start_offset": 2, "end_offset": 3, "type": "word", "position": 1},
    {"token": "猫", "start_offset": 3, "end_offset": 4, "type": "word", "position": 2},
    {"token": "で", "start_offset": 4, "end_offset": 5, "type": "word", "position": 3},
    {"token": "ある", "start_offset": 5, "end_offset": 7, "type": "word", "position": 4},
    {"token": "名前", "start_offset": 8, "end_offset": 10, "type": "word", "position": 5},
    {"token": "は", "start_offset": 10, "end_offset": 11, "type": "word", "position": 6},
    {"token": "まだ", "start_offset": 11, "end_offset": 13, "type": "word", "position": 7},
    {"token": "ない", "start_offset": 13, "end_offset": 15, "type": "word", "position": 8}
  ]
}

ウムラウト(ü)、アクセント(é)、セディーユ(ç)などの特殊文字を ASCII 文字に変換する

asciifolding フィルターを用いる。

preserve_original:false の場合

POST _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "asciifolding",
      "preserve_original": false
    }
  ],
  "text": "Café naïve façade rôle über"
}

{
  "tokens": [
    {"token": "Cafe", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0},
    {"token": "naive", "start_offset": 5, "end_offset": 10, "type": "<ALPHANUM>", "position": 1},
    {"token": "facade", "start_offset": 11, "end_offset": 17, "type": "<ALPHANUM>", "position": 2},
    {"token": "role", "start_offset": 18, "end_offset": 22, "type": "<ALPHANUM>", "position": 3},
    {"token": "uber", "start_offset": 23, "end_offset": 27, "type": "<ALPHANUM>", "position": 4}
  ]
}

preserve_original:true の場合

POST _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "asciifolding",
      "preserve_original": true
    }
  ],
  "text": "Café naïve façade rôle über"
}

{
  "tokens": [
    {"token": "Cafe", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0},
    {"token": "Café", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0},
    {"token": "naive", "start_offset": 5, "end_offset": 10, "type": "<ALPHANUM>", "position": 1},
    {"token": "naïve", "start_offset": 5, "end_offset": 10, "type": "<ALPHANUM>", "position": 1},
    {"token": "facade", "start_offset": 11, "end_offset": 17, "type": "<ALPHANUM>", "position": 2},
    {"token": "façade", "start_offset": 11, "end_offset": 17, "type": "<ALPHANUM>", "position": 2},
    {"token": "role", "start_offset": 18, "end_offset": 22, "type": "<ALPHANUM>", "position": 3},
    {"token": "rôle", "start_offset": 18, "end_offset": 22, "type": "<ALPHANUM>", "position": 3},
    {"token": "uber", "start_offset": 23, "end_offset": 27, "type": "<ALPHANUM>", "position": 4},
    {"token": "über", "start_offset": 23, "end_offset": 27, "type": "<ALPHANUM>", "position": 4}
  ]
}

類似語 検索

アメリカ英語 ⇄ イギリス英語の吸収 の例

インデックス

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "color, colour",
            "favorite, favourite",
            "organize, organise",
            "realize, realise",
            "honor, honour",
            "theater, theatre",
            "analyze, analyse"
          ]
        }
      },
      "analyzer": {
        "synonym_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "synonym_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "synonym_analyzer"
      }
    }
  }
}

データ投入

  • color(米)と colour(英)の両方を含むデータを投入
  • favorite(米)と favourite(英)の両方を含むデータを投入
  • organize(米)と organise(英)の両方を含むデータを投入
PUT my_index/_doc/1
{
  "content": "The color of the sky is blue."
}

PUT my_index/_doc/2
{
  "content": "The colour of the ocean is deep blue."
}

PUT my_index/_doc/3
{
  "content": "I love my favorite book."
}

PUT my_index/_doc/4
{
  "content": "My favourite movie is Inception."
}

PUT my_index/_doc/5
{
  "content": "We need to organize the event soon."
}

PUT my_index/_doc/6
{
  "content": "They will organise the festival next month."
}

検索

{
  "took": 71,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.40997607,
    "hits": [
      {
        "_index": "my_index",
        "_id": "1",
        "_score": 0.40997607,
        "_source": {
          "content": "The color of the sky is blue."
        }
      },
      {
        "_index": "my_index",
        "_id": "2",
        "_score": 0.40832296,
        "_source": {
          "content": "The colour of the ocean is deep blue."
        }
      }
    ]
  }
}

感想として

  • 公式ドキュメントが翻訳されてないので読解に一苦労
  • chatGPTとおしゃべりしながら機能開発していくのが良い

Discussion

ログインするとコメントできます