💬

Elastic Searchにおける異なるキーと値（ラベルとID）を持つアグリゲーション

2022/07/04に公開

Elasticsearch

tech

概要

現在Cultural Japanプロジェクトにおける検索アプリの更新を進めており、多言語データのアグリゲーションを行う必要がありました。本記事では、その方法に関する調査結果の備忘録です。

データ

データとしては、以下のように、agential（人物を示す）フィールドに、id、ja、enの値を持つケースを想定します。

{
  "agential": [
    {
      "ja": "葛飾北斎",
      "en": "Katsushika, Hokusai",
      "id": "chname:葛飾北斎"
    }
  ]
}

上記のデータに対して、idでフィルタリング処理などを行いつつ、言語設定に合わせてjaまたはenの値を表示することを想定します。

理想的には、aggregationの結果として、以下のようなデータを取得したいです。

（jaを指定した場合）

{
  "buckets": [
    {
      "key": "葛飾北斎",
      "id": "chname:葛飾北斎",
      "doc_count": 1
    }
  ]
}

（enを指定した場合）

{
  "buckets": [
    {
      "key": "Katsushika, Hokusai",
      "id": "chname:葛飾北斎",
      "doc_count": 1
    }
  ]
}

方法1: nested aggregationの利用

以下の記事を参考に、nested aggregationを試します。

DELETE test 

PUT test
{
  "mappings": {
    "properties": {
      "agential": {
        "type": "nested",
        "properties": {
          "id": {
            "type": "keyword"
          },
          "ja": {
            "type": "keyword"
          },
          "en": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

PUT test/_doc/1
{
  "agential": [
    {
      "ja": "葛飾北斎",
      "en": "Katsushika, Hokusai",
      "id": "chname:葛飾北斎"
    }
  ]
}

GET test/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "nested": {
            "path": "agential",
            "query": {
              "bool": {
                "filter": [
                  {
                    "term": {
                      "agential.id": "chname:葛飾北斎"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  },
  "_source": [
    "agential"
  ],
  "aggs": {
    "agential": {
      "nested": {
        "path": "agential"
      },
      "aggs": {
        "id": {
          "terms": {
            "field": "agential.id"
          }
        },
        "label": {
          "terms": {
            "field": "agential.ja"
          }
        }
      }
    }
  }
}

この場合、以下のような結果が返却されます。

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0,
        "_source" : {
          "agential" : [
            {
              "ja" : "葛飾北斎",
              "en" : "Katsushika, Hokusai",
              "id" : "chname:葛飾北斎"
            }
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "agential" : {
      "doc_count" : 1,
      "label" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
          {
            "key" : "葛飾北斎",
            "doc_count" : 1
          }
        ]
      },
      "id" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
          {
            "key" : "chname:葛飾北斎",
            "doc_count" : 1
          }
        ]
      }
    }
  }
}

aggregations.agentialにlabelとidが返却されますが、冗長な結果のように思われます。

方法2: 文字列の結合

以下の記事の質問で挙げられている方法を試します。

fc-agentialフィールドを用意し、$$$を区切り文字として、id、ja、enの値を結合した値を入れてみました。

DELETE test 

PUT test
{
  "mappings": {
    "properties": {
      "agential": {
        "properties": {
          "id": {
            "type": "keyword"
          },
          "ja": {
            "type": "keyword"
          },
          "en": {
            "type": "keyword"
          }
        }
      },
      "fc-agential": {
        "type": "keyword"
      }
    }
  }
}

PUT test/_doc/1
{
  "agential": [
    {
      "ja": "葛飾北斎",
      "en": "Katsushika, Hokusai",
      "id": "chname:葛飾北斎"
    }
  ],
  "fc-agential": [
    "chname:葛飾北斎$$$葛飾北斎$$$Katsushika, Hokusai"
  ]
}

GET test/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "agential.id": "chname:葛飾北斎"
          }
        }
      ]
    }
  },
  "_source": [
    "agential"
  ],
  "aggs": {
    "agential": {
      "terms": {
        "field": "fc-agential"
      }
    }
  }
}

この場合、以下のような結果が返却されます。

{
  "took" : 964,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0,
        "_source" : {
          "agential" : [
            {
              "ja" : "葛飾北斎",
              "en" : "Katsushika, Hokusai",
              "id" : "chname:葛飾北斎"
            }
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "agential" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "chname:葛飾北斎$$$葛飾北斎$$$Katsushika, Hokusai",
          "doc_count" : 1
        }
      ]
    }
  }
}

nestedを使用しない分、クエリや結果は簡潔ですが、取得後にデータを修正する必要があります。

方法3: a nested top-hits aggregation?

以下の記事の回答として挙げられている方法を試します。

DELETE test 

PUT test
{
  "mappings": {
    "properties": {
      "agential": {
        "properties": {
          "id": {
            "type": "keyword"
          },
          "ja": {
            "type": "keyword"
          },
          "en": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

PUT test/_doc/1
{
  "agential": [
    {
      "ja": "葛飾北斎",
      "en": "Katsushika, Hokusai",
      "id": "chname:葛飾北斎"
    }
  ]
}

GET test/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "agential.id": "chname:葛飾北斎"
          }
        }
      ]
    }
  },
  "_source": [
    "agential"
  ],
  "aggs": {
    "agential": {
      "terms": {
        "field": "agential.ja"
      },
      "aggs": {
        "doc": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

この場合、以下のような結果が返却されます。

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0,
        "_source" : {
          "agential" : [
            {
              "ja" : "葛飾北斎",
              "en" : "Katsushika, Hokusai",
              "id" : "chname:葛飾北斎"
            }
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "agential" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "葛飾北斎",
          "doc_count" : 1,
          "doc" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : 0.0,
              "hits" : [
                {
                  "_index" : "test",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_score" : 0.0,
                  "_source" : {
                    "agential" : [
                      {
                        "ja" : "葛飾北斎",
                        "en" : "Katsushika, Hokusai",
                        "id" : "chname:葛飾北斎"
                      }
                    ]
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

aggregations.agentialに結果の1例が含まれるため、そこからidとjaの関係を抽出できます。方法2に比べて、fc-agentialのような項目を用意する必要がない点が利点ですが、これも結果が冗長なように思われます。

まとめ

Elastic Searchにおける異なるキーと値（ラベルとID）を持つアグリゲーションの調査結果をまとめました。参考になりましたら幸いです。

また、上記で挙げた3つ方法以外にも良い方法があるように思います。もしご存じの方がいらっしゃれば教えていただけますと幸いです。

概要

データ

方法1: nested aggregationの利用

方法2: 文字列の結合

方法3: a nested top-hits aggregation?

まとめ

Discussion