Japanese-SimCSE を早速使ってみる

既にやや話題になっていますが、HuggingFace 上で Japanese-SimCSE なる名前がついたリポジトリが大量発生しています。
現時点ではREADMEが書かれていないので詳細は不明ですが、名前からして本家 SimCSE と同様の学習を日本語言語モデルに対して行ったものだという気がしています。

tokenizer_config.json を見ると、大元の言語モデルは cl-tohoku/bert-base-japanese っぽいです。

本家と同様の学習ということであれば、

-unsup がついたモデル: dropoutの位置を変えただけの文を正例、違う文を負例とした対照学習
-sup がついたモデル: NLIデータセットの entailment を正例、contradiction を負例とした対照学習

かなという気がします。
日本語のNLIデータセットとしては黒橋研のJSNLI、谷中先生らのJaNLI、JGLUEに含まれているJNLIあたりがありそうですが、どれを使っているんでしょうか。。

（このあたりは適当に推測しているだけなので、今年の年次大会やJSAIなどで大学公式の文献発表が出たらそっちを参照していただくといいかと思います。）

Kaito Sugimoto

本家 SimCSE の動かし方と同じような動かし方をしてみます。
Google Colab で使うだけなら !pip install transformers[ja] だけで準備OKです。

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("MU-Kindai/Japanese-SimCSE-BERT-base-sup")
model = AutoModel.from_pretrained("MU-Kindai/Japanese-SimCSE-BERT-base-sup")

texts = [
    "私はご飯が好きだ。",
    "私は卵かけご飯が好きだ。",
    "私はご飯が嫌いだ。"
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

Cosine similarity between "私はご飯が好きだ。" and "私は卵かけご飯が好きだ。" is: 0.691
Cosine similarity between "私はご飯が好きだ。" and "私はご飯が嫌いだ。" is: 0.820

期待としては「私はご飯が好きだ。」と「私は卵かけご飯が好きだ。」のコサイン距離の方が、「私はご飯が好きだ。」と「私はご飯が嫌いだ。」よりも大きくなって欲しかったのですが、うまくいきませんでした。。

ただ、自分の動かし方がまずい可能性もあるので、やはり、公式のREADMEを待った方がいいかと思います。

Kaito Sugimoto

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("MU-Kindai/Japanese-SimCSE-BERT-large-sup")
model = AutoModel.from_pretrained("MU-Kindai/Japanese-SimCSE-BERT-large-sup")

texts = [
    "私はご飯が好きだ。",
    "私は卵かけご飯が好きだ。",
    "私はご飯が嫌いだ。"
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

Cosine similarity between "私はご飯が好きだ。" and "私は卵かけご飯が好きだ。" is: 0.676
Cosine similarity between "私はご飯が好きだ。" and "私はご飯が嫌いだ。" is: 0.855

一応 large も使ってみましたが、スコアの傾向は変わらないですね。

このスクラップは2024/02/10にクローズされました