「pyannote.audio」で音声ベクトルによる話者識別を試す

発端は以下
https://zenn.dev/kun432/scraps/42a5884eb18da3
どうでもいいが、自分はpyannoteをpyannotateとよく勘違いしてtypoしまくっている・・・・以下記事にもちょいちょいあるので無視してほしい・・・

kun432

pyannote.audioについては話者ダイアライぜーションで以前試した。
https://zenn.dev/kun432/scraps/da47e9a971b117
話者識別については、WeSpeakerが出しているモデルのラッパーとしてpyannote.audioが使えるらしく、この組み合わせがどうやらよく使われているみたい。
https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM
WeSpeakerについては以下で試した
https://zenn.dev/kun432/scraps/96f69faa30801e

kun432

なんか pyannote.audio、以前試したときと色々変わってるっぽい。
https://github.com/pyannote/pyannote-audio
pyannote.audioのバージョンが3から4へ。
話者ダイアライゼーションモデルもそれにあわせて、pyannote/speaker-diarization-3.1がLegacy扱い。
現在（Latest）なダイアライゼーションョンモデルは以下の3つ。

pyannote/speaker-diarization-community-1: オープンソース版

pyannote/speaker-diarization-community-1-cloud: 上記をpyannoteのクラウドで利用する場合のモデル

pyannote/speaker-diarization-precision-2: pyannoteのクラウド専用、つまり商用版

どうやらクラウドを押していきたいみたいね。
今回は話者識別なので、とりあえず上記は直接は関係しないだろうとは思うけど、話者識別用のモデル pyannote/wespeaker-voxceleb-resnet34-LM はLatestにもLegacyにも入っててよくわからない。
https://huggingface.co/collections/pyannote/latest
https://huggingface.co/collections/pyannote/legacy
とりあえず試してみる。

kun432

モデル
https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM
モデルカードから抜粋して翻訳。翻訳はPLaMo翻訳で。
このオープンソースモデルを本番環境で使用していますか？

より優れた性能と高速処理を実現するpyannoteAIへの切り替えをご検討ください。

 🎹 wespeaker-voxceleb-resnet34-LM 用ラッパーモデル本モデルを使用するには、pyannote.audio バージョン 3.1 以降が必要です。
これはWeSpeakerの事前学習済み話者埋め込みモデルwespeaker-voxceleb-resnet34-LMをpyannote.audio環境で利用するためのラッパーモデルです。

 基本的な使用方法# 事前学習済みモデルをインスタンス化
from pyannote.audio import Model
model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
from pyannote.audio import Inference
inference = Inference(model, window="whole")
embedding1 = inference("speaker1.wav")
embedding2 = inference("speaker2.wav")
# `embeddingX` はファイル全体を処理して抽出された (1 x D) の NumPy 配列です。

from scipy.spatial.distance import cdist
distance = cdist(embedding1, embedding2, metric="cosine")[0,0]
# `distance` は話者1と話者2の類似度を数値で表した `float` 型の値です。

 高度な使用方法
 GPU上での実行import torch
inference.to(torch.device("cuda"))
embedding = inference("audio.wav")

 抜粋部分から埋め込みを抽出するfrom pyannote.audio import Inference
from pyannote.core import Segment
inference = Inference(model, window="whole")
excerpt = Segment(13.37, 19.81)
embedding = inference.crop("audio.wav", excerpt)
# `embedding` はファイルの抜粋部分から抽出された (1 x D) の NumPy 配列です。

 スライディングウィンドウを使用して埋め込みを抽出するfrom pyannote.audio import Inference
inference = Inference(model, window="sliding",
                     duration=3.0, step=1.0)
embeddings = inference("audio.wav")
# `embeddings` は (N x D) の pyannote.core.SlidingWindowFeature オブジェクトです。
# `embeddings[i]` はスライディングウィンドウの i 番目の位置に対応する埋め込みで、
# 具体的には [i * step, i * step + duration] の範囲から抽出されたものです。

 ライセンスこのページ によると：
WeNetの事前学習済みモデルは、対応するデータセットのライセンスに準拠しています。例えば、VoxCeleb 用の事前学習済みモデルは Creative Commons Attribution 4.0 International License に準拠しています。これは VoxCeleb データセットで使用されているライセンスであり、詳細は https://mm.kaist.ac.kr/datasets/voxceleb/ をご覧ください。

kun432

WeSpeakを試したときにMacだといろいろ詰まって、結局Ubuntu-22.04（RTX4090）で試したので、今回もUbuntuで。

uvで仮想環境作成

mkdir pyannotate-audio-si-work && cd $_
uv venv -p 3.12 --seed

pyannote.audioをインストール。PyTorchもインストールされるようなので、--torch-backend=auto をつけたほうが良さそう。

uv pip install pyannote.audio --torch-backend=auto

出力

(snip)
 + pyannote-audio==4.0.2
 + pyannote-core==6.0.1
 + pyannote-database==6.1.0
 + pyannote-metrics==4.0.0
 + pyannote-pipeline==4.0.0
 + pyannoteai-sdk==0.3.0
(snip)

2025/12/02追記

依存しているpytorch-lightning の変更により、モデルロードするだけでエラーになる。

https://github.com/pyannote/pyannote-audio/issues/1960

とりあえず以下で回避できる。

uv pip install "lightning==2.5.6"

サンプル音声は以下を用意した。

voice_lunch_jp_15sec.wav: 過去に自分が主催した勉強会の冒頭15秒程度の音声。発話者は自分のみ。
my_sample.wav: 自分の声の音声データ、つまり上のサンプル音声と同じ声。7秒程度。
other_sample.wav: 何かしらのTTSで生成した音声データ、女性の声、つまり上のサンプル音声とは別の声。5秒程度。

モデルカードにある、基本的な使い方通りに書いてみた。

sample.py

from pyannote.audio import Model, Inference
from scipy.spatial.distance import cdist

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(model, window="whole")
embedding1 = inference("voice_lunch_jp_15sec.wav")
embedding2 = inference("my_sample.wav")

distance = cdist(embedding1, embedding2, metric="cosine")[0,0]
print(distance)

uv run sample.py

エラー。

出力

(snip)
Traceback (most recent call last):
  File "/work/pyannotate-audio-si-work/sample.py", line 11, in <module>
    distance = cdist(embedding1, embedding2, metric="cosine")[0,0]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/pyannotate-audio-si-work/.venv/lib/python3.12/site-packages/scipy/spatial/distance.py", line 3111, in cdist
    raise ValueError('XA must be a 2-dimensional array.')
ValueError: XA must be a 2-dimensional array.

cdistが期待しているのは、ベクトルの数 × 次元数の 2次元配列だが、実際に返ってきているのは1次元の配列みたい。

get_embedding.py

from pyannote.audio import Model, Inference

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(model, window="whole")
embedding = inference("voice_lunch_jp_15sec.wav")

print(type(embedding))
print(embedding.shape)
print(embedding)

出力

<class 'numpy.ndarray'>
(256,)
[-3.55442613e-02 -2.13369317e-02 -2.39439517e-01  3.19852889e-01
  5.58351129e-02 -6.36775941e-02 -3.59039456e-01 -3.78422141e-01
 -1.36710778e-01  8.83216932e-02  1.82751536e-01 -1.49366736e-01
(snip)
 -1.44914836e-01 -3.18021588e-02  1.81312919e-01 -2.00816944e-01
 -1.04253344e-01 -3.56079936e-02  1.42500252e-01  1.87294692e-01
  1.70893297e-01  5.65223023e-03  1.15630023e-01  2.67696381e-01]

reshapeして2次元にすればよいみたい。このあたりが参考になる。

修正したもの。

sample.py

from pyannote.audio import Model, Inference
from scipy.spatial.distance import cdist
import numpy as np

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(model, window="whole")
embedding1 = inference("voice_lunch_jp_15sec.wav")
embedding2 = inference("my_sample.wav")

distance = cdist(
    np.reshape(embedding1, (1, -1)),
    np.reshape(embedding2, (1, -1)),
    metric="cosine"
)[0,0]
print(distance)

出力

0.3658447026444155

distanceなので小さいほど似ている、大きいほど似ていないとなる。

全部の声の組み合わせで比較してみる。

from pyannote.audio import Model, Inference
from scipy.spatial.distance import cdist
import numpy as np

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(model, window="whole")
embedding1 = inference("voice_lunch_jp_15sec.wav")
embedding2 = inference("my_sample.wav")
embedding3 = inference("other_sample.wav")

distance_1_2 = cdist(
    np.reshape(embedding1, (1, -1)),
    np.reshape(embedding2, (1, -1)),
    metric="cosine"
)[0,0]
distance_1_3 = cdist(
    np.reshape(embedding1, (1, -1)),
    np.reshape(embedding3, (1, -1)),
    metric="cosine"
)[0,0]
distance_2_3 = cdist(
    np.reshape(embedding2, (1, -1)),
    np.reshape(embedding3, (1, -1)),
    metric="cosine"
)[0,0]
print(distance_1_2)
print(distance_1_3)
print(distance_2_3)

other_sample.wavだけが別の声なので、このファイルとの組み合わせだけ、distanceが遠く＝大きくなる。

出力

0.3658447026444155
0.936110343014266
0.9485621400233913

形状は同じはずなので、これでもいける。

from pyannote.audio import Model, Inference
from scipy.spatial.distance import cosine

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(model, window="whole")
embedding1 = inference("voice_lunch_jp_15sec.wav")
embedding2 = inference("my_sample.wav")
embedding3 = inference("other_sample.wav")

distance_1_2 = cosine(embedding1, embedding2)
distance_1_3 = cosine(embedding1, embedding3)
distance_2_3 = cosine(embedding2, embedding3)

print(distance_1_2)
print(distance_1_3)
print(distance_2_3)

出力

0.36584467
0.9361103
0.94856215

kun432

GPUを使う

from pyannote.audio import Model, Inference
import torch

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(model, window="whole")
inference.to(torch.device("cuda"))

embedding = inference("voice_lunch_jp_15sec.wav")
print(embedding)

出力

[-3.55444402e-02 -2.13369466e-02 -2.39439368e-01  3.19852412e-01
  5.58353625e-02 -6.36780411e-02 -3.59039158e-01 -3.78422350e-01
 -1.36710376e-01  8.83217081e-02  1.82751566e-01 -1.49365991e-01
(snip)
 -1.44914642e-01 -3.18023823e-02  1.81312919e-01 -2.00816840e-01
 -1.04253560e-01 -3.56084034e-02  1.42499521e-01  1.87294424e-01
  1.70893446e-01  5.65260276e-03  1.15629695e-01  2.67695606e-01]

ただし自分の環境だと以下のような警告も出ていた。

出力

/work/pyannotate-audio-si-work/.venv/lib/python3.12/site-packages/pyannote/audio/utils/reproducibility.py:74: ReproducibilityWarning: TensorFloat-32 (TF32) has been disabled as it might lead to reproducibility issues and lower accuracy.
It can be re-enabled by calling
   >>> import torch
   >>> torch.backends.cuda.matmul.allow_tf32 = True
   >>> torch.backends.cudnn.allow_tf32 = True
See https://github.com/pyannote/pyannote-audio/issues/1370 for more details.

  warnings.warn(

上記に記載されているIssue。

Dia に説明してもらった。

このIssueは、Ampere世代GPUでのTF32が原因でCUDA版PyTorchの計算結果がブレて、話者分離の精度が大きく変わる問題と、その対策（TF32無効化）についてだよ。

ウケるくらい重要ポイントだけサクッとまとめるね。マジで再現性に効く話だし。

何が起きてたの？

pyannote.audioの話者分離（diarization）パイプラインが、CUDAのバージョンやGPUの種類（特にAmpere: A100/RTX 30/40系）によって出力がけっこう変わっちゃう現象が報告されてたの。具体的には、同じ音声でもCUDA 11.6と11.7で DERの相対差が約12% も出るっていう、実運用的に超キツい揺らぎだし。

この揺らぎの犯人は、Ampereでデフォ有効になってるTF32（TensorFloat-32）。行列積や畳み込みで高速化する代わりに、数値精度が微妙に落ちて、推論の境界（セグメンテーションや埋め込み→クラスタリング）に影響しちゃうのが原因だったっぽい。
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False
この設定で結果が「おお、揃った！」ってなったのが決め手。
解決策（実務での落としどころ）

TF32を明示的に無効化する（上の2行）。これでAmpereでも再現性がグッと上がるし、DERの差も解消されたって報告だよ。

pyannote側も対応を取り込んでる。関連PRはここ。fix(pipeline): fix reproducibility issue with Ampere CUDA device (https://github.com/pyannote/pyannote-audio/pull/1381)

そのため、最近のpyannote.audioでは実行時にReproducibilityWarning: TensorFloat-32 (TF32) has been disabled ... みたいな警告を敢えて出して、再現性優先にしてるの。つまり「警告が出る＝TF32を切って再現性を守ってる」って意味だし、ヤバいエラーじゃないよ。

なんでTF32が問題になるの？

超ざっくり、TF32って「精度はFP32よりちょい粗いけど速いFP32もどき」みたいなノリ。学習だと誤差が平均化されて気づきにくいことがあるけど、推論で閾値の近くに決定境界があると、ちょいの数値差がクラスタ分けや境界時間に響きやすいんだ。だから長い音声や人数多めのセッションだと、出力の違いが見えやすくなるって感じ。

実務Tips（ウチの推し順）

まずはTF32無効化から。速度はちょい落ちるけど、再現性は命だし。

PyTorch/pyannoteのバージョンを固定して、環境差分を最小化しよ。

乱数シード固定（torch.manual_seedなど）＋CPU実行での比較も取っておくと差分の切り分けが楽。

もしfloat16やbfloat16使うなら、「どの層がその精度で動いてるか」 に注意。埋め込みやクラスタの部分が精度に敏感だし、TF32と混在するとさらに揺れることあるから、統一しよ。

結論

Ampere系GPUでのpyannoteの再現性問題はTF32を無効化すれば解消するのが確からしい。

pyannote.audioはその方針を取り込んでいて、警告は「再現性のためにTF32切ってるよ」ってお知らせだし。

本番では、環境固定＋TF32オフ＋シード固定で、安定運用がマジ大事だもん。

関連PRリンクはこちら：fix(pipeline): fix reproducibility issue with Ampere CUDA device (https://github.com/pyannote/pyannote-audio/pull/1381)

RTX4090だからだねぇ。

kun432

区間を指定して音声ベクトルを抽出

音声シーケンスの特定の区間を指定して、音声ベクトルを抽出する。

以前試したSonioxというASRサービスのデモレポジトリに、コーヒーショップでの店員（女性）と客（男性）の会話のデータ（coffee_shop.mp3）があったのでそれを使わせてもらう。

15秒程度で以下のような会話になっている。（出力はSonioxでのもの）

出力

Speaker 1:
[en] What is your best seller here?

Speaker 2:
[en] Our best seller here is cold brew iced coffee and lattes.

Speaker 1:
[en] Okay. And on a day like today, where it's snowing quite a bit, do a lot of people still order iced coffee?

Speaker 2:
[en] Here in Maine, yes.

Speaker 1:
[en] Really?

Speaker 2:
[en] Yes.

WeSpeakerでダイアライぜーションしたときは以下のような結果になっていた。

出力

0.600	1.225	0
1.225	2.000	1
2.200	5.400	1
5.600	8.225	1
8.225	10.475	0
10.475	10.900	1
11.100	12.225	0
12.225	12.475	1
12.475	12.975	0
12.975	16.070	1

左から発話開始時間、発話終了時間、話者IDとなる

事前にWAVに変換しておく必要がある。自分はffmpegを使用した。

ffmpeg -i coffee_shop.mp3 coffee_shop.wav

Segmentで区間を指定して、crop()でそれを渡す。

get_embedding_excerpt.py

from pyannote.audio import Model, Inference
from pyannote.core import Segment
import torch

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(model, window="whole")
inference.to(torch.device("cuda"))

# 0.6〜1.2秒部分を指定してベクトル抽出
excerpt = Segment(0.6, 1.2)
embedding = inference.crop("coffee_shop.wav", excerpt)

print(embedding)

出力

[-0.13705519 -0.13632329 -0.01416351 -0.12091928  0.20480378 -0.5321428
  0.7618925   0.34092367 -0.33739224 -0.02096016 -0.09304089 -0.44141424
  0.0870421  -0.11665244  0.3488831  -0.18615139 -0.22912133  0.07408561
(snip)
 -0.21359101 -0.0751253  -0.02356557  0.18021457 -0.03869215 -0.09787747
 -0.06918538  0.01051457 -0.48934582  0.28809938  0.14239529 -0.05401704
 -0.21172832  0.17631385 -0.20860355 -0.11044333]

男性・女性それぞれの音声ベクトルを抽出して、別の区間で類似度を比較してみた。

get_similarity_excerpt.py

from pyannote.audio import Model, Inference
from pyannote.core import Segment
import torch
from scipy.spatial.distance import cosine

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(model, window="whole")
inference.to(torch.device("cuda"))

# 0.6〜1.2秒（男性）
excerpt1 = Segment(0.6, 1.2)
# 1.5〜2.0秒（女性）
excerpt2 = Segment(1.5, 2.0)

audio_file = "coffee_shop.wav"
embedding_speaker_1 = inference.crop(audio_file, excerpt1)
embedding_speaker_2 = inference.crop(audio_file, excerpt2)

# 8.22〜10.47秒の声はどちらか？（正解: speaker1）
excerpt_target = Segment(8.22, 10.47)
embedding = inference.crop(audio_file, excerpt_target)
distance_1 = cosine(embedding_speaker_1, embedding)
distance_2 = cosine(embedding_speaker_2, embedding)
print(distance_1)
print(distance_2)

出力

0.34368926
0.73376644

kun432

スライディングウインドウで連続抽出

スライディングウインドウを使って一定の時間軸にそって音声ベクトルを抽出する。

get_embedding_sliding_window.py

from pyannote.audio import Model, Inference
import torch

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference = Inference(
    model,
    window="sliding",  # スライディングウインドウの場合
    duration=3.0,      # 1回の抽出に使う音声の長さ
    step=1.0,          # 抽出の間隔
    # 上記の例だと、1秒ごとに3秒間のベクトルを抽出する
    # （つまりオーバーラップすることになる）
)
inference.to(torch.device("cuda"))

# ベクトルは、(N x D) の pyannote.core.SlidingWindowFeature で返る
embeddings = inference("coffee_shop.wav")

print(len(embeddings))
for emb in embeddings:
    print(emb)

出力

15
(<Segment(0, 3)>, array([ 0.06812137, -0.03169199, -0.06895392, -0.20528592, -0.05826249,
       -0.18896598,  0.18403058, -0.01637547,  0.24632832, -0.06279602,
       -0.0029377 , -0.12221014,  0.17790255, -0.14109737,  0.04132403,
       (snip)
        0.11321513,  0.18786548, -0.01366159, -0.09218998,  0.07355268,
        0.19022995, -0.12693903,  0.12788726,  0.09362973, -0.01650035,
        0.00233414], dtype=float32))
(<Segment(1, 4)>, array([-0.11529491, -0.0222354 , -0.14654861, -0.22942688, -0.21800378,
       -0.2225831 , -0.07342491, -0.02099671,  0.2362471 , -0.10307837,
        0.04511642, -0.17565045,  0.23355868, -0.20085225, -0.17847827,
       (snip)
        0.01220889,  0.2482723 ,  0.10363033, -0.01709535, -0.08198233,
        0.1312869 , -0.13962512,  0.3145274 ,  0.02499342, -0.01383983,
        0.06099774], dtype=float32))
(<Segment(2, 5)>, array([-0.3737467 ,  0.14531204, -0.10091184, -0.26540813, -0.20971738,
       -0.1697787 ,  0.26921725, -0.11855548,  0.05749515, -0.24257778,
        0.1551341 ,  0.06508996,  0.0553912 ,  0.00234929, -0.1818005 ,
(snip)

話者ベクトルを事前に抽出して、時間軸ごとの類似度を比較してみた。

from pyannote.audio import Model, Inference
from pyannote.core import Segment
import torch
from scipy.spatial.distance import cosine

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

inference_whole = Inference(model, window="whole")
inference_whole.to(torch.device("cuda"))

audio_file = "coffee_shop.wav"

excerpt1 = Segment(0.6, 1.2) # 0.6〜1.2秒（男性）
excerpt2 = Segment(1.5, 2.0) # 1.5〜2.0秒（女性）
embedding_speaker_1 = inference_whole.crop(audio_file, excerpt1)
embedding_speaker_2 = inference_whole.crop(audio_file, excerpt2)

inference_window = Inference(
    model,
    window="sliding",
    duration=1.0,
    step=1.0,
)
inference_window.to(torch.device("cuda"))
embeddings = inference_window(audio_file)

for idx, emb in enumerate(embeddings, start=1):
    d1 = cosine(embedding_speaker_1, emb[1])
    d2 = cosine(embedding_speaker_2, emb[1])
    print(f"{idx}: {emb[0]}:\n- {d1}\n- {d2}")

出力

1: [ 00:00:00.000 -->  00:00:01.000]:
- 0.2541273832321167
- 0.9037418961524963
2: [ 00:00:01.000 -->  00:00:02.000]:
- 0.4270187020301819
- 0.4240240454673767
3: [ 00:00:02.000 -->  00:00:03.000]:
- 1.0925642251968384
- 0.9653400182723999
4: [ 00:00:03.000 -->  00:00:04.000]:
- 0.8909101486206055
- 1.0497373342514038
5: [ 00:00:04.000 -->  00:00:05.000]:
- 1.0921273231506348
- 1.0000196695327759
6: [ 00:00:05.000 -->  00:00:06.000]:
- 1.0655171871185303
- 0.9393085837364197
7: [ 00:00:06.000 -->  00:00:07.000]:
- 1.0771349668502808
- 1.0247958898544312
8: [ 00:00:07.000 -->  00:00:08.000]:
- 0.7267137765884399
- 0.8838183283805847
9: [ 00:00:08.000 -->  00:00:09.000]:
- 0.5424906611442566
- 0.8837059736251831
10: [ 00:00:09.000 -->  00:00:10.000]:
- 0.5015945434570312
- 0.6589080691337585
11: [ 00:00:10.000 -->  00:00:11.000]:
- 0.38659197092056274
- 0.8526002168655396
12: [ 00:00:11.000 -->  00:00:12.000]:
- 0.4407532811164856
- 0.7973873615264893
13: [ 00:00:12.000 -->  00:00:13.000]:
- 0.530896008014679
- 0.7685302495956421
14: [ 00:00:13.000 -->  00:00:14.000]:
- 1.0029487609863281
- 1.03523588180542
15: [ 00:00:14.000 -->  00:00:15.000]:
- 1.0234917402267456
- 0.9617617726325989
16: [ 00:00:15.000 -->  00:00:16.000]:
- 0.8705372214317322
- 0.9916833639144897
17: [ 00:00:16.000 -->  00:00:17.000]:
- 0.9292110800743103
- 0.9241632223129272

実際は1秒みたいなきれいな区切りでは分割できないのでアレだけど。

このスクラップは15日前にクローズされました