「sesami/csm-1b」を試す

先日話題になったこれ
https://x.com/sesame/status/1895159087010324615
実際に公式サイトで試せるデモがすごい。
https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice
で、コードとモデルが公開された
https://x.com/brendaniribe/status/1900306051867529584
https://x.com/brendaniribe/status/1900306296236040392
とはいっても、公開されたのは1Bのモデルで、公式サイトのデモレベルまではさすがにいかなさそう。とりあえずどこまでできるか試してみる。

kun432

GitHubレポジトリ
https://github.com/SesameAILabs/csm

 CSM2025/03/13 - 1B CSMバリアントをリリースしました。チェックポイントはHugging Faceでホストされています。
CSM (Conversational Speech Model)は、Sesameによる音声生成モデルで、テキストおよび音声入力からRVQ音声コードを生成します。モデルのアーキテクチャはLlamaのバックボーンと、Mimi音声コードを生成する小型のオーディオデコーダを採用しています。
CSMのファインチューニングされたバリアントは、ブログ記事で紹介されているインタラクティブボイスデモにパワーを提供しています。
音声生成をテストするためのHugging Faceスペースも用意されています。

 RequirementsCUDA対応GPU
コードはCUDA 12.4および12.6でテストされていますが、他のバージョンでも動作する可能性があります
同様に、Python 3.10が推奨されていますが、より新しいバージョンでも問題ない場合があります
一部の音声操作には ffmpeg が必要な場合があります
以下のHugging Faceモデルへのアクセス:
Llama-3.2-1B
CSM-1B


 FAQこのモデルにいろんな音声は付属していますか？
ここでオープンソース化されたモデルは基本的な生成モデルです。多様な声を生成することが可能ですが、特定の声にファインチューニングされてはいません。
モデルと会話できますか？
CSMは音声生成モデルとして訓練されており、汎用のマルチモーダルLLMではありません。テキストを生成することはできません。テキスト生成には別のLLMの使用をお勧めします。
他の言語に対応していますか？
学習データにおけるデータ混入により、非英語言語に対するある程度の能力はありますが、うまく機能しない可能性が高いです。

 誤用および乱用 ⚠️このプロジェクトは、研究および教育目的のための高品質な音声生成モデルを提供します。責任ある倫理的な使用を推奨する一方で、以下の行為を明示的に禁止します:

なりすましまたは詐欺: 明示的な同意なしに実在の個人を模倣する音声の生成にこのモデルを使用しないでください。

誤報または欺瞞: 偽ニュースや詐欺的な通話など、欺瞞的または誤解を招くコンテンツの作成にこのモデルを使用しないでください。

違法または有害な活動: 違法、有害、または悪意のある目的でこのモデルを使用しないでください。
このモデルを使用することにより、すべての適用法および倫理ガイドラインに従うことに同意したものとみなされます。いかなる誤用に対しても我々は責任を負いません。また、この技術の非倫理的な応用を強く非難します。
モデル
https://huggingface.co/sesame/csm-1b

kun432

モデルカードに記載されているコードに従って試してみる。Colaboratory T4で。

レポジトリをクローンしてパッケージインストール

!git clone https://github.com/SesameAILabs/csm
%cd csm
!pip install -r requirements.txt

↑でランタイム再起動が必要になるので、再起動後、再度クローンしたレポジトリのディレクトリに移動。

%cd csm

モデルをダウンロードしてロード。

from generator import load_csm_1b

generator = load_csm_1b(device="cuda")

音声の生成

import torchaudio

audio = generator.generate(
    text="Hi Bob, long time no see! did you catch the horse races this weekend?",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

再生

from IPython.display import display, Audio

display(Audio("audio.wav"))

生成された音声

なお、音声は生成するたびにぜんぜん違う音声に変わる。同じコードを複数回実行した結果。

kun432

上に書いた通り、何もしなければ毎回生成される音声はガラッと違うものになる。speakerというパラメータは上記の使い方だとほとんど意味がない。

そこでcontextを使う。contextを使うと、過去の会話を元に次の発話を生成することができる。つまり、会話の流れに沿って最適な発話を生成できる。ここでspeakerパラメータも生きてくる。

まずベタにやってみる。

まず、最初の音声を生成する。

import torchaudio
from IPython.display import display, Audio

audio = generator.generate(
    text="Hey! it's been so long since we last caught up!",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("speaker_0_1.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
display(Audio("speaker_0_1.wav"))

この音声をコンテキストに変換するには、Segementでテキスト・音声・話者IDを指定して音声プロンプトに変換する。

from generator import Segment

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

contexts = [
    Segment(
        text="Hi Bob, long time no see! did you catch the horse races this weekend?",
        speaker=0,
        audio=load_audio("speaker_0_1.wav")
    )
]

このコンテキストを元に新しい発話を生成する。speakerを1に変更している。

audio = generator.generate(
    text="Yea! I've really missed our chats. How have you been?",
    speaker=1,
    context=contexts,
    max_audio_length_ms=10_000,
)

torchaudio.save("speaker_1_1.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
display(Audio("speaker_1_1.wav"))

あとはこれを繰り返していけば良い。

contexts.append(
    Segment(
        text="Yea! I've really missed our chats. How have you been?",
        speaker=1,
        audio=load_audio("speaker_1_1.wav")
    )
)

audio = generator.generate(
    text="I’ve been doing well, thanks. I’ve also kept up with our shared passion for horse racing.",
    speaker=0,
    context=contexts,
    max_audio_length_ms=10_000,
)

torchaudio.save("speaker_0_2.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
display(Audio("speaker_0_2.wav"))

contexts.append(
    Segment(
        text="I've been doing well, thanks. I've also kept up with our shared passion for horse racing.",
        speaker=0,
        audio=load_audio("speaker_0_2.wav")
    )
)

audio = generator.generate(
    text="That's great to hear! I watched a few races recently and the competition was fierce.",
    speaker=1,
    context=contexts,
    max_audio_length_ms=10_000,
)

torchaudio.save("speaker_1_2.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
display(Audio("speaker_1_2.wav"))

全部をまとめるとこんな感じで、会話がつながっているのがわかる。

kun432

まるっとやるとこんな感じ。最初の音声プロンプトを自分で用意すれば、その音声で会話させることができると思うので、ポッドキャストとか作れそうだし、マイクで取得してSTT・LLMと組み合わせれば音声対話も作れると思う。

import torch
import torchaudio

device = "cuda" if torch.cuda.is_available() else "cpu"
generator = load_csm_1b(device=device)

# 2者の発話を自動生成し、初期プロンプトとして利用
default_text_a = "Hello, How's it going?"
default_text_b = "Hello, How's it going?"

audio_a = generator.generate(
    text=default_text_a,
    speaker=0,
    context=[],  # 初回なのでコンテキストなし
    max_audio_length_ms=10_000,
)
audio_b = generator.generate(
    text=default_text_b,
    speaker=1,
    context=[],  # 初回なのでコンテキストなし
    max_audio_length_ms=10_000,
)

# 初期プロンプトとしての Segment を作成
initial_prompt_a = Segment(text=default_text_a, speaker=0, audio=audio_a)
initial_prompt_b = Segment(text=default_text_b, speaker=1, audio=audio_b)

# 初期コンテキストに初回発話を格納
conversation_segments = [initial_prompt_a, initial_prompt_b]

# 会話履歴を定義、各行が交互の発話として扱う
conversation_text = """\
Hi, it's been so long since we last caught up!
Yea, I know! I've really missed our chats. How have you been?
I've been doing well, thanks. I've also kept up with our shared passion for horse racing.
That's great to hear! I watched a few races recently and the competition was fierce.
Absolutely, the races were exhilarating, especially that unexpected win by an underdog.
I couldn't agree more—the strategy displayed by the jockey was simply brilliant.
Have you read about the new training techniques being introduced? They seem to be changing the game.
Yes, I saw an article on that! It looks like they could really enhance performance in future races.
I'm excited to see how these developments will shape upcoming events. It really adds another layer of thrill.
Definitely. We should plan to catch a race together soon and discuss all these new trends in person!
"""

utterances = [line.strip() for line in conversation_text.split("\n") if line.strip()]

# 各発話の生成
for i, utterance in enumerate(utterances):
    # 発話は交互に Speaker A (0) と Speaker B (1) に割り当て
    speaker = i % 2
    print(f"Generating audio for Speaker {'A' if speaker == 0 else 'B'}: {utterance}")

    audio_tensor = generator.generate(
        text=utterance,
        speaker=speaker,
        context=conversation_segments,
        max_audio_length_ms=10_000,
    )

    segment = Segment(text=utterance, speaker=speaker, audio=audio_tensor)
    conversation_segments.append(segment)

# 全ての発話を連結して1つの wav ファイルに出力
# ※最初のプロンプトの元になった2つの音声は除いている
all_audio_tensors = [segment.audio for segment in conversation_segments[2:]]
final_audio = torch.cat(all_audio_tensors, dim=0)

output_filename = "final_conversation.wav"
torchaudio.save(output_filename, final_audio.unsqueeze(0).cpu(), generator.sample_rate)

display(Audio("final_conversation.wav"))

kun432

発話の流暢さを構成する要素はいろいろある。その中で、音声コンテキストに応じて自然に発話する、というのがこのモデルのウリなのだろうと思う。

ただ、それ以外にもいろいろな要素はあると思っていて、今提供されているモデルだけで実現できるわけではなく、Sesamiのデモにはもっと他にも必要な要素が含まれているのではないかと思う。