Closed24日前にクローズ5

簡単に使える日本語TTS「AivisSpeech」を試す

https://x.com/izutorishima/status/1858885899108708560
公式サイト
https://aivis-project.com/
ざっと見た感じ
AivisSpeech Engine: TTSエンジンのコア
AivisSpeech: UI。AivisSpeech Engineを含む。
AivisBuilder: 音声データから音声モデルを作成するためのツール。
AIVM Generator: 音声モデル用ファイルフォーマット、AIVMの生成・編集。
AivisHub: 音声モデルを公開・共有するためのハブ
Aivis API: 音声モデル用APIサーバ
という感じなのかな。
GitHubレポジトリは以下
https://github.com/Aivis-Project

まずはローカルのMacで試してみる。
https://aivis-project.com/
上記からインストーラをダウンロードしてインストール。起動にちょっと時間がかかって、自分の場合は1回失敗した。再起動したら立ち上がってきた。
こんな感じで、テキスト入力して生成。
生成・再生される。アクセントの調整やパラメータの設定などができる。
オーディオファイルへの書き出し。
以下のような感じ。
https://audio.com/kun432/audio/aivis-speech-sample-1
スタイルの変更はアイコンをクリック。

AivisSpeech起動中はAPIが起動している。ブラウザからhttp://127.0.0.1:10101/docsを開くとAPIドキュメントが表示される。

API経由で生成させてみる。まずSTYLE_IDを確認。

curl http://127.0.0.1:10101/speakers | jq -r .

出力

[
  {
    "name": "Anneli",
    "speaker_uuid": "e756b8e4-b606-4e15-99b1-3f9c6a1b2317",
    "styles": [
      {
        "name": "ノーマル",
        "id": 888753760,
        "type": "talk"
      },
      {
        "name": "通常",
        "id": 888753761,
        "type": "talk"
      },
      {
        "name": "テンション高め",
        "id": 888753762,
        "type": "talk"
      },
      {
        "name": "落ち着き",
        "id": 888753763,
        "type": "talk"
      },
      {
        "name": "上機嫌",
        "id": 888753764,
        "type": "talk"
      },
      {
        "name": "怒り・悲しみ",
        "id": 888753765,
        "type": "talk"
      }
    ],
    "version": "1.0.0",
    "supported_features": {
      "permitted_synthesis_morphing": "NOTHING"
    }
  }
]

発話させたいテキストをファイルに記載。

echo -n "こんにちは。やっぱりドウデュースは強かったですね！ジャパンカップ優勝おめでとう、ドウデュース！" > text.txt

これをクエリ用のJSONに変換するっぽい。ここでSTYLE_IDをセットする。

curl -s -X POST "127.0.0.1:10101/audio_query?speaker=888753764" \
    --get \
    --data-urlencode text@text.txt > query.json

こんな感じの中身になる。

query.json

{
  "accent_phrases": [
    {
      "moras": [
        {
          "text": "コ",
          "consonant": "k",
          "consonant_length": 0.0,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ン",
          "consonant": null,
          "consonant_length": null,
          "vowel": "N",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ニ",
          "consonant": "n",
          "consonant_length": 0.0,
          "vowel": "i",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "チ",
          "consonant": "ch",
          "consonant_length": 0.0,
          "vowel": "i",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ワ",
          "consonant": "w",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": ".",
          "consonant": null,
          "consonant_length": null,
          "vowel": "pau",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 5,
      "pause_mora": null,
      "is_interrogative": false
    },
    {
      "moras": [
        {
          "text": "ヤ",
          "consonant": "y",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ッ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "cl",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "パ",
          "consonant": "p",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "リ",
          "consonant": "r",
          "consonant_length": 0.0,
          "vowel": "i",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 3,
      "pause_mora": null,
      "is_interrogative": false
    },
    {
      "moras": [
        {
          "text": "ド",
          "consonant": "d",
          "consonant_length": 0.0,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "オ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 1,
      "pause_mora": null,
      "is_interrogative": false
    },
    {
      "moras": [
        {
          "text": "デュ",
          "consonant": "dy",
          "consonant_length": 0.0,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ウ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ス",
          "consonant": "s",
          "consonant_length": 0.0,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ワ",
          "consonant": "w",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 1,
      "pause_mora": null,
      "is_interrogative": false
    },
    {
      "moras": [
        {
          "text": "ツ",
          "consonant": "ts",
          "consonant_length": 0.0,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ヨ",
          "consonant": "y",
          "consonant_length": 0.0,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "カ",
          "consonant": "k",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ッ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "cl",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "タ",
          "consonant": "t",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "デ",
          "consonant": "d",
          "consonant_length": 0.0,
          "vowel": "e",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ス",
          "consonant": "s",
          "consonant_length": 0.0,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ネ",
          "consonant": "n",
          "consonant_length": 0.0,
          "vowel": "e",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "!",
          "consonant": null,
          "consonant_length": null,
          "vowel": "pau",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 2,
      "pause_mora": null,
      "is_interrogative": false
    },
    {
      "moras": [
        {
          "text": "ジャ",
          "consonant": "j",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "パ",
          "consonant": "p",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ン",
          "consonant": null,
          "consonant_length": null,
          "vowel": "N",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "カ",
          "consonant": "k",
          "consonant_length": 0.0,
          "vowel": "a",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ッ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "cl",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "プ",
          "consonant": "p",
          "consonant_length": 0.0,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ユ",
          "consonant": "y",
          "consonant_length": 0.0,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ウ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ショ",
          "consonant": "sh",
          "consonant_length": 0.0,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "オ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 7,
      "pause_mora": null,
      "is_interrogative": false
    },
    {
      "moras": [
        {
          "text": "オ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "メ",
          "consonant": "m",
          "consonant_length": 0.0,
          "vowel": "e",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "デ",
          "consonant": "d",
          "consonant_length": 0.0,
          "vowel": "e",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ト",
          "consonant": "t",
          "consonant_length": 0.0,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "オ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": ",",
          "consonant": null,
          "consonant_length": null,
          "vowel": "pau",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 5,
      "pause_mora": null,
      "is_interrogative": false
    },
    {
      "moras": [
        {
          "text": "ド",
          "consonant": "d",
          "consonant_length": 0.0,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "オ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "o",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 1,
      "pause_mora": null,
      "is_interrogative": false
    },
    {
      "moras": [
        {
          "text": "デュ",
          "consonant": "dy",
          "consonant_length": 0.0,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ウ",
          "consonant": null,
          "consonant_length": null,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "ス",
          "consonant": "s",
          "consonant_length": 0.0,
          "vowel": "u",
          "vowel_length": 0.0,
          "pitch": 0.0
        },
        {
          "text": "!",
          "consonant": null,
          "consonant_length": null,
          "vowel": "pau",
          "vowel_length": 0.0,
          "pitch": 0.0
        }
      ],
      "accent": 1,
      "pause_mora": null,
      "is_interrogative": false
    }
  ],
  "speedScale": 1.0,
  "intonationScale": 1.0,
  "tempoDynamicsScale": 1.0,
  "pitchScale": 0.0,
  "volumeScale": 1.0,
  "prePhonemeLength": 0.1,
  "postPhonemeLength": 0.1,
  "pauseLength": null,
  "pauseLengthScale": 1.0,
  "outputSamplingRate": 44100,
  "outputStereo": false,
  "kana": "こんにちは。やっぱりドウデュースは強かったですね！ジャパンカップ優勝おめでとう、ドウデュース！"
}

でこれを使って生成リクエストを送る。

curl -s  -X POST "127.0.0.1:10101/synthesis?speaker=888753764" \
    -H "Content-Type: application/json" \
    -d @query.json > audio.wav

できた音声ファイルはこんな感じ。

VOICEVOX ENGINE HTTP APIと互換性があるということなので、VOICEVOXがこういう仕様なんだろうな、ちょっとAPIにクセがあるように自分は感じた。

Linux上のDockerで、AivisSpeech Engineを起動してみる。Ubuntu 22.04 & RTX4090（VRAM24GB）

mkdir -p ~/.local/share/AivisSpeech-Engine

docker run --rm --gpus all -p '10101:10101' \
    -v ~/.local/share/AivisSpeech-Engine:/home/user/.local/share/AivisSpeech-Engine-Dev \
    ghcr.io/aivis-project/aivisspeech-engine:nvidia-latest

こんな感じでログが出ていればOK

出力

[2024/11/24 10:16:37] INFO:     Started server process [1]
[2024/11/24 10:16:37] INFO:     Waiting for application startup.
[2024/11/24 10:16:37] INFO:     Application startup complete.
[2024/11/24 10:16:37] INFO:     Uvicorn running on http://0.0.0.0:10101 (Press CTRL+C to quit)

１つ前で書いた通りAPIにアクセスすれば使える。

nvidia-smiの結果はこんな感じだった。

出力

Sun Nov 24 19:24:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   41C    P0             50W /  450W |    3215MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1377      G   /usr/lib/xorg/Xorg                            167MiB |
|    0   N/A  N/A      1454      G   /usr/bin/gnome-shell                           15MiB |
|    0   N/A  N/A   3299193      C   ...aivisspeech-engine/.venv/bin/python       3008MiB |
+-----------------------------------------------------------------------------------------+

自分でも音声モデル作ってみたいところ。

このスクラップは24日前にクローズされました