GladiaのSTTモデル「Solaria」を試す

音声は、私たちにとって最もパワフルで親密なコミュニケーション手段です。そして今回初めて、人間とAIが同じ言語を話すことができるようになりました。

Solariaをご紹介します。グローバル展開を目的として設計された初の音声AIモデル。即時性、正確性、真の多言語対応を実現します。

・ 94%の単語精度を誇る、クラス最高のリアルタイム文字起こし
・ Solaria独自の42言語を含む、100以上の言語でネイティブレベルの品質を実現
・自然で遅延のない会話のための270ミリ秒のレイテンシ

業界に特化した音声エージェントを構築しているか、あるいは高パフォーマンスのカスタマーエクスペリエンスを提供しているかに関わらず、どこでも誰とでもシームレスに会話ができる唯一の音声認識モデルで、未開拓のグローバル市場を開拓しましょう。

kun432

Gladiaは知らなかったのだが、どうやらSTTサービス専門の様子。
https://www.gladia.io/
サービスとしては以下の2つ。
Speech-to-Text API
Real-Time Transcription API
ざっと見た感じ、正確性もそうだけど、高速性がウリのようで、リアルタイムの方だと思うのだけど300ms以下を謳っている。
料金プランをざっと表にまとめてみた。
https://www.gladia.io/pricing


プラン
Free
Pro


料金
$0/月
※10時間/月替含まれる
バッチ: $0.612/時間
リアルタイム: $0.144/時間

バッチ文字起こし
✔
✔

話者分離
✔
✔

単語レベルのタイムスタンプ
ー
✔

リアルタイム文字起こし
✔
✔

100以上の言語のフルサポート
ー
✔

言語検出
ー
✔

Code Swithcing？
ー
✔

Code Translation？
ー
✔

句読点と大文字小文字の自動付与
ー
✔

カスタム辞書
ー
✔

デュアルチャネル文字起こし
ー
✔

SRT/VTT字幕フォーマット対応
ー
✔

無制限ファイルサイズ
✘
✔（多分）

無制限ファイル長
✘
✔（多分）

同時接続制限
✘
✔（多分）

上記以外にEnterpriseもあって、Contact Salesになっている。Enterpriseの場合は以下が利用できる様子。
データ保持期間やSLAの個別契約ができそう
クラウドのロケーションやプロバイダーがカスタム、オンプレも。閉域みたいなのもできるっぽい？
メールと電話でのサポート。専任のアカウントマネージャとサポートエンジニアが用意されるっぽい
とりあえず無料でどこまで行けるか確かめてみる

プラン	Free	Pro
料金	$0/月 ※10時間/月替含まれる	バッチ: $0.612/時間リアルタイム: $0.144/時間
バッチ文字起こし	✔	✔
話者分離	✔	✔
単語レベルのタイムスタンプ	ー	✔
リアルタイム文字起こし	✔	✔
100以上の言語のフルサポート	ー	✔
言語検出	ー	✔
Code Swithcing？	ー	✔
Code Translation？	ー	✔
句読点と大文字小文字の自動付与	ー	✔
カスタム辞書	ー	✔
デュアルチャネル文字起こし	ー	✔
SRT/VTT字幕フォーマット対応	ー	✔
無制限ファイルサイズ	✘	✔（多分）
無制限ファイル長	✘	✔（多分）
同時接続制限	✘	✔（多分）

kun432

各STTプロバイダの料金を以前調べたのだけど、最安クラスじゃないかな

kun432

 Getting Startedhttps://docs.gladia.io/chapters/introduction/getting-started
無料アカウントでまずは試してみる。今回はリアルタイムも試したいのでローカルのMacで。
とりまアカウント作成後に、APIキーが作成されているので、それを確認しておく

kun432

Asynchronous Speech-to-Text

おそらくこれがバッチ文字起こしだと思う。

まずはCLIで試す。サンプル音声として、自分が開催した勉強会のYouTube動画から冒頭4分30秒程度の音声を抜き出したオーディオファイルを用意した。

まずこのファイルをGladiaにアップロードする。XXX...に先ほど確認したGladiaのAPIキーをセットする。

export GLADIA_API_KEY=XXXXXXXXXX

curl --request POST \
  --url https://api.gladia.io/v2/upload \
  --header 'Content-Type: multipart/form-data' \
  --header "x-gladia-key: $GLADIA_API_KEY" \
  --form audio=@voice_lunch_jp_5min.wav

ファイルのURLが返ってくる。これが必要になる。

出力

{
    "audio_url":"https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "audio_metadata":{
        "id":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "filename":"voice_lunch_jp_5min.wav",
        "extension":"wav",
        "size":8600312,
        "audio_duration":268.757,
        "number_of_channels":1
    }
}

ではファイルの文字起こし。今回のサンプルだと話者分離は必要ないのだけど、ドキュメントどおりに設定してみた。enable_code_switchingってなんぞや？と思ったら、どうやら複数言語の話者が含まれる音声の場合に自動的に文字起こしの言語を切り替えるというものみたい。ということは、以下にはないけど、"code translation"は多分その場合の翻訳を制御するものだと思われる。

  curl --request POST \
    --url https://api.gladia.io/v2/pre-recorded \
    --header 'Content-Type: application/json' \
    --header "x-gladia-key: $GLADIA_API_KEY" \
    --data '{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "diarization": true,
    "diarization_config": {
      "number_of_speakers": 3,
      "min_speakers": 1,
      "max_speakers": 5
    },
    "translation": true,
    "translation_config": {
      "model": "base",
      "target_languages": ["ja", "en"]
    },
    "subtitles": true,
    "subtitles_config": {
      "formats": ["srt", "vtt"]
    },
    "detect_language": true,
    "enable_code_switching": false
    }
  '

実行すると以下のような結果が返る。直ぐに結果が返ってくるわけではなくて、非同期にジョブが実行されるので、結果に含まれるidとresult_urlを使って確認する様子。

出力

{
    "id":"YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY",
    "result_url":"https://api.gladia.io/v2/pre-recorded/YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY"
}

結果を取得するURLはhttps://api.gladia.io/v2/pre-recorded/<ID>となっていて、上記のresult_urlをそのまま使えば良い。

確認してみる。

curl --request GET \
  --url https://api.gladia.io/v2/pre-recorded/YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY \
    --header "x-gladia-key: $GLADIA_API_KEY" | jq -r .

出力

{
  "id": "YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY",
  "request_id": "G-dfb9a7fc",
  "version": 2,
  "status": "processing",
  "created_at": "2025-04-11T08:35:31.066Z",
  "completed_at": null,
  "custom_metadata": null,
  "error_code": null,
  "kind": "pre-recorded",
  "file": {
    "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "filename": "voice_lunch_jp_5min.wav",
    "source": null,
    "audio_duration": 268.757,
    "number_of_channels": 1
  },
  "request_params": {
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "sentences": false,
    "subtitles": true,
    "moderation": false,
    "diarization": true,
    "translation": true,
    "audio_to_llm": false,
    "display_mode": false,
    "summarization": false,
    "audio_enhancer": true,
    "chapterization": false,
    "custom_spelling": false,
    "detect_language": true,
    "name_consistency": false,
    "subtitles_config": {
      "style": "default",
      "formats": [
        "srt",
        "vtt"
      ]
    },
    "diarization_config": {
      "enhanced": false,
      "max_speakers": 5,
      "min_speakers": 1,
      "number_of_speakers": 3
    },
    "sentiment_analysis": false,
    "translation_config": {
      "model": "base",
      "target_languages": [
        "ja",
        "en"
      ],
      "match_original_utterances": true
    },
    "diarization_enhanced": false,
    "punctuation_enhanced": false,
    "enable_code_switching": false,
    "named_entity_recognition": false,
    "speaker_reidentification": false,
    "accurate_words_timestamps": false,
    "skip_channel_deduplication": false,
    "structured_data_extraction": false
  },
  "result": null
}

なんどか繰り返していると、上の"status": "processing"が"status":"done"になり、結果が返ってくる。

出力

{
  "id": "YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY",
  "request_id": "G-dfb9a7fc",
  "version": 2,
  "status": "done",
  "created_at": "2025-04-11T08:35:31.066Z",
  "completed_at": "2025-04-11T08:36:26.986Z",
  "custom_metadata": null,
  "error_code": null,
  "kind": "pre-recorded",
  "file": {
    "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "filename": "voice_lunch_jp_5min.wav",
    "source": null,
    "audio_duration": 268.757,
    "number_of_channels": 1
  },
  "request_params": {
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "sentences": false,
    "subtitles": true,
    "moderation": false,
    "diarization": true,
    "translation": true,
    "audio_to_llm": false,
    "display_mode": false,
    "summarization": false,
    "audio_enhancer": true,
    "chapterization": false,
    "custom_spelling": false,
    "detect_language": true,
    "name_consistency": false,
    "subtitles_config": {
      "style": "default",
      "formats": [
        "srt",
        "vtt"
      ]
    },
    "diarization_config": {
      "enhanced": false,
      "max_speakers": 5,
      "min_speakers": 1,
      "number_of_speakers": 3
    },
    "sentiment_analysis": false,
    "translation_config": {
      "model": "base",
      "target_languages": [
        "ja",
        "en"
      ],
      "match_original_utterances": true
    },
    "diarization_enhanced": false,
    "punctuation_enhanced": false,
    "enable_code_switching": false,
    "named_entity_recognition": false,
    "speaker_reidentification": false,
    "accurate_words_timestamps": false,
    "skip_channel_deduplication": false,
    "structured_data_extraction": false
  },
  "result": {
    "metadata": {
      "audio_duration": 268.757313,
      "number_of_distinct_channels": 1,
      "billing_time": 268.757313,
      "transcription_time": 55.92
    },
    "transcription": {
      "utterances": [
        {
          "words": [
            {
              "word": "はい",
              "start": 0.876,
              "end": 1.096,
              "confidence": 0.45
            },
            {
              "word": "じゃあ",
              "start": 1.356,
              "end": 1.696,
              "confidence": 0.46
            },
            {
              "word": "始め",
              "start": 1.776,
              "end": 1.976,
              "confidence": 0.43
            },
            {
              "word": "ます",
              "start": 2.056,
              "end": 2.336,
              "confidence": 0.86
            },
(snip)

リクエスト時の設定で、翻訳と字幕フォーマットを定義していると

(snip)
    "translation": true,
    "translation_config": {
      "model": "base",
      "target_languages": ["ja", "en"]
    },
    "subtitles": true,
    "subtitles_config": {
      "formats": ["srt", "vtt"]
    },
(snip)

それぞれの文字起こしが含まれためちゃめちゃでかいレスポンスが返ってくるので、一旦ファイルに出力して確認してみた。

curl --request GET \
    --url https://api.gladia.io/v2/pre-recorded/YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY \
    --header "x-gladia-key: $GLADIA_API_KEY" > result.json

まず文字起こし

jq -r .result.transcription.full_transcript result.json

はいじゃあ始めますちょっとまだ来られてない方もいらっしゃるんですけどボイスランチ JP始めます皆さん日曜日にお集まりいただきましてありがとうございます今日は久しぶりにですねオフラインということで今日はですねスペシャルなゲストをお二人来ていただいておりますということで今日ちょっとトピックに回りますけれどもボイスローの仕様であるブレイデン・リームさんとあとセールスフォースのカンバース・フェジョナルデザインのディレクターであるブレイク・ベネスさんに来ていただいてますということで日本に来ていただいてありがとうございました今日はちょっとこのお二人にまた後でいろいろと聞こうというコーナーがありますのでそこでまたいろいろと聞きたいと思います今日のアジェンダなんですけどもちょっと時間過ぎちゃいましたがまず最初にボイスランチJP についてというとあと会場のところですね少しご説明させていただいて一つ目のセッションでまず私の方からボイスローン 2022年の新機能とかですねその辺の話を少しさせていただいてその後2つ目のセッションでグレイデンさんとグルーさんにいろいろカンバセーショナルデザインですねについて何でも聞こうぜみたいなところを予定しておりますその後 15時から15時で一旦終了という形でさせていただいて一応ボイスランチJP確か記念撮影は必須ですよねなのでそれだけさせていただいてその後ちょっと1時間ぐらい簡単にお菓子と飲み物を用意してますので懇親会というのをそのままさせていただこうと思っていますボイスランチJPについてなんですけどもボイスランチはボイスUIとか音声関連ですねそういった技術に実際に携わっている人もしくは興味がある人のためのグローバルなコミュニティという形になっていてボイスランチの日本リージョンという形がボイスランチJPになっています過去もずっとやってますけどオンラインオフラインでいろんな音声のデザインだったり技術だったりというところで情報とかを共有してみんなで業界盛り上げていこうぜというようなことでやっております今日のハッシュタグですねシャープボイスランチJPでいろいろと自由に活かしてくださいあと会場ですね今回グラニカ様のお声で利用させていただいてますありがとうございますぜひこちらもシェアをお願いしたいですと配信のところもいろいろとやっていただいてますので非常に感謝しております今コロナで会場に来られる方とかもあまりいないということでされてないんですけれども通常はここでIoT機器とかガジェットとかを展示されているようなのでそういったものがあるとき今度ですねまた体験してみていただければなと思っていますというところであとすいませんトイレがこちらであとタバコ吸われる方はこちらのところになってますのでよろしくお願いしますはいということで最初の挨拶はこれでじゃあまず私の方のセッションからさせていただきますというところでボイスローアップデート2022とというところで、今年の新機能についてお話をします。自己紹介です。清水と申します。神戸でインフラのエンジニアをやってますので、普段はKubernetesとかAWSとかTerraformとかをいじってまして、最近ちょっとフリーランスになります。ちょっと調べてみたら、ボイスフローを一番最初に始めたのは2019年の頭ぐらいなので、大体4年弱ぐらいですね。いろいろと触ってまして。あと音声関連のコミュニティのところでは、ボイスランチAP。今回のやつですね以外に AJAG Amazon Alexa Japan User Group とかあとVoicelo の日本語ユーザーグループということでVFJUGというのをやっています日本語コミュニティの方はFacebookの方でやってますのでもしよろしければ見ていただければなと思いますあと2年ぐらい前にですね技術書店の方でここに今日スタッフで聞いていただいている皆さんとですね一緒に同人誌作ろうぜということで作ったんですけれども、もうこれちょっと2年ぐらい経って中身がだいぶ古くなってしまっているので、すでにちょっと販売は終了しております。今日ちょっと持ってきたかったんですけど、忘れてしまいました。はい、なのでこういうこともやっています。

字幕用の出力。SRTとVTTそれぞれ出力される。以下はSRT。

jq -r .result.transcription.subtitles[0].subtitles result.json

1
00:00:00.876 --> 00:00:07.598
はいじゃあ始めますちょっとまだ来られてない方もいらっしゃる
んですけどボイスランチ

2
00:00:08.118 --> 00:00:09.998
JP 始めます皆さん

3
00:00:10.579 --> 00:00:17.480
日曜日にお集まりいただきましてありがとうございます今日

4
00:00:17.581 --> 00:00:22.742
は久しぶりにですねオフラインということで今日はですね
スペシャルなゲストをお二人

5
00:00:23.442 --> 00:00:25.643
来ていただいておりますということで

6
00:00:26.483 --> 00:00:31.565
今日ちょっとトピックに回りますけれどもボイスローの仕様で
あるブレイデン・リームさんと

7
00:00:31.925 --> 00:00:36.109
あとセールスフォースのカンバース・フェジョナルデザインのディレクターで
ある

8
00:00:36.749 --> 00:00:38.891
ブレイク・ベネスさんに来ていただいてます

9
00:00:40.032 --> 00:00:44.656
ということで日本に来ていただいてありがとうございまし

10
00:00:45.517 --> 00:00:48.579
た今日はちょっとこのお二人にまた後でいろいろと聞こうと
いう

(snip)

単語ごとのタイムスタンプ、日本語の場合はなんとなくトークン単位で分割されているように思える。

jq -r .result.transcription.utterances result.json

[
  {
    "words": [
      {
        "word": "はい",
        "start": 0.876,
        "end": 1.096,
        "confidence": 0.45
      },
      {
        "word": "じゃあ",
        "start": 1.356,
        "end": 1.696,
        "confidence": 0.46
      },
      {
        "word": "始め",
        "start": 1.776,
        "end": 1.976,
        "confidence": 0.43
      },
      {
        "word": "ます",
        "start": 2.056,
        "end": 2.336,
        "confidence": 0.86
      },
      {
        "word": "ちょっと",
        "start": 2.356,
(snip)

で、上記の翻訳もそれぞれ含まれている。

jq -r .result.translation.results[1].full_transcript result.json

Okay, then let's begin. There are still some people who haven't arrived yet, but we'll start Voice Lunch JP. Thank you all for gathering on Sunday. Today, for the first time in a while, we're having an offline event. We have two special guests today. We have Braden Ream, who is the Voiceflow spec, and Blake Benes, who is the Director of Conversational Design at Salesforce. Thank you for coming to Japan. We have a corner where we'll be asking them various questions later, so I'm looking forward to hearing from them. Today's agenda is... we're a little behind schedule, but first, we'll talk about Voice Lunch JP. Then, we'll explain the venue. In the first session, I'll talk about Voiceflow's new features for 2022. After that, in the second session, we'll have a Q&A with Braden and Blake about conversational design. We'll end at 3:00 PM. We'll have a commemorative photo, which is a must for Voice Lunch JP. After that, we'll have a casual get-together with snacks and drinks for about an hour. About Voice Lunch JP, Voice Lunch is a global community for people who are actually involved in voice UI and voice-related technologies, or who are interested in them. Voice Lunch JP is the Japanese region of Voice Lunch. We've been doing this for a long time, both online and offline, sharing information about voice design and technology, and working together to grow the industry. Today's hashtag is #VoiceLunchJP, so feel free to use it. The venue is provided by Granica. Thank you very much. Please share this. We're also doing a lot of streaming, so we're very grateful. Due to the current situation with COVID-19, not many people are able to come to the venue. Normally, they display IoT devices and gadgets here, so if you have the chance, please come and experience them. The restrooms are over here. For those who smoke, the smoking area is over there. Please feel free to use them. Okay, that's it for the initial greetings. Now, let's start with my session. It's called "Voiceflow Update 2022," and I'll be talking about this year's new features. This is my introduction. My name is Shimizu. I'm an infrastructure engineer in Kobe, so I usually work with Kubernetes, AWS, Terraform, etc. Recently, I'm going to be a freelancer. I checked a little, and it seems Voiceflow was first started around the beginning of 2019, so it's been about 4 years, a little less. I've been touching a lot of things. Later, in the voice-related community, Voice Lunch AP. This one, right? Besides that, we have the AJAG Amazon Alexa Japan User Group, and also the Voicelo Japanese User Group, so we're doing something called VFJUG. The Japanese community is on Facebook, so if you'd like, please take a look. Also, about two years ago, we were at a technical bookstore, and we were talking with everyone here today, the staff, and we decided to make a doujinshi together. But it's been about two years, and the content is quite outdated, so we've already stopped selling it. I wanted to bring it today, but I forgot. Yes, so I'm doing this too.

jq -r .result.translation.results[1].subtitles[1].subtitles result.json

WEBVTT

1
00:00:00.876 --> 00:00:07.598
Okay, then let's begin. There are still
some people who haven't arrived yet, but

2
00:00:08.118 --> 00:00:09.998
we'll start Voice Lunch

3
00:00:10.579 --> 00:00:17.480
JP. Thank you all for gathering on Sunday.
Today,

4
00:00:17.581 --> 00:00:22.742
for the first time in a while, we're
having an

5
00:00:23.442 --> 00:00:25.503
offline event. We have

6
00:00:27.143 --> 00:00:31.405
special guests today. We have who is the
Voiceflow

7
00:00:32.586 --> 00:00:35.868
and Blake Benes, who

8
00:00:37.009 --> 00:00:38.891
the Director of Conversational

9
00:00:40.032 --> 00:00:44.656
Design at Salesforce. Thank you

10
00:00:45.517 --> 00:00:48.579
for coming to Japan. We have a corner

jq -r .result.translation.results[1].utterances result.json

[
  {
    "words": [
      {
        "word": "Okay,",
        "start": 0.876,
        "end": 1.096,
        "confidence": 0.45
      },
      {
        "word": " then",
        "start": 1.356,
        "end": 1.696,
        "confidence": 0.46
      },
      {
        "word": " let's",
        "start": 1.776,
        "end": 1.976,
        "confidence": 0.43
      },
      {
        "word": " begin.",
        "start": 2.056,
        "end": 2.336,
        "confidence": 0.86
      },
      {
        "word": " There",
        "start": 2.356,
(snip)

これだけ出力してくれると、動画の字幕をマルチリンガルでつけたりとかも一発でできそう。

kun432

バッチ文字起こしの機能を少し確認しておく

句読点の付与

文字起こし生成時に有効化する。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "punctuation_enhanced": true
}

単語レベルのタイムスタンプ

これはデフォルトで有効になっている。出力は1つ前のところを参照。

文章単位に分割

全文の文字起こし以外に、文単位で結果を出力することができる。単語レベルのタイムスタンプと同じような出力になるが、文章単位になる。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "sentences": true
}

言語の検出

文字起こしする言語の自動判定。これもデフォルトで有効化されている。明示的に指定する場合は以下。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "detect_language": true
}

なお、自動判定せずに、手動で指定することもできる

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "detect_language": false,
    "language": "ja"
}

複数言語の検出（Code Switching）

1つの音声ファイルの中に複数の言語が含まれている場合、それぞれの言語を検出して文字起こしを切り替えるにはenable_code_switchingを有効化する。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "enable_code_switching": true
}

言語のリストを指定することもできる。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "enable_code_switching": true,
    "code_switching_config": {
        "languages": ["en", "es", "fr"]
    }
}

字幕ファイルフォーマット

SRTやVRTで出力することができる。1つ前の例はミニマムだけど、字幕の単位やフォーマットなんかも指定できるっぽい。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "subtitles": true,
    "subtitles_config": {
      "formats": ["srt", "vtt"],
      "minimum_duration": 1,
      "maximum_duration": 5,
      "maximum_characters_per_row": 42,
      "maximum_rows_per_caption": 2,
      "style": "compliance"
    }
}

コンテキストプロンプト

音声データの内容をプロンプトで与えることができる。おそらく文字起こしの精度が向上するのだと思う。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "context_prompt": "ゲーム・オブ・スローンズシリーズのサンサ・スタークとピーター・ベーリッシュの会話。"
}

カスタム語彙・スペル

特定の単語やフレーズをカスタム語彙として指定することで、文字起こしに反映させる。ここはサンプルのまんま。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "custom_vocabulary": true,
    "custom_vocabulary_config": {
        "vocabulary": [
            "Westeros", 
            {"value": "Stark"}, 
            { 
                "value": "Night's Watch",
                "pronunciations": ["Nightz Vatch"],
                "intensity": 0.4,
                "language": "de"
            }
        ],
        "default_intensity": 0.6
    }
}

また特定の発音に対するスペルを指定することもできる。

{
    "custom_spelling": true,
    "custom_spelling_config": {
        "spelling_dictionary": {
            "Gorish": ["ghorish", "gaurish", "gaureish"],
            "Data Science": ["data-science", "data science"],
            ".": ["period", "full stop"],
            "SQL": ["sequel"]
        }
    }
}

名前の一貫性

name_consistencyを有効にすると、名称を文字起こし内で統一してくれるみたい。人名などに有効。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "name_consistency": true
}

デュアルチャネル・複数チャネルの文字起こし

オーディオファイルが複数のチャネルで構成される場合、それぞれのチャネルごとに文字起こしが自動で行われる。ただし、それぞれのチャネルで発話している内容が違う場合には、そのチャネル数の分だけ料金が発生する。チャネルが複数あっても同じ発話なら1チャネル分の料金になるらしい。

メタデータ

リクエスト時にメタデータを付与しておくと、そのメタデータで結果を取得できる。

話者分離

diarizationを有効にしてリクエストすると、それぞれの話者ごとにIDが付与されて結果が返される。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "diarization": true
}

話者分離が難しいような音源？の場合、"enhanced dialization"を有効にすることで精度が上る可能性があるらしい。

{
    "audio_url": "https://api.gladia.io/file/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "diarization": true,
    "diarization_config": {
        "enhanced": true
    }
}

また、事前に話者数などがわかっている場合はそれを指定すると精度が上る可能性があるらしい。

{
    "diarization": true,
    "diarization_config": {
      "number_of_speakers": 3,
      "min_speakers": 1,
      "max_speakers": 5
    }
}

ここでは触れていないが、最初の例では翻訳なども行っていた。そのあたりは後ほど。

kun432

Real-time Speech-to-Text

リアルタイムの文字起こし。WebSocketを使用して行う様子。ドキュメントのサンプルコードはJavaScriptなのだけど、Pythonはないかな？と思って探してみたらちゃんとあった。

バッチとリアルタイムの両方のサンプルが用意されている。今回はリアルタイムの方を見てみる。

レポジトリクローン

git clone https://github.com/gladiaio/gladia-samples && cd gladia-samples

Pythonサンプルのソースがあるディレクトリに移動

cd python/src

仮想環境を作成する。

uv init -p 3.12.9

パッケージインストール

uv add -r requirements.txt

出力

 + certifi==2025.1.31
 + charset-normalizer==3.4.1
 + idna==3.10
 + pyaudio==0.2.14
 + python-dotenv==1.0.1
 + requests==2.32.3
 + urllib3==2.4.0
 + websockets==13.1

リアルタイムのサンプルはstreamingディレクトリにある

ls streaming/

出力

live-from-file-with-resume.py		live-from-microphone-with-resume.py
live-from-file.py			live-from-microphone.py

live-from-file.py: 音声ファイルから文字起こし
live-from-file-with-resume.py: 音声ファイルから文字起こし、レジューム機能付き
live-from-microphone.py: マイクからのリアルタイム文字起こし
live-from-microphone-with-resume.py: マイクからのリアルタイム文字起こし

リアルタイムは live-from-microphone.py が良さそう。マイクの設定などを自動で取得するように修正してみた。

live-from-microphone.py

import asyncio
import base64
import json
import signal
import sys
from datetime import time
from typing import Literal, TypedDict

import pyaudio
import requests
from websockets.asyncio.client import ClientConnection, connect
from websockets.exceptions import ConnectionClosedOK

## 定数
GLADIA_API_URL = "https://api.gladia.io"


## 型定義

class InitiateResponse(TypedDict):
    """初期化レスポンス"""
    id: str
    url: str

class LanguageConfiguration(TypedDict):
    """言語設定"""
    languages: list[str] | None
    code_switching: bool | None

class StreamingConfiguration(TypedDict):
    """ストリーミング設定
    以下は限定されたオプションの一部です。全てのオプションについては、APIドキュメントを参照してください。
    https://docs.gladia.io/ja/api-reference/v2/live/init
    """
    encoding: Literal["wav/pcm", "wav/alaw", "wav/ulaw"]
    bit_depth: Literal[8, 16, 24, 32]
    sample_rate: Literal[8_000, 16_000, 32_000, 44_100, 48_000]
    channels: int
    language_config: LanguageConfiguration | None


## ヘルパー関数

def get_gladia_key() -> str:
    """Gladia APIキーを取得"""
    if len(sys.argv) != 2 or not sys.argv[1]:
        print("You must provide a Gladia key as the first argument.")
        exit(1)
    return sys.argv[1]


# デフォルトのマイク設定を取得
def get_mic_settings():
    """デフォルトマイクの設定を取得する"""
    p = pyaudio.PyAudio()
    # デフォルト入力デバイスのインデックスを取得
    default_input_device_index = p.get_default_input_device_info()["index"]
    device_info = p.get_device_info_by_index(default_input_device_index)
    
    sample_rate = int(device_info["defaultSampleRate"])
    channels = int(device_info["maxInputChannels"])
    
    # チャンネル数をチェック
    if channels <= 0:
        raise ValueError("使用可能な入力チャンネルがありません。マイクが接続されているか確認してください。")
    
    # 情報を表示
    print(f"デフォルトマイク: {device_info['name']}")
    print(f"  最大入力チャンネル: {channels}")
    print(f"  デフォルトサンプルレート: {sample_rate} Hz")
    
    # フレームバッファサイズを計算（約100msの音声データ）
    frames_per_buffer = int(sample_rate * 0.1)
    
    return {
        "sample_rate": sample_rate,
        "channels": channels,
        "format": pyaudio.paInt16,  # 16ビット整数形式
        "frames_per_buffer": frames_per_buffer
    }


def init_live_session(config: StreamingConfiguration) -> InitiateResponse:
    """ライブセッションを開始"""
    gladia_key = get_gladia_key()
    response = requests.post(
        f"{GLADIA_API_URL}/v2/live",
        headers={"X-Gladia-Key": gladia_key},
        json=config,
        timeout=3,
    )
    if not response.ok:
        print(f"{response.status_code}: {response.text or response.reason}")
        exit(response.status_code)
    return response.json()


def format_duration(seconds: float) -> str:
    """秒をHH:MM:SS.mmm形式に変換"""
    milliseconds = int(seconds * 1_000)
    return time(
        hour=milliseconds // 3_600_000,
        minute=(milliseconds // 60_000) % 60,
        second=(milliseconds // 1_000) % 60,
        microsecond=milliseconds % 1_000 * 1_000,
    ).isoformat(timespec="milliseconds")


async def print_messages_from_socket(socket: ClientConnection) -> None:
    """WebSocketからメッセージを受信して、最終的なテキストを出力"""
    async for message in socket:
        content = json.loads(message)
        if content["type"] == "transcript" and content["data"]["is_final"]:
            start = format_duration(content["data"]["utterance"]["start"])
            end = format_duration(content["data"]["utterance"]["end"])
            text = content["data"]["utterance"]["text"].strip()
            print(f"{start} --> {end} | {text}")
        if content["type"] == "post_final_transcript":
            print("\n################ セッション終了 ################\n")
            print(json.dumps(content, indent=2, ensure_ascii=False))


async def stop_recording(websocket: ClientConnection) -> None:
    """録音を停止"""
    print(">>>>> 録音を終了します...")
    await websocket.send(json.dumps({"type": "stop_recording"}))
    await asyncio.sleep(0)


async def send_audio(socket: ClientConnection) -> None:
    """音声を送信"""
    stream = P.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=SAMPLE_RATE,
        input=True,
        frames_per_buffer=FRAMES_PER_BUFFER,
    )

    while True:
        data = stream.read(FRAMES_PER_BUFFER)
        data = base64.b64encode(data).decode("utf-8")
        json_data = json.dumps({"type": "audio_chunk", "data": {"chunk": str(data)}})
        try:
            await socket.send(json_data)
            await asyncio.sleep(0.1)  # Send audio every 100ms
        except ConnectionClosedOK:
            return


## サンプルコード
P = pyaudio.PyAudio()

# マイク設定を取得
mic_settings = get_mic_settings()

# Gladiaがサポートするサンプルレートを確認
supported_rates = StreamingConfiguration.__annotations__["sample_rate"].__args__
sample_rate = mic_settings["sample_rate"]

# マイクのサンプルレートがサポートされているか確認
if sample_rate not in supported_rates:
    # 最も近いサポートされているサンプルレートを選択
    closest_rate = min(supported_rates, key=lambda x: abs(x - sample_rate))
    print(f"\u8b66告: マイクのサンプルレート {sample_rate} Hz はサポートされていません")
    print(f"  代わりに {closest_rate} Hz を使用します")
    sample_rate = closest_rate

# ビット深度は16ビットに固定
bit_depth = 16

# ビット深度がサポートされているか確認
supported_bit_depths = StreamingConfiguration.__annotations__["bit_depth"].__args__
if bit_depth not in supported_bit_depths:
    raise ValueError(f"ビット深度 {bit_depth} はサポートされていません。サポートされている値: {supported_bit_depths}")

# エンコーディングは固定
encoding = "wav/pcm"

# エンコーディングがサポートされているか確認
supported_encodings = StreamingConfiguration.__annotations__["encoding"].__args__
if encoding not in supported_encodings:
    raise ValueError(f"エンコーディング {encoding} はサポートされていません。サポートされている値: {supported_encodings}")

# チャンネル数はマイクの設定から取得
channels = mic_settings["channels"]

print(f"使用する設定:")
print(f"  サンプルレート: {sample_rate} Hz")
print(f"  ビット深度: {bit_depth} bit")
print(f"  エンコーディング: {encoding}")
print(f"  チャンネル数: {channels}")
print(f"  フレームバッファサイズ: {mic_settings['frames_per_buffer']}")

# PyAudio設定
FORMAT = mic_settings["format"]
CHANNELS = channels
SAMPLE_RATE = sample_rate
FRAMES_PER_BUFFER = mic_settings["frames_per_buffer"]

# Gladia API設定
STREAMING_CONFIGURATION: StreamingConfiguration = {
    "encoding": encoding,
    "sample_rate": sample_rate,
    "bit_depth": bit_depth,
    "channels": channels,
    "language_config": {
        "languages": [],
        "code_switching": True,
    }
}

async def main():
    """メイン処理"""
    response = init_live_session(STREAMING_CONFIGURATION)
    async with connect(response["url"]) as websocket:
        print("\n################ セッション開始 ################\n")
        loop = asyncio.get_running_loop()
        loop.add_signal_handler(
            signal.SIGINT,
            loop.create_task,
            stop_recording(websocket),
        )

        send_audio_task = asyncio.create_task(send_audio(websocket))
        print_messages_task = asyncio.create_task(print_messages_from_socket(websocket))
        await asyncio.wait(
            [send_audio_task, print_messages_task],
        )


if __name__ == "__main__":
    asyncio.run(main())

では実行。コマンドライン引数でGladiaのAPIキーをセットする。

uv run streaming/live-from-microphone.py XXXXXXXXXX

以下の記事を参考に、新幹線のアナウンスを読み上げてみた結果はこんな感じ。

出力

デフォルトマイク: Jabra SPEAK 510 USB
  最大入力チャンネル: 1
  デフォルトサンプルレート: 16000 Hz
使用する設定:
  サンプルレート: 16000 Hz
  ビット深度: 16 bit
  エンコーディング: wav/pcm
  チャンネル数: 1
  フレームバッファサイズ: 1600

################ セッション開始 ################

00:00:02.556 --> 00:00:06.564 | 今日も新幹線をご利用くださいましてありがとうございます。
00:00:07.392 --> 00:00:10.408 | この電車はのぞみ号東京行きです
00:00:11.108 --> 00:00:13.484 | 途中の停車駅は京都
00:00:13.608 --> 00:00:16.944 | 名古屋、新横浜、品川です
^C>>>>> 録音を終了します...

################ セッション終了 ################

{
  "type": "post_final_transcript",
  "session_id": "c7db9295-3f54-46f3-beba-c3802c7e9b30",
  "created_at": "2025-04-11T12:07:34.313Z",
  "data": {
    "metadata": {
      "audio_duration": 20,
      "billing_time": 20,
      "number_of_distinct_channels": 1,
      "transcription_time": 22.518
    },
    "transcription": {
      "languages": [
        "ja"
      ],
      "utterances": [
        {
          "text": "今日も新幹線をご利用くださいましてありがとうございます。",
          "start": 2.556,
          "end": 6.564,
          "language": "ja",
          "confidence": 1,
          "channel": 0,
          "words": [
            {
              "word": "今日も新幹線をご利用くださいましてありがとうございます。",
              "start": 2.556,
              "end": 6.564,
              "confidence": 1
            }
          ]
        },
        {
          "text": "この電車はのぞみ号東京行きです",
          "start": 7.392,
          "end": 10.408,
          "language": "ja",
          "confidence": 1,
          "channel": 0,
          "words": [
            {
              "word": "この電車はのぞみ号東京行きです",
              "start": 7.392,
              "end": 10.408,
              "confidence": 1
            }
          ]
        },
        {
          "text": "途中の停車駅は京都",
          "start": 11.108,
          "end": 13.484,
          "language": "ja",
          "confidence": 1,
          "channel": 0,
          "words": [
            {
              "word": "途中の停車駅は京都",
              "start": 11.108,
              "end": 13.484,
              "confidence": 1
            }
          ]
        },
        {
          "text": "名古屋、新横浜、品川です",
          "start": 13.608,
          "end": 16.944,
          "language": "ja",
          "confidence": 1,
          "channel": 0,
          "words": [
            {
              "word": "名古屋、新横浜、品川です",
              "start": 13.608,
              "end": 16.944,
              "confidence": 1
            }
          ]
        }
      ],
      "full_transcript": "今日も新幹線をご利用くださいましてありがとうございます。この電車はのぞみ号東京行きです途中の停車駅は京都名古屋、新横浜、品川です"
    }
  }
}

少し修正して、翻訳も同時に行えるようにしてみる。

(snip)

class TranslationConfiguration(TypedDict):
    target_languages: list[str] | None

class RealtimeProcessing(TypedDict):
    translation: bool | None
    translation_config: TranslationConfiguration | None

class StreamingConfiguration(TypedDict):
    encoding: Literal["wav/pcm", "wav/alaw", "wav/ulaw"]
    bit_depth: Literal[8, 16, 24, 32]
    sample_rate: Literal[8_000, 16_000, 32_000, 44_100, 48_000]
    channels: int
    language_config: LanguageConfiguration | None
    realtime_processing: RealtimeProcessing | None

(snip)

async def print_messages_from_socket(socket: ClientConnection) -> None:
    async for message in socket:
        content = json.loads(message)
        if content["type"] == "transcript" and content["data"]["is_final"]:
            start = format_duration(content["data"]["utterance"]["start"])
            end = format_duration(content["data"]["utterance"]["end"])
            text = content["data"]["utterance"]["text"].strip()
            print(f"{start} --> {end} | {text}")
        # 翻訳の場合を追加
        if content["type"] == "translation":
            start = format_duration(content["data"]["utterance"]["start"])
            end = format_duration(content["data"]["utterance"]["end"])
            language = content["data"]["target_language"]
            translation = content["data"]["translated_utterance"]["text"].strip()
            print(f"{start} --> {end} | ({language}) {translation}")
        # ここまで
        if content["type"] == "post_final_transcript":
            print("\n################ セッション終了 ################\n")
            print(json.dumps(content, indent=2, ensure_ascii=False))

(snip)

STREAMING_CONFIGURATION: StreamingConfiguration = {
    "encoding": encoding,
    "sample_rate": sample_rate,
    "bit_depth": bit_depth,
    "channels": channels,
    "language_config": {
        "languages": ["ja"],
        "code_switching": True,
    },
    "realtime_processing": {
        "translation": True,
        "translation_config": {
          "target_languages": ["en", "ko"]
        }
    }
}
(snip)

再度実行してみた結果。

デフォルトマイク: Jabra SPEAK 510 USB
  最大入力チャンネル: 1
  デフォルトサンプルレート: 16000 Hz
使用する設定:
  サンプルレート: 16000 Hz
  ビット深度: 16 bit
  エンコーディング: wav/pcm
  チャンネル数: 1
  フレームバッファサイズ: 1600

################ セッション開始 ################

00:00:04.224 --> 00:00:08.135 | 今日も新幹線をご利用くださいましてありがとうございます。
00:00:04.224 --> 00:00:08.135 | (en) Thank you for using the Shinkansen today.
00:00:04.224 --> 00:00:08.135 | (ko) 오 늘 도 칸 센 을 용 해 셔 서 사 합 니 다.
00:00:09.793 --> 00:00:12.649 | この電車はのぞみ号東京行きです。
00:00:09.793 --> 00:00:12.649 | (en) This train is the Nozomi express bound for Tokyo.
00:00:09.793 --> 00:00:12.649 | (ko) こ の 電 車 は の ぞ み 号 東 京 行 き で す。
00:00:13.725 --> 00:00:16.229 | 途中の停車駅は京都
00:00:13.725 --> 00:00:16.229 | (en) The stop along the way is Kyoto.
00:00:13.725 --> 00:00:16.229 | (ko) 중 간 에 차 하 는 은 토 입 니 다.
00:00:16.513 --> 00:00:19.881 | 名古屋、新横浜、品川です
00:00:16.513 --> 00:00:19.881 | (en) Nagoya, Shin-Yokohama, Shinagawa.
00:00:16.513 --> 00:00:19.881 | (ko) 名 古 屋 、 横 浜 、 川 で す。
^C>>>>> 録音を終了します...

################ セッション終了 ################

{
  "type": "post_final_transcript",
  "session_id": "9da04472-1dfe-4eeb-9248-b9e7bdb9745c",
  "created_at": "2025-04-11T12:39:59.607Z",
  "data": {
    "metadata": {
      "audio_duration": 25,
      "billing_time": 25,
      "number_of_distinct_channels": 1,
      "transcription_time": 27.431
    },
    "transcription": {
      "languages": [
        "ja"
      ],
      "utterances": [
        {
          "text": "今日も新幹線をご利用くださいましてありがとうございます。",
          "start": 4.224,
          "end": 8.136,
          "language": "ja",
          "confidence": 1,
          "channel": 0,
          "words": [
            {
              "word": "今日も新幹線をご利用くださいましてありがとうございます。",
              "start": 4.224,
              "end": 8.136,
              "confidence": 1
            }
          ]
        },
        {
          "text": "この電車はのぞみ号東京行きです。",
          "start": 9.793,
          "end": 12.649,
          "language": "ja",
          "confidence": 1,
          "channel": 0,
          "words": [
            {
              "word": "この電車はのぞみ号東京行きです。",
              "start": 9.793,
              "end": 12.649,
              "confidence": 1
            }
          ]
        },
        {
          "text": "途中の停車駅は京都",
          "start": 13.725,
          "end": 16.229,
          "language": "ja",
          "confidence": 1,
          "channel": 0,
          "words": [
            {
              "word": "途中の停車駅は京都",
              "start": 13.725,
              "end": 16.229,
              "confidence": 1
            }
          ]
        },
        {
          "text": "名古屋、新横浜、品川です",
          "start": 16.513,
          "end": 19.881,
          "language": "ja",
          "confidence": 1,
          "channel": 0,
          "words": [
            {
              "word": "名古屋、新横浜、品川です",
              "start": 16.513,
              "end": 19.881,
              "confidence": 1
            }
          ]
        }
      ],
      "full_transcript": "今日も新幹線をご利用くださいましてありがとうございます。この電車はのぞみ号東京行きです。途中の停車駅は京都名古屋、新横浜、品川です"
    }
  }
}

日本語がベースの場合、韓国語や中国語への翻訳はなんか失敗することがちらほらあるが、一応複数言語同時翻訳的なことができている。

kun432

リアルタイム文字起こしの機能については以下
https://docs.gladia.io/chapters/live-stt/features
概ねバッチ文字起こしと似たような感じだが、
言語の検出
複数言語の検出（Code Switching）
単語レベルのタイムスタンプ
カスタム語彙・スペル
翻訳
デュアルチャネル・複数チャネルの文字起こし
に対応していて、バッチに比べると少ないかも知れない。

kun432

 その他上で何度か試しているけど翻訳
https://docs.gladia.io/chapters/audio-intelligence/pages/translation
要約
https://docs.gladia.io/chapters/audio-intelligence/pages/summarization
名前付きエンティティの検出
https://docs.gladia.io/chapters/audio-intelligence/pages/named entity recognition
感情分析。おそらく音声ではなくて文字起こし結果からだと思う。
https://docs.gladia.io/chapters/audio-intelligence/pages/sentiment analysis
コンテンツモデレーション
https://docs.gladia.io/chapters/audio-intelligence/pages/moderation
複数のチャプターに分割して、ヘッドラインと要約をつけてくれる
https://docs.gladia.io/chapters/audio-intelligence/pages/chapterization
音声データに対してLLMでチャットができる
https://docs.gladia.io/chapters/audio-intelligence/pages/audio to llm
このあたりはLLMと連携してる感じのものが多い印象

kun432

でここまでやったことはPlayGroundでもできる。お手軽にできるのでとりあえず精度確かめるとかでいいんじゃないだろうか。

kun432

 まとめ新しいモデル「Solaria」を使っているのかどうかがどこにも明記されていないのだけど、Discordあたりをみるとデフォルトモデルが今は「Solaria」になっているらしい。
でそれはともかくとして、謳っている音声認識の高速性というのは、自分が過去試したストリーミング文字起こしのソリューションと比較しても、めちゃめちゃ速いとまでは感じなかった。レスポンスを見てたんだけど、他社でよくある「中間認識結果」がどうも返されずis_finalな最終認識結果だけが返ってくるように見えているので、ここは体感として大きいかもしれない。
品質的にはどうだろう？ElevenLabsのScribeはかなり良かったのでそれに比べるとやや劣るような印象を受けた。試行数1なので参考にはならないかもだけど。とりあえずWERのデータを公開して欲しいなと感じる。
あとPythonで書いてみたのだけど、Gladia固有のSDKを使わないで済む反面、WebSocketを書くのはちょっとしんどいなと思ってしまった。Whisperライクなインタフェースがあればなぁと感じた。
とはいえ、自分は同時翻訳はかなりすごいと感じた。失敗する場合もあるけども、文字起こししつつ同時に複数言語にほぼほぼリアルタイム翻訳できるってのは他には見ないのではないだろうか。これが必要なユースケースは必ずありそう。多国籍会議なんかだと良さそう。。

このスクラップは5ヶ月前にクローズされました