🎙️

【WhisperKit】MBAで音声を文字起こし✍️

2024/03/31に公開

Swift

私が参加している「エンジニアと人生コミュニティ」のオーナー堤さん（@shu223）がSansanモバイルエンジニアLT会でWhisperKitを紹介されており、私は発表自体は見れなかったのですが、公開されているスライドを拝見したらめちゃくちゃ気になったので試してみました。

試したこと

MacBook Air上の音声ファイルを文字起こし
マイクからのストリーム入力音声を文字起こし

まずWhisperKitって？

WhisperKitは、OpenAIの音声認識モデルWhisperをAppleのCoreMLフレームワークと統合し、iPhoneやMacでの効率的なローカル推論を可能にするSwiftパッケージ。以下のような特徴があります。

iOSとmacOSアプリでWhisperの推論を数行のコードで実行可能
GPUとNeural Engineを柔軟に組み合わせて使用可能
- iPhoneではNeural Engineを単独で使うことで、最もエネルギー効率が高く低レイテンシーを実現
- Macではバッチ処理のスループットを最大化するため、GPUとNeural Engineを並行して使用可能
librispeechなどのベンチマークでモデルの精度を検証済み
Mac上でWhisperを最適化・評価するPythonツールwhisperkittoolsも提供

WhisperKitを使用することで、動画やライブストリームからの音声をリアルタイムでテキストに変換することが可能になります。また、特定の言語に対応したモデルをダウンロードして使用することで、より正確な変換が期待できます。

By：Claude3 Opus

とのこと。

さっそくインストール

`whisperkit-cli` とは？

WhisperKitをコマンドラインから利用できるツール。
ターミナルから直接、音声ファイルを文字起こしすることが可能になります。

基本的にこちらのREADMEの手順に従って進めていけば大丈夫です👌
https://github.com/argmaxinc/WhisperKit

まずwhisperkit-cliをインストール

brew install whisperkit-cli

次に利用するモデルなどをセットアップするために、whisperkitをクローン。

git clone https://github.com/argmaxinc/whisperkit.git
cd whisperkit

つづけて、Makefileを使ってセットアップします。

make setup

使いたいモデルをダウンロード

モデルの選択は任意の好きなモデルを指定出来ます。

各モデルについて

今回は、手頃そうな base(145MB) モデルを選択してダウンロードしてみます。

make download-model MODEL=base

以上でセットアップは完了です。

文字起こし

音声ファイルを用意

手元に適切な音声ファイルがなかったので、ffmpeg を使って動画から音声を抽出し音声ファイルを作成しました。

なお、WhisperKitが対応している音声ファイル形式は

wav
mp3
m4a
flac

の様です。今回はmp3で作成しました。

指定の動画から音声ファイルをmp3形式で抽出・保存

ffmpeg -i path-to-your-sample-video.mp4 -vn -ar 44100 -ac 2 -b:a 192k path-to-sample-audio.mp3

音声ファイルを使って文字起こし実行

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-base" --audio-path "/Users/sawatari/Desktop/iPhone_App_Development/PoC/WhisperKit-PoC/videos/sample-audio.mp3" --language "ja"

各オプション

--model-path オプションでダウンロードしたモデルのパスを指定します。
--audio-path オプションで音声ファイルのパスを指定します。
--language オプションで言語を指定します。（しないと英語として認識されてしまいます）

実行結果のスクショ
GIF？

結果をファイルに保存する場合

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-base" --audio-path "/Users/sawatari/Desktop/iPhone_App_Development/PoC/WhisperKit-PoC/videos/sample-audio.mp3" --language "ja"  | tee transcribed.txt

teeコマンドは標準入力から受け取った内容を、標準出力とファイルに書き出すコマンドです。-aでファイルに上書き保存ではなく追加保存も可能です。

ストリーム入力音声を文字起こしする

次にマイクからのストリーム入力音声を文字起こししてみます。

--stream オプションを指定することで、マイクからのストリーム入力音声を文字起こしできます。

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-base" --language "ja" --stream |tee transcribed_stream.txt

結果

...
---
Unconfirmed segment: <|startoftranscript|><|ja|><|transcribe|><|notimestamps|>これはどうかな?これはどうか?<|endoftext|>
Unconfirmed segment: <|startoftranscript|><|nocaptions|><|endoftext|>
Current text: <|startoftranscript|><|ja|><|transcribe|><|notimestamps|>これは
---
Unconfirmed segment: <|startoftranscript|><|ja|><|transcribe|><|notimestamps|>これはどうかな?これはどうか?<|endoftext|>
Unconfirmed segment: <|startoftranscript|><|nocaptions|><|endoftext|>
Current text: <|startoftranscript|><|ja|><|transcribe|><|notimestamps|>これはどう
---
Unconfirmed segment: <|startoftranscript|><|ja|><|transcribe|><|notimestamps|>これはどうかな?これはどうか?<|endoftext|>
Unconfirmed segment: <|startoftranscript|><|nocaptions|><|endoftext|>
Current text: <|startoftranscript|><|ja|><|transcribe|><|notimestamps|>これはどうかな
---
...

現状、ストリームからの処理出力はすべて出力されてしまうようです。

// TODO: Print only net new text without any repeats
print("---")
for segment in newState.confirmedSegments {
    print("Confirmed segment: \(segment.text)")
}
for segment in newState.unconfirmedSegments {
    print("Unconfirmed segment: \(segment.text)")
}
print("Current text: \(newState.currentText)")

遅延は感じられますが、マイク入力からの音声も文字起こしできることが確認できました。

`whisperkit-cli`のオプション

whisperkit-cli のtranscribeコマンドのオプションは以下で確認できます。

whisperkit-cli help transcribe

OVERVIEW: Transcribe audio to text using WhisperKit
USAGE: whisperkit-cli transcribe [<options>] [<supress-tokens> ...]

ARGUMENTS:
  <supress-tokens>        Supress given tokens in the output

OPTIONS:
  --audio-path <audio-path>
                          Path to audio file (default: Tests/WhisperKitTests/Resources/jfk.wav)
  --model-path <model-path>
                          Path of model files
  --model <model>         Model to download if no modelPath is provided
  --download-model-path <download-model-path>
                          Path to save the downloaded model
  --download-tokenizer-path <download-tokenizer-path>
                          Path to save the downloaded tokenizer files
  --audio-encoder-compute-units <audio-encoder-compute-units>
                          Compute units for audio encoder model with {all,cpuOnly,cpuAndGPU,cpuAndNeuralEngine,random} (values: all, cpuAndGPU, cpuOnly, cpuAndNeuralEngine, random; default: cpuAndNeuralEngine)
  --text-decoder-compute-units <text-decoder-compute-units>
                          Compute units for text decoder model with {all,cpuOnly,cpuAndGPU,cpuAndNeuralEngine,random} (values: all, cpuAndGPU, cpuOnly, cpuAndNeuralEngine, random; default: cpuAndNeuralEngine)
  --verbose               Verbose mode
  --language <language>   Language spoken in the audio
  --temperature <temperature>
                          Temperature to use for sampling (default: 0.0)
  --temperature-increment-on-fallback <temperature-increment-on-fallback>
                          Temperature to increase on fallbacks during decoding (default: 0.2)
  --temperature-fallback-count <temperature-fallback-count>
                          Number of times to increase temperature when falling back during decoding (default: 5)
  --best-of <best-of>     Number of candidates when sampling with non-zero temperature (default: 5)
  --use-prefill-prompt    Force initial prompt tokens based on language, task, and timestamp options
  --use-prefill-cache     Use decoder prefill data for faster initial decoding
  --skip-special-tokens   Skip special tokens in the output
  --without-timestamps    Force no timestamps when decoding
  --word-timestamps       Add timestamps for each word in the output
  --compression-ratio-threshold <compression-ratio-threshold>
                          Gzip compression ratio threshold for decoding failure
  --logprob-threshold <logprob-threshold>
                          Average log probability threshold for decoding failure
  --no-speech-threshold <no-speech-threshold>
                          Probability threshold to consider a segment as silence
  --report                Output a report of the results
  --report-path <report-path>
                          Directory to save the report (default: .)
  --stream                Process audio directly from the microphone
  -h, --help              Show help information.

おわりに

手軽にSpeech-to-Textが出来るのは、色々と活用方法が考えられるのでとてもわくわくしますね🎉
ストリーム入力音声をリアルタイムで文字起こしする場合、結構遅延があるかなーとは感じました。これは環境次第かもしれません。
しかし、音声ファイルからの文字起こしの方はかなり応答が早い印象。
今回試したのはbaseモデルだったので、他のサイズの大きいモデルも試してみたいです💪

試したこと

まずWhisperKitって？

さっそくインストール

whisperkit-cli とは？

まずwhisperkit-cliをインストール

使いたいモデルをダウンロード

文字起こし

音声ファイルを用意

音声ファイルを使って文字起こし実行

ストリーム入力音声を文字起こしする

whisperkit-cliのオプション

おわりに

Discussion

`whisperkit-cli` とは？

`whisperkit-cli`のオプション