🎙️

【Whisper】GPUが無くてもオフラインで簡単音声認識

2022/12/07に公開

C++

Whisper

tech

OpenAIの高性能な音声認識モデルであるWhisperを、オフラインでかつGPUが無くても簡単に試せるようにしてくれたリポジトリを知ったのでご紹介。

私の環境

Mac mini (M1, 2020)
Chip Apple M1
Memory 16 GB
macOS Monterey Version 12.6

試してみる

音声データから文字起こし

リポジトリをクローン

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

モデルをダウンロード

tiny, base, small, medium, large とあり、左から順に精度が上がるがサイズと使用するメモリの量が上がる。詳細はこちらへ https://github.com/ggerganov/whisper.cpp#memory-usage

bash ./models/download-ggml-model.sh base

ビルド

make

実行する

読み込める音声ファイルが現在16ビットのWAVファイルのみであることに注意。
必要に応じて次のようなコマンドなどで変換して下さい。

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

./main -l ja -f ~/Music/output.wav -m models/ggml-base.bin

オプション	説明
l	認識する言語
f	読み込むファイル
m	使用するモデル

これだけで認識できちゃいます。
ちなみに、-trというオプションもあり、これを入れると英語に翻訳してくれます（すごい）

出力

whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |

main: processing '/Users/kaisei/Music/output.wav' (96597 samples, 6.0 sec), 4 threads, 1 processors, lang = ja, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:05.740]  音声認識テスト中


whisper_print_timings:     load time =   121.41 ms
whisper_print_timings:      mel time =    13.86 ms
whisper_print_timings:   sample time =     1.96 ms
whisper_print_timings:   encode time =   371.95 ms / 61.99 ms per layer
whisper_print_timings:   decode time =    35.89 ms / 5.98 ms per layer
whisper_print_timings:    total time =   545.44 ms

リアルタイム音声認識

リアルタイムでの音声認識も簡単にできました。

必要なもの

sdlというものが必要らしいので、パッケージマネージャで入れる。

brew install sdl2

ビルド

make stream

実行する

./stream -l ja  -m models/ggml-base.bin

出力

audio_sdl_init: found 1 capture devices:
audio_sdl_init:    - Capture device #0: 'AT2020USB+'
audio_sdl_init: attempt to open default capture device ...
audio_sdl_init: obtained spec for input device (SDL Id = 2):
audio_sdl_init:     - sample rate:       16000
audio_sdl_init:     - format:            33056 (required: 33056)
audio_sdl_init:     - channels:          1 (required: 1)
audio_sdl_init:     - samples per frame: 1024
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

main: processing 48000 samples (step = 3.0 sec / len = 10.0 sec), 4 threads, lang = ja, task = transcribe, timestamps = 0 ...
main: n_new_line = 2

日本語の音声にしき
サクランボ
こんにちは
コンバンは
GPUがなくてもオフラインで
簡単音声認識

とっても簡単に出来ちゃうのでぜひ試してみて下さい！

私の環境

試してみる

音声データから文字起こし

リポジトリをクローン

モデルをダウンロード

ビルド

実行する

リアルタイム音声認識

必要なもの

ビルド

実行する

Discussion