👋

PythonからWhisper.cppを呼び出す

2023/02/12に公開

タイトル通りです。
PythonでWhisperを使う方法で最も手軽なのは、open-whisperを使うことです。
https://github.com/openai/whisper#setup

ただ、私はM1 Mac miniを使用しているので、そこまでパフォーマンスが出ません。
そこで、以前は次のような記事を書きました。

Apple silicon に最適化されているため、open-whisperを使うよりスピードが出るのですが、Pythonに使い慣れた私にとってC++は少々扱いづらいです。
whisper.cppのREADMEをよく読んでみたところ、bindingsの章にPythonという項目があったので試してみました。

ただし、まだWIPとなっており、今回紹介する方法はissueの中に書かれている中で私の環境で動いたものです。

ビルド

whisper.cppをクローンする。

git clone https://github.com/ggerganov/whisper.cpp.git

次のコマンドでwhiper.cppをビルドし、soファイルを作ります。
ここで作ったsoファイルをPythonで読み込みます。

gcc -O3 -std=c11   -pthread -mavx -mavx2 -mfma -mf16c -fPIC -c ggml.c
g++ -O3 -std=c++11 -pthread --shared -fPIC -static-libstdc++ whisper.cpp ggml.o -o libwhisper.so

また、必要なモデルもダウンロードしといてください。

bash ./models/download-ggml-model.sh base

その他のモデル → https://github.com/ggerganov/whisper.cpp#memory-usage

Python

あとはPythonから読み込むだけです。
次のコードを書いて、必要に応じて音声ファイルとモデルの部分を書き換えると、Pythonから実行することができると思います。
ちなみに、雑に速度を測った時はopen-whisperの約二倍の速さになっていました。

スクリプト

import ctypes
import pathlib

# this is needed to read the WAV file properly from scipy.io import wavfile
from scipy.io import wavfile

libname = "libwhisper.so"
fname_model = "models/ggml-base.bin"
fname_wav = "samples/jfk.wav"

# this needs to match the C struct in whisper.h
class WhisperFullParams(ctypes.Structure):
    _fields_ = [
        ("strategy", ctypes.c_int),
        #
        ("n_max_text_ctx", ctypes.c_int),
        ("n_threads", ctypes.c_int),
        ("offset_ms", ctypes.c_int),
        ("duration_ms", ctypes.c_int),
        #
        ("translate", ctypes.c_bool),
        ("no_context", ctypes.c_bool),
        ("single_segment", ctypes.c_bool),
        ("print_special", ctypes.c_bool),
        ("print_progress", ctypes.c_bool),
        ("print_realtime", ctypes.c_bool),
        ("print_timestamps", ctypes.c_bool),
        #
        ("token_timestamps", ctypes.c_bool),
        ("thold_pt", ctypes.c_float),
        ("thold_ptsum", ctypes.c_float),
        ("max_len", ctypes.c_int),
        ("max_tokens", ctypes.c_int),
        #
        ("speed_up", ctypes.c_bool),
        ("audio_ctx", ctypes.c_int),
        #
        ("prompt_tokens", ctypes.c_void_p),
        ("prompt_n_tokens", ctypes.c_int),
        #
        ("language", ctypes.c_char_p),
        #
        ("suppress_blank", ctypes.c_bool),
        #
        ("temperature_inc", ctypes.c_float),
        ("entropy_thold", ctypes.c_float),
        ("logprob_thold", ctypes.c_float),
        ("no_speech_thold", ctypes.c_float),
        #
        ("greedy", ctypes.c_int * 1),
        ("beam_search", ctypes.c_int * 3),
        #
        ("new_segment_callback", ctypes.c_void_p),
        ("new_segment_callback_user_data", ctypes.c_void_p),
        #
        ("encoder_begin_callback", ctypes.c_void_p),
        ("encoder_begin_callback_user_data", ctypes.c_void_p),
    ]

if __name__ == "__main__":
    # load library and model
    libname = pathlib.Path().absolute() / libname
    whisper = ctypes.CDLL(libname)

    # tell Python what are the return types of the functions
    whisper.whisper_init_from_file.restype = ctypes.c_void_p
    whisper.whisper_full_default_params.restype = WhisperFullParams
    whisper.whisper_full_get_segment_text.restype = ctypes.c_char_p

    # initialize whisper.cpp context
    ctx = whisper.whisper_init_from_file(fname_model.encode("utf-8"))

    # get default whisper parameters and adjust as needed
    params = whisper.whisper_full_default_params()
    params.print_realtime = True
    params.print_progress = False

    # load WAV file
    samplerate, data = wavfile.read(fname_wav)

    # convert to 32-bit float
    data = data.astype("float32") / 32768.0

    # run the inference
    result = whisper.whisper_full(
        ctypes.c_void_p(ctx),
        params,
        data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        len(data),
    )

    if result != 0:
        print("Error: {}".format(result))
        exit(1)

    # print results from Python
    n_segments = whisper.whisper_full_n_segments(ctypes.c_void_p(ctx))
    for i in range(n_segments):
        t0 = whisper.whisper_full_get_segment_t0(ctypes.c_void_p(ctx), i)
        t1 = whisper.whisper_full_get_segment_t1(ctypes.c_void_p(ctx), i)
        txt = whisper.whisper_full_get_segment_text(ctypes.c_void_p(ctx), i)

        print(f"{t0/1000.0:.3f} - {t1/1000.0:.3f} : {txt.decode('utf-8')}")

    # free the memory
    whisper.whisper_free(ctypes.c_void_p(ctx))

出力結果

❯ python3 main.py
whisper_init_from_file: loading model from 'models/ggml-base.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB
[00:00:00.000 --> 00:00:07.600]   And so my fellow Americans ask not what your country can do for you,
[00:00:07.600 --> 00:00:10.600]   ask what you can do for your country.
0.000 - 0.760 :  And so my fellow Americans ask not what your country can do for you,
0.760 - 1.060 :  ask what you can do for your country.
time elapsed:0.8874366283416748

ビルド

Python

Discussion