🕌

ChatGPT と DeepL と whisper.cpp を使ってリアルタイム音声認識（中断中）

2023/02/18に公開

C++

Python

WSL

tech

環境

wsl --list
Linux 用 Windows サブシステムディストリビューション:
Ubuntu-20.04 (既定)

1. whisper.cpp のビルド

まず、whisper.cpp を自分の環境でビルドします。

dev@DESKTOP-PN9CACL:~$ mkdir whisper
dev@DESKTOP-PN9CACL:~$ cd whisper
dev@DESKTOP-PN9CACL:~/whisper$ git clone https://github.com/ggerganov/whisper.cpp.git
dev@DESKTOP-PN9CACL:~/whisper$ cd whisper.cpp/
dev@DESKTOP-PN9CACL:~/whisper/whisper.cpp$ bash ./models/download-ggml-model.sh medium  # 学習済みモデルをダウンロード https://github.com/ggerganov/whisper.cpp/tree/master/models#available-models
dev@DESKTOP-PN9CACL:~/whisper/whisper.cpp$ make

2. DeepL + ChatGPT による Python3 コード生成

せっかくなので、エンジニア不要論で渦中のサービスChatGPTを使ってコードを書いてみます。

(原文) マイクからリアルタイムで音声を取得するプログラムをPython3で書いてください。
取得する音声データのフォーマットは次に従ってください。
the format to 16-bit integers, 1 channel (mono), a sampling rate of 44.1 kHz, and a buffer size of 1024 samples.
->(DeepL + 微修正) Write a program in Python3 to acquire real-time audio from a microphone and transcribe it.
Use libwhisper.so from https://github.com/ggerganov/whisper.cpp for transcription.
The format of the audio data to be acquired should be as follows:
the format to 16-bit integers, 1 channel (mono), a sampling rate of 44.1 kHz, and a buffer size of 1024 samples.
->(ChatGPT) To acquire real-time audio from a microphone and transcribe it using libwhisper.so, we can use the PyAudio library to access the microphone, and ctypes to load and call functions from the libwhisper.so library. Here is an example program that does this:

後述のPython3コード)

In this program, we first load the libwhisper.so library using ctypes. We then define the audio settings: the audio format is set to 16-bit integers, there is only one channel (mono), the sampling rate is 44.1 kHz, and the buffer size is 1024 samples.

We then define a callback function that will be called each time a new chunk of audio data is available from the microphone. In the callback function, we first convert the audio data to a numpy array. We then call the whisper() function from the libwhisper.so library, passing in the audio data as a ctypes void pointer and its length. The whisper() function transcribes the audio and returns the result as a C string, which we decode to a Python string and print.

We then create a PyAudio object and open the microphone stream, passing in the audio settings and the callback function. We start the stream and wait for it to finish, which happens when the user stops the program. Finally, we stop and close the stream and PyAudio object.

import pyaudio
import ctypes
import numpy as np

# Load the libwhisper.so library
whisper_lib = ctypes.cdll.LoadLibrary('./libwhisper.so')

# Define the audio settings
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
CHUNK = 1024

# Define the callback function to process the audio data
def callback(in_data, frame_count, time_info, status):
    # Convert the audio data to a numpy array
    audio_data = np.frombuffer(in_data, dtype=np.int16)
    # Call the whisper() function from the libwhisper.so library to transcribe the audio
    result = whisper_lib.whisper(audio_data.ctypes.data_as(ctypes.c_void_p), len(audio_data))
    # Print the transcription result
    print(result.decode('utf-8'))
    return (in_data, pyaudio.paContinue)

# Create a PyAudio object and open the microphone stream
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK,
                stream_callback=callback)

# Start the stream
stream.start_stream()

# Wait for the stream to finish
while stream.is_active():
    pass

# Stop and close the stream and PyAudio object
stream.stop_stream()
stream.close()
p.terminate()

3. libwhisper.so の生成

そういえば、libwhisper.soのビルドができてなかったので、Makefileにビルド方法聞いてビルドします。

dev@DESKTOP-PN9CACL:~/whisper/whisper.cpp$ make libwhisper.so
I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -pthread
I LDFLAGS:
I CC:       cc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
I CXX:      g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

g++ -I. -I./examples -O3 -std=c++11 -fPIC -pthread -shared -o libwhisper.so ggml.o whisper.o
dev@DESKTOP-PN9CACL:~/whisper/whisper.cpp$ g++ -I. -I./examples -O3 -std=c++11 -fPIC -pthread -shared -o libwhisper.so ggml.o whisper.o

4. 足りないライブラリをインストール

dev@DESKTOP-PN9CACL:~/whisper/work$ sudo apt-get install portaudio19-dev
dev@DESKTOP-PN9CACL:~/whisper/work$ pip install pyaudio

5. realtime.py にコピペして実行 -> 失敗

上手くいかなくて安心しました！！

dev@DESKTOP-PN9CACL:~/whisper/work$ python3 realtime.py
略)
Traceback (most recent call last):
  File "realtime.py", line 26, in <module>
    stream = p.open(format=FORMAT,
  File "/home/dev/.local/lib/python3.8/site-packages/pyaudio/__init__.py", line 639, in open
    stream = PyAudio.Stream(self, *args, **kwargs)
  File "/home/dev/.local/lib/python3.8/site-packages/pyaudio/__init__.py", line 441, in __init__
    self._stream = pa.open(**arguments)
OSError: [Errno -9996] Invalid input device (no default output device)

6. DeepL + ChatGPT で失敗の原因を調査して解決

(原文) wsl2でUbuntu20.04を動かしています。マイクによる音声入力を有効にする方法を教えてください。
->(DeepL) I am running Ubuntu 20.04 on wsl2. How do I enable voice input via microphone?
->(ChatGPT) In order to enable voice input via microphone in Ubuntu 20.04 on WSL2, you will need to perform the following steps:

Install PulseAudio and the PulseAudio Volume Control (pavucontrol) package. You can do this by running the following command in a terminal:
csharp

sudo apt-get install pulseaudio pavucontrol

Start PulseAudio by running the following command in the terminal:
css

pulseaudio --start

Launch pavucontrol by running the following command in the terminal:

pavucontrol

In the "Input Devices" tab of pavucontrol, you should see your microphone listed. Make sure it is not muted and that the volume level is high enough.

In the "Configuration" tab of pavucontrol, select "Analog Stereo Duplex" as the profile for your microphone.

Open the sound settings in Ubuntu by clicking on the speaker icon in the top right corner of the screen and selecting "Sound Settings".

In the "Input" tab of the sound settings, make sure that the correct microphone is selected as the input device.

After completing these steps, you should be able to use your microphone for voice input in Ubuntu 20.04 on WSL2.

7. TODO

音声をうまく拾ってくれないのでまた後でやります！！

環境