iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📖

Generative AI Made Easy on Google Colaboratory [Part 5: Speech Recognition AI with faster-whisper]

に公開

Introduction

In this series, we are creating an environment that makes it easy for beginners to use Generative AI on Google Colaboratory.

In Part 5, we will set up an environment to use speech recognition AI.
The speech recognition AI we will use this time is faster-whisper.

Note that although this is Part 5, it is written so that you can understand the content even without reading Parts 1 to 4.

In the previous part, we touched upon Whisper, a speech recognition AI, via API. In Part 5, we will explore faster-whisper, a speech recognition model that operates at high speed in a local environment.

I have also written an article regarding voice dialogue systems using speech recognition AI below.
I would appreciate it if you could take a look if you are interested.
https://zenn.dev/asap/articles/5b1b7553fcaa76

What is faster-whisper?

https://github.com/SYSTRAN/faster-whisper

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
This implementation is up to 4 times faster than openai/whisper while using less memory with the same accuracy. Efficiency can be further improved with 8-bit quantization on both CPU and GPU.

As stated in the repository above, faster-whisper is a high-speed version of OpenAI's Whisper speech recognition AI model.
In Part 4, we ran the Whisper API. Since that model runs on OpenAI's high-performance servers, it can achieve speech recognition very quickly; however, faster-whisper operates at comparable or even higher speeds even in a local environment.
Furthermore, it is a very lightweight model that can even be run on a CPU.

To begin with, speech recognition AI is an AI that transcribes human speech contained in audio into text.
When building a voice dialogue system, technology to transcribe audio into text is essential.
In a voice dialogue system, response content is generated from this transcribed text using a Large Language Model (LLM) such as ChatGPT, and then synthesized speech AI is used to make the AI speak the response, thereby realizing voice dialogue.

By the way, I explained synthesized speech AI in Part 1 and Large Language Models in Part 2 and Part 3, so if you are interested, I would appreciate it if you could take a look at those as well.

Also, this time, we will compare the processing speeds of faster-whisper and the Whisper API.
Furthermore, while there are various models for faster-whisper, we will use the large-v3 model, which is the highest performance model, and the kotoba-whisper-v1.0 model, which is specialized for Japanese.

Deliverables

Please see the repository below.
https://github.com/personabb/colab_AI_sample/tree/main/colab_fasterwhisper_sample

Explanation

I will explain as follows.
First, please clone the repository mentioned above.

./
git clone https://github.com/personabb/colab_AI_sample.git

After that, place the cloned folder "colab_AI_sample" in an appropriate location in your My Drive.

Directory Structure

The following directory structure on Google Drive is assumed:

MyDrive/
    └ colab_AI_sample/
          └ colab_fasterwhisper_sample/
                  ├ configs/
                  |    └ config.ini
                  ├ inputs/
                  |    └ audio_recognition/
                  |             └ xxxxx.wav
                  ├ module/
                  |    └ module_whisper.py
                  └ Faster-whisper_sample.ipynb.ipynb

  • The colab_AI_sample folder can be anything. It doesn't have to be at the root level; it can be multiple levels deep as shown below:
    • MyDrive/hogehoge/spamspam/hogespam/colab_AI_sample
  • Store the audio files to be transcribed in the audio_recognition folder.
    • The following formats should work:
      • mp3, mp4, mpeg, mpga, m4a, wav, webm

Preparation

To use the Whisper API, you need to obtain an OpenAI API key.
Please refer to the link below for how to obtain an API key.
https://qiita.com/shimmy-notat/items/1e22dcdaa06ea54208ac

Additionally, you need to register the obtained API key in Google Colab.
Please refer to the following article for registration.
https://note.com/npaka/n/n79bb63e17685

Usage Explanation

Open Faster-whisper_sample.ipynb with the Google Colaboratory app.
Right-click the file, and an "Open with" option will appear; select Google Colaboratory from there.

If it's not available, go to "Connect more apps," search for "Google Colaboratory" in the marketplace, and install it.

Once opened in Google Colaboratory, follow the notes in Faster-whisper_sample.ipynb and execute the cells in order from the top. It should run through to the end without issues, allowing you to perform speech-to-text transcription.

If you wish to run it again after changing parameters after completing the execution, click on "Runtime" -> "Restart session and run all."

Code Explanation

I will primarily explain the important Faster-whisper_sample.ipynb.

Faster-whisper_sample.ipynb

The corresponding code can be found below:
https://github.com/personabb/colab_AI_sample/blob/main/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb

Below is a cell-by-cell explanation.

1st Cell

./colab_AI_sample/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb
# Install modules required for Faster-Whisper
!pip install faster_whisper
# Install modules required for Whisper API
!pip install openai

In this cell, we install the necessary modules.
Since basic deep learning packages like PyTorch are already pre-installed in Google Colab, you only need to install the ones listed above.

2nd Cell

./colab_AI_sample/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb
# Mount Google Drive (requires authentication)
from google.colab import drive
drive.mount('/content/drive')

# Set OpenAI API key
# Refer to the following for setup:
# https://note.com/npaka/n/n79bb63e17685

from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

# Change current directory to the directory where this file exists.
import glob
import os
pwd = os.path.dirname(glob.glob('/content/drive/MyDrive/**/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb', recursive=True)[0])
print(pwd)

%cd $pwd
!pwd

In this cell, we mount the contents of Google Drive.
By mounting, it becomes possible to read and write files stored within Google Drive.

When mounting, you will need to grant permission from Colab.
A popup will appear; follow the instructions to allow mounting.

Additionally, we retrieve the OpenAI API key that was obtained and registered in Google Colab earlier and store it in api_key.

Following that, we change the current directory from / to /content/drive/MyDrive/**/colab_fasterwhisper_sample.
(** is a wildcard that matches any (multiple) directories.)
Changing the current directory is not strictly necessary, but it makes specifying folder paths easier in subsequent steps.

3rd Cell

./colab_AI_sample/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb
# Import the faster-whisper module
from module.module_whisper import FasterWhisperModel
import time

# Import modules for Whisper API
from openai import OpenAI
# Initialize the OpenAI client
client = OpenAI(api_key = api_key)

We import the FasterWhisperModel class from module/module_whisper.py as a module.
Details regarding its contents will be explained in a later chapter.

In addition, we initialize the OpenAI client to use the Whisper API for comparison with faster-whisper.

4th Cell

./colab_AI_sample/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb
# Configure the model settings.
# Settings are divided for CPU and GPU execution (due to processing time).
# If the same settings are used, transcription will have the same accuracy.

config_text = """
[FasterWhisper]
device = auto
language = ja

gpu_model_type = large-v3
gpu_beam_size = 1
gpu_compute_type = float16

cpu_model_type = small
cpu_beam_size = 1
cpu_compute_type = int8

use_kotoba = False
kotoba_model_type = kotoba-tech/kotoba-whisper-v1.0-faster
chunk_length = 15
condition_on_previous_text = False
"""

with open("configs/config.ini", "w", encoding="utf-8") as f:
  f.write(config_text)

In this cell, the content of the configuration file configs/config.ini is overwritten with the content of config_text.
Faster Whisper operates according to these settings.
For example, the following configurations are set:

  • Whether to use the machine's GPU or CPU is specified by device = auto.
    • If auto, the machine will use the GPU if one is available.
    • Alternatively, you can specify cuda or cpu.
  • The language is set via language = ja. Here, Japanese is configured.
  • Next, model settings are defined:
    • Different models are used depending on whether a GPU or CPU is utilized.
      • For GPU use, the high-performance large-v3 model is used.
      • For CPU use, the lightweight small model is used.
        • Additionally, for the CPU, the int8 data type is used to reduce the amount of computation.
  • Also, use_kotoba = False determines whether to use the Japanese-specialized model kotoba-whisper-v1.0.
  • By changing kotoba_model_type, you can change the specific faster-whisper model used.

5th Cell

./colab_AI_sample/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb
#文字起こしするファイルを指定する。下記はサンプルファイル

wav_path = "./inputs/audio_recognition/MANA_yofukashi_QUESTION_007.wav"

In this cell, we set the path to the audio file for transcription.
Please ensure that the audio file you want to transcribe is stored at the location specified by this path.

Incidentally, the sample file set above is borrowed from the link below:
https://amitaro.net/voice/corpus-list/mana-corpus/

6th Cell

./colab_AI_sample/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb

#通常のfaster-whisperの利用
whisper = FasterWhisperModel()
print("transcribe start!")
start = time.time()
text = whisper.audioFile2text(wav_path)
print("whisper_time:",time.time()-start)
print(text)

Here, we perform the transcription of the audio. If you are using a GPU, the model used will be large-v3, and if you are using a CPU, it will be the small model.
Transcription is executed by running this cell after changing the settings in the 4th cell.

Additionally, the processing speed is calculated using the time module.
Upon calculating the processing speed, the result when using the GPU was:
whisper_time: 0.7156243324279785

On the other hand, when using the CPU, the result was:
whisper_time: 5.328901052474976

Since the audio file is clear, both models were able to transcribe this level of audio accurately without any issues.
(Of course, for practical use, it is recommended to use a larger model.)

In cells 7-8, the configuration file is modified to use kotoba-whisper-v1.0, a model specialized for Japanese.

The processing speed was as follows:
whisper_time: 0.4141805171966553
Regarding transcription, everything was processed accurately.

As shown in the results above, kotoba-whisper-v1.0 processed the audio faster than the large-v3 model.

As seen from the model card below, kotoba-whisper-v1.0 is a model fine-tuned based on the distil-large-v2 model, which is a distilled version of Whisper, leading to faster processing speeds.
https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0-faster

Furthermore, since it is specialized for Japanese, its Japanese speech recognition performance is comparable to that of large-v3.

9th Cell

./colab_AI_sample/colab_fasterwhisper_sample/Faster-whisper_sample.ipynb
#whisper APIの利用
print("transcribe start!")
start = time.time()
transcription = client.audio.transcriptions.create(model="whisper-1",
                        language="ja",
                        file=open(wav_path, "rb"),
                        response_format="verbose_json",
                        prompt = "こんにちは!今日は、あんまり、元気じゃないかな・・・君はどう?"
                        )
print("whisper_time:",time.time()-start)
print(transcription.text)

I also executed the Whisper-API for comparison.
The processing time was as follows:
whisper_time: 0.9151504039764404

Summary of Results

The processing times are summarized below:
Whisper-API: 0.9151504039764404
faster-whisper (large-v3): 0.7156243324279785
faster-whisper (kotoba): 0.4141805171966553
faster-whisper (small (cpu)): 5.328901052474976

When creating a voice dialogue system, the processing speed of each step, including speech recognition, is extremely important.
As shown above, the kotoba-whisper-v1.0 model had high performance and was fast in terms of processing time.
On the other hand, if you do not mind the API usage fees, using the Whisper-API is also an option, as you don't have to worry about local machine resources.

Additionally, another strength of faster-whisper is its ability to transcribe within a realistic time frame even when using a CPU. If real-time performance is not required, it is possible to perform transcription even on a regular laptop.

module/module_whisper.py

Next, I will explain the contents of the module imported from Faster-whisper_sample.ipynb.

The full code is shown below.

Full Code
./colab_AI_sample/colab_fasterwhisper_sample/module/module_whisper.py

from faster_whisper import WhisperModel
import numpy as np
import torch

import os
import configparser
# Module for checking file existence
import errno

class FasterWhisperconfig:
    def __init__(self, config_ini_path = './configs/config.ini'):
        # Load the ini file
        self.config_ini = configparser.ConfigParser()
        
        # Raise an error if the specified ini file does not exist
        if not os.path.exists(config_ini_path):
            raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), config_ini_path)
        
        self.config_ini.read(config_ini_path, encoding='utf-8')
        FasterWhisper_items = self.config_ini.items('FasterWhisper')
        self.FasterWhisper_config_dict = dict(FasterWhisper_items)

class FasterWhisperModel:
    def __init__(self,device = None, config_ini_path = './configs/config.ini'):
        FasterWhisper_config = FasterWhisperconfig(config_ini_path = config_ini_path)
        config_dict = FasterWhisper_config.FasterWhisper_config_dict

        if device is not None:
            self.DEVICE = device
        else:
            device = config_dict["device"]

            self.DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
            if device != "auto":
                self.DEVICE = device
            
        self.BEAM_SIZE = int(config_dict["gpu_beam_size"]) if self.DEVICE == "cuda" else int(config_dict["cpu_beam_size"])
        self.language = config_dict["language"]
        self.COMPUTE_TYPE = config_dict["gpu_compute_type"] if self.DEVICE == "cuda" else config_dict["cpu_compute_type"]
        self.MODEL_TYPE = config_dict["gpu_model_type"] if self.DEVICE == "cuda" else config_dict["cpu_model_type"]
        self.kotoba_chunk_length = int(config_dict["chunk_length"])
        self.kotoba_condition_on_previous_text = config_dict["condition_on_previous_text"]
        if self.kotoba_condition_on_previous_text == "True":
            self.kotoba_condition_on_previous_text = True
        else:
            self.kotoba_condition_on_previous_text = False

        if config_dict["use_kotoba"] == "True":
            self.use_kotoba = True
        else:
            self.use_kotoba = False

        if not self.use_kotoba:
            self.model = WhisperModel(self.MODEL_TYPE, device=self.DEVICE, compute_type=self.COMPUTE_TYPE)
        else:
            self.MODEL_TYPE = config_dict["kotoba_model_type"]
            #self.model = WhisperModel(self.MODEL_TYPE, device=self.DEVICE, compute_type=self.cotoba_compute_type)
            self.model = WhisperModel(self.MODEL_TYPE)

            
    def audioFile2text(self, file_path):
        result = ""
        if not self.use_kotoba:
            segments, _ = self.model.transcribe(file_path, beam_size=self.BEAM_SIZE,language=self.language)
        else:
            segments, _ = self.model.transcribe(file_path, beam_size=self.BEAM_SIZE,language=self.language, chunk_length=self.kotoba_chunk_length, condition_on_previous_text=self.kotoba_condition_on_previous_text)
        
        for segment in segments:
            result += segment.text

        return result

Now, let's go through it section by section.

FasterWhisperconfig Class

./colab_AI_sample/colab_fasterwhisper_sample/module/module_whisper.py

class FasterWhisperconfig:
    def __init__(self, config_ini_path = './configs/config.ini'):
        # Load the ini file
        self.config_ini = configparser.ConfigParser()
        
        # Raise an error if the specified ini file does not exist
        if not os.path.exists(config_ini_path):
            raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), config_ini_path)
        
        self.config_ini.read(config_ini_path, encoding='utf-8')
        FasterWhisper_items = self.config_ini.items('FasterWhisper')
        self.FasterWhisper_config_dict = dict(FasterWhisper_items)

Here, the configuration file specified by config_ini_path = './configs/config.ini' is loaded into FasterWhisper_config_dict. Since it is loaded as a dictionary type, it becomes possible to access the contents of the configuration file as a Python dictionary.

init Method of the FasterWhisperModel Class

./colab_AI_sample/colab_fasterwhisper_sample/module/module_whisper.py

class FasterWhisperModel:
    def __init__(self,device = None, config_ini_path = './configs/config.ini'):
        FasterWhisper_config = FasterWhisperconfig(config_ini_path = config_ini_path)
        config_dict = FasterWhisper_config.FasterWhisper_config_dict

        if device is not None:
            self.DEVICE = device
        else:
            device = config_dict["device"]

            self.DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
            if device != "auto":
                self.DEVICE = device
            
        self.BEAM_SIZE = int(config_dict["gpu_beam_size"]) if self.DEVICE == "cuda" else int(config_dict["cpu_beam_size"])
        self.language = config_dict["language"]
        self.COMPUTE_TYPE = config_dict["gpu_compute_type"] if self.DEVICE == "cuda" else config_dict["cpu_compute_type"]
        self.MODEL_TYPE = config_dict["gpu_model_type"] if self.DEVICE == "cuda" else config_dict["cpu_model_type"]
        self.kotoba_chunk_length = int(config_dict["chunk_length"])
        self.kotoba_condition_on_previous_text = config_dict["condition_on_previous_text"]
        if self.kotoba_condition_on_previous_text == "True":
            self.kotoba_condition_on_previous_text = True
        else:
            self.kotoba_condition_on_previous_text = False

        if config_dict["use_kotoba"] == "True":
            self.use_kotoba = True
        else:
            self.use_kotoba = False

        if not self.use_kotoba:
            self.model = WhisperModel(self.MODEL_TYPE, device=self.DEVICE, compute_type=self.COMPUTE_TYPE)
        else:
            self.MODEL_TYPE = config_dict["kotoba_model_type"]
            #self.model = WhisperModel(self.MODEL_TYPE, device=self.DEVICE, compute_type=self.cotoba_compute_type)
            self.model = WhisperModel(self.MODEL_TYPE)

First, the contents of the configuration file are stored in config_dict. Since this is a dictionary type, you can retrieve the settings as strings using keys like config_dict["device"]. Note that since everything is retrieved as a string, you need to perform type conversion if you need int or bool types.

Next, the following processing steps are performed:

  • Specify the device on which to run the model.
  • Retrieve various settings from the configuration file.
  • Define the model.
    • Initialize the appropriate model based on the configuration file settings.

audioFile2text Method of the FasterWhisperModel Class

./colab_AI_sample/colab_fasterwhisper_sample/module/module_whisper.py

class FasterWhisperModel:
    ...        
    def audioFile2text(self, file_path):
        result = ""
        if not self.use_kotoba:
            segments, _ = self.model.transcribe(file_path, beam_size=self.BEAM_SIZE,language=self.language)
        else:
            segments, _ = self.model.transcribe(file_path, beam_size=self.BEAM_SIZE,language=self.language, chunk_length=self.kotoba_chunk_length, condition_on_previous_text=self.kotoba_condition_on_previous_text)
        
        for segment in segments:
            result += segment.text

        return result

In this method, the transcribe method of the faster-whisper model is called to perform speech recognition. It uses the appropriate arguments based on the model specified in the configuration file.

Since faster-whisper processes audio longer than 30 seconds by splitting it into segments, the text generated from each segment is appended to the result variable and then returned.

Summary

In this article, we created an environment to easily use Generative AI on Google Colaboratory for beginners.

In Part 5, we enabled the use of faster-whisper, a type of speech recognition AI. During this process, we compared its processing speed with the Whisper API covered in the previous Part 4.

Furthermore, I have also written an article about a voice dialogue system using speech recognition AI at the link below. If you are interested, please take a look.
https://zenn.dev/asap/articles/5b1b7553fcaa76

In the next Part 6, I would like to set up the environment for the image generation AI, Stable Diffusion.

Discussion