🦁
音声認識バッチ処理システムの作成_ローカル実装

Kei TAKAHATA
2024/10/08に公開
 前書き試行錯誤のための仕組み作りの重要度は高めにしているAIエンジニアです。
個人的に変更可能な値が複数ある場合、設定ファイルを用意する派です。

argsも便利ですが、実行コマンドはなるべく短くしたいと考えています。
前回、Google Colaboratory上でReazonSpeechとWhisperを利用して音声認識を実行するまでの流れをまとめました。

https://zenn.dev/goals_techblog/articles/953920904e941c
今回は音声認識を実施したい音ファイルがある場合に設定ファイルの内容に応じて音声認識を実行するバッチ処理システムを作成していきます。

オンライン処理のシステムではない点にご注意ください。

https://req-definer.com/entry/batch-and-online

 音声認識バッチ処理システムの作成この章では音声認識バッチ処理システムの環境構築、ディレクトリ構成、全体の流れを説明していきます。音声認識のモデルとしてReazonSpeechとWhisperを利用できる仕組みにしています。

 環境構築まずは音声認識用の環境を構築します。
今回はMacでconda環境を使い、環境構築していきます。

参考までに私が実装したPCの性能をまとめておきます。


項目
詳細


マシン名
MacBook Pro

チップ
Apple M1 Pro

メモリ
16GB

OS
mac OS Sonomaバージョン14.6.1

HomebrewとMiniforgeをインストールしてbrewコマンドとcondaコマンドが使用可能な前提で進めていきます。
参考までに

HomebrewとMiniforgeのインストールのやり方は以下の記事に記載しております。
https://zenn.dev/goals_techblog/articles/03d36b417e6de9
conda環境を作成して起動します。
# 音声認識用のconda環境をspeech_recognitionという名前で作成
% conda create -n speech_recognition python=3.10

# speech_recognition環境を起動
% conda activate speech_recognition
必要なライブラリをインストールしていきます。
(speech_recognition)% brew install ffmpeg

# 自分が作業するworkspaceに移動
(speech_recognition)% cd workspace

# reazonspeechのコードをクローン
(speech_recognition)% git clone https://github.com/reazon-research/reazonspeech

# huggingface-hubのインストール
# （何も指定しないとバージョンの関係で動かなかったため、バージョンを指定）
(speech_recognition)% pip install huggingface-hub==0.23.2

# ReazonSpeechに必要なライブラリのインストール
(speech_recognition)% pip install --no-warn-conflicts reazonspeech/pkg/nemo-asr

# Whisperに必要なライブラリのインストール
(speech_recognition)% pip install git+https://github.com/openai/whisper.git

 ディレクトリ構成以下のようなディレクトリ構成にしました。
workspace/
    └── speech_recognition/
        ├── 00_data/
        │   └── 01_input_data/
        │       ├── 01_test.wav
        │       ├── 02_test.wav
        │       ├── ...
        │       └── xx_test.wav
        ├── 01_code/
        │   ├── main.py
        │   ├── speech_recognition_setting.yaml
        │   └── using_main/
        │       └── speech_recognition
        └── reazonspeech

 全体の流れ全体の流れは以下のような流れにしました。


 実装コードここからは各ファイルの説明と実装コードを記載していきます。

 mainファイルmain.pyにはSpeechRecognitionをインスタンス化して実行する処理を記載します。
 main.py
import using_main.speech_recognition as speech_recognition


def main():
    speech_recog = speech_recognition.SpeechRecognition()
    result = speech_recog.speech_recognition()
    return


if __name__ == "__main__":
    main()


 設定ファイルspeech_recognition_setting.yamlには音声認識に関する設定を記載します。

dir内にディレクトリの設定をまとめます。
input_dir内のファイルすべてに音声認識を実行するか、指定ファイルだけ選ぶようにするのかも選択できるようにします。
音声認識を実行するモデルも複数選択できるようにします。

whisperのモデルは複数あるので"whisper_base", "whisper_large"のように先頭に"whisper_"という文字列をつけて指定することにします。
CPUで実行するか、GPUで実行するかという点も選択できるようにします。
!ReazonSpeechは"mps"での実行を動作確認済みですが、Whisperでは実行できていない状態です。

参考: https://github.com/pytorch/pytorch/issues/87886
 speech_recognition_setting.yaml
dir:
  base_dir: "/workspace/"
  data_dir: "00_data"
  input_dir: "01_input_data"
  code_dir: "01_code"

# input_dir内のファイル全てに実行する場合
target_type: "all"
# 指定したファイルのみ実行する場合
# target_type: "designation"
# file_name_list: ["xx_test.wav"] 

# 音声認識を実行するモデルのリスト
model_list: ["reazon", "whisper_base", "whisper_large"]
# デバイスの設定
reazon_device: "cpu" # "cuda:0" or "mps" or "cpu"
whisper_device: "cpu" # "cuda:0" or "mps" or "cpu"

 SpeechRecognitionファイルspeech_recognition.pyには設定ファイルの内容に基づき、音声認識を実施する処理を記載します。
 speech_recognition.py
import os
import glob

import yaml
import whisper

from reazonspeech.nemo.asr import transcribe, audio_from_path, load_model


class SpeechRecognition:
    """
    音声認識を実施するクラス

    Attributes
    ----------
    setting_dict : dict
        設定情報を格納した辞書
    base_dir_path : str
        ベースディレクトリのパス
    data_dir_path : str
        データディレクトリのパス
    input_dir_path : str
        入力ディレクトリのパス
    """
    def __init__(self):
        """初期化処理"""
        self.setting_dict = self.__read_setting_yaml()
        self.base_dir_path = self.setting_dict["dir"]["base_dir"]
        self.data_dir_path = self.__generate_data_dir_path()
        self.input_dir_path = self.__generate_input_dir_path()
        return

    def __read_setting_yaml(self):
        """設定情報をyamlファイルから読み込む"""
        with open("/workspace/01_code/speech_recognition_setting.yaml", "r") as yml:
            setting_dict = yaml.safe_load(yml)
        return setting_dict

    def __generate_data_dir_path(self):
        """データディレクトリのパスを生成する"""
        data_dir = self.setting_dict["dir"]["data_dir"]
        data_dir_path = os.path.join(self.base_dir_path, data_dir)
        return data_dir_path

    def __generate_input_dir_path(self):
        """入力ディレクトリのパスを生成する"""
        input_dir = self.setting_dict["dir"]["input_dir"]
        input_dir_path = os.path.join(self.data_dir_path, input_dir)
        return input_dir_path

    def __get_all_file_path_list(self, dir_path):
        """input_dir内のすべてのファイルパスをリストで取得する"""
        key = dir_path + "/*.wav"
        file_path_list = glob.glob(key)
        file_path_list.sort()
        return file_path_list

    def __generate_designation_file_path_list(self, dir_path, file_name_list):
        """指定のファイル名リストに対応するファイルパスリストを取得する"""
        file_path_list = []
        for file_name in file_name_list:
            file_path = os.path.join(dir_path, file_name)
            file_path_list.append(file_path)
        return file_path_list

    def __generate_target_file_path_list(self, setting_dict):
        """対象ファイルパスリストを作成する"""
        target_type = setting_dict["target_type"]
        if target_type == "all":
            target_file_path_list = self.__get_all_file_path_list(self.input_dir_path)
        elif target_type == "designation":
            file_name_list = setting_dict["file_name_list"]
            target_file_path_list = self.__generate_designation_file_path_list(self.input_dir_path, file_name_list)
        else:
            raise ValueError("target_type is not supported.")
        return target_file_path_list

    def __exec_reazon(self, file_path, setting_dict):
        """Reazonで指定の音声ファイルを音声認識する"""
        audio = audio_from_path(file_path)
        # ReazonSpeechモデルをHugging Faceから取得
        device = setting_dict["reazon_device"]
        model = load_model(device=device)
        result = transcribe(model, audio)
        return result

    def __print_speech_segment_reazon(self, result):
        """Reazonで認識した結果をセグメントごとに表示する"""
        for seg in result.segments:
            print("%5.2f %5.2f %s" % (seg.start_seconds, seg.end_seconds, seg.text))
        return

    def __exec_whisper(self, file_path, setting_dict, load_model=None):
        """whisperで音声認識を実施"""
        device = setting_dict["whisper_device"]
        if load_model is None:
            model = whisper.load_model(name="base", device=device)
        else:
            model = whisper.load_model(name=load_model, device=device)
        result = model.transcribe(file_path)
        return result

    def __print_speech_segment_whisper(self, result):
        """whisperで認識した結果をセグメントごとに表示する"""
        for seg in result["segments"]:
            id, start, end, text = [seg[key] for key in ["id", "start", "end", "text"]]
            print(f"{id:03}: {start:5.1f} - {end:5.1f} | {text}")
        return

    def __exec_speech_recognition(self, file_path, model, setting_dict):
        """指定の音声ファイルを指定の指定のモデルで音声認識する"""
        if model == "reazon":
            result = self.__exec_reazon(file_path, setting_dict)
            self.__print_speech_segment_reazon(result)
        elif "whisper" in model:
            model_name = model.replace("whisper_", "")
            result = self.__exec_whisper(file_path, load_model=model_name, setting_dict=setting_dict)
            self.__print_speech_segment_whisper(result)
        else:
            raise ValueError("model is not supported.")
        return result

    def speech_recognition(self):
        """設定に従い、音声認識を実施する"""
        file_path_list = self.__generate_target_file_path_list(self.setting_dict)
        for file_path in file_path_list:
            print("file_path: ", file_path)
            model_list = self.setting_dict["model_list"]
            for model in model_list:
                print("model: ", model)
                result = self.__exec_speech_recognition(file_path, model, self.setting_dict)
        return result


 後書き簡単ですが設定ファイルの内容に応じて音声認識を実行するバッチ処理システムを作成しました。

使用したいモデルが増えた場合も内部を拡張することで活用の幅を広げることができるのではないかと考えております。

 参考・バッチ処理とオンライン処理の違い

https://req-definer.com/entry/batch-and-online
・ディレクトリ構成をマークダウン化

https://tree.nathanfriend.io/
・シーケンス図作成のサイト

https://sequencediagram.org/
・シーケンス図の書き方

https://sequencediagram.org/instructions.html
・Whisperのmspでの実行

https://github.com/pytorch/pytorch/issues/87886
項目	詳細
マシン名	MacBook Pro
チップ	Apple M1 Pro
メモリ	16GB
OS	mac OS Sonomaバージョン14.6.1
Goals Tech BlogPublication
飲食業界向けSaas 「HANZO」シリーズを展開している株式会社Goalsのテックブログです。
Discussion

ログインするとコメントできます