🤖

localで動く英会話アプリ開発 #1

2025/08/02に公開

雑談

やっぱり、夏はうどんが最強‼︎ってことでうどんを食べてきましたー
運動してた頃はたっくさん食べれていたのに、運動やめてから食べる量が減り、大盛りを食べるのが辛くなってきた…
美味しいものをたくさん食べれる胃袋が欲しいので運動ちゃんとやろうかなー
なんか、運動してた時より痩せてるしショックを受けている今日この頃です

何するの？

英語を使う機会が増えて、英語ができたら楽しいんだろうなって思い始めました。
そんな中で、最近、AIと英会話するアプリケーションをよく見かけます。
めっちゃ良さげだけど、月額料金がかかるため、金欠大学生にはハードルが高い…

あれ、こんな感じの英会話アプリって簡単に作れそう
ということで、コストがかからない、ローカルで動く英会話アプリを作っていきます！

今回の技術的な内容

whisperの使い方
Qwen2.5の使い方
parler-ttsの使い方

現在の進捗

1. Speech-to-Text, LLM, Text-to-Speechを使用したSpeakerの実装 ←今回
2. Qwen2.5-Omniを使用したSpeakerの実装
3. データベースの実装
4. Web APIの実装
5. フロントエンドの実装

実行環境

M2 MacBook Air (24GBメモリ、10コアGPU)
python 3.12.7
- torch: 2.7.1
- transformers: 4.46.1 etc...

Speakerの実装方針

Speech-to-Textによる音声のテキストデータ化
LLMによる会話の応答の生成
Text-to-Speechによる応答のオーディオファイル化

各クライエントの実装

1. Speech-to-Text

openai/whisper-large-v3-turboを使用して実装します。
Hugging Faceのページを参考に、Clientクラスを実装します！

音声ファイルのパスを受け取り、その音声をテキストデータ化するrun()メソッドを実装しています。

class STTClient:
    def __init__(self, device: torch.device):
        # 変数の設定
        torch_dtype=torch.float32
        model_id = "openai/whisper-large-v3-turbo"

        # モデルの初期化
        model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_id, 
            torch_dtype=torch_dtype, 
            low_cpu_mem_usage=True, 
            use_safetensors=True
        ).to(device)

        # プロセッサーの初期化
        processor = AutoProcessor.from_pretrained(model_id)

        # pipelineの初期化
        self.pipe = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=processor.feature_extractor,
            torch_dtype=torch_dtype,
            device=device,
        )

    def run(self, audio_file_path: str) -> str:
        result = self.pipe(
            audio_file_path,
            generate_kwargs={"language": "en"}
        )
        return result["text"]

使用方法は以下の通りです！

if __name__ == "__main__":
    # デバイスの取得
    if torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    
    # 実行
    client = STTClient(device)
    text = client.run("test.mp3")
    print(text)

2. LLM

Qwen/Qwen2.5-3B-Instructを使用して実装します。
Hugging Faceのページを参考に、Clientクラスを実装します！

会話履歴を保存するために、conversationの変数を導入し、記憶する会話数を_delete_old_conversation()で制限するようにしています。

class LLMClient:
    def __init__(self, device: torch.device):
        # モデルの初期化
        model_id = "Qwen/Qwen2.5-3B-Instruct"
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype="auto"
        ).to(device)

        # tokenizerの初期化
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

        # promptの変数
        self.system_prompt = {
            "role": "system", 
            "content": "You are English teacher in Japan. Let's practice English conversation together."
        }
        self.conversation = []
    
    def run(self, prompt: str) -> str:
        # 履歴の削除
        self._delete_old_conversation()

        # メッセージの作成
        user_prompt = {
            "role": "user", 
            "content": prompt
        }
        self.conversation.append(user_prompt)
        messages = [self.system_prompt] + self.conversation

        # 応答の生成
        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        model_inputs = self.tokenizer([text], return_tensors="pt").to(self.model.device)

        generated_ids = self.model.generate(
            **model_inputs,
            max_new_tokens=512
        )
        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]

        response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

        # 応答の保存
        assistant_prompt = {
            "role": "assistant", 
            "content": response
        }
        self.conversation.append(assistant_prompt)

        return response
    
    def _delete_old_conversation(self) -> None:
        """ 古い会話履歴を消すメソッド """
        if len(self.conversation) > MAX_CONVERSATION_HISTORY_LENGTH:
            self.conversation = self.conversation[-MAX_CONVERSATION_HISTORY_LENGTH:]
    

    def get_conversation(self) -> List[Dict[str, str]]:
        return self.conversation

    def set_conversation(self, conversation: List[Dict[str, str]]) -> None:
        self.conversation = conversation

使用方法は以下の通りです！

if __name__ == "__main__":
    # デバイスの取得
    if torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    
    # 実行
    client = LLMClient(device)
    text = client.run("Hello, I'm Tom. I like soccer.")
    print(text)

3. Text-to-Speech

parler-tts/parler-tts-mini-v1を使用して実装します。
Hugging Faceのページを参考に、Clientクラスを実装します！

応答のテキストデータを受け取り、音声ファイルを作成・保存を行い、その音声ファイルのパスを返すrun()メソッドを実装しています。

class TTSClient:
    def __init__(self, device: torch.device):
        # 変数の設定
        model_id = "parler-tts/parler-tts-mini-v1"

        # モデルの初期化
        self.model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device)

        # tokenizerの初期化
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

        # descriptionの設定
        self.description = (
            "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. "
            "The recording is of very high quality, with the speaker's voice sounding clear and very close up."
        )
    
    def run(self, prompt: str) -> str:
        """
        実行メソッド

        Args:
            prompt: プロンプト
        
        Returns:
            str: outputファイルパス
        """
        # トークンidの取得
        input_ids = self.tokenizer(self.description, return_tensors="pt").input_ids.to(self.model.device)
        prompt_input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(self.model.device)

        # 音声の生成
        generation = self.model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
        audio_arr = generation.cpu().numpy().squeeze()

        # 音声の保存
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_file_name = f"{timestamp}.wav"
        output_file_path = os.path.join(OUTPUT_DIR_PATH, output_file_name)

        sf.write(output_file_path, audio_arr, self.model.config.sampling_rate)
        
        return output_file_path

使用方法は以下の通りです！

if __name__ == "__main__":
    # デバイスの取得
    if torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    
    # 実行
    client = TTSClient(device)
    output_file_path = client.run("Hello, I'm Tom. Nice to meet you.")
    print(output_file_path)

Speakerの実装

今まで作成したClientを使用して、Speakerを実装します！
__init__()では、使用デバイスを取得するメソッドを実装します。
speak()では、以下の順でClientを使用して、入力音声ファイルパスを受け取り、出力音声ファイルパスを返すメソッドを実装します。

STTClient
LLMClient
TTSClient

`init()`の実装

def __init__(self):
    self.conversation = []
    self.device = self._get_device()

def _get_device(self) -> torch.device:
    """ 実行デバイスを取得するメソッド """
    if torch.cuda.is_available():
        device = torch.device("cuda")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    
    return device

`speak()`の実装

def speak(self, input_audio_path: str) -> str:
    """
    実行メソッド

    Args:
        input_audio_path: inputファイルパス
    
    Returns:
        str: outputファイルパス
    """
    # audio -> text
    client = STTClient(self.device)
    text = client.run(input_audio_path)

    # text -> text
    client = LLMClient(self.device)
    client.set_conversation(self.conversation)
    response = client.run(text)
    self.conversation = client.get_conversation()

    # text -> audio
    client = TTSClient(self.device)
    output_audio_path = client.run(response)

    return output_audio_path

実行結果

約1分ほどで応答が生成された！
遅いけど、PCスペック的にしょうがないかなと思います。
もっと早くする方法ないかなーー

参考ページ

実際のコード

現在実装中
Qwen/Qwen2.5-Omni-3Bを使用したSpeech-to-Speechを使用して、Speakerを実装するが、応答時間が現実的じゃなそう…

localで動く英会話アプリ開発 #1

雑談

何するの？

今回の技術的な内容

現在の進捗

実行環境

Speakerの実装方針

各クライエントの実装

1. Speech-to-Text

2. LLM

3. Text-to-Speech

Speakerの実装

`init()`の実装

`speak()`の実装

実行結果

参考ページ

実際のコード

前のページ

次のページ

Discussion

雑談

何するの？

今回の技術的な内容

現在の進捗

実行環境

Speakerの実装方針

各クライエントの実装

1. Speech-to-Text

2. LLM

3. Text-to-Speech

Speakerの実装

__init__()の実装

speak()の実装

実行結果

参考ページ

実際のコード

前のページ

次のページ

Discussion

`init()`の実装

`speak()`の実装