🌕

Web Speech API で音声認識を試してみよう(React Hooks 実装付き)

に公開

🗣️ Web Speech APIとは

Web Speech API は、ブラウザで音声処理を行うためのAPIで、
以下の2つのモジュールで構成されています。

  • SpeechRecognition(音声認識 / Speech-to-Text
     音声をリアルタイムでテキストに変換します。
  • SpeechSynthesis(音声合成 / Text-to-Speech
     テキストを音声として読み上げます。

今回は 文字起こし(SpeechRecognition) を使って実装を行います。

1.型定義

  • まずはTypeScript用の型を定義しておく。
src/types/SpeechRecognition.ts
export interface SpeechRecognitionEvent extends Event {
  results: SpeechRecognitionResultList;
}

export interface SpeechRecognitionResultList {
  length: number;
  item(index: number): SpeechRecognitionResult;
  [index: number]: SpeechRecognitionResult;
}

export interface SpeechRecognitionResult {
  length: number;
  item(index: number): SpeechRecognitionAlternative;
  [index: number]: SpeechRecognitionAlternative;
  isFinal: boolean;
}

export interface SpeechRecognitionAlternative {
  transcript: string;
  confidence: number;
}

export interface SpeechRecognitionErrorEvent extends Event {
  error: string;
  message: string;
}

export interface SpeechRecognition extends EventTarget {
  continuous: boolean;
  interimResults: boolean;
  lang: string;
  start(): void;
  stop(): void;
  abort(): void;
  onstart: ((this: SpeechRecognition, ev: Event) => void) | null;
  onresult: ((this: SpeechRecognition, ev: SpeechRecognitionEvent) => void) | null;
  onerror: ((this: SpeechRecognition, ev: SpeechRecognitionErrorEvent) => void) | null;
  onend: ((this: SpeechRecognition, ev: Event) => void) | null;
}

/**
 * useSpeechRecognition Hook 用の設定オプション
 */
export interface SpeechRecognitionOptions {
  /** 継続的に音声認識を行うかどうか(デフォルト: true) */
  continuous?: boolean;
  /** 中間結果を取得するかどうか(デフォルト: true) */
  interimResults?: boolean;
  /** 認識する言語(デフォルト: "ja-JP") */
  lang?: string;
  /** 最大録音時間(ミリ秒): 経過後に自動停止 */
  maxListeningMs?: number;
  /** 無音が続いたときに停止するまでの時間(ミリ秒) */
  silenceTimeoutMs?: number;
}

  • グローバル宣言も追加する。(windowオブジェクトを拡張)
src/types/global.d.ts
declare global {
  interface Window {
    SpeechRecognition: new () => SpeechRecognition;
    webkitSpeechRecognition: new () => SpeechRecognition;
  }
}

2.useSpeechRecognition Hookの実装

  • Web Speech APIを簡単に使えるようにするカスタムフックを作成する。
src/hooks/useSpeechRecognition.ts
"use client";
import { useState, useEffect, useCallback, useRef } from "react";
import type {
  SpeechRecognition,
  SpeechRecognitionEvent,
  SpeechRecognitionErrorEvent,
  SpeechRecognitionOptions,
} from "@/types/SpeechRecognition"; // ✅ 型をインポート

export const useSpeechRecognition = (options?: SpeechRecognitionOptions) => {
  const [transcript, setTranscript] = useState("");
  const [listening, setListening] = useState(false);
  const [browserSupportsSpeechRecognition, setBrowserSupport] = useState(false);
  const [status, setStatus] = useState<"idle" | "connecting" | "connected" | "error">("idle");
  const [errorMessage, setErrorMessage] = useState<string>("");

  const recognitionRef = useRef<SpeechRecognition | null>(null);
  const maxTimerRef = useRef<number | null>(null);
  const silenceTimerRef = useRef<number | null>(null);

  const stableOptions = JSON.stringify(options ?? {});

  useEffect(() => {
    const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
    if (!SpeechRecognition) {
      setBrowserSupport(false);
      return;
    }

    setBrowserSupport(true);
    recognitionRef.current = new SpeechRecognition();

    const continuous = options?.continuous ?? true;
    const interimResults = options?.interimResults ?? true;
    const lang = options?.lang ?? "ja-JP";
    const silenceTimeoutMs = options?.silenceTimeoutMs ?? 10 * 1000;
    const maxListeningMs = options?.maxListeningMs ?? 60 * 1000;

    recognitionRef.current.lang = lang;
    recognitionRef.current.continuous = continuous;
    recognitionRef.current.interimResults = interimResults;

    const startSilenceTimer = () => {
      if (!silenceTimeoutMs) return;
      if (silenceTimerRef.current !== null) clearTimeout(silenceTimerRef.current);
      silenceTimerRef.current = window.setTimeout(() => recognitionRef.current?.stop(), silenceTimeoutMs);
    };

    const startMaxTimer = () => {
      if (!maxListeningMs) return;
      if (maxTimerRef.current !== null) clearTimeout(maxTimerRef.current);
      maxTimerRef.current = window.setTimeout(() => recognitionRef.current?.stop(), maxListeningMs);
    };

    recognitionRef.current.onstart = () => {
      startMaxTimer();
      startSilenceTimer();
      setListening(true);
      setStatus("connected");
      setErrorMessage("");
    };

    recognitionRef.current.onresult = (event: SpeechRecognitionEvent) => {
      let fullTranscript = "";
      for (let i = 0; i < event.results.length; i++) {
        fullTranscript += event.results[i][0].transcript;
      }
      setTranscript(fullTranscript);
      startSilenceTimer();
    };

    recognitionRef.current.onend = () => {
      clearTimeout(maxTimerRef.current ?? undefined);
      clearTimeout(silenceTimerRef.current ?? undefined);
      setListening(false);
      setStatus("idle");
    };

    recognitionRef.current.onerror = (event: SpeechRecognitionErrorEvent) => {
      console.error("Speech recognition error:", event.error);
      setErrorMessage(event.message || event.error || "Speech recognition error");
      setStatus("error");
      setListening(false);
    };

    return () => recognitionRef.current?.abort();
  }, [stableOptions]);

  const startListening = useCallback(() => {
    if (recognitionRef.current && !listening) {
      setTranscript("");
      setStatus("connecting");
      setErrorMessage("");
      try {
        recognitionRef.current.start();
      } catch {
        setStatus("error");
        setErrorMessage("Failed to start speech recognition");
      }
    }
  }, [listening]);

  const stopListening = useCallback(() => recognitionRef.current?.stop(), []);
  const abortListening = useCallback(() => recognitionRef.current?.abort(), []);
  const resetTranscript = useCallback(() => setTranscript(""), []);

  return {
    transcript,
    listening,
    browserSupportsSpeechRecognition,
    startListening,
    stopListening,
    abortListening,
    resetTranscript,
    status,
    errorMessage,
  };
};

3.UI作成

  • speechRecognitionを検証する為の簡易的なUI作成。
src/app/speechRecognition/page.tsx
"use client";

import { useState, useMemo } from "react";
import {
  useSpeechRecognition,
} from "@/hooks/useSpeechRecognition";
import type { SpeechRecognitionOptions } from "@/types/SpeechRecognition";

export default function SpeechDemo() {
  // UI から渡すオプション(必要最低限)
  const [lang, setLang] = useState<SpeechRecognitionOptions["lang"]>("ja-JP");
  const [silenceTimeoutMs, setSilenceTimeoutMs] = useState<number>(3000);
  const [maxListeningMs, setMaxListeningMs] = useState<number>(60000);

  const options = useMemo<SpeechRecognitionOptions>(() => ({
    lang,
    silenceTimeoutMs,
    maxListeningMs,
    continuous: true,
    interimResults: true,
  }), [lang, silenceTimeoutMs, maxListeningMs]);

  const {
    transcript,
    listening,
    browserSupportsSpeechRecognition,
    startListening,
    stopListening,
    abortListening,
    resetTranscript,
    status,
    errorMessage,
  } = useSpeechRecognition(options);

  const copy = async () => {
    try {
      await navigator.clipboard.writeText(transcript);
    } catch {
      // no-op
    }
  };

  return (
    <div className="mx-auto max-w-3xl space-y-6 p-4">
      <header className="space-y-2">
        <h1 className="text-2xl font-bold">🎙️ Web Speech API デモ</h1>
        <p className="text-sm text-gray-500">
          ブラウザだけで音声をテキスト化します(試験的API / Chrome推奨)
        </p>
        {!browserSupportsSpeechRecognition && (
          <p className="rounded-md bg-red-50 p-3 text-sm text-red-700">
            お使いのブラウザは Web Speech API(SpeechRecognition)をサポートしていません。
            Chrome系ブラウザでお試しください。
          </p>
        )}
      </header>

      {/* 設定 */}
      <section className="rounded-xl border p-4">
        <h2 className="mb-3 font-semibold">設定</h2>
        <div className="grid gap-4 sm:grid-cols-3">
          <label className="block text-sm">
            <span className="mb-1 block text-gray-600">言語</span>
            <select
              className="w-full rounded-lg border px-3 py-2"
              value={lang}
              onChange={(e) => setLang(e.target.value)}
            >
              <option value="ja-JP">日本語 (ja-JP)</option>
              <option value="en-US">English (en-US)</option>
              <option value="zh-CN">中文 (zh-CN)</option>
              <option value="ko-KR">한국어 (ko-KR)</option>
            </select>
          </label>

          <label className="block text-sm">
            <span className="mb-1 block text-gray-600">無音タイムアウト (ms)</span>
            <input
              type="number"
              min={0}
              step={500}
              className="w-full rounded-lg border px-3 py-2"
              value={silenceTimeoutMs}
              onChange={(e) => setSilenceTimeoutMs(Number(e.target.value) || 0)}
            />
          </label>

          <label className="block text-sm">
            <span className="mb-1 block text-gray-600">最大録音時間 (ms)</span>
            <input
              type="number"
              min={0}
              step={1000}
              className="w-full rounded-lg border px-3 py-2"
              value={maxListeningMs}
              onChange={(e) => setMaxListeningMs(Number(e.target.value) || 0)}
            />
          </label>
        </div>
      </section>

      {/* コントロール */}
      <section className="rounded-xl border p-4">
        <h2 className="mb-3 font-semibold">コントロール</h2>
        <div className="flex flex-wrap items-center gap-2">
          <button
            onClick={startListening}
            disabled={!browserSupportsSpeechRecognition || listening}
            className="rounded-lg bg-black px-4 py-2 text-white disabled:opacity-50"
          >
            ▶️ 開始
          </button>
          <button
            onClick={stopListening}
            disabled={!browserSupportsSpeechRecognition || !listening}
            className="rounded-lg bg-gray-800 px-4 py-2 text-white disabled:opacity-50"
          >
            ⏹ 停止
          </button>
          <button
            onClick={abortListening}
            disabled={!browserSupportsSpeechRecognition || !listening}
            className="rounded-lg bg-gray-600 px-4 py-2 text-white disabled:opacity-50"
          >
            🛑 中断
          </button>
          <button
            onClick={resetTranscript}
            className="rounded-lg border px-4 py-2"
          >
            🧹 クリア
          </button>
          <button
            onClick={copy}
            disabled={!transcript}
            className="rounded-lg border px-4 py-2 disabled:opacity-50"
          >
            📋 コピー
          </button>
        </div>

        {/* ステータス表示 */}
        <div className="mt-3 flex flex-wrap items-center gap-2 text-sm">
          <span className="rounded-full border px-3 py-1">
            状態: <strong>{status}</strong>
          </span>
          <span
            className={
              "rounded-full px-3 py-1 " +
              (listening ? "bg-green-100 text-green-700" : "bg-gray-100 text-gray-600")
            }
          >
            {listening ? "🎤 リスニング中" : "⏸ 待機中"}
          </span>
          <span className="text-gray-500">Lang: {lang}</span>
        </div>

        {errorMessage && (
          <p className="mt-2 rounded-md bg-red-50 p-3 text-sm text-red-700">
            {errorMessage}
          </p>
        )}
      </section>

      {/* 結果 */}
      <section className="rounded-xl border p-4">
        <h2 className="mb-3 font-semibold">結果(Transcript)</h2>
        <textarea
          className="h-40 w-full resize-none rounded-lg border p-3 font-mono text-sm"
          value={transcript}
          readOnly
          placeholder="ここに文字起こし結果が表示されます…"
        />
      </section>

      <footer className="text-xs text-gray-500">
        ※ マイク権限が必要です。HTTPS での動作を推奨します(localhostは可)。
      </footer>
    </div>
  );
}

4.検証

  • "開始"を押しマイク入力を開始。
  • マイクが起動し、音声入力がリアルタイムで反映されている事を確認。
  • "停止"を押し入力を止める。

まとめ

  • Web Speech API を使うことで、ブラウザだけで 音声認識 が実現可能
  • 無音タイムアウト最大録音時間 の設定により、UX(ユーザー体験) が向上
  • 現時点では Chrome 系ブラウザ での動作が最も安定しており、開発・検証にもおすすめ

📚 参考資料

Discussion