📖
YouTube音声をダウンロードしてWhisperで文字起こしするスクリプト

2025/02/19に公開
YouTube
 はじめにYouTube から音声のみをダウンロードし、OpenAI が提供する Whisper モデルを使って文字起こしを行う Python スクリプトの紹介です。

無料で利用できる上に精度も高いため、ちょっとした文字起こし作業にはとても便利です。
本記事では以下の手順を中心に解説していきます。
必要なライブラリのインストール
YouTube から音声のみをダウンロード
Whisper で文字起こし
テキストとセグメント（タイムスタンプ付き）のファイル出力
この記事の最後に、完成したコードをそのまま載せていますので、興味があればぜひ試してみてください。

 使用ライブラリ
yt-dlp

YouTube の動画をダウンロードするためのライブラリです。

ffmpeg-python

FFmpeg を Python から扱いやすくするためのラッパーです。

openai-whisper (whisper)

OpenAI が提供している音声認識モデルのライブラリです。

 環境構築以下の環境での動作を想定しています。
Python 3.7 以上
pip (Python パッケージ管理ツール)

 ライブラリのインストール# まず既存の whisper が入っている場合はアンインストール
pip uninstall whisper

# 次に openai-whisper をインストール
pip install openai-whisper

# yt-dlp と ffmpeg-python をインストール
pip install yt-dlp ffmpeg-python

# (必要であれば pip のアップグレードも)
pip install --upgrade pip
YouTube の動画URLを指定すると、次の手順で処理が行われます。
YouTube 動画の音声のみをダウンロードして mp3 ファイルとして保存

Whisper で文字起こし（model_size に応じて処理速度や精度が変わります）

文字起こし結果をテキストファイル（transcript_full.txt）と JSON ファイル（transcript_segments.json）に出力

以下のコードを youtube.py などのファイル名で保存し、上記のライブラリをインストールした環境で実行してください。

 Usage:pip uninstall whisper
pip install openai-whisper
pip install yt-dlp ffmpeg-python

（念のため pip install --upgrade pip も行ってください）
python youtube.py
youtube.py
import yt_dlp
import whisper
import json
import os

def download_youtube_audio(url: str, output_filename: str) -> str:
    """
    指定のYouTube動画から音声のみをダウンロードし、
    output_filename に mp3 形式で保存します。

    Returns:
        audio_filepath (str): 保存先のファイルパス (mp3)
    """
    ydl_opts = {
        "format": "bestaudio/best",
        "outtmpl": "%(title)s.%(ext)s",
        "postprocessors": [{
            "key": "FFmpegExtractAudio",
            "preferredcodec": "mp3",
            "preferredquality": "192",
        }]
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)
        # ダウンロード後のファイル名を取得 (webm -> mp3)
        downloaded_filepath = ydl.prepare_filename(info)
        base, _ = os.path.splitext(downloaded_filepath)
        audio_filepath = base + ".mp3"

    # mp3ファイル名を任意のファイル名に揃えたい場合は rename
    if os.path.exists(audio_filepath):
        os.rename(audio_filepath, output_filename)
        return output_filename
    else:
        return audio_filepath

def transcribe_audio(audio_filepath: str, model_size="base"):
    """
    Whisperモデルで音声認識を行い、文字起こし結果を返す。
    model_size: 'tiny' | 'base' | 'small' | 'medium' | 'large'
    """
    model = whisper.load_model(model_size)
    # CPU環境でのエラー回避に fp16=False を指定したい場合は以下のように設定
    # result = model.transcribe(audio_filepath, fp16=False)
    result = model.transcribe(audio_filepath)
    return result

if __name__ == "__main__":
    # 例: youtube_url = "https://www.youtube.com/watch?v=AQ7jzvKgnLw"
    youtube_url = "https://www.youtube.com/watch?v=LX63YKORWa8"
    output_audio = "youtube_audio.mp3"

    # 1. YouTube音声ダウンロード
    audio_path = download_youtube_audio(youtube_url, output_audio)
    print(f"Audio saved as: {audio_path}")

    # 2. Whisper で文字起こし
    transcript_data = transcribe_audio(audio_path, model_size="base")

    # 3. テキストとセグメントをファイル出力
    full_text = transcript_data["text"]
    segments = transcript_data["segments"]  # タイムスタンプあり

    with open("transcript_full.txt", "w", encoding="utf-8") as f:
        f.write(full_text)

    with open("transcript_segments.json", "w", encoding="utf-8") as f:
        json.dump(segments, f, ensure_ascii=False, indent=2)

    print("Transcription complete! Full text and segmented JSON saved.")

使い方

スクリプトファイル（例: youtube.py）を用意し、上記のコードを貼り付ける。

ターミナルやコマンドプロンプトで次のように実行する。
python youtube.py
処理が完了すると、同じフォルダに以下のファイルが生成される。

youtube_audio.mp3

transcript_full.txt

transcript_segments.json

transcript_full.txt には文字起こしされた全文が、transcript_segments.json にはタイムスタンプ付きのセグメント情報が含まれています。
おわりに

以上のように、YouTube 動画から音声のみを取得して、Whisper で簡単に文字起こしができるスクリプトを紹介しました。

Whisper は複数のモデルサイズを提供しているため、状況や使用リソースにあわせて選んでみてください。
tiny や base は軽量で、比較的速く動作します。

large は高精度ですが、メモリも CPU/GPU リソースも大きく必要となります。

ぜひ試してみてください！
はじめに

使用ライブラリ

環境構築

ライブラリのインストール

Usage:

Discussion