iTranslated by AI
Implementing a Minimalist Audio Summary Feature for Claude Code Responses
Isn't reading all those long responses a chore?
I've been using Claude Code every day, and there's been one thing that has subtly bothered me for a while.
When I send over research tasks, it returns a 3,000-character wall of text, and if I start a design discussion, it outputs multiple proposals, comparison tables, and even a recommendation—all at once. I'm grateful for the information, but it's exhausting to stare at the screen and track every word. I've always thought that if I could just listen to the key points, I could grasp the content while looking at other screens.
However, in my environment, I run multiple Claude Code instances in parallel, so a simple implementation like "read aloud automatically when the response is complete" doesn't work. It would turn into a nightmare where every instance starts talking at the same time.
So, this is the story of how I ended up with a solution.
I settled on a configuration using only the macOS say command and a /speak slash command. Zero dependencies, 5-minute setup, and zero cost. I’ll share the journey, including how I tried several new Japanese Neural voices and they all failed, how 360 wpm turned out to be surprisingly listenable, and my brief detour into the CLAUDE.md marker convention approach before circling back.
There are several OSS projects with similar ideas in the English-speaking world (like Vox and AgentVibes). Since this isn't entirely new, what I want to write about in this article isn't the code itself, but "why this design was chosen."
What I wanted to do and what I gave up on
Simply put, I wanted two things:
- I want to hear the key points of long responses via audio.
- I want to listen while watching other screens.
There's a catch: I use git worktree to switch branches and run 3–4 Claude Code instances in parallel. If each instance starts talking on its own, it’s chaos. It interrupts me with "Task complete" when I’m focused on something else, or one session starts reading key points while another finishes and talks over it. Just imagining it sounds noisy.
So, the requirements became:
- Read only the key points. The full text is too long.
- Do not talk automatically. I want to trigger speech at my own timing.
- Since I use it daily, I want minimal dependencies and setup.
A quick look at existing implementations
I wanted to avoid reinventing the wheel, so I looked at what was already out there. There were a few.
| Implementation | Method | Scope | Parallel Support |
|---|---|---|---|
| Claude Code hooks for voice notifications (Zenn/yatabis) | Stop hook | Full text | Not mentioned |
| Claude Code hooks for task status (DevelopersIO) | Stop hook | Notifications only | Not mentioned |
| AgentVibes (GitHub) | PostToolUse hook | Markdown removal | Has speech lock |
| Vox (rtk-ai) | /speak command + hook | Unknown | Speaks via hook |
| Real-time audio for Claude Code responses (Zenn/ytka) | Hook | Cloud TTS | Not mentioned |
Most Japanese articles use a hook-based approach, and they read the full text. The TTS engines used are typically "quality-focused" ones like Kokoro, VOICEVOX, Style-Bert-VITS2, or Aivis.
In the English-speaking world, Vox goes as far as creating a /speak slash command, and honestly, I had an "oh no" moment when I found it. AgentVibes also includes the idea of automatically adding TTS instructions to CLAUDE.md. However, I couldn't find any Japanese articles covering the "on-demand execution + macOS say" combination, so I decided to write this down as a note to myself.
Three design decisions
I'll start with the decisions I made first.
Instead of a hook, I went with a /speak slash command. The TTS is fixed to the macOS say command. The key points are generated on the spot the moment /speak is called. The speed is set to 360 wpm by default (more on this later).
Why I dropped hooks
It comes down to the parallel session problem. If you set up a stop hook, every single Claude Code instance running (all 3 or 4 of them) will start talking whenever a response finishes. It makes it impossible to get work done.
While you can manage flags to "let only this session speak" even with the hook approach, managing environment variables for each worktree is a hassle since the shells are separate. The most effortless way was for me to manually type /speak in the session I actually want to listen to.
Why I used the macOS say command
There are two reasons: I wanted to keep the setup simple, and I didn't want to incur any running costs.
Candidates included VOICEVOX, ElevenLabs, Google Cloud TTS, and OpenAI TTS, and all of them beat say in terms of quality. But what I want to do is "pick up key points while listening in the background," and announcer-level quality isn't necessary. As long as it's intelligible, it's enough.
I didn't feel like dealing with the trouble of setting up resident processes, managing billing or free-tier limits for API calls, or sending text to the cloud just for this use case. If I use say, which is already built into macOS, everything is solved: no installation, no API, no fees, and it works offline. If I want to adjust the quality, I can just fine-tune it later using CLAUDE_SPEAK_VOICE and CLAUDE_SPEAK_RATE.
I'll touch on other options in the "Other routes" section later.
Generating key points on-demand when /speak is called
Actually, this was a different approach at first. I hit a snag, so I'll explain the process that led to the current state.
There are about three ways to think about extracting key points. One is to have a separate model (like Haiku) summarize the entire response afterward. This works for any response, but it increases API costs and latency. Another is to mechanically slice off the first or last N characters. This is easy to implement but misses the point if the summary lies in the middle.
What remains is having Claude itself write a summary during the response generation phase. I chose this at first.
I added a rule to CLAUDE.md to "embed a <!-- TTS_SUMMARY --> block at the end of the response," and the /speak side would extract it using sed. I thought it was a solid plan since it would have zero additional LLM costs or latency.
It turned out to be intrusive when I actually ran it.
A summary block was embedded at the end of every response body. Since it's an HTML comment, it's invisible when rendering Markdown, but it's clearly visible in plain text within the Claude Code terminal. Even for responses I didn't intend to read, "What I did: ..." and "What I need you to judge: ..." would dutifully appear, cluttering my vision.
So, I changed direction and switched to a method where Claude Code looks at its immediately preceding response and summarizes it on the spot the moment /speak is called. No changes to CLAUDE.md. The response body remains clean. The latency of /speak increases by a few seconds, but since it only runs when I use it, it's not a practical issue.
Running it when called was more straightforward than preparing it with a rules file. I should have done this from the beginning.
Implementation
There are two files. I don't add anything to CLAUDE.md.
~/.claude/commands/speak.md
Claude Code slash commands are automatically recognized just by placing them in ~/.claude/commands/<name>.md. This is the core of /speak.
---
description: Read the key points of the previous response aloud using the macOS say command (on-demand execution)
---
# /speak — Read the previous response aloud
You will now read your own previous response aloud. Do not generate a new answer; mechanically follow the steps below.
## Steps
### 1. Generate text to be read
Look at your own previous response and summarize it on the spot into the following two-line structure. No markers or prior preparation are needed (generate from scratch every time this command is called).
Done: {1-3 sentences: Lead with the conclusion about what you investigated/decided/implemented}
Requires decision: {1-3 sentences: The point the user needs to select/approve next. Omit this line if not applicable}
Summary rules:
- Each line should be 1-3 sentences, in Japanese, polite form (desu/masu).
- If the previous response is a short reply of 1-2 sentences, use it as-is without summarizing.
- Do not include emojis, URLs, file paths, code blocks, or brackets. Use only Japanese periods "。" and commas "、".
- Always convert English words, abbreviations, code names, and file names to Katakana or Japanese readings (because the macOS say command's Kyoko misreads English). Examples:
- Claude → クロード, Claude Code → クロードコード
- TTS → ティーティーエス, LLM → エルエルエム, API → エーピーアイ, URL → ユーアールエル
- CLAUDE.md → クロードエムディー, speak.md → スピークエムディー
- `/speak` → スラッシュスピーク, `/speak-stop` → スラッシュスピークストップ
- Kyoko → 京子, VOICEVOX → ボイスボックス, ElevenLabs → イレブンラボ
- macOS → マックオーエス, zshrc → ゼットエスエイチアールシー
- hook → フック, worktree → ワークツリー, wpm → ダブリューピーエム
- Convert others based on the same principle.
- Numbers can remain as Arabic numerals (e.g., 360).
### 2. Format the text
Remove or replace the following:
- Remove markdown symbols (`#`, `*`, `` ` ``, `>`, `-`, `[...](...)`, etc.)
- Remove emojis
- Remove URLs, absolute paths, and the contents of code blocks
- Replace line breaks with "、"
- Replace consecutive spaces with a single space
- Delete single quotes (`'`) (to avoid say command escape accidents)
### 3. Stop existing speech
Execute the following using the Bash tool (ignore failures):
pkill -f "say -v" 2>/dev/null; true
### 4. Speak (background, non-blocking)
Execute the following using the Bash tool. Always use a subshell + `&` to detach and avoid blocking Claude Code. Do not use `run_in_background` (as the subshell already detaches).
Command format (embed the text from Step 2 directly into `<TEXT>` enclosed in single quotes):
( say -v "${CLAUDE_SPEAK_VOICE:-Kyoko}" -r "${CLAUDE_SPEAK_RATE:-360}" '<TEXT>' >/dev/null 2>&1 & )
### 5. Confirmation output
The response to the user should be exactly one line. Format:
Reading: <formatted text>
No extra explanation, introduction, emojis, or apologies allowed. If reading fails, return only `Reading failed: <reason>`.
## Prohibitions
- Generating a new answer (except for summarization)
- Reciting steps or long explanations
- Synchronously waiting for Bash commands (forgetting `&` is forbidden; `run_in_background: true` is also forbidden)
- Retrying on the first failure
Step 1 is where I "summarize the previous response the moment it is called." Since I switched to a method that doesn't use pre-defined rules in CLAUDE.md, I freshly format it into the "Done" and "Requires decision" structure here. The Katakana conversion rules for English words are also consolidated here. By doing this, I don't contaminate CLAUDE.md files in other projects, and toggling is completed just by deleting the file.
In Step 4, I included a subshell and & because, if forgotten, Claude Code would wait for the say command to finish speaking the entire audio before proceeding to the next turn. I learned that the hard way.
~/.claude/commands/speak-stop.md
There are many times when you think, "Wait, stop," after it starts talking, so I included a stop command.
---
description: Instantly stop the current speech (kill the say process running from /speak)
---
# /speak-stop — Stop reading
Execute the following command using the Bash tool:
pkill -f "say -v" 2>/dev/null; true
After execution, the response to the user should be exactly one line: `Stopped`. No extra explanation, introduction, or emojis allowed.
If the `say` process does not exist (meaning playback has already finished), return `Stopped` as well (do not differentiate).
That's it. Two files, under 50 lines. No touching CLAUDE.md or zshrc.
Kyoko was the only one that worked
The first thing I tried once it was working was say -v Kyoko. Kyoko is a Japanese voice that has been in macOS for a long time, and frankly, it sounds extremely robotic. That distinct, emotionless intonation, like "konnichiwa, watashino, namaewa, kyoko, desu."
I thought, "There must be better voices," and looked it up, only to find that eight new Neural-based Japanese voices have been added since macOS Sonoma. Checking with say -v '?' shows the following:
Eddy (日本語(日本)) ja_JP
Flo (日本語(日本)) ja_JP
Grandma (日本語(日本)) ja_JP
Grandpa (日本語(日本)) ja_JP
Kyoko ja_JP
Reed (日本語(日本)) ja_JP
Rocko (日本語(日本)) ja_JP
Sandy (日本語(日本)) ja_JP
Shelley (日本語(日本)) ja_JP
Since Eddy, Flo, and Reed had good reputations in the English-speaking world, I wrote an audition script with high expectations.
for voice in Eddy Flo Reed Sandy Shelley Rocko Grandma Grandpa Kyoko; do
echo "=== $voice ==="
say -v "$voice" -r 230 "${voice}です。クロードコードの応答を音声で読み上げます。機械音声に聞こえませんか。"
done
The result.
Everything except Kyoko was unusable.
The problem was that they mumbled and I couldn't understand what they were saying. That's a bigger issue than whether they sounded robotic or not. Kyoko might be very robotic and occasionally misread things, but because her voice is clear, I can understand the meaning properly.
If you want anything better, you have no choice but to go with ElevenLabs or VOICEVOX. If you want to fight for free, Kyoko stays in the game.
360wpm turned out to be surprisingly comfortable
You can specify the speed (in wpm) for say using the -r option. The default is around 200. I initially set it to 230.
My impression after using it was that it was slow. It takes about 10 seconds to read 150 characters at 230wpm. That 10 seconds of listening without being able to do anything else felt surprisingly stressful.
So, I tried adjusting the speed.
for r in 250 280 300 330 360; do
echo "=== rate=$r ==="
say -v Kyoko -r "$r" "スピードテスト。これは速度${r}での読み上げサンプルです。聞き取りやすさを確認してください。"
done
I expected 280 or 300 to be the sweet spot. When I actually compared them, 360 felt just right. It was unexpected.
I think a big factor is that I'm used to listening to YouTube at 2x speed, so my ears are accustomed to it. I feel no discomfort even at 360wpm. In fact, anything below 300 feels "slow." Since this is entirely a matter of personal preference, I encourage those reading this to listen for themselves and find what they like. I also tried 400 and above, but at that point, it becomes difficult to understand. The practical upper limit for Kyoko seems to be around 350-380.
I set the default in speak.md to ${CLAUDE_SPEAK_RATE:-360}.
How to use it in practice
It is deceptively simple.
Ask a question that is likely to return a long response. Once the response arrives, type /speak. Kyoko will read the key points (what was done / what needs a decision) at 360wpm. If it gets annoying, type /speak-stop.
That's it.
If you want to change the voice or speed, you can override them via environment variables. It's easiest to write them in your ~/.zshrc.
export CLAUDE_SPEAK_VOICE=Kyoko # If you want to change to another voice
export CLAUDE_SPEAK_RATE=340 # Lower it if 360 is too fast
Even if you are running multiple Claude Code instances in parallel, only the session where you typed /speak will speak, leaving the other sessions silent. This is the biggest difference from the automatic hook-based speech method, and for me, it was the deciding factor.
Other routes are also perfectly valid options
I've written about my choices so far, but this is by no means the only solution. Different use cases call for different combinations. I've summarized some representative cases.
If you want to detect task completion and have the entire text read automatically: If you don't mind parallel instances, or rather, if you want to monitor progress through sound, this is the straightforward path. The main approach is to use a Stop hook for automatic speech, and yatabis-san's Zenn article is the most compact summary in Japanese.
If you don't want to compromise on Japanese sound quality: Using cloud TTS is the practical solution. ElevenLabs offers a free tier of 10k characters/month, and the Japanese in their eleven_multilingual_v2 model is reasonably natural. Google Cloud TTS's Neural2 is exceptional with a free tier of 1 million characters/month, and the Japanese quality is high. However, the setup is heavier on the GCP side. There is also OpenAI TTS, but the impression was that their Japanese is a step below ElevenLabs or Google Neural2. When implementing, you can place a small shell script to call the API in ~/.claude/scripts/ and swap it for say in your /speak command. If you set it up as a wrapper and allow switching via an environment variable like CLAUDE_SPEAK_ENGINE=elevenlabs, it's convenient to switch back depending on your mood.
For a fully local, free, and decent quality option: VOICEVOX is a candidate. Since these are character voices like Zundamon, preferences may vary for business use, but the Japanese quality is among the best for local TTS. The downside is that you need to keep the engine running, which consumes 1-2GB of memory. Kokoro TTS is also lightweight and runs fast on Apple Silicon (refer to DevelopersIO's Kokoro article). There is also a configuration using Style-Bert-VITS2, which is mentioned in yatabis-san's article.
If you want to have fun assigning different voices to multiple agents: The English-language OSS projects AgentVibes and Vox are built with this use case in mind. If you like the idea of "giving each agent a vocal personality" (BMAD framework-esque), they are worth checking out.
My approach in this article is for people who don't fit into any of these categories: arbitrary timing, only key points, zero dependencies. I hope you can read this with the understanding that those who want automatic speech via hooks should go with A, those who want quality should go with B, those who want to play locally should go with C, and those who want characterization should go with D.
Conclusion
It's just a matter of the "correct" answer differing based on how you use it, and each option having its own suitable environment. In my case, I happened to be running Claude Code instances in parallel, and because automatic speech was in the way, I ended up with this minimal configuration pushed to the other extreme.
Two files of code, five minutes of setup, zero dependencies, zero runtime costs. If you're interested, just place two files in ~/.claude/commands/. Trying it out is instantaneous, and so is throwing it away if you don't like it.
For those who want to have the full text read aloud by hooks, I recommend starting with yatabis-san's article.
References
- Claude Codeのタスク完了時にその内容を音声で説明してもらう(Zenn/yatabis)
- [小ネタ] Claude Codeで入力待ち・処理完了を音声で知らせるHooks(DevelopersIO)
- Claude Codeの応答をリアルタイム音声化 - Aivis Cloud API(Zenn/ytka)
- Claude Code / Claude Desktopで音声対話環境を構築する【Mac】(Zenn/fivot)
- I tried local voice reading of Claude Code's response with Kokoro TTS(DevelopersIO)
- paulpreibisch/AgentVibes(GitHub)
- Vox — Give Claude Code a Voice(rtk-ai)
- 離席駆動開発: Macに喋らせAlexaで指示する(Qiita)
- Claude Code Hooks 公式ドキュメント
Discussion