💬
GoとGinで実装する軽量音声合成APIサーバー（AquesTalk利用）

2025/02/03に公開
こんにちは。

今回は、Go言語とGinフレームワークを使って実装したWindows向けの音声合成APIサーバーについて解説します。

本プロジェクトでは、AquesTalkのDLL（旧ライセンス版）を利用しており、簡単なHTTP APIで音声合成を行う仕組みを採用しています。

旧ライセンスでは、営利目的での利用が許可されているため利用を続ける利点があります。

なお、リポジトリ自体のライセンスや、AquesTalkのライセンスについては必ずご確認ください。
https://github.com/Lqm1/aquestalk-server

 プロジェクト概要とディレクトリ構造本プロジェクトは、Windows専用に設計された軽量音声合成APIサーバーです。

ディレクトリ構造は以下のようになっています。
└── ./
    ├── .github
    │   └── workflows
    │       └── build.yml
    ├── cmd
    │   └── aquestalk-server
    │       └── main.go
    ├── pkg
    │   └── aquestalk
    │       ├── bin
    │       │   ├── dvd
    │       │   │   ├── AquesTalk.dll
    │       │   │   └── AquesTalkDa.dll
    │       │   ├── f1
    │       │   │   ├── AquesTalk.dll
    │       │   │   └── AquesTalkDa.dll
    │       │   ├── f2
    │       │   │   ├── AquesTalk.dll
    │       │   │   └── AquesTalkDa.dll
    │       │   ├── imd1
    │       │   │   ├── AquesTalk.dll
    │       │   │   └── AquesTalkDa.dll
    │       │   ├── jgr
    │       │   │   ├── AquesTalk.dll
    │       │   │   └── AquesTalkDa.dll
    │       │   ├── m1
    │       │   │   ├── AquesTalk.dll
    │       │   │   └── AquesTalkDa.dll
    │       │   ├── m2
    │       │   │   ├── AquesTalk.dll
    │       │   │   └── AquesTalkDa.dll
    │       │   ├── r1
    │       │   │   ├── AquesTalk.dll
    │       │   │   └── AquesTalkDa.dll
    │       │   └── .gitignore
    │       └── aquestalk.go
    ├── scripts
    │   ├── build.bat
    │   └── build.sh
    ├── .gitignore
    ├── AqLicense.txt
    ├── go.mod
    ├── go.sum
    ├── LICENSE
    ├── Makefile
    └── README.md
cmd/aquestalk-server/main.go

Ginフレームワークを用いてHTTPリクエストを受け付け、入力のバリデーションや音声合成処理を実行するエントリーポイントです。
pkg/aquestalk/aquestalk.go

AquesTalkのDLLをプロジェクト内に埋め込み、動的にロードして音声合成処理を行うラッパーの実装です。
.github/workflows/build.yml や Makefile

CI/CD環境での自動ビルドなど、開発効率を向上させるための設定ファイルです。

 APIサーバーの実装詳細まずは、APIサーバーのエントリーポイントである main.go の主要部分を見ていきます。
package main

import (
	"fmt"
	"net/http"
	"os"

	"github.com/c7e715d1b04b17683718fb1e8944cc28/aquestalk-server/pkg/aquestalk"
	"github.com/gin-gonic/gin"
)

// 許可するvoice一覧
var allowedVoices = map[string]bool{
	"dvd":  true,
	"f1":   true,
	"f2":   true,
	"imd1": true,
	"jgr":  true,
	"m1":   true,
	"m2":   true,
	"r1":   true,
}

type SpeechRequest struct {
	Model          string  `json:"model" binding:"required"`
	Input          string  `json:"input" binding:"required"`
	Voice          string  `json:"voice" binding:"required"`
	ResponseFormat string  `json:"response_format,omitempty"`
	Speed          float64 `json:"speed,omitempty"`
}

func main() {
	r := gin.Default()

	r.POST("/v1/audio/speech", func(c *gin.Context) {
		var req SpeechRequest
		if err := c.ShouldBindJSON(&req); err != nil {
			c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
			return
		}

		// modelは "tts-1" のみ対応
		if req.Model != "tts-1" {
			c.JSON(http.StatusBadRequest, gin.H{
				"error": "only 'tts-1' model is supported",
			})
			return
		}

		// response_formatは"wav"のみ対応
		if req.ResponseFormat != "" && req.ResponseFormat != "wav" {
			c.JSON(http.StatusBadRequest, gin.H{
				"error": "'response_format' must be 'wav'",
			})
			return
		}

		// voiceのチェック
		if !allowedVoices[req.Voice] {
			c.JSON(http.StatusBadRequest, gin.H{
				"error": "invalid voice specified",
			})
			return
		}

		// 入力テキストの長さチェック（1～4096文字）
		if len(req.Input) == 0 || len(req.Input) > 4096 {
			c.JSON(http.StatusBadRequest, gin.H{
				"error": "input must be between 1 and 4096 characters",
			})
			return
		}

		// speedのチェック（0.5～3.0の範囲）
		if req.Speed != 0 && (req.Speed < 0.5 || req.Speed > 3.0) {
			c.JSON(http.StatusBadRequest, gin.H{
				"error": "speed must be between 0.5 and 3.0",
			})
			return
		}

		// デフォルト速度は1.0
		speed := 1.0
		if req.Speed != 0 {
			speed = req.Speed
		}

		// AquesTalkの初期化（指定されたvoiceに対応するDLLをロード）
		aq, err := aquestalk.New(req.Voice)
		if err != nil {
			c.JSON(http.StatusInternalServerError, gin.H{
				"error": fmt.Sprintf("aquestalk init failed: %v", err),
			})
			return
		}
		defer aq.Close()

		// DLLには速度を100倍した整数値（例: 1.0 -> 100）で渡す必要があります
		speedParam := int(speed * 100)

		// 音声合成処理
		wav, err := aq.Synthe(req.Input, speedParam)
		if err != nil {
			c.JSON(http.StatusInternalServerError, gin.H{
				"error": fmt.Sprintf("synthesis failed: %v", err),
			})
			return
		}

		// 合成したWAVデータを返却
		c.Data(http.StatusOK, "audio/wav", wav)
	})

	// PORT環境変数が未設定の場合は8080を使用
	port := os.Getenv("PORT")
	if port == "" {
		port = "8080"
	}
	r.Run(":" + port)
}

 ポイント解説バリデーションとエラーハンドリング

JSONリクエストを SpeechRequest 構造体にバインドし、model、response_format、voice、input、speed について厳密にチェックしています。
DLL連携による音声合成

aquestalk.New() でDLLを一時ディレクトリに展開しロード、DLL内の AquesTalk_Synthe を呼び出すことで音声合成を実現しています。

※なお、AquesTalkのDLLは旧ライセンス版を使用していますが、詳細は各自でご確認ください。

 DLLラッパーの実装次に、AquesTalkのDLLを操作するラッパー実装について解説します。
package aquestalk

import (
	"embed"
	"fmt"
	"os"
	"path/filepath"
	"syscall"
	"unsafe"

	"golang.org/x/text/encoding/japanese"
)

// go:embedディレクティブで、DLLファイルをプロジェクトに埋め込みます。
// 各voiceに対応するDLLは bin/<voice>/AquesTalk.dll に配置します。
//
//go:embed bin/*/AquesTalk.dll
var dllFS embed.FS

type AquesTalk struct {
	dll          *syscall.DLL
	syntheProc   *syscall.Proc
	freeWaveProc *syscall.Proc
	tempDir      string // 一時ディレクトリ
}

// Newは、指定されたvoiceに対応するDLLを一時ディレクトリに展開してロードします。
func New(voice string) (*AquesTalk, error) {
	dllPathInEmbed := fmt.Sprintf("bin/%s/AquesTalk.dll", voice)
	dllData, err := dllFS.ReadFile(dllPathInEmbed)
	if err != nil {
		return nil, fmt.Errorf("DLL not found for voice %s: %w", voice, err)
	}

	// 一時ディレクトリを作成
	tempDir, err := os.MkdirTemp("", "aquestalk-*")
	if err != nil {
		return nil, fmt.Errorf("failed to create temp dir: %w", err)
	}

	// DLLを一時ファイルとして書き出す
	tempDLLPath := filepath.Join(tempDir, "AquesTalk.dll")
	if err := os.WriteFile(tempDLLPath, dllData, 0644); err != nil {
		os.RemoveAll(tempDir)
		return nil, fmt.Errorf("failed to write DLL: %w", err)
	}

	// DLLをロード
	dll, err := syscall.LoadDLL(tempDLLPath)
	if err != nil {
		os.RemoveAll(tempDir)
		return nil, fmt.Errorf("DLL load error: %w", err)
	}

	// DLL内の関数ポインタを取得
	syntheProc, err := dll.FindProc("AquesTalk_Synthe")
	if err != nil {
		dll.Release()
		os.RemoveAll(tempDir)
		return nil, fmt.Errorf("AquesTalk_Synthe not found: %w", err)
	}

	freeWaveProc, err := dll.FindProc("AquesTalk_FreeWave")
	if err != nil {
		dll.Release()
		os.RemoveAll(tempDir)
		return nil, fmt.Errorf("AquesTalk_FreeWave not found: %w", err)
	}

	return &AquesTalk{
		dll:          dll,
		syntheProc:   syntheProc,
		freeWaveProc: freeWaveProc,
		tempDir:      tempDir,
	}, nil
}

// Closeは、DLLリソースと一時ディレクトリを解放します。
func (a *AquesTalk) Close() error {
	if a.dll != nil {
		a.dll.Release()
		a.dll = nil
	}
	if a.tempDir != "" {
		os.RemoveAll(a.tempDir)
		a.tempDir = ""
	}
	return nil
}

// Syntheは、音声記号列（koe）と速度(speed)を受け取り、DLLのAquesTalk_Synthe関数を呼び出して音声合成を実行します。
// ここで受け取るkoeは、単なるテキストではなく、AquesTalkが定める独自の読み上げ用音声記号列です。
// この音声記号列はShift-JISでエンコードされる必要があります。
// ※なお、漢字などを音声記号列に変換するには、AqKanji2Koeという別の動的ライブラリが必要となります。
//  「棒読みちゃん」などのソフトウェアでは、IMEなどを利用して一括変換しています。
// 当リポジトリでは、漢字等から音声記号列への変換処理は実装していないため、利用する際は独自に変換処理を実装してからAPIに渡してください。
func (a *AquesTalk) Synthe(koe string, speed int) ([]byte, error) {
	enc := japanese.ShiftJIS.NewEncoder()
	koe, err := enc.String(koe)
	if err != nil {
		return nil, fmt.Errorf("failed to convert koe to sjis: %w", err)
	}

	ckoe, err := syscall.BytePtrFromString(koe)
	if err != nil {
		return nil, fmt.Errorf("invalid parameter: %w", err)
	}

	var size int
	// DLL関数呼び出し
	ret, _, _ := a.syntheProc.Call(
		uintptr(unsafe.Pointer(ckoe)),
		uintptr(speed),
		uintptr(unsafe.Pointer(&size)),
	)

	if ret == 0 {
		return nil, fmt.Errorf("synthesis failed (code: %d)", size)
	}

	// unsafe.SliceでWAVデータをコピー
	wavData := unsafe.Slice((*byte)(unsafe.Pointer(ret)), size)
	data := make([]byte, len(wavData))
	copy(data, wavData)

	// 使用済みのWAVバッファを解放
	a.freeWaveProc.Call(ret)

	return data, nil
}

 ポイント解説DLLの埋め込みと動的ロード

//go:embed によりDLLファイルをプロジェクトに含め、リクエスト時に一時ディレクトリへ展開してロードします。
音声記号列について

Synthe 関数が受け取る koe は、一般的なテキストではなく、AquesTalkが定める独自の音声記号列です。

この記号列はShift-JISでエンコードする必要があり、漢字等を変換するには別途AqKanji2Koe等の仕組みが必要です。

当リポジトリではその変換処理は実装していないため、各自で用意してAPIに渡してください。
安全なメモリ操作

unsafe.Slice を利用して、DLLから返却されたWAVデータを安全にコピーし、不要なメモリの解放を適切に行っています。

 ビルドとCIの設定プロジェクトでは、Windows向け実行ファイル aquestalk-server.exe を生成するために、以下の仕組みを採用しています。
ビルドスクリプト (scripts/build.bat / build.sh)

Windows用に GOOS=windows と GOARCH=386 を設定して go build を実行します。

AquesTalkの旧ライセンスDLLは32bitなのでターゲットのGOARCHを386に設定する必要があります。
Makefile

OSを判定し、適切なビルドスクリプトを呼び出すようにしています。
ifeq ($(OS),Windows_NT)
    BUILD_SCRIPT = scripts\build.bat
else
    BUILD_SCRIPT = scripts/build.sh
endif

.PHONY: build
build:
	@echo "Running build script: $(BUILD_SCRIPT)"
	$(BUILD_SCRIPT)

GitHub Actions

.github/workflows/build.yml により、main ブランチへのpushやpull request時に自動ビルドを実行し、生成された実行ファイルをアーティファクトとしてアップロードしています。

 リリース済みEXEを使った簡単なAPI起動方法Goの環境がなくても、GitHubリリースからWindows用実行ファイル aquestalk-server.exe をダウンロードすれば、すぐにAPIサーバーを起動できます。

 起動手順GitHubリリースからEXEをダウンロード

リリースページ から aquestalk-server.exe を取得します。
EXEを実行

ダウンロードした aquestalk-server.exe をダブルクリックするか、コマンドプロンプトから実行すると、サーバーはデフォルトでポート8080で起動します。
APIの利用

サーバーが起動した状態で、以下のようにcurlコマンドを用いてAPIを呼び出し、音声合成を行います。
curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "おはようございます。",
    "voice": "f1",
    "response_format": "wav",
    "speed": 1.0
  }' \
  --output speech.wav
この例はREADME.mdに記載されている利用例と同様で、OpenAIクライアントライブラリからの呼び出しも可能です。

 JMeterでのパフォーマンス計測Thred Groupの設定
Number of Threads (users): 1000
Ramp-up period (seconds): 10
Loop Count: 1



Label
# Samples
Average
Min
Max
Std. Dev.
Error %
Throughput
Received KB/sec
Sent KB/sec
Avg. Bytes


HTTP リクエスト
1000
5
3
14
1.60
0.000%
99.86020
1232.45
25.06
12638.0

合計
1000
5
3
14
1.60
0.000%
99.86020
1232.45
25.06
12638.0


 ライセンスについての注意リポジトリのライセンス

本プロジェクトのソースコードは、リポジトリ内に記載されたライセンス（例：GPL 3.0など）に基づいて提供されています。

利用や改変にあたっては、必ずリポジトリのライセンスを確認してください。
AquesTalkのライセンス

本プロジェクトで利用しているAquesTalk DLLは旧ライセンス版ですが、使用条件などはAquesTalkの公式ライセンス文書（例：AqLicense.txt）を必ず確認してください。

ライセンス違反とならないよう、各自で適切な対応を行ってください。

 まとめこの記事では、GoとGinを用いて実装した軽量音声合成APIサーバーについて解説しました。

主なポイントは以下の通りです。
バリデーションとエラーハンドリング

リクエストパラメータの厳密なチェックと、DLLを利用した音声合成処理を実装。
DLLの動的ロード

//go:embed によりDLLをプロジェクトに含め、実行時に一時ディレクトリへ展開してロードする手法を採用。
音声記号列の取り扱い

AquesTalk.Syntheで受け取るkoeは、通常のテキストではなく、独自の音声記号列（Shift-JISエンコード済み）である点に注意。

漢字などの変換処理は実装されていないため、必要に応じて独自に実装してください。
簡単にAPIサーバーを起動可能

GitHubリリースからダウンロードできる実行ファイルを利用することで、Go環境がなくても簡単にAPIサーバーを立ち上げられます。
ライセンスの確認

リポジトリとAquesTalkそれぞれのライセンスを必ず確認し、利用条件に従って使用してください。
ぜひ、実際にコードを読みながら動作確認を行ってみてください。ご質問やフィードバックはお気軽にどうぞ。

 参考リンクAquesTalk公式ブログ
AquesTalk FAQ
Gin Web Framework
Go embedパッケージ
golang.org/x/text/encoding/japanese
GitHub Actions ドキュメント
Label	# Samples	Average	Min	Max	Std. Dev.	Error %	Throughput	Received KB/sec	Sent KB/sec	Avg. Bytes
HTTP リクエスト	1000	5	3	14	1.60	0.000%	99.86020	1232.45	25.06	12638.0
合計	1000	5	3	14	1.60	0.000%	99.86020	1232.45	25.06	12638.0
Discussion

ログインするとコメントできます