iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
👋

Implementing Style-Bert-VITS2 as an API

に公開

Introduction

Style-Bert-VITS2 (SBV2) is a very high-performance speech synthesis AI model.
https://github.com/litagin02/Style-Bert-VITS2

Also, thankfully, it is under licenses that allow commercial use: "AGPL-3.0 license, LGPL-3.0 license."
The following article is very helpful regarding these licenses:
https://qiita.com/tatsumi_t2/items/3da688a5123d37986331

And the following is stated in that article:

If you incorporate software made with AGPL into your own software, the license propagates to that software as well.

In the case of usage through communication over a network such as an API call, it is not considered incorporation and the license does not seem to propagate.

In other words, if you incorporate SBV2 into your software, you need to release the source code of the entire software. However, if you use SBV2 through network communication such as an API, that does not seem to be the case. Therefore, this time I will describe how to call SBV2 from another server via API. Also, since that code itself needs to be released due to the license, I will publish it below.
https://github.com/personabb/sbv2_api

(Updated August 29, 2024) Regarding the AGPL ver3 license

Preparation

Development Environment

The author's development environment is an M2 Mac with 16GB RAM.

Environment Setup

We will use Python 3.11. (It has been confirmed that 3.12 does not work). Please set up an environment where Python can be used.

Cloning the Repository

git clone https://github.com/personabb/sbv2_api.git

Obtaining Necessary Voice Models

Obtain the "amitaro" model and "jvnv-F1-jp" model and store them as follows.


sbv2_api/
    ├ model_assets/
    |      ├ amitaro/
    |      |       ├ amitaro.safetensors
    |      |       ├ config.json
    |      |       └ style_vectors.npy
    |      └ jvnv-F1-jp/
    |              ├ jvnv-F1-jp_e160_s14000.safetensors
    |              ├ config.json
    |              └ style_vectors.npy
    ├ dict_data/
    |     └ default.csv
    ├ sbv2_api.py
    └ client.py

The "amitaro" and "jvnv-F1-jp" models are the default models for SBV2.

https://zenn.dev/asap/articles/f8c0621cdd74cc#環境構築
Once you finish up to the "Environment Setup" chapter of the article above, the folders for the above models should be stored in the "model_assets" folder of the "Style-Bert-VITS2" repository, so please copy them.

Also, the above models are used because they are called in the code below, but there is no problem even if it is a model you trained yourself. In that case, please rewrite the corresponding part of the code below.

Installing Necessary Packages

pip install numpy==1.26.4
pip install style-bert-vits2
pip install sounddevice
pip install fastapi\[all\]

Code Implementation

Server Side

sbv2_api.py
import os
import numpy as np
from pathlib import Path
from style_bert_vits2.nlp import bert_models
from style_bert_vits2.constants import Languages
from style_bert_vits2.tts_model import TTSModel
from style_bert_vits2.logging import logger
from style_bert_vits2.nlp.japanese.user_dict import update_dict
import torch
from pydantic import BaseModel

from fastapi import FastAPI, Depends, Header
from typing import List, Dict, Any
from fastapi.security.api_key import APIKeyHeader
import uvicorn
import json
import time
import glob

device = "cuda" if torch.cuda.is_available() else "cpu"
update_user_dict = False
default_dict_path = "dict_data/default.csv"
compiled_dict_path = "dict_data/user.dic"
bert_models_model = "ku-nlp/deberta-v2-large-japanese-char-wwm"
bert_models_tokenizer = "ku-nlp/deberta-v2-large-japanese-char-wwm"


class SBV2:
    def __init__(self, model_path):
        logger.remove()

        if update_user_dict:
            print("loading user dict")
            update_dict(default_dict_path = Path(default_dict_path), compiled_dict_path = Path(compiled_dict_path))
        

        if device == "auto":
            self.DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
        else:
            self.DEVICE = device

        bert_models.load_model(Languages.JP, bert_models_model)
        bert_models.load_tokenizer(Languages.JP, bert_models_tokenizer)

        style_file = glob.glob(f'{model_path}/*.npy',recursive=True)[0]
        config_file = glob.glob(f'{model_path}/*.json',recursive=True)[0]
        model_file = glob.glob(f'{model_path}/*.safetensors',recursive=True)[0]

        print(style_file)
        print(config_file)
        print(model_file)

        
        self.model_TTS = TTSModel(
            model_path=model_file,
            config_path=config_file,
            style_vec_path=style_file,
            device=self.DEVICE
        )

    def call_TTS(self,message):
        sr, audio = self.model_TTS.infer(text=message)

        return sr, audio
    
    def text2speech(self,message):
        sr, audio = self.model_TTS.infer(text=message)
        sd.play(audio, sr)
        sd.wait()

app = FastAPI()

class SBV2_inputs(BaseModel):
    text: str

class SBV2_init(BaseModel):
    modelname: str

# Dictionary to manage instances for each user
user_instances: Dict[str, Dict] = {}

class Dependencies:
    def __init__(self,api_key, model):
        model_path = f"model_assets/{model}"
        self.sbv2 = SBV2(model_path = model_path)

    def get_sbv2(self):
        return self.sbv2


def get_user_dependencies(api_key: str,model = None):
    # Register a new one if the API key has not been registered in the past
    if api_key not in user_instances:
        if model is None:
            raise Exception("model is required for the first time initialization")
        user_instances[api_key] = Dependencies(api_key, model)
        
    # If registered, return it as is
    return user_instances[api_key]

API_KEY_NAME = "api_key"
api_key_header = APIKeyHeader(name=API_KEY_NAME, auto_error=True)
def get_api_key(api_key: str = Depends(api_key_header)):
    return api_key

print("server started")

@app.post("/initialize/")
async def initialize(
    inputs: SBV2_init,
    api_key: str =  Depends(get_api_key)
    ):
    dependencies = get_user_dependencies(api_key, inputs.modelname)
    # The first execution takes a long time, possibly due to torch.nn.utils.weight_norm's FutureWarning, so complete the first execution at the time of initialization.
    _, _ = dependencies.get_sbv2().call_TTS("Initialization")
    return {"message": "Initialized"}

@app.post("/process/")
async def process_data(
    inputs: SBV2_inputs,
    api_key: str = Depends(get_api_key),    
):
    dependencies = get_user_dependencies(api_key)
    start_tts = time.time()
    sr, audio = dependencies.get_sbv2().call_TTS(inputs.text)
    print(f"Time taken for TTS: {time.time() - start_tts}")
    return {"audio": audio.tolist(), "sr": sr}


if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=8001)

Explanation

Updating the User Dictionary

if update_user_dict:
    print("loading user dict")
    update_dict(default_dict_path = Path(default_dict_path), compiled_dict_path = Path(compiled_dict_path))

In the part above, the user dictionary is updated.
If you want to register a user dictionary, set update_user_dict to True and place the dictionary at the path specified in default_dict_path = "dict_data/default.csv".
(Initially, update_user_dict=False is set, so no user dictionary is registered.)

For information on user dictionaries, please see the following:
https://zenn.dev/asap/articles/f8c0621cdd74cc#辞書登録

Regarding api_key

# Dictionary to manage instances for each user
user_instances: Dict[str, Dict] = {}

class Dependencies:
    def __init__(self,api_key, model):
        model_path = f"model_assets/{model}"
        self.sbv2 = SBV2(model_path = model_path)

    def get_sbv2(self):   
        return self.sbv2

def get_user_dependencies(api_key: str,model = None):
    # Register a new one if the API key has not been registered in the past
    if api_key not in user_instances:
        if model is None:
            raise Exception("model is required for the first time initialization")
        user_instances[api_key] = Dependencies(api_key, model)
        
    # If registered, return it as is
    return user_instances[api_key]

In this part, if the api_key is different, a new SBV2 instance is created (and then saved in the user_instances dictionary), and if the api_key is the same, the SBV2 instance saved in the user_instances dictionary is reused.

By implementing it this way, you can switch and run multiple voice models by changing the api_key.

Separating Initialization and Execution

@app.post("/initialize/")
async def initialize(
    inputs: SBV2_init,
    api_key: str =  Depends(get_api_key)
    ):
    dependencies = get_user_dependencies(api_key, inputs.modelname)
    # The first execution takes a long time, possibly due to torch.nn.utils.weight_norm's FutureWarning, so complete the first execution at the time of initialization.
    _, _ = dependencies.get_sbv2().call_TTS("Initialization")
    return {"message": "Initialized"}

@app.post("/process/")
async def process_data(
    inputs: SBV2_inputs,
    api_key: str = Depends(get_api_key),    
):
    dependencies = get_user_dependencies(api_key)
    start_tts = time.time()
    sr, audio = dependencies.get_sbv2().call_TTS(inputs.text)
    print(f"Time taken for TTS: {time.time() - start_tts}")
    return {"audio": audio.tolist(), "sr": sr}

As shown here, initialization (/initialize/) and execution (/process/) are separated.
At the time of initialization, the api_key and the name of the voice model are obtained to create the SBV2 instance and save it to the user_instances dictionary.

Also, SBV2 has an issue where the processing speed becomes slightly slower during the first execution after instance creation, possibly due to the Warning mentioned below. Therefore, the first execution is also performed during initialization.
(This might just be in my environment, so please comment it out if it's unnecessary.)

Relevant part

# The first execution takes a long time, possibly due to torch.nn.utils.weight_norm's FutureWarning, so complete the first execution at the time of initialization.
_, _ = dependencies.get_sbv2().call_TTS("Initialization")

Warning

FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.

In addition, during execution (/process/), it obtains the api_key and the text to be spoken, retrieves the instance created for that api_key from the dictionary, and generates the synthetic voice for the spoken text.

Client Side

client.py
import sys, os
import torch
import requests
import numpy as np
import sounddevice as sd

def init_abv2_api(api_key = "sbv2_amitaro", model_name = "amitaro"):
    init_url = "http://127.0.0.1:8001/initialize/"

    # Initialize the instance on the server side
    headers = {"api_key": api_key}

    init_inputs = {
        "modelname": model_name,
    }

    init_response = requests.post(init_url, json=init_inputs, headers=headers)
    if init_response.status_code == 200:
        print("Initialization successful.")
    else:
        print("Initialization failed.")
        exit(1)

def call_TTS_API(text,api_key = "sbv2_amitaro"):
    url = "http://127.0.0.1:8001/process/"
    headers = {"api_key": api_key}

    inputs = {
        "text": text,
    }

    response = requests.post(url, json=inputs, headers=headers)
    # Parse the response as JSON data
    data = response.json() 

    audio = data['audio']
    audio = np.array(audio, dtype=np.float32)
    audio = audio / 32768.0
    sr = data['sr']

    return audio, sr

if __name__ == "__main__":
    init_abv2_api(api_key = "sbv2_amitaro", model_name = "amitaro")
    init_abv2_api(api_key = "sbv2_jvnv-F1-jp", model_name = "jvnv-F1-jp")

    audio, sr = call_TTS_API("こんにちは。",api_key = "sbv2_amitaro")
    sd.play(audio, sr)
    sd.wait()

    audio, sr = call_TTS_API("こんにちは。",api_key = "sbv2_jvnv-F1-jp")
    sd.play(audio, sr)
    sd.wait()

Explanation

Initialization Function

As mentioned above, once the client-side preparations are complete, an initialization request is sent to the server to have it generate an instance.
At this point, the api_key and model_name are sent to the server.
Since this api_key and model_name are uniquely linked, it is necessary to specify the api_key for the voice model you wish to call in the execution function described later.

It is called as follows during execution:

init_abv2_api(api_key = "sbv2_amitaro", model_name = "amitaro")
init_abv2_api(api_key = "sbv2_jvnv-F1-jp", model_name = "jvnv-F1-jp")

In this case, as shown above, the two models provided by default in SBV2 are called.
The model by Amitaro is particularly recommended due to its high sound quality.
It was trained using voice materials provided by Amitaro on the following site:
https://amitaro.net/voice/livevoice/

For the model_name specification, specify the folder name directly under model_assets.
In terms of the code, it searches for and retrieves the safetensor weights, etc., located inside the specified folder.

If you want to use SBV2 pre-trained weights that you have prepared yourself, specify the folder name where the model files you want to use are stored.

Execution Function

def call_TTS_API(text,api_key = "sbv2_amitaro"):
    url = "http://127.0.0.1:8001/process/"
    headers = {"api_key": api_key}

    inputs = {
        "text": text,
    }

    response = requests.post(url, json=inputs, headers=headers)
    # Parse the response as JSON data
    data = response.json() 

    audio = data['audio']
    audio = np.array(audio, dtype=np.float32)
    audio = audio / 32768.0
    sr = data['sr']

    return audio, sr

This is a function that calls the call_TTS method of the SBV2 class on the server side via API. By specifying the api_key linked to the voice model you wish to call and the text you want to speak as arguments, you can obtain the audio waveform audio and the sampling rate sr after synthesis.

The function is called as follows, including the playback of the obtained audio waveform:

audio, sr = call_TTS_API("こんにちは。",api_key = "sbv2_amitaro")
sd.play(audio, sr)
sd.wait()
audio, sr = call_TTS_API("こんにちは。",api_key = "sbv2_jvnv-F1-jp")
sd.play(audio, sr)
sd.wait()

The top snippet speaks "Konnichiwa" (Hello) in Amitaro's voice, and the bottom one speaks it in the "jvnv-F1-jp" voice. Furthermore, if you perform initialization in advance, you can synthesize speech with various voices simply by changing the api_key when calling the call_TTS_API function.

Execution

Open two terminals where the environment has been set up, and run the following commands in each terminal:

python sbv2_api.py
python client.py

In the terminal where you ran sbv2_api.py, wait for the following to be displayed before executing the client.py command:

server started
INFO:     Started server process [11246]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)

When you execute the client.py command, "Konnichiwa" should be played back in the voices of the two speakers.

Summary

In this article, I described how to call SBV2 from another server via API. By implementing it this way, you can call SBV2 through API calls while specifying the voice from the client side.

SBV2 itself is a very high-performance speech synthesis model. I have also written an article about detailed usage as shown below, so please give it a try.
https://zenn.dev/asap/articles/f8c0621cdd74cc

Thank you for reading this far!

Discussion