iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🤩

Building an On-Premise Transcription Server with Streamlit, Flask, and Whisper (Synchronous Processing)

に公開

✨ Introduction

You want to try transcribing meetings, but...

  • You're worried about information leakage from sending recording data externally...
  • The transcription accuracy of Teams or Zoom is mediocre...

In such cases, if you have even a lightweight GPU (e.g., RTX 1650 with about 4GB of VRAM), you can achieve high-accuracy transcription in your company's own local environment!

In this article, I built a simple transcription app using Streamlit + Flask + Whisper.

📂 Configuration Overview

This system consists of the following three Python files:

File Name Role
app_flask.py Provides the UI for users to upload audio files
server_flask.py Receives files and calls the processing script
transcribe_flask.py Transcribes audio files using Whisper
app_flask.py
import streamlit as st
import requests
import time

# Function to convert seconds to "X min Y sec" format
def convert_seconds(seconds):
    minutes = seconds // 60  # Calculate minutes (integer division)
    remaining_seconds = seconds % 60  # Calculate remaining seconds
    return f"{int(minutes)}分{int(remaining_seconds)}秒"

# Set the application title and description
st.title("音声文字起こし")
st.write('**Whisperを利用して音声データを文字起こしすることが出来ます。**')

# Server URL setting
server_url = "http://localhost:5000/"

# Check if the server is accessible
with st.spinner("サーバーのチェック中..."):  
    try:
        requests.get(server_url)  # GET request to the server
    except requests.ConnectionError:
        st.error('サーバーが立ち上がっていません。', icon="🚨")  # Display error message
        st.stop()  # Stop the app

# Radio button for model selection
model = st.radio("model",["汎用モデル","チューニングモデル"])

# Audio file upload feature
audio_file = st.file_uploader("音声ファイルをアップロードしてください", type=["mp3", "wav", "m4a", "mp4"],key="audio_file_trancribe")

# Process when a file is uploaded
if audio_file:
    # Set the output file name
    st.session_state.transcribe_file_name = "文字起こし結果_"+audio_file.name.replace(".mp4","").replace(".mp3","").replace(".wav","").replace(".m4a","")+".txt"
    
    # Toggle button for providing training data
    if button_save_audio := st.toggle("学習用の為音声ファイルをフィードバックする",key="button_save_audio",
                                        help="音声文字起こしモデルを改善するために、音声ファイルを集めています。音声ファイルを学習データとして利用させていただきます。"):
        st.subheader("ご協力ありがとうございます🤗")
        st.balloons()  # Display balloon effect
    
    # Transcription start button
    button_trans_start = st.button("文字起こしを開始する",type="primary")
    if button_trans_start:
        start_time = time.time()  # Record processing start time
        with st.spinner("**文字起こしを実行中...**"):
            try:
                # Send data to server and request transcription execution
                response = requests.post(
                    server_url+"transcribe_server",
                    files={"audio": audio_file.getvalue()},  # Audio file data
                    data={"model": model,"save_audio":button_save_audio,"file_name":audio_file.name},  # Additional data
                )

                if response.status_code == 200:  # In case of success
                    end_time = time.time()  # Record processing end time
                    st.session_state.execution_time = end_time - start_time  # Calculate execution time
                    st.session_state.transcribe_data = response.json()  # Save result data
                else:
                    st.write("Error: ", response.text)  # Display error message
            except requests.ConnectionError:
                st.error('サーバーとの接続が解除されました。', icon="🚨")
                st.stop()

# Display transcription results
if "transcribe_data" in st.session_state:
    st.write("実行時間:", convert_seconds(st.session_state.execution_time))  # Display execution time
    probability = round(float(st.session_state.transcribe_data['language_probability']) * 100, 1)  # Calculate language detection confidence
    st.write(f"検出言語: {st.session_state.transcribe_data['language']} 信用度 {probability}%")  # Display detected language and confidence
    
    # Download button for transcription results
    st.download_button(label="文字起こし結果をダウンロードする",data=st.session_state.transcribe_data["full_text"],file_name=st.session_state.transcribe_file_name,icon=":material/download:")
    
    # Display transcription results
    st.markdown("**文字起こし結果**")
    st.markdown(st.session_state.transcribe_data["time_line"], unsafe_allow_html=True)  # Display transcription results with timestamps

server_flask.py
from flask import Flask, request
import os
import logging
logging.basicConfig(level=logging.DEBUG)
app = Flask(__name__)
from transcribe_flask import transcribe

# Endpoint for transcription API (for POST requests)
@app.route('/transcribe_server', methods=['POST'])
def transcribe_server():
    try:
        # Get data from request
        audio_file = request.files['audio']  # Audio file
        model = request.form['model']  # Model to use
        save_audio = request.form['save_audio']  # Audio save flag
        file_name = request.form['file_name']  # File name
        
        # Temporarily save the audio file
        audio_file.save(file_name)
        
        # Process for saving audio for training
        if save_audio == "True":  
            with open(file_name, 'rb') as f_src:  # Read original file
                destination_path = "path_tp_save"  # Destination path
                with open(destination_path, 'wb') as f_dst:  # Write to destination file
                    f_dst.write(f_src.read())
        
        # Process for the general-purpose model
        if model == "汎用モデル":
            result = transcribe(audio_file=file_name)  # Execute transcription

        # Delete temporary file
        os.remove(file_name)
        return result  # Return results

    except Exception as e:
        return str(e), 500  # Return 500 error if an exception occurs

# Main program (when executed directly)
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000, debug=True)  # Start server (listen on all interfaces)
transcribe_flask.py
from flask import jsonify
from faster_whisper import WhisperModel

# Function to convert seconds to "X min Y sec" format
def convert_seconds(seconds):
    minutes = seconds // 60  # Calculate minutes (integer division)
    remaining_seconds = seconds % 60  # Calculate remaining seconds
    return f"{int(minutes)}分{int(remaining_seconds)}秒"

# Function to transcribe audio files
def transcribe(audio_file):
    # Initialize Whisper model (large-v3) on GPU
    model = WhisperModel("large-v3", device="cuda", compute_type="float16")
    
    # Execute transcription of the audio file
    segments, info = model.transcribe(audio_file,
                                      language = "ja",  # Specify Japanese
                                      beam_size=5,  # Beam search width (for improved accuracy)
                                      vad_filter=True,  # Enable Voice Activity Detection filter
                                      without_timestamps=True,  # Disable output without timestamps
                                      prompt_reset_on_temperature=0,  # Temperature threshold for prompt reset
                                      # initial_prompt=""  # Initial prompt (unused this time)
                                      )

    # Initialize variables to store results
    full_text=""  # Full text
    time_line=""  # Text with timestamps
    
    # Process each segment (sentence)
    for segment in segments:
        # Generate text with timestamps (format: Start time -> End time)
        time_line+="[%s -> %s] %s" % (convert_seconds(segment.start), convert_seconds(segment.end), segment.text)+"  \n"
        # Add to full text
        full_text+=segment.text+"\n"

    # Create data to return in JSON format
    result = {
        "language": info.language,  # Detected language
        "language_probability": info.language_probability,  # Language detection confidence
        "time_line":time_line,  # Text with timestamps
        "full_text":full_text  # Full text
    }

    # Return results in JSON format
    return jsonify(result)

https://github.com/tsuzukia21/st-transcribe

The processing flow is as follows:

[User]
   ↓ (Upload audio file)
[app_flask.py]
   ↓ (POST submission)
[server_flask.py]
   ↓ (Transcribe with Whisper)
[transcribe_flask.py]

[Display text results]

📅 How to Run

1. Install Required Libraries

pip install flask faster_whisper streamlit

To run faster_whisper, a separate CUDA and cuDNN environment is required. This will not be covered in this article. Please refer to the following articles or search online:
https://siro-yamaneko.hatenablog.jp/entry/2024/08/03/210826
https://qiita.com/tf63/items/0c6da72fe749319423b4

2. Start the App and Server

python server_flask.py
python app_flask.py

3. Access via Browser

http://localhost:8501

🔹 Explanation of Each File

app_flask.py

app_flask.py is responsible for providing the UI for users to select an audio file on their browser and upload it to the server. This time, it is implemented simply using Streamlit.

The main processing steps are as follows:

Check if the transcription server is running

# Check if the server is accessible
with st.spinner("Checking the server..."):  
    try:
        requests.get(server_url)  # GET request to the server
    except requests.ConnectionError:
        st.error('The server is not running.', icon="🚨")  # Display error message
        st.stop()  # Stop the app

This application is configured to send HTTP requests to the server (server_flask.py) started with Flask.
Therefore, the transcription process will not work properly if the server is not running.

You need to start server_flask.py beforehand, as follows:

python server_flask.py

Requests from app_flask.py are sent to http://localhost:5000/transcribe_server by default.
Therefore, if you have changed the port or host, you need to correct the URL accordingly.

File Upload UI

audio_file = st.file_uploader("Please upload an audio file", type=["mp3", "wav", "m4a", "mp4"],key="audio_file_trancribe")

This part uses Streamlit's file_uploader component to let the user select a file. The type argument restricts the file formats that can be uploaded.

Display "Start Transcription" button and start processing when clicked

button_trans_start = st.button("Start transcription",type="primary")

st.button creates the button UI.

Send the file to the Flask server

response = requests.post(
    server_url+"transcribe_server",
    files={"audio": audio_file.getvalue()},  # Audio file data
    data={"model": model,"save_audio":button_save_audio,"file_name":audio_file.name},  # Additional data
)

In this part, the file is sent to the /transcribe_server endpoint of the server_flask.py started with Flask. An HTTP request is sent via requests.post, and the file is specified in the files argument as a dictionary.

Display the text returned from the server

if "transcribe_data" in st.session_state:
    st.write("Execution time:", convert_seconds(st.session_state.execution_time))  # Display execution time
    probability = round(float(st.session_state.transcribe_data['language_probability']) * 100, 1)  # Calculate language detection confidence
    st.write(f"Detected language: {st.session_state.transcribe_data['language']} confidence {probability}%")  # Display detected language and confidence
    
    # Download button for transcription results
    st.download_button(label="Download transcription results",data=st.session_state.transcribe_data["full_text"],file_name=st.session_state.transcribe_file_name,icon=":material/download:")
    
    # Display transcription results
    st.markdown("**Transcription Results**")
    st.markdown(st.session_state.transcribe_data["time_line"], unsafe_allow_html=True)  # Display transcription results with timestamps

The results transcribed on the server side are displayed directly on the screen. The response is in JSON format, and the transcription of the content uploaded by the user is displayed here. The JSON format will be described later.
Additionally, transcription results can be downloaded in text format via the download button.

Supplement *About Fine-tuning

Although not implemented this time, publicly available Whisper models cannot correctly transcribe company-specific or technical terms. When actually operating a transcription server, I recommend fine-tuning the Whisper model. The following articles are helpful for the fine-tuning method:
https://zenn.dev/k_sone/articles/e0c08268986ac2
https://zenn.dev/k_sone/articles/4d137d58dd06a6

Training data is necessary for fine-tuning, so it would be ideal to collect it if permission is granted.

    # Toggle button for providing training data
    if button_save_audio := st.toggle("Provide audio file feedback for training",key="button_save_audio",
                                        help="We are collecting audio files to improve the transcription model. Audio files will be used as training data."):
        st.subheader("Thank you for your cooperation 🤗")
        st.balloons()  # Display balloon effect

If you don't need fine-tuning, you can just comment this part out.

server_flask.py

This file is responsible for receiving the audio file uploaded by the user, calling the program (transcribe_flask.py) that actually performs the transcription process, and returning the result to the user.

The main processing steps are as follows:

1. Creating an endpoint

In Flask, by writing something like @app.route('/transcribe_server', methods=['POST']), you can define the processing for when a specific URL is accessed. Here, it handles the process of receiving the posted audio file.

2. Saving the uploaded file

The uploaded file can be received via request.files['audio']. This is saved locally as a temporary file.

def transcribe_server():
    try:
        # Get data from request
        audio_file = request.files['audio']  # Audio file
        model = request.form['model']  # Model to use
        save_audio = request.form['save_audio']  # Audio save flag
        file_name = request.form['file_name']  # File name
        
        # Temporarily save the audio file
        audio_file.save(file_name)

3. Calling transcribe_flask.py for conversion

Another Python script (transcribe_flask.py) is executed. This script uses Whisper to convert the audio file into text.

from transcribe_flask import transcribe

result = transcribe(audio_file=file_name)

The external script is invoked in this manner.

4. Loading the result and returning it as a response

It returns the transcription results generated by transcribe_flask.py as a response.

transcribe_flask.py

This file is the script that actually performs the process of transcribing audio using Whisper.

The main processing steps are as follows:

1. Opening the audio file received as an argument

When called by the Flask server side, a filename (path) is passed, so the audio file is read based on that.

2. Transcribing with the Whisper model

Using OpenAI's Whisper, the audio is transcribed into text.

from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

3. Saving the text as JSON

The transcription result is obtained in the format of result["full_text"].

result = {
    "language": info.language,  # Detected language
    "language_probability": info.language_probability,  # Language detection confidence
    "time_line":time_line,  # Text with timestamps
    "full_text":full_text  # Full text
}

📋 Conclusion

There is a high demand for safely transcribing audio within a company or automating the creation of meeting minutes.
By combining Flask, Whisper, and Streamlit as shown this time, you can easily build an on-premise transcription environment!

This implementation uses "synchronous processing," so the screen will show a loading spinner ⏳ from the moment the audio is sent until the conversion is finished, requiring the user to wait for completion. Switching this to asynchronous processing would allow the server to immediately reply with "received," and then later notify the user of progress like "now at 30%" or display partial transcription results. This is a significant difference between synchronous and asynchronous processing, and it completely changes the UX. In the next article, I will create an audio processing app using asynchronous processing!
https://github.com/tsuzukia21/st-transcribe

Discussion