iTranslated by AI
Building an On-Premise Transcription Server with Streamlit, Flask, and Whisper (Synchronous Processing)
✨ Introduction
You want to try transcribing meetings, but...
- You're worried about information leakage from sending recording data externally...
- The transcription accuracy of Teams or Zoom is mediocre...
In such cases, if you have even a lightweight GPU (e.g., RTX 1650 with about 4GB of VRAM), you can achieve high-accuracy transcription in your company's own local environment!
In this article, I built a simple transcription app using Streamlit + Flask + Whisper.
📂 Configuration Overview
This system consists of the following three Python files:
| File Name | Role |
|---|---|
app_flask.py |
Provides the UI for users to upload audio files |
server_flask.py |
Receives files and calls the processing script |
transcribe_flask.py |
Transcribes audio files using Whisper |
app_flask.py
import streamlit as st
import requests
import time
# Function to convert seconds to "X min Y sec" format
def convert_seconds(seconds):
minutes = seconds // 60 # Calculate minutes (integer division)
remaining_seconds = seconds % 60 # Calculate remaining seconds
return f"{int(minutes)}分{int(remaining_seconds)}秒"
# Set the application title and description
st.title("音声文字起こし")
st.write('**Whisperを利用して音声データを文字起こしすることが出来ます。**')
# Server URL setting
server_url = "http://localhost:5000/"
# Check if the server is accessible
with st.spinner("サーバーのチェック中..."):
try:
requests.get(server_url) # GET request to the server
except requests.ConnectionError:
st.error('サーバーが立ち上がっていません。', icon="🚨") # Display error message
st.stop() # Stop the app
# Radio button for model selection
model = st.radio("model",["汎用モデル","チューニングモデル"])
# Audio file upload feature
audio_file = st.file_uploader("音声ファイルをアップロードしてください", type=["mp3", "wav", "m4a", "mp4"],key="audio_file_trancribe")
# Process when a file is uploaded
if audio_file:
# Set the output file name
st.session_state.transcribe_file_name = "文字起こし結果_"+audio_file.name.replace(".mp4","").replace(".mp3","").replace(".wav","").replace(".m4a","")+".txt"
# Toggle button for providing training data
if button_save_audio := st.toggle("学習用の為音声ファイルをフィードバックする",key="button_save_audio",
help="音声文字起こしモデルを改善するために、音声ファイルを集めています。音声ファイルを学習データとして利用させていただきます。"):
st.subheader("ご協力ありがとうございます🤗")
st.balloons() # Display balloon effect
# Transcription start button
button_trans_start = st.button("文字起こしを開始する",type="primary")
if button_trans_start:
start_time = time.time() # Record processing start time
with st.spinner("**文字起こしを実行中...**"):
try:
# Send data to server and request transcription execution
response = requests.post(
server_url+"transcribe_server",
files={"audio": audio_file.getvalue()}, # Audio file data
data={"model": model,"save_audio":button_save_audio,"file_name":audio_file.name}, # Additional data
)
if response.status_code == 200: # In case of success
end_time = time.time() # Record processing end time
st.session_state.execution_time = end_time - start_time # Calculate execution time
st.session_state.transcribe_data = response.json() # Save result data
else:
st.write("Error: ", response.text) # Display error message
except requests.ConnectionError:
st.error('サーバーとの接続が解除されました。', icon="🚨")
st.stop()
# Display transcription results
if "transcribe_data" in st.session_state:
st.write("実行時間:", convert_seconds(st.session_state.execution_time)) # Display execution time
probability = round(float(st.session_state.transcribe_data['language_probability']) * 100, 1) # Calculate language detection confidence
st.write(f"検出言語: {st.session_state.transcribe_data['language']} 信用度 {probability}%") # Display detected language and confidence
# Download button for transcription results
st.download_button(label="文字起こし結果をダウンロードする",data=st.session_state.transcribe_data["full_text"],file_name=st.session_state.transcribe_file_name,icon=":material/download:")
# Display transcription results
st.markdown("**文字起こし結果**")
st.markdown(st.session_state.transcribe_data["time_line"], unsafe_allow_html=True) # Display transcription results with timestamps
server_flask.py
from flask import Flask, request
import os
import logging
logging.basicConfig(level=logging.DEBUG)
app = Flask(__name__)
from transcribe_flask import transcribe
# Endpoint for transcription API (for POST requests)
@app.route('/transcribe_server', methods=['POST'])
def transcribe_server():
try:
# Get data from request
audio_file = request.files['audio'] # Audio file
model = request.form['model'] # Model to use
save_audio = request.form['save_audio'] # Audio save flag
file_name = request.form['file_name'] # File name
# Temporarily save the audio file
audio_file.save(file_name)
# Process for saving audio for training
if save_audio == "True":
with open(file_name, 'rb') as f_src: # Read original file
destination_path = "path_tp_save" # Destination path
with open(destination_path, 'wb') as f_dst: # Write to destination file
f_dst.write(f_src.read())
# Process for the general-purpose model
if model == "汎用モデル":
result = transcribe(audio_file=file_name) # Execute transcription
# Delete temporary file
os.remove(file_name)
return result # Return results
except Exception as e:
return str(e), 500 # Return 500 error if an exception occurs
# Main program (when executed directly)
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000, debug=True) # Start server (listen on all interfaces)
transcribe_flask.py
from flask import jsonify
from faster_whisper import WhisperModel
# Function to convert seconds to "X min Y sec" format
def convert_seconds(seconds):
minutes = seconds // 60 # Calculate minutes (integer division)
remaining_seconds = seconds % 60 # Calculate remaining seconds
return f"{int(minutes)}分{int(remaining_seconds)}秒"
# Function to transcribe audio files
def transcribe(audio_file):
# Initialize Whisper model (large-v3) on GPU
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Execute transcription of the audio file
segments, info = model.transcribe(audio_file,
language = "ja", # Specify Japanese
beam_size=5, # Beam search width (for improved accuracy)
vad_filter=True, # Enable Voice Activity Detection filter
without_timestamps=True, # Disable output without timestamps
prompt_reset_on_temperature=0, # Temperature threshold for prompt reset
# initial_prompt="" # Initial prompt (unused this time)
)
# Initialize variables to store results
full_text="" # Full text
time_line="" # Text with timestamps
# Process each segment (sentence)
for segment in segments:
# Generate text with timestamps (format: Start time -> End time)
time_line+="[%s -> %s] %s" % (convert_seconds(segment.start), convert_seconds(segment.end), segment.text)+" \n"
# Add to full text
full_text+=segment.text+"\n"
# Create data to return in JSON format
result = {
"language": info.language, # Detected language
"language_probability": info.language_probability, # Language detection confidence
"time_line":time_line, # Text with timestamps
"full_text":full_text # Full text
}
# Return results in JSON format
return jsonify(result)
The processing flow is as follows:
[User]
↓ (Upload audio file)
[app_flask.py]
↓ (POST submission)
[server_flask.py]
↓ (Transcribe with Whisper)
[transcribe_flask.py]
↓
[Display text results]

📅 How to Run
1. Install Required Libraries
pip install flask faster_whisper streamlit
To run faster_whisper, a separate CUDA and cuDNN environment is required. This will not be covered in this article. Please refer to the following articles or search online:
2. Start the App and Server
python server_flask.py
python app_flask.py
3. Access via Browser
http://localhost:8501

🔹 Explanation of Each File
app_flask.py
app_flask.py is responsible for providing the UI for users to select an audio file on their browser and upload it to the server. This time, it is implemented simply using Streamlit.
The main processing steps are as follows:
Check if the transcription server is running
# Check if the server is accessible
with st.spinner("Checking the server..."):
try:
requests.get(server_url) # GET request to the server
except requests.ConnectionError:
st.error('The server is not running.', icon="🚨") # Display error message
st.stop() # Stop the app
This application is configured to send HTTP requests to the server (server_flask.py) started with Flask.
Therefore, the transcription process will not work properly if the server is not running.
You need to start server_flask.py beforehand, as follows:
python server_flask.py
Requests from app_flask.py are sent to http://localhost:5000/transcribe_server by default.
Therefore, if you have changed the port or host, you need to correct the URL accordingly.
File Upload UI
audio_file = st.file_uploader("Please upload an audio file", type=["mp3", "wav", "m4a", "mp4"],key="audio_file_trancribe")
This part uses Streamlit's file_uploader component to let the user select a file. The type argument restricts the file formats that can be uploaded.
Display "Start Transcription" button and start processing when clicked
button_trans_start = st.button("Start transcription",type="primary")
st.button creates the button UI.
Send the file to the Flask server
response = requests.post(
server_url+"transcribe_server",
files={"audio": audio_file.getvalue()}, # Audio file data
data={"model": model,"save_audio":button_save_audio,"file_name":audio_file.name}, # Additional data
)
In this part, the file is sent to the /transcribe_server endpoint of the server_flask.py started with Flask. An HTTP request is sent via requests.post, and the file is specified in the files argument as a dictionary.
Display the text returned from the server
if "transcribe_data" in st.session_state:
st.write("Execution time:", convert_seconds(st.session_state.execution_time)) # Display execution time
probability = round(float(st.session_state.transcribe_data['language_probability']) * 100, 1) # Calculate language detection confidence
st.write(f"Detected language: {st.session_state.transcribe_data['language']} confidence {probability}%") # Display detected language and confidence
# Download button for transcription results
st.download_button(label="Download transcription results",data=st.session_state.transcribe_data["full_text"],file_name=st.session_state.transcribe_file_name,icon=":material/download:")
# Display transcription results
st.markdown("**Transcription Results**")
st.markdown(st.session_state.transcribe_data["time_line"], unsafe_allow_html=True) # Display transcription results with timestamps
The results transcribed on the server side are displayed directly on the screen. The response is in JSON format, and the transcription of the content uploaded by the user is displayed here. The JSON format will be described later.
Additionally, transcription results can be downloaded in text format via the download button.
Supplement *About Fine-tuning
Although not implemented this time, publicly available Whisper models cannot correctly transcribe company-specific or technical terms. When actually operating a transcription server, I recommend fine-tuning the Whisper model. The following articles are helpful for the fine-tuning method:
Training data is necessary for fine-tuning, so it would be ideal to collect it if permission is granted.
# Toggle button for providing training data
if button_save_audio := st.toggle("Provide audio file feedback for training",key="button_save_audio",
help="We are collecting audio files to improve the transcription model. Audio files will be used as training data."):
st.subheader("Thank you for your cooperation 🤗")
st.balloons() # Display balloon effect
If you don't need fine-tuning, you can just comment this part out.
server_flask.py
This file is responsible for receiving the audio file uploaded by the user, calling the program (transcribe_flask.py) that actually performs the transcription process, and returning the result to the user.
The main processing steps are as follows:
1. Creating an endpoint
In Flask, by writing something like @app.route('/transcribe_server', methods=['POST']), you can define the processing for when a specific URL is accessed. Here, it handles the process of receiving the posted audio file.
2. Saving the uploaded file
The uploaded file can be received via request.files['audio']. This is saved locally as a temporary file.
def transcribe_server():
try:
# Get data from request
audio_file = request.files['audio'] # Audio file
model = request.form['model'] # Model to use
save_audio = request.form['save_audio'] # Audio save flag
file_name = request.form['file_name'] # File name
# Temporarily save the audio file
audio_file.save(file_name)
3. Calling transcribe_flask.py for conversion
Another Python script (transcribe_flask.py) is executed. This script uses Whisper to convert the audio file into text.
from transcribe_flask import transcribe
result = transcribe(audio_file=file_name)
The external script is invoked in this manner.
4. Loading the result and returning it as a response
It returns the transcription results generated by transcribe_flask.py as a response.
transcribe_flask.py
This file is the script that actually performs the process of transcribing audio using Whisper.
The main processing steps are as follows:
1. Opening the audio file received as an argument
When called by the Flask server side, a filename (path) is passed, so the audio file is read based on that.
2. Transcribing with the Whisper model
Using OpenAI's Whisper, the audio is transcribed into text.
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
3. Saving the text as JSON
The transcription result is obtained in the format of result["full_text"].
result = {
"language": info.language, # Detected language
"language_probability": info.language_probability, # Language detection confidence
"time_line":time_line, # Text with timestamps
"full_text":full_text # Full text
}
📋 Conclusion
There is a high demand for safely transcribing audio within a company or automating the creation of meeting minutes.
By combining Flask, Whisper, and Streamlit as shown this time, you can easily build an on-premise transcription environment!
This implementation uses "synchronous processing," so the screen will show a loading spinner ⏳ from the moment the audio is sent until the conversion is finished, requiring the user to wait for completion. Switching this to asynchronous processing would allow the server to immediately reply with "received," and then later notify the user of progress like "now at 30%" or display partial transcription results. This is a significant difference between synchronous and asynchronous processing, and it completely changes the UX. In the next article, I will create an audio processing app using asynchronous processing!
Discussion