音声認識サーベイ

サービス	コスト(100時間)	コスト(1,000時間)	リンク	制限
Whisper (OpenAI)	約6,000~1万円	約6~10万円	1分0.006$または0.01$と計算。https://community.openai.com/t/api-model-whisper-real-cost/469816/11	25MB
Whisper (Groq)	約500円	約5,000円	https://console.groq.com/settings/billing	Requests per Minute:20 Requests per Day:2,000 Tokens per Minute:15,000, 25MB
AmiVoice(汎用モデル)	約1万円	9.9万円	https://acp.amivoice.com/amivoice_api/price/	-
Amazon Transcribe	約2万円ほど(0.024 USD * 100 * 60)	約20万円	https://aws.amazon.com/jp/transcribe/pricing/	--
Azure Speech Service(Standard, Real-time)	約1.6万円	約16万円	https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/#pricing	-
Azure Speech Service(Standard, batch)	約3千円	約3万円	https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/#pricing	-

AzureとGoogle Cloud(コストググっていないが)はバッチがあるので安くなりそう。

batchについて（Azureの説明）

バッチ文字起こしジョブは、ベストエフォートの原則でスケジュールされます。ピークの時間帯には、文字起こしジョブの処理が開始されるまでに最大 30 分以上かかることもあります。実行中のほとんどの時間、文字起こしの状態は Running になります。これは、ジョブがバッチ文字起こしバックエンドシステムに移行するとすぐに、そのジョブには Running 状態が割り当てられるためです。この割り当ては、基本モデルが使用されている場合はほぼ瞬時に、カスタムモデルの場合は若干遅れて発生します。このため、文字起こしジョブが Running 状態にある時間は、実際の文字起こしの時間には対応せず、これには内部キューでの待ち時間も含まれます。

sergicalsix

技術選定基準

応答時間
- リアルタイムの場合も複数のリクエストで耐えうるどうか
コストとデータ量

音声認識の質
- 業界用語の認識できるか
- 実際の音質レベルに対応できるか

sergicalsix

文字起こしの精度向上

データクリーニング

チャンク分け
- VAD(音声区間検出)
  - 信号強度ベース
  - スペクトルベース
  - ゼロ交差点利用
  - N-cross法
  - MarbleNet
  - SileroVAD

https://github.com/wiseman/py-webrtcvad

ノイズの除去

語彙追加

Amazon Transcribe

モデルの改良

whisper

Amazon Transcribe

プロンプト改良

後処理

sergicalsix

システムの精度向上

話者分離

https://fukugyouhistory.tokyo/?p=623

sergicalsix

whisper

whisperの仕組み

無声音処理のロジックだけ抜き出すのはありか

whisperのストリーミング処理

高速処理

プロンプト

sergicalsix

チューニング

whisper

Amazon Transcribe

sergicalsix

精度指標

「認識精度」と「認識率」

sergicalsix

精度比較実験

Azure vs GCP vs Whisper

https://www.smartaxe.co.jp/blog/3565

AmiVoice vs Whisper

Whisper vs Amazon

sergicalsix

論文系/記事

精度指標と応用

SpecAugment

sergicalsix

環境

ReazonSpeech

ドキュメント通りだとhugging_faceからモデルをロードする部分でエラー。
暫定的に以下のrequirements.txtで対処

requirements.txt

absl-py==2.1.0
aiohttp==3.9.5
aiosignal==1.3.1
altair==5.3.0
anthropic==0.28.0
antlr4-python3-runtime==4.9.3
anyio==4.4.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
audioread==3.0.1
blinker==1.8.2
braceexpand==0.1.7
cachetools==5.3.3
certifi==2024.6.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
comm==0.2.2
contourpy==1.2.1
cycler==0.12.1
Cython==3.0.10
decorator==4.4.2
Distance==0.1.3
distro==1.9.0
docker-pycreds==0.4.0
docopt==0.6.2
editdistance==0.8.1
exceptiongroup==1.2.1
executing==2.0.1
filelock==3.14.0
fonttools==4.53.0
frozenlist==1.4.1
fsspec==2024.6.0
g2p-en==2.1.0
gitdb==4.0.11
GitPython==3.1.43
grpcio==1.64.1
h11==0.14.0
httpcore==1.0.5
httpx==0.27.0
huggingface-hub==0.23.3
hydra-core==1.3.2
idna==3.7
imageio==2.34.1
imageio-ffmpeg==0.5.1
importlib_metadata==7.1.0
importlib_resources==6.4.0
inflect==7.2.1
ipython==8.18.1
ipywidgets==8.1.3
jedi==0.19.1
Jinja2==3.1.4
jiter==0.4.1
jiwer==3.0.4
joblib==1.4.2
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyterlab_widgets==3.0.11
kaldi-python-io==1.2.2
kaldiio==2.18.0
kiwisolver==1.4.5
lazy_loader==0.4
Levenshtein==0.25.1
librosa==0.10.2.post1
lightning-utilities==0.11.2
llvmlite==0.42.0
loguru==0.7.2
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.3
matplotlib==3.9.0
matplotlib-inline==0.1.7
mdurl==0.1.2
more-itertools==10.2.0
moviepy==1.0.3
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
nemo-toolkit==1.21.0
networkx==3.2.1
nltk==3.8.1
numba==0.59.1
numpy==1.23.5
omegaconf==2.3.0
onnx==1.16.1
openai==0.28.0
packaging==24.0
pandas==2.2.2
parso==0.8.4
pexpect==4.9.0
pillow==10.3.0
plac==1.4.3
platformdirs==4.2.2
pooch==1.8.2
proglog==0.1.10
prompt_toolkit==3.0.46
protobuf==4.25.3
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pyannote.core==5.0.0
pyannote.database==5.1.0
pyannote.metrics==3.2.1
pyarrow==16.1.0
pybind11==2.12.0
pycparser==2.22
pydantic==1.10.15
pydeck==0.9.1
pydub==0.25.1
Pygments==2.18.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytorch-lightning==2.0.7
pytz==2024.1
PyYAML==6.0.1
rapidfuzz==3.9.3
reazonspeech-nemo-asr @ file:///Users/XXX/playground/mtg/ReazonSpeech/pkg/nemo-asr
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rpds-py==0.18.1
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
sacremoses==0.1.1
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.13.1
sentencepiece==0.2.0
sentry-sdk==2.5.1
setproctitle==1.3.3
shellingham==1.5.4
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sortedcontainers==2.4.0
soundfile==0.12.1
sox==1.4.1
soxr==0.3.7
stack-data==0.6.3
streamlit==1.35.0
sympy==1.12.1
tabulate==0.9.0
tenacity==8.3.0
tensorboard==2.17.0
tensorboard-data-server==0.7.2
termcolor==2.4.0
text-unidecode==1.3
texterrors==0.4.4
threadpoolctl==3.5.0
tokenizers==0.13.3
toml==0.10.2
toolz==0.12.1
torch==2.3.1
torchmetrics==1.4.0.post0
tornado==6.4.1
tqdm==4.66.4
traitlets==5.14.3
transformers==4.33.3
typeguard==4.3.0
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.1
wandb==0.17.1
watchdog==4.0.1
wcwidth==0.2.13
webdataset==0.1.62
Werkzeug==3.0.3
wget==3.2
widgetsnbextension==4.0.11
wrapt==1.16.0
yarl==1.9.4
youtokentome==1.0.6
zipp==3.19.2