Closed2024/05/05にクローズ13

WhisperKitを勉強するスレ

iOS

Swift

Whisper

Ryo24

Swift CLI

// install
git clone https://github.com/argmaxinc/whisperkit.git
cd whisperkit

// setup
make setup
make download-model MODEL=large-v3 // A

Aを実行するととModels/whisperkit-coreml ディレクトリが作成され、そのディレクトリ内に複数のLLMが作られる。

Ryo24

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" --audio-path ./sample.mp3

を実行したが、

Fetching https://github.com/huggingface/swift-transformers.git
Fetching https://github.com/apple/swift-argument-parser.git
Fetched https://github.com/apple/swift-argument-parser.git from cache (3.83s)
Fetched https://github.com/huggingface/swift-transformers.git from cache (3.83s)
Computing version for https://github.com/apple/swift-argument-parser.git
Computed https://github.com/apple/swift-argument-parser.git at 1.3.0 (0.24s)
Computing version for https://github.com/huggingface/swift-transformers.git
Computed https://github.com/huggingface/swift-transformers.git at 0.1.7 (0.21s)
Computed https://github.com/apple/swift-argument-parser.git at 1.3.0 (0.00s)
Computed https://github.com/huggingface/swift-transformers.git at 0.1.7 (0.00s)
Creating working copy for https://github.com/huggingface/swift-transformers.git
Working copy of https://github.com/huggingface/swift-transformers.git resolved at 0.1.7
Creating working copy for https://github.com/apple/swift-argument-parser.git
Working copy of https://github.com/apple/swift-argument-parser.git resolved at 1.3.0
warning: 'whisperkit': found 1 file(s) which are unhandled; explicitly declare them as resources or exclude from the target
    /Users/<username>/dev/whisperkit/sample.mp3
Building for debugging...
[112/112] Applying whisperkit-cli
Build complete! (11.02s)
Error: Unable to load model: file:///Users/<username>/dev/whisperkit/Models/whisperkit-coreml/openai_whisper-large-v3/MelSpectrogram.mlmodelc/. Compile the model with Xcode or `MLModel.compileModel(at:)`.

とエラーが発生する。
MelSpectrogram.mlmodelcは存在するため、エラー文通りCompile the model with Xcode or MLModel.compileModel(at:)`どちらかでモデルをコンパイルしないといけないらしい。

Ryo24

MLModel.compileModel(at:)はSwiftのメソッドを指しているのかな？

Ryo24

この記事を信じると.mlmodelc はコンパイル後のファイルらしいから、エラーの意味がわからないな

Ryo24

brew install whisperkit-cli

この記事を参考にしたが、変わらず

Error: Unable to load model: file:///Users/<username>/dev/whisperkit/Models/whisperkit-coreml/openai_whisper-base/MelSpectrogram.mlmodelc/. Compile the model with Xcode or `MLModel.compileModel(at:)`.

のため、どちらかというとCLI側に問題がある可能性があるな。もしくは権限不足で読み込めないとか？

Ryo24

sudo つけて実行したが、変わらないから権限ではなく、CLI側に問題がある可能性が高い。

Ryo24

Models/whisperkit-coreml ディレクトリでgit lfs pull を実行しないといけないのね、、
READEMEに書いてくれよ、、、(あとでPR作ってみるか)

Ryo24

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" --audio-path ./sample.mp3

↓sample.mp3 はCM原稿（せっけん）を使用している。
https://pro-video.jp/voice/announce/

実行結果

無添加のシャボン玉石鹸ならもう安心!天然の保湿成分が含まれるため、肌にうるおいを与え、健やかに保ちます。お肌のことでお悩みの方は、ぜひ一度、無添加シャボン玉石鹸をお試しください。お求めは、0120-0055-95まで。

漢字までに正確に文字起こしているので、精度はかなりいいね。

Ryo24

Stream形式

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" --stream

実行結果

Unconfirmed segment: <|startoftranscript|><|nospeech|><|endoftext|>
Unconfirmed segment: <|startoftranscript|><|ja|><|transcribe|><|notimestamps|>マイクテースマイクテースこんにちは<|endoftext|>
Current text: 
---
Unconfirmed segment: <|startoftranscript|><|nospeech|><|endoftext|>
Unconfirmed segment: <|startoftranscript|><|ja|><|transcribe|><|notimestamps|>マイクテースマイクテースこんにちは<|endoftext|>
Current text: Waiting for speech...

少なくとも日本語の精度はファイル読み込みより劣る印象

Ryo24

swift run whisperkit-cli transcribe --help の実行結果

OVERVIEW: Transcribe audio to text using WhisperKit

USAGE: whisperkit-cli transcribe [<options>] [<supress-tokens> ...]

ARGUMENTS:
  <supress-tokens>        Supress given tokens in the output

OPTIONS:
  --audio-path <audio-path>
                          Paths to audio files
  --audio-folder <audio-folder>
                          Path to a folder containing audio files
  --model-path <model-path>
                          Path of model files
  --model <model>         Model to download if no modelPath is provided
  --model-prefix <model-prefix>
                          Text to add in front of the model name to specify
                          between different types of the same variant (values:
                          "openai", "distil") (default: openai)
  --download-model-path <download-model-path>
                          Path to save the downloaded model
  --download-tokenizer-path <download-tokenizer-path>
                          Path to save the downloaded tokenizer files
  --audio-encoder-compute-units <audio-encoder-compute-units>
                          Compute units for audio encoder model with
                          {all,cpuOnly,cpuAndGPU,cpuAndNeuralEngine,random}
                          (values: all, cpuAndGPU, cpuOnly, cpuAndNeuralEngine,
                          random; default: cpuAndNeuralEngine)
  --text-decoder-compute-units <text-decoder-compute-units>
                          Compute units for text decoder model with
                          {all,cpuOnly,cpuAndGPU,cpuAndNeuralEngine,random}
                          (values: all, cpuAndGPU, cpuOnly, cpuAndNeuralEngine,
                          random; default: cpuAndNeuralEngine)
  --verbose               Verbose mode
  --task <task>           Task to perform (transcribe or translate) (default:
                          transcribe)
  --language <language>   Language spoken in the audio
  --temperature <temperature>
                          Temperature to use for sampling (default: 0.0)
  --temperature-increment-on-fallback <temperature-increment-on-fallback>
                          Temperature to increase on fallbacks during decoding
                          (default: 0.2)
  --temperature-fallback-count <temperature-fallback-count>
                          Number of times to increase temperature when falling
                          back during decoding (default: 5)
  --best-of <best-of>     Number of candidates when sampling with non-zero
                          temperature (default: 5)
  --use-prefill-prompt    Force initial prompt tokens based on language, task,
                          and timestamp options
  --use-prefill-cache     Use decoder prefill data for faster initial decoding
  --skip-special-tokens   Skip special tokens in the output
  --without-timestamps    Force no timestamps when decoding
  --word-timestamps       Add timestamps for each word in the output
  --prefix <prefix>       Force prefix text when decoding
  --prompt <prompt>       Condition on this text when decoding
  --compression-ratio-threshold <compression-ratio-threshold>
                          Gzip compression ratio threshold for decoding failure
  --logprob-threshold <logprob-threshold>
                          Average log probability threshold for decoding failure
  --first-token-log-prob-threshold <first-token-log-prob-threshold>
                          Log probability threshold for first token decoding
                          failure
  --no-speech-threshold <no-speech-threshold>
                          Probability threshold to consider a segment as silence
  --report                Output a report of the results
  --report-path <report-path>
                          Directory to save the report (default: .)
  --stream                Process audio directly from the microphone
  --stream-simulated      Simulate streaming transcription using the input
                          audio file
  --concurrent-worker-count <concurrent-worker-count>
                          Maximum concurrent inference, might be helpful when
                          processing more than 1 audio file at the same time. 0
                          means unlimited (default: 0)
  -h, --help              Show help information.