Closed2
juliusでコマンドラインから音声認識
いろいろ試してみたのだけど、
- julius本体の最新版はv4.6(homebrew等でインストールできる)。ただし単体では動作せず、音響モデル・単語辞書・言語モデルが別途必要。
- 音響モデル・単語辞書・言語モデルなどを含むディクテーションキットが配布されているが最新版はv4.5。
の組み合わせの場合、どうも上手く動かない。
ディクテーションキットにはv4.5のjuliusが同梱されているので、それを使って試すことにする。
ちなみに、本体がv4.6、ディクテーションキットがv4.5の場合でどういう風に動作しないかは以下。
$ which julius
/usr/local/bin/julius
$ julius -help
Julius rev.4.6 - based on JuliusLib rev.4.6 (fast)
$ julius -nostrip -C `brew --prefix julius-dictation-kit`/share/main.jconf -C `brew --prefix julius-dictation-kit`/share/am-gmm.jconf
(snip)
STAT: ###### initialize input device
Stat: adin_darwin: sample rate = 16000
Error: adin_darwin: cannot set InputUnit's EnableIO(Input)
ERROR: m_adin: failed to ready input device
$ ./bin/osx/julius -help
Julius rev.4.5 - based on JuliusLib rev.4.5 (fast)
$ ./bin/osx/julius -nostrip -C `brew --prefix julius-dictation-kit`/share/main.jconf -C `brew --prefix julius-dictation-kit`/share/am-gmm.jconf
(snip)
STAT: ###### initialize input device
Stat: adin_darwin: sample rate = 16000
Stat: adin_darwin: using device "Built-in Microphone" for input
Stat: adin_darwin: sample rate 44100.000000
2 channels
32-bit sample
Stat: adin_darwin: 512 buffer frames
Stat: adin_darwin: ----- details of stream -----
Stat: adin_darwin: sample rate: 44100.000000
Stat: adin_darwin: format flags: [float][packed]
Stat: adin_darwin: bytes per packet: 8
Stat: adin_darwin: frames per packet: 1
Stat: adin_darwin: bytes per frame: 8
Stat: adin_darwin: channels per frame: 2
Stat: adin_darwin: bits per channel: 32
Stat: adin_darwin: -----------------------------------
Stat: adin_darwin: ----- details of stream -----
Stat: adin_darwin: sample rate: 44100.000000
Stat: adin_darwin: format flags: [signed integer][packed]
Stat: adin_darwin: bytes per packet: 2
Stat: adin_darwin: frames per packet: 1
Stat: adin_darwin: bytes per frame: 2
Stat: adin_darwin: channels per frame: 1
Stat: adin_darwin: bits per channel: 16
Stat: adin_darwin: -----------------------------------
Stat: adin_darwin: input device's buffer size (# of samples): 512
Stat: adin_darwin: ----- details of stream -----
Stat: adin_darwin: sample rate: 16000.000000
Stat: adin_darwin: format flags: [signed integer][packed]
Stat: adin_darwin: bytes per packet: 2
Stat: adin_darwin: frames per packet: 1
Stat: adin_darwin: bytes per frame: 2
Stat: adin_darwin: channels per frame: 1
Stat: adin_darwin: bits per channel: 16
Stat: adin_darwin: -----------------------------------
Stat: adin_darwin: CoreAudio: initialized
----------------------- System Information begin ---------------------
JuliusLib rev.4.5 (fast)
Engine specification:
- Base setup : fast
- Supported LM : DFA, N-gram, Word
- Extension :
- Compiled by : gcc -g -O2
Library configuration: version 4.5
- Audio input
primary A/D-in driver : coreaudio (MacOSX CoreAudio)
available drivers :
wavefile formats : RAW and WAV only
max. length of an input : 320000 samples, 150 words
- Language Model
class N-gram support : yes
MBR weight support : yes
word id unit : short (2 bytes)
- Acoustic Model
multi-path treatment : autodetect
- External library
file decompression by : zlib library
- Process hangling
fork on adinnet input : no
- built-in SIMD instruction set for DNN
SSE AVX FMA
FMA is available maximum on this cpu, use it
------------------------------------------------------------
Configuration of Modules
Number of defined modules: AM=1, LM=1, SR=1
Acoustic Model (with input parameter spec.):
- AM00 "_default"
hmmfilename=/usr/local/opt/julius-dictation-kit/share/model/phone_m/jnas-tri-3k16-gid.binhmm
hmmmapfilename=/usr/local/opt/julius-dictation-kit/share/model/phone_m/logicalTri
Language Model:
- LM00 "_default"
vocabulary filename=/usr/local/opt/julius-dictation-kit/share/model/lang_m/bccwj.60k.htkdic
n-gram filename=/usr/local/opt/julius-dictation-kit/share/model/lang_m/bccwj.60k.bingram (binary format)
Recognizer:
- SR00 "_default" (AM00, LM00)
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 24
cepst. lifter = 22
raw energy = False
energy normalize = False
delta window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = ON
use power = OFF
CVN = OFF
VTLN = OFF
spectral subtraction = off
cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
initial mean from file = N/A
beginning data weight = 100.00
cep. var. normalization = no
base setup from = Julius defaults
------------------------------------------------------------
Acoustic Model(s)
[AM00 "_default"]
HMM Info:
8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
model type = context dependency handling ON
training parameter = MFCC_E_N_D_Z
vector length = 25
number of stream = 1
stream info = [0-24]
cov. matrix type = DIAGC
duration type = NULLD
max mixture size = 16 Gaussians
max length of model = 5 states
logical base phones = 43
model skip trans. = not exist, no multi-path handling
AM Parameters:
Gaussian pruning = none (full computation) (-gprune)
short pause HMM name = "sp" specified, "sp" applied (physical) (-sp)
cross-word CD on pass1 = handle by approx. (use 3-best of same LC)
------------------------------------------------------------
Language Model(s)
[LM00 "_default"] type=n-gram
N-gram info:
spec = 3-gram, backward (right-to-left)
OOV word = <unk>(id=2)
wordset size = 59084
1-gram entries = 59084 ( 0.5 MB)
2-gram entries = 2476660 ( 27.7 MB) (64% are valid contexts)
3-gram entries = 7894442 ( 52.8 MB)
LR 2-gram entries= 2476660 ( 9.7 MB)
pass1 = given additional forward 2-gram
Vocabulary Info:
vocabulary size = 64274 words, 366102 models
average word len = 5.7 models, 17.1 states
maximum state num = 54 nodes per word
transparent words = not exist
words under class = 9444 words
Parameters:
(-silhead)head sil word = 0: "<s> @0.000000 [] silB(silB)"
(-siltail)tail sil word = 1: "</s> @0.000000 [。] silE(silE)"
------------------------------------------------------------
Recognizer(s)
[SR00 "_default"] AM00 "_default" + LM00 "_default"
Lexicon tree:
total node num = 415714
root node num = 632
(148 hi-freq. words are separated from tree lexicon)
leaf node num = 64274
fact. node num = 64274
Inter-word N-gram cache:
root node to be cached = 195 / 631 (isolated only)
word ends to be cached = 59084 (all)
max. allocation size = 46MB
(-lmp) pass1 LM weight = 8.0 ins. penalty = -2.0
(-lmp2) pass2 LM weight = 8.0 ins. penalty = -2.0
(-transp)trans. penalty = +0.0 per word
(-cmalpha)CM alpha coef = 0.050000
Search parameters:
multi-path handling = no
(-b) trellis beam width = 1500
(-bs)score pruning thres= disabled
(-n)search candidate num= 30
(-s) search stack size = 500
(-m) search overflow = after 10000 hypothesis poped
2nd pass method = searching sentence, generating N-best
(-b2) pass2 beam width = 100
(-lookuprange)lookup range= 5 (tm-5 <= t <tm+5)
(-sb)2nd scan beamthres = 80.0 (in logscore)
(-n) search till = 30 candidates found
(-output) and output = 1 candidates out of above
IWCD handling:
1st pass: approximation (use 3-best of same LC)
2nd pass: loose (apply when hypo. is popped and scanned)
factoring score: 1-gram prob. (statically assigned beforehand)
short pause segmentation = off
fall back on search fail = off, returns search failure
------------------------------------------------------------
Decoding algorithm:
1st pass input processing = real time, on-the-fly
1st pass method = 1-best approx. generating indexed trellis
output word confidence measure based on search-time scores
------------------------------------------------------------
FrontEnd:
Input stream:
input type = waveform
input source = microphone
device API = default
sampling freq. = 16000 Hz
threaded A/D-in = supported, on
zero frames stripping = off
silence cutting = on
level thres = 2000 / 32767
zerocross thres = 60 / sec.
head margin = 300 msec.
tail margin = 400 msec.
chunk size = 1000 samples
FVAD switch value = -1 (disabled)
long-term DC removal = off
level scaling factor = 1.00 (disabled)
reject short input = < 800 msec
reject long input = off
----------------------- System Information end -----------------------
Notice for feature extraction (01),
*************************************************************
* Cepstral mean normalization for real-time decoding: *
* NOTICE: The first input may not be recognized, since *
* no initial mean is available on startup. *
*************************************************************
------
### read waveform input
STAT: AD-in thread created
<<< please speak >>>
ディクテーションキットのダウンロード
Julius音声認識パッケージは3つある。
- ディクテーションキット (dictation-kit)
- 一般的なモデル
- JNASと『日本語話し言葉コーパス』模擬講演データによるDNN-HMM音響モデルとGMM-HMM音響モデルの2つを用意
- 国立国語研究所のBCCWJによる汎用的な言語モデル
- 話し言葉モデルキット (ssr-kit)
- 話し言葉認識を目的としたモデル
- JNASと『日本語話し言葉コーパス』模擬講演データによるDNN-HMM音響モ デル
- 『日本語話し言葉コーパス』の模擬講演データと学会データによる言語モデル
- 講演音声モデルキット (lsr-kit)
- 大きな部屋等での講演を対象としたモデル
- 『日本語話し言葉コーパス』の学会データによるDNN-HMM音響モデル
- 『日本語話し言葉コーパス』の模擬講演データと学会データによる言語モデル
今回はディクテーションキットを使う。
$ wget https://osdn.net/dl/julius/dictation-kit-4.5.zip
$ unzip dictation-kit-4.5.zip
$ cd dictation-kit-4.5
Macのサウンド環境設定を開いて、入力を内蔵マイクにしておく。うちの場合、普段はJabra SPEAK 510を使っているけど、入力デバイスとして認識してくれなかった。
同梱されているjulius本体には実行権がないのでつけておく。
$ chmod +x ./bin/osx/*
実行。まずはGMM版から。
$ ./bin/osx/julius -C main.jconf -C am-gmm.jconf -nostrip
以下のように表示されれば音声を待ち受けている状態になる。
<<< please speak >>>
適当に喋ってみる。「今日はいいお天気ですね」と言ってみる。
pass1_best: 今日 は 理論 的 です ね 。
pass1_best_wordseq: <s> 今日+名詞 は+助詞 理論+名詞 的+接尾辞 です+助動詞 ね+助詞 </s>
pass1_best_phonemeseq: silB | ky o: | w a | r i r o N | t e k i | d e s u | n e | silE
pass1_best_score: -4753.997559
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 85769 generated, 2579 pushed, 560 nodes popped in 178
sentence1: 今日 は いい お 天気 です ね 。
wseq1: <s> 今日+名詞 は+助詞 いい+形容詞 お+接頭辞 天気+名詞 です+助動詞 ね+助詞 </s>
phseq1: silB | ky o: | w a | i: | o | t e N k i | d e s u | n e | silE
cmscore1: 0.670 0.249 0.146 0.085 0.136 0.626 0.693 0.044 1.000
score1: -4751.155762
次にDNN版。
$ ./bin/osx/julius -C main.jconf -C am-dnn.jconf -dnnconf julius.dnnconf -nostrip
こちらもちゃんと認識される。
pass1_best: 今日 は いい の 元気 です ね 。
pass1_best_wordseq: <s> 今日+名詞 は+助詞 いい+形容詞 の+助詞 元気+名詞 です+助動詞 ね+助詞 </s>
pass1_best_phonemeseq: sp_S | ky_B o:_E | w_B a_E | i:_S | n_B o_E | g_B e_I N_I k_I i_E | d_B e_I s_I u_E | n_B e_E | sp_S
pass1_best_score: 116.675194
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 29090 generated, 2051 pushed, 437 nodes popped in 187
sentence1: 今日 は いい お 天気 です ね 。
wseq1: <s> 今日+名詞 は+助詞 いい+形容詞 お+接頭辞 天気+名詞 です+助動詞 ね+助詞 </s>
phseq1: sp_S | ky_B o:_E | w_B a_E | i:_S | o_S | t_B e_I N_I k_I i_E | d_B e_I s_I u_E | n_B e_E | sp_S
cmscore1: 0.699 0.767 0.464 0.747 0.641 0.451 0.938 0.939 1.000
score1: 139.660751
いろいろ試してみた感じだとDNN版のほうが認識精度が高いように思えた。
このスクラップは2023/03/22にクローズされました