Closed2

juliusでコマンドラインから音声認識

kun432kun432

https://julius.osdn.jp/

いろいろ試してみたのだけど、

  • julius本体の最新版はv4.6(homebrew等でインストールできる)。ただし単体では動作せず、音響モデル・単語辞書・言語モデルが別途必要。
  • 音響モデル・単語辞書・言語モデルなどを含むディクテーションキットが配布されているが最新版はv4.5。

の組み合わせの場合、どうも上手く動かない。

ディクテーションキットにはv4.5のjuliusが同梱されているので、それを使って試すことにする。

ちなみに、本体がv4.6、ディクテーションキットがv4.5の場合でどういう風に動作しないかは以下。

$ which julius
/usr/local/bin/julius

$ julius -help
Julius rev.4.6 - based on JuliusLib rev.4.6 (fast)

$ julius -nostrip -C `brew --prefix julius-dictation-kit`/share/main.jconf   -C `brew --prefix julius-dictation-kit`/share/am-gmm.jconf

(snip)

STAT: ###### initialize input device
Stat: adin_darwin: sample rate = 16000
Error: adin_darwin: cannot set InputUnit's EnableIO(Input)
ERROR: m_adin: failed to ready input device
$ ./bin/osx/julius -help
Julius rev.4.5 - based on JuliusLib rev.4.5 (fast)

$ ./bin/osx/julius -nostrip -C `brew --prefix julius-dictation-kit`/share/main.jconf   -C `brew --prefix julius-dictation-kit`/share/am-gmm.jconf


(snip)

STAT: ###### initialize input device
Stat: adin_darwin: sample rate = 16000
Stat: adin_darwin: using device "Built-in Microphone" for input
Stat: adin_darwin: sample rate 44100.000000
	2 channels
	32-bit sample
Stat: adin_darwin: 512 buffer frames
Stat: adin_darwin: ----- details of stream -----
Stat: adin_darwin: sample rate: 44100.000000
Stat: adin_darwin: format flags: [float][packed]
Stat: adin_darwin: bytes per packet: 8
Stat: adin_darwin: frames per packet: 1
Stat: adin_darwin: bytes per frame: 8
Stat: adin_darwin: channels per frame: 2
Stat: adin_darwin: bits per channel: 32
Stat: adin_darwin: -----------------------------------
Stat: adin_darwin: ----- details of stream -----
Stat: adin_darwin: sample rate: 44100.000000
Stat: adin_darwin: format flags: [signed integer][packed]
Stat: adin_darwin: bytes per packet: 2
Stat: adin_darwin: frames per packet: 1
Stat: adin_darwin: bytes per frame: 2
Stat: adin_darwin: channels per frame: 1
Stat: adin_darwin: bits per channel: 16
Stat: adin_darwin: -----------------------------------
Stat: adin_darwin: input device's buffer size (# of samples): 512
Stat: adin_darwin: ----- details of stream -----
Stat: adin_darwin: sample rate: 16000.000000
Stat: adin_darwin: format flags: [signed integer][packed]
Stat: adin_darwin: bytes per packet: 2
Stat: adin_darwin: frames per packet: 1
Stat: adin_darwin: bytes per frame: 2
Stat: adin_darwin: channels per frame: 1
Stat: adin_darwin: bits per channel: 16
Stat: adin_darwin: -----------------------------------
Stat: adin_darwin: CoreAudio: initialized
----------------------- System Information begin ---------------------
JuliusLib rev.4.5 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    :
 -  Compiled by  : gcc -g -O2
Library configuration: version 4.5
 - Audio input
    primary A/D-in driver   : coreaudio (MacOSX CoreAudio)
    available drivers       :
    wavefile formats        : RAW and WAV only
    max. length of an input : 320000 samples, 150 words
 - Language Model
    class N-gram support    : yes
    MBR weight support      : yes
    word id unit            : short (2 bytes)
 - Acoustic Model
    multi-path treatment    : autodetect
 - External library
    file decompression by   : zlib library
 - Process hangling
    fork on adinnet input   : no
 - built-in SIMD instruction set for DNN
    SSE AVX FMA
    FMA is available maximum on this cpu, use it


------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM00 "_default"
	hmmfilename=/usr/local/opt/julius-dictation-kit/share/model/phone_m/jnas-tri-3k16-gid.binhmm
	hmmmapfilename=/usr/local/opt/julius-dictation-kit/share/model/phone_m/logicalTri

 Language Model:
 - LM00 "_default"
	vocabulary filename=/usr/local/opt/julius-dictation-kit/share/model/lang_m/bccwj.60k.htkdic
	n-gram  filename=/usr/local/opt/julius-dictation-kit/share/model/lang_m/bccwj.60k.bingram (binary format)

 Recognizer:
 - SR00 "_default" (AM00, LM00)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM00 _default]

 Acoustic analysis condition:
	       parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
	sample frequency = 16000 Hz
	   sample period =  625  (1 = 100ns)
	     window size =  400 samples (25.0 ms)
	     frame shift =  160 samples (10.0 ms)
	    pre-emphasis = 0.97
	    # filterbank = 24
	   cepst. lifter = 22
	      raw energy = False
	energy normalize = False
	    delta window = 2 frames (20.0 ms) around
	     hi freq cut = OFF
	     lo freq cut = OFF
	 zero mean frame = ON
	       use power = OFF
	             CVN = OFF
	            VTLN = OFF

    spectral subtraction = off

 cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
  initial mean from file = N/A
   beginning data weight = 100.00
 cep. var. normalization = no

	 base setup from = Julius defaults

------------------------------------------------------------
Acoustic Model(s)

[AM00 "_default"]

 HMM Info:
    8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
	      model type = context dependency handling ON
      training parameter = MFCC_E_N_D_Z
	   vector length = 25
	number of stream = 1
	     stream info = [0-24]
	cov. matrix type = DIAGC
	   duration type = NULLD
	max mixture size = 16 Gaussians
     max length of model = 5 states
     logical base phones = 43
       model skip trans. = not exist, no multi-path handling

 AM Parameters:
        Gaussian pruning = none (full computation)  (-gprune)
    short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use 3-best of same LC)

------------------------------------------------------------
Language Model(s)

[LM00 "_default"] type=n-gram

 N-gram info:
	            spec = 3-gram, backward (right-to-left)
	        OOV word = <unk>(id=2)
	    wordset size = 59084
	  1-gram entries =      59084  (  0.5 MB)
	  2-gram entries =    2476660  ( 27.7 MB) (64% are valid contexts)
	  3-gram entries =    7894442  ( 52.8 MB)
	LR 2-gram entries=    2476660  (  9.7 MB)
	           pass1 = given additional forward 2-gram

 Vocabulary Info:
        vocabulary size  = 64274 words, 366102 models
        average word len = 5.7 models, 17.1 states
       maximum state num = 54 nodes per word
       transparent words = not exist
       words under class = 9444 words

 Parameters:
	(-silhead)head sil word = 0: "<s> @0.000000 [] silB(silB)"
	(-siltail)tail sil word = 1: "</s> @0.000000 [。] silE(silE)"

------------------------------------------------------------
Recognizer(s)

[SR00 "_default"]  AM00 "_default"  +  LM00 "_default"

 Lexicon tree:
	 total node num = 415714
	  root node num =    632
	(148 hi-freq. words are separated from tree lexicon)
	  leaf node num =  64274
	 fact. node num =  64274

 Inter-word N-gram cache:
	root node to be cached = 195 / 631 (isolated only)
	word ends to be cached = 59084 (all)
	  max. allocation size = 46MB
	(-lmp)  pass1 LM weight = 8.0  ins. penalty = -2.0
	(-lmp2) pass2 LM weight = 8.0  ins. penalty = -2.0
	(-transp)trans. penalty = +0.0 per word
	(-cmalpha)CM alpha coef = 0.050000

 Search parameters:
	    multi-path handling = no
	(-b) trellis beam width = 1500
	(-bs)score pruning thres= disabled
	(-n)search candidate num= 30
	(-s)  search stack size = 500
	(-m)    search overflow = after 10000 hypothesis poped
	        2nd pass method = searching sentence, generating N-best
	(-b2)  pass2 beam width = 100
	(-lookuprange)lookup range= 5  (tm-5 <= t <tm+5)
	(-sb)2nd scan beamthres = 80.0 (in logscore)
	(-n)        search till = 30 candidates found
	(-output)    and output = 1 candidates out of above
	 IWCD handling:
	   1st pass: approximation (use 3-best of same LC)
	   2nd pass: loose (apply when hypo. is popped and scanned)
	 factoring score: 1-gram prob. (statically assigned beforehand)
	short pause segmentation = off
	fall back on search fail = off, returns search failure

------------------------------------------------------------
Decoding algorithm:

	1st pass input processing = real time, on-the-fly
	1st pass method = 1-best approx. generating indexed trellis
	output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
	             input type = waveform
	           input source = microphone
	    device API          = default
	          sampling freq. = 16000 Hz
	         threaded A/D-in = supported, on
	   zero frames stripping = off
	         silence cutting = on
	             level thres = 2000 / 32767
	         zerocross thres = 60 / sec.
	             head margin = 300 msec.
	             tail margin = 400 msec.
	              chunk size = 1000 samples
	       FVAD switch value = -1 (disabled)
	    long-term DC removal = off
	    level scaling factor = 1.00 (disabled)
	      reject short input = < 800 msec
	      reject  long input = off

----------------------- System Information end -----------------------

Notice for feature extraction (01),
	*************************************************************
	* Cepstral mean normalization for real-time decoding:       *
	* NOTICE: The first input may not be recognized, since      *
	*         no initial mean is available on startup.          *
	*************************************************************

------
### read waveform input
STAT: AD-in thread created
<<< please speak >>>
kun432kun432

ディクテーションキットのダウンロード

https://julius.osdn.jp/index.php?q=dictation-kit.html

Julius音声認識パッケージは3つある。

  • ディクテーションキット (dictation-kit)
    • 一般的なモデル
    • JNASと『日本語話し言葉コーパス』模擬講演データによるDNN-HMM音響モデルとGMM-HMM音響モデルの2つを用意
    • 国立国語研究所のBCCWJによる汎用的な言語モデル
  • 話し言葉モデルキット (ssr-kit)
    • 話し言葉認識を目的としたモデル
    • JNASと『日本語話し言葉コーパス』模擬講演データによるDNN-HMM音響モ デル
    • 『日本語話し言葉コーパス』の模擬講演データと学会データによる言語モデル
  • 講演音声モデルキット (lsr-kit)
    • 大きな部屋等での講演を対象としたモデル
    • 『日本語話し言葉コーパス』の学会データによるDNN-HMM音響モデル
    • 『日本語話し言葉コーパス』の模擬講演データと学会データによる言語モデル

今回はディクテーションキットを使う。

$ wget https://osdn.net/dl/julius/dictation-kit-4.5.zip
$ unzip dictation-kit-4.5.zip
$ cd dictation-kit-4.5

Macのサウンド環境設定を開いて、入力を内蔵マイクにしておく。うちの場合、普段はJabra SPEAK 510を使っているけど、入力デバイスとして認識してくれなかった。

同梱されているjulius本体には実行権がないのでつけておく。

$ chmod +x ./bin/osx/* 

実行。まずはGMM版から。

$ ./bin/osx/julius -C main.jconf -C am-gmm.jconf -nostrip

以下のように表示されれば音声を待ち受けている状態になる。

<<< please speak >>>

適当に喋ってみる。「今日はいいお天気ですね」と言ってみる。

pass1_best:  今日 は 理論 的 です ね 。
pass1_best_wordseq: <s> 今日+名詞 は+助詞 理論+名詞 的+接尾辞 です+助動詞 ね+助詞 </s>
pass1_best_phonemeseq: silB | ky o: | w a | r i r o N | t e k i | d e s u | n e | silE
pass1_best_score: -4753.997559
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 85769 generated, 2579 pushed, 560 nodes popped in 178
sentence1:  今日 は いい お 天気 です ね 。
wseq1: <s> 今日+名詞 は+助詞 いい+形容詞 お+接頭辞 天気+名詞 です+助動詞 ね+助詞 </s>
phseq1: silB | ky o: | w a | i: | o | t e N k i | d e s u | n e | silE
cmscore1: 0.670 0.249 0.146 0.085 0.136 0.626 0.693 0.044 1.000
score1: -4751.155762

次にDNN版。

$ ./bin/osx/julius -C main.jconf -C am-dnn.jconf -dnnconf julius.dnnconf -nostrip

こちらもちゃんと認識される。

pass1_best:  今日 は いい の 元気 です ね 。
pass1_best_wordseq: <s> 今日+名詞 は+助詞 いい+形容詞 の+助詞 元気+名詞 です+助動詞 ね+助詞 </s>
pass1_best_phonemeseq: sp_S | ky_B o:_E | w_B a_E | i:_S | n_B o_E | g_B e_I N_I k_I i_E | d_B e_I s_I u_E | n_B e_E | sp_S
pass1_best_score: 116.675194
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 29090 generated, 2051 pushed, 437 nodes popped in 187
sentence1:  今日 は いい お 天気 です ね 。
wseq1: <s> 今日+名詞 は+助詞 いい+形容詞 お+接頭辞 天気+名詞 です+助動詞 ね+助詞 </s>
phseq1: sp_S | ky_B o:_E | w_B a_E | i:_S | o_S | t_B e_I N_I k_I i_E | d_B e_I s_I u_E | n_B e_E | sp_S
cmscore1: 0.699 0.767 0.464 0.747 0.641 0.451 0.938 0.939 1.000
score1: 139.660751

いろいろ試してみた感じだとDNN版のほうが認識精度が高いように思えた。

このスクラップは2023/03/22にクローズされました