kaldi-GMM-HMM pipeline
GMM-HMM 作為asr 的baseline,經常為後面的tdnn提供 對齊 ,下面具體探究一下這條 pipeline
steps/train_mono.sh
steps/train_mono。sh ——cmd “$train_cmd” ——nj 10 \
data/train data/lang exp/mono || exit 1;
Flat start and monophone training, with delta-delta features。
過程:
steps/train_mono。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/mono
steps/train_mono。sh: Initializing monophone system。
steps/train_mono。sh: Compiling training graphs
steps/train_mono。sh: Aligning data equally (pass 0)
steps/train_mono。sh: Pass 1
steps/train_mono。sh: Aligning data
。。。
utils/mkgraph.sh
This script creates a fully expanded decoding graph (HCLG) that represents
# all the language-model, pronunciation dictionary (lexicon), context-dependency,
# and HMM structure in our model。 The output is a Finite State Transducer
# that has word-ids on the output, and pdf-ids on the input (these are indexes
# that resolve to Gaussian Mixture Models)。
mkgraph 需要 lang_test 下的 L。fst G。fst phones。txt, words。txt , phones/silence。csl , phones/
http://
disambig。int
以及 exp/tri 下的 tree, final。mdl
tree-info exp/mono/tree
tree-info exp/mono/tree
fstminimizeencoded
fstpushspecial
fstdeterminizestar ——use-log=true
fsttablecompose data/lang_test/L_disambig。fst data/lang_test/G。fst
fstisstochastic data/lang_test/tmp/LG。fst
-0。0663446 -0。0666824
[info]: LG not stochastic。
fstcomposecontext ——context-size=1 ——central-position=0 ——read-disambig-syms=data/lang_test/phones/disambig。int ——write-disambig-syms=data/lang_test/tmp/disambig_ilabels_1_0。int data/lang_test/tmp/ilabels_1_0。105996 data/lang_test/tmp/LG。fst
fstisstochastic data/lang_test/tmp/CLG_1_0。fst
-0。0663446 -0。0666824
[info]: CLG not stochastic。
make-h-transducer ——disambig-syms-out=exp/mono/graph/disambig_tid。int ——transition-scale=1。0 data/lang_test/tmp/ilabels_1_0 exp/mono/tree exp/mono/final。mdl
fsttablecompose exp/mono/graph/Ha。fst data/lang_test/tmp/CLG_1_0。fst
fstrmepslocal
fstrmsymbols exp/mono/graph/disambig_tid。int
fstminimizeencoded
fstdeterminizestar ——use-log=true
fstisstochastic exp/mono/graph/HCLGa。fst
0。000157882 -0。132761
HCLGa is not stochastic
add-self-loops ——self-loop-scale=0。1 ——reorder=true exp/mono/final。mdl exp/mono/graph/HCLGa。fst
可以看到,首先將 L_disambig。fst 和 G。fst 進行 compose 形成 LG。fst
然後將 LG。fst 和 ilabels_1_0。105996 幾個消歧符 compose 形成 CLG_1_0。fst
然後進行 make-h-transducer 形成 Ha。fst
再 將 Ha。fst 和 CLG_1_0。fst 進行 compose 形成 HCLGa。fst
add_self_loops 形成最終 HCLG。fst
utils/mkgraph。sh data/lang_test exp/mono exp/mono/graph
steps/decode.sh
steps/decode。sh ——cmd slurm。pl ——mem 4G ——config conf/decode。config ——nj 10 exp/mono/graph data/dev exp/mono/decode_dev
decode。sh: feature type is delta
steps/diagnostic/analyze_lats。sh ——cmd slurm。pl ——mem 4G exp/mono/graph exp/mono/decode_dev
steps/diagnostic/analyze_lats。sh: see stats in exp/mono/decode_dev/log/analyze_alignments。log
Overall, lattice depth (10,50,90-percentile)=(1,16,133) and mean=49。3
steps/diagnostic/analyze_lats。sh: see stats in exp/mono/decode_dev/log/analyze_lattice_depth_stats。log
+ steps/score_kaldi。sh ——cmd ‘slurm。pl ——mem 4G’ data/dev exp/mono/graph exp/mono/decode_dev
steps/score_kaldi。sh ——cmd slurm。pl ——mem 4G data/dev exp/mono/graph exp/mono/decode_dev
steps/score_kaldi。sh: scoring with word insertion penalty=0。0,0。5,1。0
+ steps/scoring/score_kaldi_cer。sh ——stage 2 ——cmd ‘slurm。pl ——mem 4G’ data/dev exp/mono/graph exp/mono/decode_dev
steps/scoring/score_kaldi_cer。sh ——stage 2 ——cmd slurm。pl ——mem 4G data/dev exp/mono/graph exp/mono/decode_dev
steps/scoring/score_kaldi_cer。sh: scoring with word insertion penalty=0。0,0。5,1。0
+ echo ‘local/score。sh: Done’
local/score。sh: Done
用法:
steps/decode。sh ——cmd “$decode_cmd” ——config conf/decode。config ——nj 10 \
exp/mono/graph data/dev exp/mono/decode_dev
steps/align_si.sh
steps/align_si。sh ——cmd “$train_cmd” ——nj 10 \
data/train data/lang exp/mono exp/mono_ali
get alignmensts from monophone system
steps/align_si。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/mono exp/mono_ali
steps/align_si。sh: feature type is delta
steps/align_si。sh: aligning data in data/train using model from exp/mono, putting alignments in exp/mono_ali
steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/mono_ali
steps/diagnostic/analyze_alignments。sh: see stats in exp/mono_ali/log/analyze_alignments。log
steps/align_si。sh: done aligning data。
用生成的mono 進行ali
ali 裡面是每一段utt對應一串 transition_id ,每個transition_id 和 (transition-state,transition-index)一 一對應,
transition-state 和 (phone-id, hmm-state , pdf-id )一 一對應
其中 phone-id ,hmm-state 通常為 (0,1,2) ,pdf-id 為 tree 裡面 葉子節點編號,對應一個獨立的語音類別
對一個特定的 hmm-state 來說 有兩個轉化 (forward pdf, self-loop pdf) 一般情況下他們相等,都等於對應的那個 pdf-id
在chain-model 裡面不一樣。
transition-index 標識的是 self-loop 生成的還是 forward 生成的。
從這裡面可以看出,ali 檔案將每一個特徵向量都對應到了 具體的 phone 狀態上
steps/train_deltas.sh
steps/train_deltas。sh ——cmd “$train_cmd” \
2500 20000 data/train data/lang exp/mono_ali exp/tri1
引數解釋:
numleaves=$1
totgauss=$2
data=$3
lang=$4
alidir=$5
dir=$6
train_deltas。sh 需要的引數是 numleaves 葉子數,totgauss 高斯數,訓練資料,lang(只有lexicon ,不需要G。fst ),alidir 對齊資料夾
最後的dir 是結果資料夾
for f in $alidir/final.mdl $alidir/ali.1.gz $data/feats.scp $lang/phones.txt; do
[ ! -f $f ] && echo "train_deltas.sh: no such file $f" && exit 1;
done
執行log:
steps/train_deltas。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/mono_ali exp/tri1
steps/train_deltas。sh: accumulating tree stats
steps/train_deltas。sh: getting questions for tree-building, via clustering
steps/train_deltas。sh: building the tree
steps/train_deltas。sh: converting alignments from exp/mono_ali to use current tree
steps/train_deltas。sh: compiling graphs of transcripts
steps/train_deltas。sh: training pass 1
steps/train_deltas。sh: training pass 2
。。。
steps/train_deltas。sh: training pass 10
steps/train_deltas。sh: aligning data
。。。
steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri1
steps/diagnostic/analyze_alignments。sh: see stats in exp/tri1/log/analyze_alignments。log
1 warnings in exp/tri1/log/compile_questions。log
846 warnings in exp/tri1/log/acc。*。*。log
3076 warnings in exp/tri1/log/align。*。*。log
1 warnings in exp/tri1/log/build_tree。log
exp/tri1: nj=10 align prob=-79。45 over 150。17h [retry=0。5%, fail=0。0%] states=2064 gauss=20050 tree-impr=4。49
steps/train_deltas。sh: Done training system with delta+delta-delta features in exp/tri1
整體執行流程為:
accumulating tree stats 聚集樹狀態
透過 acc-tree-stats sum-tree-stats 根據對齊資訊,聚集對應於每個monophone 的不同上下文特徵
getting questions for tree-building, via clustering 透過聚類,生成決策樹問題
透過 cluster-phones compile-questions
building the tree 構建決策樹
透過 build-tree
gmm-init-model gmm-mixup
根據之前的對齊,構建初始模型,大概是為每一個決策樹葉子節點對應的特徵進行取均值和方差,作為每個狀態的初始模型
converting alignments from $alidir to use current tree 使用現有決策樹 進行對齊轉換
透過 convert-ali (用到了 ali/final。mdl, dir/1。mdl, dir/tree 將 ali/ali。gz 轉化為 dir/ali。gz )將每一句話的原始對應轉化到現在的決策樹對應
compiling graphs of transcripts 根據訓練文字編譯訓練圖fst
用到了 compile-train-graphs 為了生成 dir/fsts。JOB。gz 將utt 對應的文字對映到 lexicon 然後根據 context ,topo ,tree, 形成基本元素為state 的fst
在這上面根據對齊不斷迭代
下面就是具體的迭代訓練了,預設一共訓練35個iteration,在第
10,20,30
個iteration處進行 reali 生成新的 ali。gz
在其他訓練步驟處,ali檔案保持不變,只進行狀態對齊然後求期望,再最大化,分別對應gmm-acc-stats-ali 和 gmm-est,訓練每個state下的GMM
incgauss=$[($totgauss-$numgauss)/$max_iter_inc] 其中 max_iter_inc = 25 ,高斯核增長 一共到底25iteration,numgauss = numleaves
說明一開始是一個葉子節點只是一個高斯,然後每一輪增加 incgauss 個高斯核,一直增加到第25iteration
OK,下面直觀理解一下為什麼 train_deltas 更加work,
每一個monophone根據上下文不同會有不同的發音,train_mono 的時候,對於不同上下文的不同發音,只用一個GMM去擬合他們,有點強模型所難
現在,我們分而治之,對每個不同的上下文用不同的GMM 擬合他們的特徵,比如一個GMM 識別 w_a_n 另一個識別 b_a_i ,這兩個GMM都的輸出最終都對映到a
,我們用不同的模型去識別同一個monophone的不同情況,這樣就解決了同一個monophone的多種變化的問題。語音裡屬於a下面的特徵會更容易被對齊到a
steps/train_lda_mllt.sh
LDA+MLLT refers to the way we transform the features after computing
the MFCCs: we splice across several frames, reduce the dimension (to 40
by default) using Linear Discriminant Analysis), and then later estimate,
over multiple iterations, a diagonalizing transform known as MLLT or STC。
See http://kaldi-asr。org/doc/transform。html for more explanation。
LDA+MLLT 在mfcc特徵提取出來後,將相鄰幾個幀拼接起來,降維度到40dim,用LDA 去評估,經過多次迭代,最後使用對角變換
steps/train_lda_mllt。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/tri2_ali exp/tri3a
steps/train_lda_mllt。sh: Accumulating LDA statistics。
steps/train_lda_mllt。sh: Accumulating tree stats
steps/train_lda_mllt。sh: Getting questions for tree clustering。
steps/train_lda_mllt。sh: Building the tree
steps/train_lda_mllt。sh: Initializing the model
steps/train_lda_mllt。sh: Converting alignments from exp/tri2_ali to use current tree
steps/train_lda_mllt。sh: Compiling graphs of transcripts
Training pass 1
Training pass 2
steps/train_lda_mllt。sh: Estimating MLLT
Training pass 3
。。。
steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri3a
steps/diagnostic/analyze_alignments。sh: see stats in exp/tri3a/log/analyze_alignments。log
333 warnings in exp/tri3a/log/acc。*。*。log
1422 warnings in exp/tri3a/log/align。*。*。log
7 warnings in exp/tri3a/log/lda_acc。*。log
1 warnings in exp/tri3a/log/build_tree。log
1 warnings in exp/tri3a/log/compile_questions。log
exp/tri3a: nj=10 align prob=-48。75 over 150。18h [retry=0。3%, fail=0。0%] states=2136 gauss=20035 tree-impr=5。07 lda-sum=24。62 mllt:impr,logdet=0。96,1。40
steps/train_lda_mllt。sh: Done training system with LDA+MLLT features in exp/tri3a
和train_deltas 的區別在於 多了一步,在讀取原始特徵後,進行了特徵轉換,用轉換後的特徵去訓練
steps/align_fmllr.sh
對齊,在原有對齊步驟後面多了3個stage
steps/align_fmllr。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/tri3a exp/tri3a_ali
steps/align_fmllr。sh: feature type is lda
steps/align_fmllr。sh: compiling training graphs
steps/align_fmllr。sh: aligning data in data/train using exp/tri3a/final。mdl and speaker-independent features。
steps/align_fmllr。sh: computing fMLLR transforms
steps/align_fmllr。sh: doing final alignment。
steps/align_fmllr。sh: done aligning data。
steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri3a_ali
steps/diagnostic/analyze_alignments。sh: see stats in exp/tri3a_ali/log/analyze_alignments。log
283 warnings in exp/tri3a_ali/log/align_pass1。*。log
4 warnings in exp/tri3a_ali/log/fmllr。*。log
305 warnings in exp/tri3a_ali/log/align_pass2。*。log
先對齊一次,作為pre_ali。gz , 利用 pre_ali。gz 計算 fmllr transforms 生成 trans。JOB
最後再利用 sifeat 和 trans 一起計算最終的 ali
steps/train_sat.sh
這個指令碼是訓練說話人自適應的,同樣的,只是對特徵進行轉換,和train_lda_mllr 差不多,他可以接受 fmllr 的對齊特徵也可以接受
原始mfcc的對齊特徵,主要看 ali 資料夾裡面有沒有 trans 檔案,如果有就用,如果沒有就再跑一遍 由原始feature生成 fmllr 的 trans
然後用這種轉換去進行 acc_tree_stats 用於決策樹聚類,build_tree , 最後在訓練的時候,也都是對轉換後的feature進行訓練。
steps/train_sat。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/tri3a_ali exp/tri4a
steps/train_sat。sh: feature type is lda
steps/train_sat。sh: Using transforms from exp/tri3a_ali
steps/train_sat。sh: Accumulating tree stats
steps/train_sat。sh: Getting questions for tree clustering。
steps/train_sat。sh: Building the tree
steps/train_sat。sh: Initializing the model
steps/train_sat。sh: Converting alignments from exp/tri3a_ali to use current tree
steps/train_sat。sh: Compiling graphs of transcripts
Pass 1
Pass 2
Estimating fMLLR transforms
Pass 3
。。。
steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri4a
steps/diagnostic/analyze_alignments。sh: see stats in exp/tri4a/log/analyze_alignments。log
1 warnings in exp/tri4a/log/build_tree。log
851 warnings in exp/tri4a/log/acc。*。*。log
1 warnings in exp/tri4a/log/compile_questions。log
53 warnings in exp/tri4a/log/fmllr。*。*。log
1850 warnings in exp/tri4a/log/align。*。*。log
steps/train_sat。sh: Likelihood evolution:
-49。2987 -49。1057 -48。9924 -48。802 -48。3585 -47。9407 -47。6175 -47。3928 -47。2062 -46。8194 -46。6762 -46。4616 -46。3543 -46。2751 -46。2059 -46。1392 -46。075 -46。0135 -45。9561 -45。825 -45。7553 -45。7127 -45。6753 -45。6406 -45。6078 -45。577 -45。5471 -45。517 -45。4879 -45。4149 -45。3743 -45。3532 -45。3395 -45。3303
exp/tri4a: nj=10 align prob=-48。28 over 150。17h [retry=0。4%, fail=0。0%] states=2152 gauss=20024 fmllr-impr=0。62 over 115。48h tree-impr=7。06
steps/train_sat。sh: done training SAT system in exp/tri4a