kaldi-GMM-HMM pipeline

作者：由清夏qx 發表于書法時間：2019-09-16

GMM-HMM 作為asr 的baseline，經常為後面的tdnn提供對齊，下面具體探究一下這條 pipeline

steps/train_mono.sh

steps/train_mono。sh ——cmd “$train_cmd” ——nj 10 \

data/train data/lang exp/mono || exit 1；

Flat start and monophone training， with delta-delta features。

過程：

steps/train_mono。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/mono

steps/train_mono。sh： Initializing monophone system。

steps/train_mono。sh： Compiling training graphs

steps/train_mono。sh： Aligning data equally （pass 0）

steps/train_mono。sh： Pass 1

steps/train_mono。sh： Aligning data

。。。

utils/mkgraph.sh

This script creates a fully expanded decoding graph （HCLG） that represents

# all the language-model， pronunciation dictionary （lexicon）， context-dependency，

# and HMM structure in our model。 The output is a Finite State Transducer

# that has word-ids on the output， and pdf-ids on the input （these are indexes

# that resolve to Gaussian Mixture Models）。

mkgraph 需要 lang_test 下的 L。fst G。fst phones。txt， words。txt ， phones/silence。csl ， phones/

http：//

disambig。int

以及 exp/tri 下的 tree， final。mdl

tree-info exp/mono/tree

fstminimizeencoded

fstpushspecial

fstdeterminizestar ——use-log=true

fsttablecompose data/lang_test/L_disambig。fst data/lang_test/G。fst

fstisstochastic data/lang_test/tmp/LG。fst

-0。0663446 -0。0666824

［info］： LG not stochastic。

fstcomposecontext ——context-size=1 ——central-position=0 ——read-disambig-syms=data/lang_test/phones/disambig。int ——write-disambig-syms=data/lang_test/tmp/disambig_ilabels_1_0。int data/lang_test/tmp/ilabels_1_0。105996 data/lang_test/tmp/LG。fst

fstisstochastic data/lang_test/tmp/CLG_1_0。fst

-0。0663446 -0。0666824

［info］： CLG not stochastic。

make-h-transducer ——disambig-syms-out=exp/mono/graph/disambig_tid。int ——transition-scale=1。0 data/lang_test/tmp/ilabels_1_0 exp/mono/tree exp/mono/final。mdl

fsttablecompose exp/mono/graph/Ha。fst data/lang_test/tmp/CLG_1_0。fst

fstrmepslocal

fstrmsymbols exp/mono/graph/disambig_tid。int

fstminimizeencoded

fstdeterminizestar ——use-log=true

fstisstochastic exp/mono/graph/HCLGa。fst

0。000157882 -0。132761

HCLGa is not stochastic

add-self-loops ——self-loop-scale=0。1 ——reorder=true exp/mono/final。mdl exp/mono/graph/HCLGa。fst

可以看到，首先將 L_disambig。fst 和 G。fst 進行 compose 形成 LG。fst

然後將 LG。fst 和 ilabels_1_0。105996 幾個消歧符 compose 形成 CLG_1_0。fst

然後進行 make-h-transducer 形成 Ha。fst

再將 Ha。fst 和 CLG_1_0。fst 進行 compose 形成 HCLGa。fst

add_self_loops 形成最終 HCLG。fst

utils/mkgraph。sh data/lang_test exp/mono exp/mono/graph

steps/decode.sh

steps/decode。sh ——cmd slurm。pl ——mem 4G ——config conf/decode。config ——nj 10 exp/mono/graph data/dev exp/mono/decode_dev

decode。sh： feature type is delta

steps/diagnostic/analyze_lats。sh ——cmd slurm。pl ——mem 4G exp/mono/graph exp/mono/decode_dev

steps/diagnostic/analyze_lats。sh： see stats in exp/mono/decode_dev/log/analyze_alignments。log

Overall， lattice depth （10，50，90-percentile）=（1，16，133） and mean=49。3

steps/diagnostic/analyze_lats。sh： see stats in exp/mono/decode_dev/log/analyze_lattice_depth_stats。log

+ steps/score_kaldi。sh ——cmd ‘slurm。pl ——mem 4G’ data/dev exp/mono/graph exp/mono/decode_dev

steps/score_kaldi。sh ——cmd slurm。pl ——mem 4G data/dev exp/mono/graph exp/mono/decode_dev

steps/score_kaldi。sh： scoring with word insertion penalty=0。0，0。5，1。0

+ steps/scoring/score_kaldi_cer。sh ——stage 2 ——cmd ‘slurm。pl ——mem 4G’ data/dev exp/mono/graph exp/mono/decode_dev

steps/scoring/score_kaldi_cer。sh ——stage 2 ——cmd slurm。pl ——mem 4G data/dev exp/mono/graph exp/mono/decode_dev

steps/scoring/score_kaldi_cer。sh： scoring with word insertion penalty=0。0，0。5，1。0

+ echo ‘local/score。sh： Done’

local/score。sh： Done

用法：

steps/decode。sh ——cmd “$decode_cmd” ——config conf/decode。config ——nj 10 \

exp/mono/graph data/dev exp/mono/decode_dev

steps/align_si.sh

steps/align_si。sh ——cmd “$train_cmd” ——nj 10 \

data/train data/lang exp/mono exp/mono_ali

get alignmensts from monophone system

steps/align_si。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/mono exp/mono_ali

steps/align_si。sh： feature type is delta

steps/align_si。sh： aligning data in data/train using model from exp/mono， putting alignments in exp/mono_ali

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/mono_ali

steps/diagnostic/analyze_alignments。sh： see stats in exp/mono_ali/log/analyze_alignments。log

steps/align_si。sh： done aligning data。

用生成的mono 進行ali

ali 裡面是每一段utt對應一串 transition_id ,每個transition_id 和（transition-state,transition-index）一一對應，

transition-state 和（phone-id， hmm-state ， pdf-id ）一一對應

其中 phone-id ，hmm-state 通常為（0，1，2），pdf-id 為 tree 裡面葉子節點編號，對應一個獨立的語音類別

對一個特定的 hmm-state 來說有兩個轉化（forward pdf， self-loop pdf）一般情況下他們相等，都等於對應的那個 pdf-id

在chain-model 裡面不一樣。

transition-index 標識的是 self-loop 生成的還是 forward 生成的。

從這裡面可以看出，ali 檔案將每一個特徵向量都對應到了具體的 phone 狀態上

steps/train_deltas.sh

steps/train_deltas。sh ——cmd “$train_cmd” \

2500 20000 data/train data/lang exp/mono_ali exp/tri1

引數解釋：

numleaves=$1

totgauss=$2

data=$3

lang=$4

alidir=$5

dir=$6

train_deltas。sh 需要的引數是 numleaves 葉子數，totgauss 高斯數，訓練資料，lang（只有lexicon ，不需要G。fst ），alidir 對齊資料夾

最後的dir 是結果資料夾

for f in $alidir/final.mdl $alidir/ali.1.gz $data/feats.scp $lang/phones.txt; do

[ ! -f $f ] && echo "train_deltas.sh: no such file $f" && exit 1;

done

執行log：

steps/train_deltas。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/mono_ali exp/tri1

steps/train_deltas。sh： accumulating tree stats

steps/train_deltas。sh： getting questions for tree-building， via clustering

steps/train_deltas。sh： building the tree

steps/train_deltas。sh： converting alignments from exp/mono_ali to use current tree

steps/train_deltas。sh： compiling graphs of transcripts

steps/train_deltas。sh： training pass 1

steps/train_deltas。sh： training pass 2

。。。

steps/train_deltas。sh： training pass 10

steps/train_deltas。sh： aligning data

。。。

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri1

steps/diagnostic/analyze_alignments。sh： see stats in exp/tri1/log/analyze_alignments。log

1 warnings in exp/tri1/log/compile_questions。log

846 warnings in exp/tri1/log/acc。*。*。log

3076 warnings in exp/tri1/log/align。*。*。log

1 warnings in exp/tri1/log/build_tree。log

exp/tri1： nj=10 align prob=-79。45 over 150。17h ［retry=0。5%， fail=0。0%］ states=2064 gauss=20050 tree-impr=4。49

steps/train_deltas。sh： Done training system with delta+delta-delta features in exp/tri1

整體執行流程為：

accumulating tree stats 聚集樹狀態

透過 acc-tree-stats sum-tree-stats 根據對齊資訊，聚集對應於每個monophone 的不同上下文特徵

getting questions for tree-building, via clustering 透過聚類，生成決策樹問題

透過 cluster-phones compile-questions

building the tree 構建決策樹

透過 build-tree

gmm-init-model gmm-mixup

根據之前的對齊，構建初始模型，大概是為每一個決策樹葉子節點對應的特徵進行取均值和方差，作為每個狀態的初始模型

converting alignments from $alidir to use current tree 使用現有決策樹進行對齊轉換

透過 convert-ali （用到了 ali/final。mdl， dir/1。mdl， dir/tree 將 ali/ali。gz 轉化為 dir/ali。gz ）將每一句話的原始對應轉化到現在的決策樹對應

compiling graphs of transcripts 根據訓練文字編譯訓練圖fst

用到了 compile-train-graphs 為了生成 dir/fsts。JOB。gz 將utt 對應的文字對映到 lexicon 然後根據 context ，topo ，tree，形成基本元素為state 的fst

在這上面根據對齊不斷迭代

下面就是具體的迭代訓練了，預設一共訓練35個iteration，在第

10,20,30

個iteration處進行 reali 生成新的 ali。gz

在其他訓練步驟處，ali檔案保持不變，只進行狀態對齊然後求期望，再最大化，分別對應gmm-acc-stats-ali 和 gmm-est，訓練每個state下的GMM

incgauss=$［（$totgauss-$numgauss）/$max_iter_inc］其中 max_iter_inc = 25 ，高斯核增長一共到底25iteration，numgauss = numleaves

說明一開始是一個葉子節點只是一個高斯，然後每一輪增加 incgauss 個高斯核，一直增加到第25iteration

OK，下面直觀理解一下為什麼 train_deltas 更加work，

每一個monophone根據上下文不同會有不同的發音，train_mono 的時候，對於不同上下文的不同發音，只用一個GMM去擬合他們，有點強模型所難

現在，我們分而治之，對每個不同的上下文用不同的GMM 擬合他們的特徵，比如一個GMM 識別 w_a_n 另一個識別 b_a_i ，這兩個GMM都的輸出最終都對映到a

，我們用不同的模型去識別同一個monophone的不同情況，這樣就解決了同一個monophone的多種變化的問題。語音裡屬於a下面的特徵會更容易被對齊到a

steps/train_lda_mllt.sh

LDA+MLLT refers to the way we transform the features after computing

the MFCCs： we splice across several frames， reduce the dimension （to 40

by default） using Linear Discriminant Analysis）， and then later estimate，

over multiple iterations， a diagonalizing transform known as MLLT or STC。

See http：//kaldi-asr。org/doc/transform。html for more explanation。

LDA+MLLT 在mfcc特徵提取出來後，將相鄰幾個幀拼接起來，降維度到40dim，用LDA 去評估，經過多次迭代，最後使用對角變換

steps/train_lda_mllt。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/tri2_ali exp/tri3a

steps/train_lda_mllt。sh： Accumulating LDA statistics。

steps/train_lda_mllt。sh： Accumulating tree stats

steps/train_lda_mllt。sh： Getting questions for tree clustering。

steps/train_lda_mllt。sh： Building the tree

steps/train_lda_mllt。sh： Initializing the model

steps/train_lda_mllt。sh： Converting alignments from exp/tri2_ali to use current tree

steps/train_lda_mllt。sh： Compiling graphs of transcripts

Training pass 1

Training pass 2

steps/train_lda_mllt。sh： Estimating MLLT

Training pass 3

。。。

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri3a

steps/diagnostic/analyze_alignments。sh： see stats in exp/tri3a/log/analyze_alignments。log

333 warnings in exp/tri3a/log/acc。*。*。log

1422 warnings in exp/tri3a/log/align。*。*。log

7 warnings in exp/tri3a/log/lda_acc。*。log

1 warnings in exp/tri3a/log/build_tree。log

1 warnings in exp/tri3a/log/compile_questions。log

exp/tri3a： nj=10 align prob=-48。75 over 150。18h ［retry=0。3%， fail=0。0%］ states=2136 gauss=20035 tree-impr=5。07 lda-sum=24。62 mllt：impr，logdet=0。96，1。40

steps/train_lda_mllt。sh： Done training system with LDA+MLLT features in exp/tri3a

和train_deltas 的區別在於多了一步，在讀取原始特徵後，進行了特徵轉換，用轉換後的特徵去訓練

steps/align_fmllr.sh

對齊，在原有對齊步驟後面多了3個stage

steps/align_fmllr。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/tri3a exp/tri3a_ali

steps/align_fmllr。sh： feature type is lda

steps/align_fmllr。sh： compiling training graphs

steps/align_fmllr。sh： aligning data in data/train using exp/tri3a/final。mdl and speaker-independent features。

steps/align_fmllr。sh： computing fMLLR transforms

steps/align_fmllr。sh： doing final alignment。

steps/align_fmllr。sh： done aligning data。

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri3a_ali

steps/diagnostic/analyze_alignments。sh： see stats in exp/tri3a_ali/log/analyze_alignments。log

283 warnings in exp/tri3a_ali/log/align_pass1。*。log

4 warnings in exp/tri3a_ali/log/fmllr。*。log

305 warnings in exp/tri3a_ali/log/align_pass2。*。log

先對齊一次，作為pre_ali。gz ，利用 pre_ali。gz 計算 fmllr transforms 生成 trans。JOB

最後再利用 sifeat 和 trans 一起計算最終的 ali

steps/train_sat.sh

這個指令碼是訓練說話人自適應的，同樣的，只是對特徵進行轉換，和train_lda_mllr 差不多，他可以接受 fmllr 的對齊特徵也可以接受

原始mfcc的對齊特徵，主要看 ali 資料夾裡面有沒有 trans 檔案，如果有就用，如果沒有就再跑一遍由原始feature生成 fmllr 的 trans

然後用這種轉換去進行 acc_tree_stats 用於決策樹聚類，build_tree , 最後在訓練的時候，也都是對轉換後的feature進行訓練。

steps/train_sat。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/tri3a_ali exp/tri4a

steps/train_sat。sh： feature type is lda

steps/train_sat。sh： Using transforms from exp/tri3a_ali

steps/train_sat。sh： Accumulating tree stats

steps/train_sat。sh： Getting questions for tree clustering。

steps/train_sat。sh： Building the tree

steps/train_sat。sh： Initializing the model

steps/train_sat。sh： Converting alignments from exp/tri3a_ali to use current tree

steps/train_sat。sh： Compiling graphs of transcripts

Pass 1

Pass 2

Estimating fMLLR transforms

Pass 3

。。。

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri4a

steps/diagnostic/analyze_alignments。sh： see stats in exp/tri4a/log/analyze_alignments。log

1 warnings in exp/tri4a/log/build_tree。log

851 warnings in exp/tri4a/log/acc。*。*。log

1 warnings in exp/tri4a/log/compile_questions。log

53 warnings in exp/tri4a/log/fmllr。*。*。log

1850 warnings in exp/tri4a/log/align。*。*。log

steps/train_sat。sh： Likelihood evolution：

-49。2987 -49。1057 -48。9924 -48。802 -48。3585 -47。9407 -47。6175 -47。3928 -47。2062 -46。8194 -46。6762 -46。4616 -46。3543 -46。2751 -46。2059 -46。1392 -46。075 -46。0135 -45。9561 -45。825 -45。7553 -45。7127 -45。6753 -45。6406 -45。6078 -45。577 -45。5471 -45。517 -45。4879 -45。4149 -45。3743 -45。3532 -45。3395 -45。3303

exp/tri4a： nj=10 align prob=-48。28 over 150。17h ［retry=0。4%， fail=0。0%］ states=2152 gauss=20024 fmllr-impr=0。62 over 115。48h tree-impr=7。06

steps/train_sat。sh： done training SAT system in exp/tri4a

標簽： sh EXP train mono data

上一篇:【學界】單純形演算法的原理和示例實現

下一篇：如何評價貴州珍酒珍十五的口感？