您當前的位置:首頁 > 書法

kaldi-GMM-HMM pipeline

作者:由 清夏qx 發表于 書法時間:2019-09-16

GMM-HMM 作為asr 的baseline,經常為後面的tdnn提供 對齊 ,下面具體探究一下這條 pipeline

steps/train_mono.sh

steps/train_mono。sh ——cmd “$train_cmd” ——nj 10 \

data/train data/lang exp/mono || exit 1;

Flat start and monophone training, with delta-delta features。

過程:

steps/train_mono。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/mono

steps/train_mono。sh: Initializing monophone system。

steps/train_mono。sh: Compiling training graphs

steps/train_mono。sh: Aligning data equally (pass 0)

steps/train_mono。sh: Pass 1

steps/train_mono。sh: Aligning data

。。。

utils/mkgraph.sh

This script creates a fully expanded decoding graph (HCLG) that represents

# all the language-model, pronunciation dictionary (lexicon), context-dependency,

# and HMM structure in our model。 The output is a Finite State Transducer

# that has word-ids on the output, and pdf-ids on the input (these are indexes

# that resolve to Gaussian Mixture Models)。

mkgraph 需要 lang_test 下的 L。fst G。fst phones。txt, words。txt , phones/silence。csl , phones/

http://

disambig。int

以及 exp/tri 下的 tree, final。mdl

tree-info exp/mono/tree

tree-info exp/mono/tree

fstminimizeencoded

fstpushspecial

fstdeterminizestar ——use-log=true

fsttablecompose data/lang_test/L_disambig。fst data/lang_test/G。fst

fstisstochastic data/lang_test/tmp/LG。fst

-0。0663446 -0。0666824

[info]: LG not stochastic。

fstcomposecontext ——context-size=1 ——central-position=0 ——read-disambig-syms=data/lang_test/phones/disambig。int ——write-disambig-syms=data/lang_test/tmp/disambig_ilabels_1_0。int data/lang_test/tmp/ilabels_1_0。105996 data/lang_test/tmp/LG。fst

fstisstochastic data/lang_test/tmp/CLG_1_0。fst

-0。0663446 -0。0666824

[info]: CLG not stochastic。

make-h-transducer ——disambig-syms-out=exp/mono/graph/disambig_tid。int ——transition-scale=1。0 data/lang_test/tmp/ilabels_1_0 exp/mono/tree exp/mono/final。mdl

fsttablecompose exp/mono/graph/Ha。fst data/lang_test/tmp/CLG_1_0。fst

fstrmepslocal

fstrmsymbols exp/mono/graph/disambig_tid。int

fstminimizeencoded

fstdeterminizestar ——use-log=true

fstisstochastic exp/mono/graph/HCLGa。fst

0。000157882 -0。132761

HCLGa is not stochastic

add-self-loops ——self-loop-scale=0。1 ——reorder=true exp/mono/final。mdl exp/mono/graph/HCLGa。fst

可以看到,首先將 L_disambig。fst 和 G。fst 進行 compose 形成 LG。fst

然後將 LG。fst 和 ilabels_1_0。105996 幾個消歧符 compose 形成 CLG_1_0。fst

然後進行 make-h-transducer 形成 Ha。fst

再 將 Ha。fst 和 CLG_1_0。fst 進行 compose 形成 HCLGa。fst

add_self_loops 形成最終 HCLG。fst

utils/mkgraph。sh data/lang_test exp/mono exp/mono/graph

steps/decode.sh

steps/decode。sh ——cmd slurm。pl ——mem 4G ——config conf/decode。config ——nj 10 exp/mono/graph data/dev exp/mono/decode_dev

decode。sh: feature type is delta

steps/diagnostic/analyze_lats。sh ——cmd slurm。pl ——mem 4G exp/mono/graph exp/mono/decode_dev

steps/diagnostic/analyze_lats。sh: see stats in exp/mono/decode_dev/log/analyze_alignments。log

Overall, lattice depth (10,50,90-percentile)=(1,16,133) and mean=49。3

steps/diagnostic/analyze_lats。sh: see stats in exp/mono/decode_dev/log/analyze_lattice_depth_stats。log

+ steps/score_kaldi。sh ——cmd ‘slurm。pl ——mem 4G’ data/dev exp/mono/graph exp/mono/decode_dev

steps/score_kaldi。sh ——cmd slurm。pl ——mem 4G data/dev exp/mono/graph exp/mono/decode_dev

steps/score_kaldi。sh: scoring with word insertion penalty=0。0,0。5,1。0

+ steps/scoring/score_kaldi_cer。sh ——stage 2 ——cmd ‘slurm。pl ——mem 4G’ data/dev exp/mono/graph exp/mono/decode_dev

steps/scoring/score_kaldi_cer。sh ——stage 2 ——cmd slurm。pl ——mem 4G data/dev exp/mono/graph exp/mono/decode_dev

steps/scoring/score_kaldi_cer。sh: scoring with word insertion penalty=0。0,0。5,1。0

+ echo ‘local/score。sh: Done’

local/score。sh: Done

用法:

steps/decode。sh ——cmd “$decode_cmd” ——config conf/decode。config ——nj 10 \

exp/mono/graph data/dev exp/mono/decode_dev

steps/align_si.sh

steps/align_si。sh ——cmd “$train_cmd” ——nj 10 \

data/train data/lang exp/mono exp/mono_ali

get alignmensts from monophone system

steps/align_si。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/mono exp/mono_ali

steps/align_si。sh: feature type is delta

steps/align_si。sh: aligning data in data/train using model from exp/mono, putting alignments in exp/mono_ali

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/mono_ali

steps/diagnostic/analyze_alignments。sh: see stats in exp/mono_ali/log/analyze_alignments。log

steps/align_si。sh: done aligning data。

用生成的mono 進行ali

ali 裡面是每一段utt對應一串 transition_id ,每個transition_id 和 (transition-state,transition-index)一 一對應,

transition-state 和 (phone-id, hmm-state , pdf-id )一 一對應

其中 phone-id ,hmm-state 通常為 (0,1,2) ,pdf-id 為 tree 裡面 葉子節點編號,對應一個獨立的語音類別

對一個特定的 hmm-state 來說 有兩個轉化 (forward pdf, self-loop pdf) 一般情況下他們相等,都等於對應的那個 pdf-id

在chain-model 裡面不一樣。

transition-index 標識的是 self-loop 生成的還是 forward 生成的。

從這裡面可以看出,ali 檔案將每一個特徵向量都對應到了 具體的 phone 狀態上

steps/train_deltas.sh

steps/train_deltas。sh ——cmd “$train_cmd” \

2500 20000 data/train data/lang exp/mono_ali exp/tri1

引數解釋:

numleaves=$1

totgauss=$2

data=$3

lang=$4

alidir=$5

dir=$6

train_deltas。sh 需要的引數是 numleaves 葉子數,totgauss 高斯數,訓練資料,lang(只有lexicon ,不需要G。fst ),alidir 對齊資料夾

最後的dir 是結果資料夾

for f in $alidir/final.mdl $alidir/ali.1.gz $data/feats.scp $lang/phones.txt; do

[ ! -f $f ] && echo "train_deltas.sh: no such file $f" && exit 1;

done

執行log:

steps/train_deltas。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/mono_ali exp/tri1

steps/train_deltas。sh: accumulating tree stats

steps/train_deltas。sh: getting questions for tree-building, via clustering

steps/train_deltas。sh: building the tree

steps/train_deltas。sh: converting alignments from exp/mono_ali to use current tree

steps/train_deltas。sh: compiling graphs of transcripts

steps/train_deltas。sh: training pass 1

steps/train_deltas。sh: training pass 2

。。。

steps/train_deltas。sh: training pass 10

steps/train_deltas。sh: aligning data

。。。

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri1

steps/diagnostic/analyze_alignments。sh: see stats in exp/tri1/log/analyze_alignments。log

1 warnings in exp/tri1/log/compile_questions。log

846 warnings in exp/tri1/log/acc。*。*。log

3076 warnings in exp/tri1/log/align。*。*。log

1 warnings in exp/tri1/log/build_tree。log

exp/tri1: nj=10 align prob=-79。45 over 150。17h [retry=0。5%, fail=0。0%] states=2064 gauss=20050 tree-impr=4。49

steps/train_deltas。sh: Done training system with delta+delta-delta features in exp/tri1

整體執行流程為:

accumulating tree stats 聚集樹狀態

透過 acc-tree-stats sum-tree-stats 根據對齊資訊,聚集對應於每個monophone 的不同上下文特徵

getting questions for tree-building, via clustering 透過聚類,生成決策樹問題

透過 cluster-phones compile-questions

building the tree 構建決策樹

透過 build-tree

gmm-init-model gmm-mixup

根據之前的對齊,構建初始模型,大概是為每一個決策樹葉子節點對應的特徵進行取均值和方差,作為每個狀態的初始模型

converting alignments from $alidir to use current tree 使用現有決策樹 進行對齊轉換

透過 convert-ali (用到了 ali/final。mdl, dir/1。mdl, dir/tree 將 ali/ali。gz 轉化為 dir/ali。gz )將每一句話的原始對應轉化到現在的決策樹對應

compiling graphs of transcripts 根據訓練文字編譯訓練圖fst

用到了 compile-train-graphs 為了生成 dir/fsts。JOB。gz 將utt 對應的文字對映到 lexicon 然後根據 context ,topo ,tree, 形成基本元素為state 的fst

在這上面根據對齊不斷迭代

下面就是具體的迭代訓練了,預設一共訓練35個iteration,在第

10,20,30

個iteration處進行 reali 生成新的 ali。gz

在其他訓練步驟處,ali檔案保持不變,只進行狀態對齊然後求期望,再最大化,分別對應gmm-acc-stats-ali 和 gmm-est,訓練每個state下的GMM

incgauss=$[($totgauss-$numgauss)/$max_iter_inc] 其中 max_iter_inc = 25 ,高斯核增長 一共到底25iteration,numgauss = numleaves

說明一開始是一個葉子節點只是一個高斯,然後每一輪增加 incgauss 個高斯核,一直增加到第25iteration

OK,下面直觀理解一下為什麼 train_deltas 更加work,

每一個monophone根據上下文不同會有不同的發音,train_mono 的時候,對於不同上下文的不同發音,只用一個GMM去擬合他們,有點強模型所難

現在,我們分而治之,對每個不同的上下文用不同的GMM 擬合他們的特徵,比如一個GMM 識別 w_a_n 另一個識別 b_a_i ,這兩個GMM都的輸出最終都對映到a

,我們用不同的模型去識別同一個monophone的不同情況,這樣就解決了同一個monophone的多種變化的問題。語音裡屬於a下面的特徵會更容易被對齊到a

steps/train_lda_mllt.sh

LDA+MLLT refers to the way we transform the features after computing

the MFCCs: we splice across several frames, reduce the dimension (to 40

by default) using Linear Discriminant Analysis), and then later estimate,

over multiple iterations, a diagonalizing transform known as MLLT or STC。

See http://kaldi-asr。org/doc/transform。html for more explanation。

LDA+MLLT 在mfcc特徵提取出來後,將相鄰幾個幀拼接起來,降維度到40dim,用LDA 去評估,經過多次迭代,最後使用對角變換

steps/train_lda_mllt。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/tri2_ali exp/tri3a

steps/train_lda_mllt。sh: Accumulating LDA statistics。

steps/train_lda_mllt。sh: Accumulating tree stats

steps/train_lda_mllt。sh: Getting questions for tree clustering。

steps/train_lda_mllt。sh: Building the tree

steps/train_lda_mllt。sh: Initializing the model

steps/train_lda_mllt。sh: Converting alignments from exp/tri2_ali to use current tree

steps/train_lda_mllt。sh: Compiling graphs of transcripts

Training pass 1

Training pass 2

steps/train_lda_mllt。sh: Estimating MLLT

Training pass 3

。。。

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri3a

steps/diagnostic/analyze_alignments。sh: see stats in exp/tri3a/log/analyze_alignments。log

333 warnings in exp/tri3a/log/acc。*。*。log

1422 warnings in exp/tri3a/log/align。*。*。log

7 warnings in exp/tri3a/log/lda_acc。*。log

1 warnings in exp/tri3a/log/build_tree。log

1 warnings in exp/tri3a/log/compile_questions。log

exp/tri3a: nj=10 align prob=-48。75 over 150。18h [retry=0。3%, fail=0。0%] states=2136 gauss=20035 tree-impr=5。07 lda-sum=24。62 mllt:impr,logdet=0。96,1。40

steps/train_lda_mllt。sh: Done training system with LDA+MLLT features in exp/tri3a

和train_deltas 的區別在於 多了一步,在讀取原始特徵後,進行了特徵轉換,用轉換後的特徵去訓練

steps/align_fmllr.sh

對齊,在原有對齊步驟後面多了3個stage

steps/align_fmllr。sh ——cmd slurm。pl ——mem 4G ——nj 10 data/train data/lang exp/tri3a exp/tri3a_ali

steps/align_fmllr。sh: feature type is lda

steps/align_fmllr。sh: compiling training graphs

steps/align_fmllr。sh: aligning data in data/train using exp/tri3a/final。mdl and speaker-independent features。

steps/align_fmllr。sh: computing fMLLR transforms

steps/align_fmllr。sh: doing final alignment。

steps/align_fmllr。sh: done aligning data。

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri3a_ali

steps/diagnostic/analyze_alignments。sh: see stats in exp/tri3a_ali/log/analyze_alignments。log

283 warnings in exp/tri3a_ali/log/align_pass1。*。log

4 warnings in exp/tri3a_ali/log/fmllr。*。log

305 warnings in exp/tri3a_ali/log/align_pass2。*。log

先對齊一次,作為pre_ali。gz , 利用 pre_ali。gz 計算 fmllr transforms 生成 trans。JOB

最後再利用 sifeat 和 trans 一起計算最終的 ali

steps/train_sat.sh

這個指令碼是訓練說話人自適應的,同樣的,只是對特徵進行轉換,和train_lda_mllr 差不多,他可以接受 fmllr 的對齊特徵也可以接受

原始mfcc的對齊特徵,主要看 ali 資料夾裡面有沒有 trans 檔案,如果有就用,如果沒有就再跑一遍 由原始feature生成 fmllr 的 trans

然後用這種轉換去進行 acc_tree_stats 用於決策樹聚類,build_tree , 最後在訓練的時候,也都是對轉換後的feature進行訓練。

steps/train_sat。sh ——cmd slurm。pl ——mem 4G 2500 20000 data/train data/lang exp/tri3a_ali exp/tri4a

steps/train_sat。sh: feature type is lda

steps/train_sat。sh: Using transforms from exp/tri3a_ali

steps/train_sat。sh: Accumulating tree stats

steps/train_sat。sh: Getting questions for tree clustering。

steps/train_sat。sh: Building the tree

steps/train_sat。sh: Initializing the model

steps/train_sat。sh: Converting alignments from exp/tri3a_ali to use current tree

steps/train_sat。sh: Compiling graphs of transcripts

Pass 1

Pass 2

Estimating fMLLR transforms

Pass 3

。。。

steps/diagnostic/analyze_alignments。sh ——cmd slurm。pl ——mem 4G data/lang exp/tri4a

steps/diagnostic/analyze_alignments。sh: see stats in exp/tri4a/log/analyze_alignments。log

1 warnings in exp/tri4a/log/build_tree。log

851 warnings in exp/tri4a/log/acc。*。*。log

1 warnings in exp/tri4a/log/compile_questions。log

53 warnings in exp/tri4a/log/fmllr。*。*。log

1850 warnings in exp/tri4a/log/align。*。*。log

steps/train_sat。sh: Likelihood evolution:

-49。2987 -49。1057 -48。9924 -48。802 -48。3585 -47。9407 -47。6175 -47。3928 -47。2062 -46。8194 -46。6762 -46。4616 -46。3543 -46。2751 -46。2059 -46。1392 -46。075 -46。0135 -45。9561 -45。825 -45。7553 -45。7127 -45。6753 -45。6406 -45。6078 -45。577 -45。5471 -45。517 -45。4879 -45。4149 -45。3743 -45。3532 -45。3395 -45。3303

exp/tri4a: nj=10 align prob=-48。28 over 150。17h [retry=0。4%, fail=0。0%] states=2152 gauss=20024 fmllr-impr=0。62 over 115。48h tree-impr=7。06

steps/train_sat。sh: done training SAT system in exp/tri4a

標簽: sh  EXP  train  mono  data