手把手教你用TensorFlow實現看圖說話｜教程+程式碼

作者：由量子學園發表于繪畫時間：2017-03-30

看圖說話這種技能，我們人類在幼兒園時就掌握了，機器們前赴後繼學了這麼多年，也終於可以對影象進行最簡單的描述。

O’reilly出版社和TensorFlow團隊聯合釋出了一份教程，詳細介紹瞭如何在Google的Show and Tell模型基礎上，用Flickr30k資料集來訓練一個影象描述生成器。模型的建立、訓練和測試都基於TensorFlow。

如果你一時想不起O’reilly是什麼，量子位很願意幫你回憶一下：

好了，看教程：

（王新民編譯整理 | 量子位出品公眾號 QbitAI）

準備工作

裝好TensorFlow；

安裝pandas、opencv2、Jupyter庫；

下載Flicker30k資料集的影象嵌入和影象描述

在教程對應的GitHub程式碼介紹（ mlberkeley/oreilly-captions ）裡，有庫、影象嵌入、影象描述的下載連結。

影象描述生成模型

△ 影象描述生成模型的網路示意圖。

該網路輸入馬的影象，經由深度卷積神經網路Deep CNN和語言生成模型RNN（迴圈神經網路）學習訓練，最終得到字幕生成網路的模型。

這就是一個我們將要訓練的網路結構示意圖。深度卷積神經網路將每個輸入影象進行編碼表示成一個4，096維的向量，利用迴圈神經網路的語言生成模型解碼該向量，生成對輸入影象的描述。

影象描述生成是影象分類的擴充套件

影象分類是一種經典的計算機視覺任務，可以使用很多強大的經典分類模型。分類模型是透過將影象中存在的形狀和物體的相關視覺資訊拼湊在一起，以實現對影象中物體的識別。

機器學習模型可以被應用到計算機視覺任務中，例如物體檢測和影象分割，不僅需要識別影象中的資訊，而且還要學習和解釋呈現出的2D空間結構，融合這兩種資訊，來確定物體在影象中的位置資訊。想要實現字幕生成，我們需要解決以下兩個問題：

1。我們如何在已有成功的影象分類模型的基礎上，從影象中獲取重要資訊？

2。我們的模型如何在理解影象的基礎上，融合資訊實現字幕生成？

運用遷移學習

我們可以利用現有的模型來幫助提取影象資訊。遷移學習允許我們用現有用於訓練不同任務的神經網路，透過資料格式轉換，將這些網路應用到我們的資料之中。

在我們的實驗中，該vgg-16影象分類模型的輸入影象格式為224×224畫素，最終會產生一個4096維的特徵向量，連線到多層全連線網路進行影象分類。

我們可以使用vgg-16網路模型的特徵提取層，用來完善我們的字幕生成網路。在這篇文章的工作中，我們抽象出vgg-16網路的特徵提取層和預先計算的4096維特徵，這樣就省去了影象的預訓練步驟，來加速全域性網路訓練程序。

載入VGG網路特徵和實現影象標註功能的程式碼是相對簡單的：

def

get_data

（

annotation_path

，

feature_path

）：

annotations

。

read_table

（

annotation_path

，

sep

‘

’

，

header

None

，

names

［

‘image’

，

‘caption’

］）

return

。

load

（

feature_path

，

‘r’

），

annotations

［

‘caption’

］

。

values

理解影象描述

現在，我們對影象標註了多個物體標籤，我們需要讓模型學習將表示標籤解碼成一個可理解的標題。

由於文字具有連續性，我們利用RNN及LSTM網路，來訓練在給定已有前面單詞的情況下網路預測後續一系列描述影象的句子的功能。

由於長短期記憶模型（LSTM）單位的存在，使得模型更好地在字幕單詞序列中提取到關鍵資訊，選擇性記住某些內容以及忘記某些無用的資訊。TensorFlow提供了一個封裝函式，用於在給定輸入和確定輸出維度的條件下生成一個LSTM網路層。

為了將單詞轉化成適合於LSTM網路輸入的具有固定長度的表示序列，我們使用一個嵌入層來學習如何將單詞對映到256維特徵，即詞語嵌入操作。詞語嵌入幫助將我們的單詞表示為向量形式，那麼類似的單詞向量就說明對應的句子在語義上也是相似的。

在VGG-16網路所構建的影象分類器中，卷積層提取到的4，096維矢量表示將透過softmax層進行影象分類。由於LSTM單元更支援用256維文字特徵作為輸入，我們需要將影象表示格式轉換為用於描述序列的表示格式。因此，我們添加了嵌入層，該層能夠將4，096維影象特徵對映到另一個256維文字特徵的向量空間。

建立和訓練模型

下圖展示了看圖說話模型的原理：

在該圖中，{s0，s1，…，sN}表示我們試圖預測的描述單詞，{wes0，wes1，…，wesN-1}是每個單詞的字嵌入向量。LSTM的輸出{p1，p2，…，pN}是由該模型基於原有的單詞序列為下一個單詞生成的機率分佈。該模型的訓練目標是為了最大化每個單詞對數機率的總和指標。

def build_model（self）：

# declaring the placeholders for our extracted image feature vectors， our caption， and our mask

# （describes how long our caption is with an array of 0/1 values of length `maxlen`

img = tf。placeholder（tf。float32，［self。batch_size， self。dim_in］）

caption_placeholder = tf。placeholder（tf。int32，［self。batch_size， self。n_lstm_steps］）

mask = tf。placeholder（tf。float32，［self。batch_size， self。n_lstm_steps］）

# getting an initial LSTM embedding from our image_imbedding

image_embedding = tf。matmul（img， self。img_embedding） + self。img_embedding_bias

# setting initial state of our LSTM

state = self。lstm。zero_state（self。batch_size， dtype=tf。float32）

total_ loss = 0。0

with tf。variable_scope（“RNN”）：

for i in range（self。n_lstm_steps）：

if i > 0：

#if this isn‘t the first iteration of our LSTM we need to get the word_embedding corresponding

# to the （i-1）th word in our caption

with tf。device（“/cpu：0”）：

current_embedding = tf。nn。embedding_lookup（self。word_embedding， caption_placeholder［：，i-1］） + self。embedding_bias

else：

#if this is the first iteration of our LSTM we utilize the embedded image as our input

current_embedding = image_embedding

if i > 0：

# allows us to reuse the LSTM tensor variable on each iteration

tf。get_variable_scope（）。reuse_variables（）

out， state = self。lstm（current_embedding， state）

print （out，self。word_encoding，self。word_encoding_bias）

if i > 0：

#get the one-hot representation of the next word in our caption

labels = tf。expand_dims（caption_placeholder［：， i］， 1）

ix_range=tf。range（0， self。batch_size， 1）

ixs = tf。expand_dims（ix_range， 1）

concat = tf。concat（［ixs， labels］，1）

onehot = tf。sparse_to_dense（

concat， tf。stack（［self。batch_size， self。n_words］）， 1。0， 0。0）

#perform a softmax classification to generate the next word in the caption

logit = tf。matmul（out， self。word_encoding） + self。word_encoding_bias

xentropy = tf。nn。softmax_cross_entropy_with_logits（logits=logit， labels=onehot）

xentropy = xentropy * mask［：，i］

loss = tf。reduce_sum（xentropy）

total_loss += loss

total_loss = total_loss / tf。reduce_sum（mask［：，1：］）

return total_loss， img， caption_placeholder， mask

透過推斷生成描述

訓練後，我們得到一個模型，能夠根據影象和標題的已有單詞給出下一個單詞出現的機率。那麼我們該如何用這個網路來產生新的字幕？

最簡單的方法是根據輸入影象並迭代輸出下一個最可能的單詞，來構建單個標題。

def build_generator（self， maxlen， batchsize=1）：

#same setup as `build_model` function

img = tf。placeholder（tf。float32，［self。batch_size， self。dim_in］）

image_embedding = tf。matmul（img， self。img_embedding） + self。img_embedding_bias

state = self。lstm。zero_state（batchsize，dtype=tf。float32）

#declare list to hold the words of our generated captions

all_words = ［］

print （state，image_embedding，img）

with tf。variable_scope（“RNN”）：

# in the first iteration we have no previous word， so we directly pass in the image embedding

# and set the `previous_word` to the embedding of the start token （［0］） for the future iterations

output， state = self。lstm（image_embedding， state）

previous_word = tf。nn。embedding_lookup（self。word_embedding，［0］） + self。embedding_bias

for i in range（maxlen）：

tf。get_variable_scope（）。reuse_variables（）

out， state = self。lstm（previous_word， state）

# get a one-hot word encoding from the output of the LSTM

logit = tf。matmul（out， self。word_encoding） + self。word_encoding_bias

best_word = tf。argmax（logit， 1）

with tf。device（“/cpu：0”）：

# get the embedding of the best_word to use as input to the next iteration of our LSTM

previous_word = tf。nn。embedding_lookup（self。word_embedding， best_word）

previous_word += self。embedding_bias

all_words。append（best_word）

return img， all_words

在許多情況下，這種方法是比較有效的。但是透過貪心演算法來選取最可能的單詞序列，我們可能不會得到一句連貫通順的字幕序列。

為避免這種情況，一個解決辦法是使用一種叫做“集束搜尋（Beam Search）”的演算法。該演算法迭代地使用k個長度為t的最佳句子集合來生成長度為t+1的候選句子，並且能夠自動找到最優的k值。這個演算法在易於處理推理計算的同時，也在探索生成更合適的標題長度。在下面的示例中，在搜尋每個垂直時間步長的粗體字路徑中，此演算法能夠列出一系列k=2的最佳候選句子。

侷限性和討論

神經網路實現的影象描述生成器，為學習從影象對映到自然語言影象描述提供了一個有用的框架。透過對大量影象和對應標題的集合進行訓練，該模型能夠從視覺特徵中捕獲相關的語義資訊。

然而，使用靜態影象時，字幕生成器將專注於提取對影象分類有用的影象特徵，而不一定是對字幕生成有用的特徵。為了提高每個特徵中所包含相關任務資訊的數量，我們可以將影象嵌入模型，即用於編碼特徵的VGG-16網路，來作為字幕生成模型進行訓練，使網路在反向傳播過程中對影象編碼器進行微調，以更好地實現字幕生成的功能。

此外，如果我們真正仔細研讀生成的字幕序列，我們會注意到都是比較普通而且變化不大的句子。拿如下的影象作為例子：