【attention系列】使用attention模組來做多模態融合

作者：由子非魚發表于攝影時間：2022-01-04

Title

： Attention Bottlenecks for Multimodal Fusion

作者

：

Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen

Cordelia SchmidChen Sun

發表單位

：Google Research

發表於

：NIPS 2021

關鍵詞

：多模態融合，attention，影片中的視聽融合，audiovisual fusion in video

一句話總結

：提出了Multimodal Bottleneck Transformer（MBT），利用self-attention在中間層對多模態資料進行資訊交換。為了減少計算量，將最需要和另一個模態分享的資訊編碼在一個4維隱向量中，使用它分別與兩個模態的向量做self-attention以實現資訊交換；

0. Abstract

任務

：影片分類，多模態資料融合， audio-visual classification

提出問題

：

人類透過同時處理和融合來自視覺和音訊等多種形式的高維輸入來感知世界。與此形成鮮明對比的是，機器感知模型通常是特定於模態的，並針對單模態基準進行了最佳化，因此，對每個模態的最終表示或預測進行後期融合（後期融合）仍是多模態影片分類的主導正規化。

解決方法

：

相反，我們引入了一種基於transformer的新型架構，該架構將fusion bottlenecks安置在多個layer中以實現模態融合。與傳統的pairwise self-attention相比，我們的模型迫使不同模態之間的資訊透過少量的bottleneck latents形成交流，要求模型在每個模態中整理和壓縮相關資訊，並共享必要的資訊。【our model forces information between different modalities to pass through a small number of bottleneck latents， requiring the model to collate and condense relevant information in each modality and share what is necessary。】

我們發現這種策略在提高融合效能的同時降低了計算成本。我們進行了全面的消融研究，並在多種視聽分類基準上取得了最先進的結果，包括Audioset、Epic-Kitchens和VGGSound。所有的程式碼和模型將被髮布。

1. Introduction

介紹任務

：audiovisual fusion in video，兩個模態：音訊、影片

使用transformer的優勢在哪裡：

model dense correlations between tokens

在幾個經典任務上的經典transformer：

image （ViT ［16］）

video classification （ViViT ［6］）

audio classification （AST ［23］）。

如何使用early fusion做audiovisual fusion？

直接將a sequence of both visual and auditory patches輸入transformer中，這是在輸入模型之前就將兩種模態的資料放到一起了，所以是early fusion。這種“早期融合”模型可以讓注意力在影象的不同時空區域之間自由流動，以及在音訊譜圖中跨越頻率和時間。

early fusion存在的問題和缺點：

雖然理論上很吸引人，但我們假設，在模型的所有層次上，完全的兩兩注意是不必要的，因為音訊和視覺輸入包含密集的、細粒度的資訊，其中很多是多餘的。這對於影片來說尤其如此，從［6］的“因數分解”版本的效能可以看出。這種模型也不能很好地擴充套件到較長的影片，因為對標記序列長度的關注的二次複雜度。

如何解決這個問題

：

為了緩解這一問題，我們提出了兩種方法來限制我們模型中的attention flow。

第一種方法遵循多模態學習的共同正規化，該正規化將 cross-modal flow限制在網路的後期層，允許早期層專門學習和提取單模態模式。因此，這被稱為中間融合（圖1，中間左），其中引入交叉模態互動的層被稱為融合層。這其中的兩個極端版本是早期融合（所有層都是交叉模式）和後期融合（所有層都是單峰模式），我們將其作為基線進行比較。【就是指傳統的early、middle、late fusion】

我們的第二個想法（也是主要貢獻）是限制一個層內tokens之間的 cross-modal attention flow。我們透過允許一個模態內的free attention flow來做到這一點，但迫使我們的模型在與其他模態共享資訊之前，對每個模態的資訊進行整理和壓縮。其核心思想是引入一組潛在的融合單元，這些融合單元形成一個attention bottleneck，一個層中的 cross-modal interactions必須透過這個瓶頸。我們證明了這種 Multimodal Bottleneck Transformer（

MBT

）的效能優於或匹配其不受限制的對手，但具有更低的計算成本。

貢獻：

我們的模型透過緊密的fusion ‘bottlenecks來限制latent units之間的跨模態資訊流動，這迫使模型收集和“壓縮”每個模態中最相關的輸入（因此只與其他模態共享最必要的資訊）。這避免了 full pairwise attention的二次縮放成本，並導致用更少的計算量獲得性能收益；

我們將MBT應用於影象和光譜圖patches （圖2），並探討了與融合層、輸入的取樣和資料大小相關的若干消融；

我們在視聽分類資料集AudioSet ［21］， epic - kitens100［12］和VGGSound［10］上達到了SOTA。在Audioset資料集上，我們的表現比當前的技術水平高出5。9個mAP（相對改進12。7%）。

2. 相關工作：Multimodal transformers

The self attention operation of transformers provides a natural mechanism to connect multimodal signals。

Multimodal transformers have been applied to various tasks including audio enhancement ［17， 53］， speech recognition ［24］， image segmentation ［58， 53］， cross-modal sequence generation ［39， 37， 49］， image and video retrieval ［25， 20， 8］， visual navigation ［46］ and image/video captioning/classification ［41， 52， 51， 36， 28］。

For many works， the inputs to transformers are the output representations of single modality CNNs ［35， 20］ – unlike these works we use transformer blocks throughout， using only a single convolutional layer to rasterise 2D patches。很多工作將CNN學到的representation輸入transformer，但是我們只使用一個卷積層來格式化patch，其餘都是用transformer block。

The tokens from different modalities are usually combined directly as inputs to the transformers ［38］， for example， the recently released Perceiver model ［29］ introduces an iterative attention mechanism which takes concatenated raw multimodal signals as inputs， which corresponds to our ‘early fusion’ baseline。 In contrast， we carefully examine the impact of different modality fusion strategies， including limiting cross-modal attention flow to later layers of our model， and ‘channeling’ cross-modal connections through bottlenecks in our proposed Multimodal Bottleneck Transformer （MBT）。

之前的工作多將不同模態的token直接組合在一起輸入transformer

，最新的Perceiver model ［29］提出了一種互動式attention機制，將拼接好的原始多模態訊號作為輸入，這種方式就是我們說的early fusion。在本文中，我們仔細地比較了各種融合策略，包括限制深層中的cross-modal attention flow、透過bottlenecks引導跨模態連線。

3. Method