深度學習 cnn trick合集

作者：由 sticky 發表于繪畫時間：2020-05-03

這兩天發現了一篇寶藏paper，2019年CVPR中的一篇 Bag of Tricks for Image Classification with Convolutional Neural Networks。這篇paper主要從3個方面講述了提高現有baseline（ResNet-50）的有效trick：

在新的硬體上有效訓練

在ResNet-50的基礎上，對模型進行了一些微量的調整

訓練的一些技巧

大概回顧這篇文章

1。在新的硬體上有效訓練

1。1 背景

在ResNet剛提出的時候，為了考慮當時的硬體條件，不得不做很多跟performance相關的trade-offs。但是隨著這幾年硬體（尤其是GPU）的快速發展，很多與performance相關的trade-offs已經改變。其中包括：

使用更大的batch size。例如從256到1024

使用較低的數值精度。例如從FP32到FP16

1。2 使用更大的batch size

使用更大的batch size會導致減緩訓練進度。對於凸問題，收斂速度會隨著batch size的增加而降低。也就是說，在相同的epoch下，使用更大的batch size可能會導致驗證集accuracy更低。因此使用一些trick來解決這個問題。

Linear scaling learning rate

：例如，當我們選擇初始學習率為0。1，batch size為256時，那麼當我們將batch size增大至b時，就需要將初始學習率增加曾0。1×b/256

Learning rate warmup

：例如，選擇5個epoch去進行warmup，在這5個epoch中線性地從0開始增加學習率至初始學習率，然後再開始正常decay

Zero

\gamma

：在residual block中的batch normalization（BN）中：BN首先標準化輸入

，得到

\hat{x}

，然後進行線性變化

\gamma \hat {x} + \beta

，其中

\gamma

和

\beta

都是可以學習的引數，其值被初始化為1s和0s。而在這裡初始化

\gamma = 0

No bias decay

：為了避免過擬合，對於權重weight和偏差bias，我們通常會使用weight decay。但在這裡，僅對weight使用decay，而不對bias使用decay。

1。3 使用更低的數值精度

以前神經網路通常使用32-bit浮點數精度（FP32）來訓練。但是現在的新的硬體增強了低精度資料型別的算術邏輯單元。例如Nvidia V100對FP32提供14 TFLOPS，而對FP16提供100 TFLOPS。因此，使用FP16時，總的訓練速度加速了2~3倍：

Comparison of the training time and validation accuracy for ResNet-50 between the baseline （BS=256 with FP32） and a more hardware efficient setting （BS=1024 with FP16）。

The breakdown effect for each effective training heuristic on ResNet-50。

2。模型調整

The architecture of ResNet-50。 The convolution kernel size， output channel size and stride size （default is 1） are illustrated， similar for pooling layers。

主要對downsampling block和input steam（上圖指出部分）做了一些改動：

downsampling做改動主要是由於使用stride=2的1×1 conv會忽略3/4的feature-map。因此，為了使輸出的shape保持不變，將path A的前兩個conv分別改為stride=1的1×1 conv和stride=2的3×3 conv，即ResNet-C；將path B換成stride=2的2×2 AvgPool和stride=1的1×1 conv，即ResNet-D

而input steam做的改動主要是由於使用7×7 conv的計算cost是3×3的5。4倍。因此將7×7 conv換成3個連續的3×3conv，即ResNet-C

Three ResNet tweaks。 ResNet-B modifies the downsampling block of Resnet。 ResNet-C further modifies the input stem。 On top of that， ResNet-D again modifies the downsampling block。

Compare ResNet-50 with three model tweaks onmodel size， FLOPs and ImageNet validation accuracy。

3。訓練技巧

3。1 Cosine Learning Rate Decay

以往學習率衰減的策略一般是“step decay”，即每隔一定的epoch，學習率才進行一次指數衰減。而現在，學習率隨著epoch的增大不斷衰減：

Visualization of learning rate schedules with warm-up。 Top： cosine and step schedules for batch size 1024。 Bottom： Top-1 validation accuracy curve with regard to the two schedules。

3。2 Label Smoothing

3。3 Knowledge Distillation

3。4 Mixup Training

在mixup中，每次隨機取樣兩個樣本

（x_i，y_i）

和

（x_j，y_j）

，然後透過加權線性插值生成新的樣本進行訓練：

其中

\lambda \in ［0，1］

為從

Beta（\alpha，\alpha）

分佈的得到的隨機數。

3。5 Experiment Results

The validation accuracies on ImageNet for stacking training refinements one by one。 We repeat each refinement on ResNet-50-D for 4 times with different initialization， and report the mean and standard deviation in the table。

標簽： ResNet size batch 50 Conv

上一篇:國稅系統調入地方黨群？

深度學習 cnn trick合集

猜你喜歡

第一臺單反想入尼康D7100什麼鏡頭搭配更合適？

iOS 超大高畫質圖展示策略TileLayer及levelsOfDetailBias分析

影象生成文字基礎框架

關於ResNet及其變體的總結（上）

深度解讀：RepVGG

深度學習 cnn trick合集

猜你喜歡

第一臺單反想入尼康D7100什麼鏡頭搭配更合適？

iOS 超大高畫質圖展示策略TileLayer及levelsOfDetailBias分析

影象生成文字 基礎框架

關於ResNet及其變體的總結（上）

深度解讀：RepVGG

影象生成文字基礎框架