基於對比學習的標籤有效語義切分方法

作者：由筱凡小築發表于繪畫時間：2022-04-30

Contrastive Learning for Label-Efficient Semantic Segmentation

基於對比學習的標籤有效語義切分方法

Abstract

Collecting labeled data for the task of semantic segmentation is expensive and time-consuming， as it requires dense pixel-level annotations。 While recent Convolutional Neural Network （CNN） based semantic segmentation approaches have achieved impressive results by using large amounts of labeled training data， their performance drops significantly as the amount of labeled data decreases。 This happens because deep CNNs trained with the de facto cross-entropy loss can easily overfit to small amounts of labeled data。 To address this issue， we propose a simple and effective contrastive learning-based training strategy in which we first pretrain the network using a pixel-wise class label-based contrastive loss， and then fine-tune it using the cross-entropy loss。This approach increases intra-class compactness and inter-class separability， thereby resulting in a better pixel classifier。We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings using the Cityscapes and PASCAL VOC 2012 segmentation datasets。Our results show that pretraining with label-based contrastive loss results in large performance gains （more than 20% absolute improvement in some settings） when the amount of labeled data is limited。

摘要

收集標記資料的語義分割任務是昂貴且耗時的，因為它需要密集的畫素級註釋。雖然最近基於卷積神經網路（CNN）的語義分割方法透過使用大量標記的訓練資料來實現了令人印象深刻的結果，但隨著標記資料的量減少，它們的效能顯著下降。發生這種情況是因為隨著事實上的跨熵損耗訓練的深度CNN可以容易地過度佔據少量標記資料。要解決這個問題，我們提出了一種簡單有效的對比學習的學習訓練策略，我們首先使用畫素基於類標籤的對比丟失，然後使用跨熵損失進行微調。該方法增加了類內的緊湊性和階級間可分離性，從而導致更好的畫素分類器。我們展示了使用CityCapes和Pascal VOC 2012分段資料集的完全監督和半監督設定中所提出的訓練策略的有效性。我們的結果表明，當標記資料的數量有限時，基於標籤的對比損價導致基於標籤的對比損價導致大量的效能提升（在某些設定中超過20％）。

Introduction

In the recent past， various approaches based on Convolutional Neural Networks （CNNs） have reported excellent results on several semantic segmentation datasets by leveraging large amounts of dense pixel-level annotations。 However， labeling images with pixel-level annotations is time-consuming and expensive。 For example， the average time taken to annotate a single image in the Cityscapes dataset is 90 minutes。pascal_drop shows how the performance of a DeepLabV3+ model trained on the PASCAL VOC 2012 dataset using the cross-entropy loss drops as the number of training images decreases。 This happens because CNNs trained with the cross-entropy loss can easily overfit to small amounts of labeled data， as the cross-entropy loss does not explicitly encourage intra-class compactness or large margins between classes。 To address this issue， we propose to first pretrain the CNN feature extractor using a pixel-wise class label-based contrastive loss （referred to as

contrastive pretraining

）， and then fine-tune the entire network including the softmax classifier using the cross-entropy loss （referred to as

softmax fine-tuning

）。emb_pascal20_baseline andemb_pascal20_proposed show the distributions of various classes in the softmax input feature spaces of models trained with the cross-entropy loss and the proposed strategy， respectively， using 2118 labeled images from the PASCAL VOC 2012 dataset。 The mean IOU values of the corresponding models on the PASCAL VOC 2012 validation dataset are 39。1 and 62。7， respectively。 The class support regions are more compact and separated when trained with the proposed strategy， leading to a better performance。

簡介

在最近的過去，透過利用大量密集畫素級註釋，基於卷積神經網路（CNNS）的各種方法在多個語義分段資料集上報告了出色的結果。但是，標記具有畫素級註釋的影象是耗時和昂貴的。例如，在CityScapes資料集中註釋單個影象所花的平均時間為90分鐘。Pascal_Drop顯示了使用跨熵丟失下降在Pascal VOC 2012資料集上訓練的Deplabv3 +模型的效能如何隨著訓練影象的數量減少而降低。發生這種情況是因為隨著跨熵損耗訓練的CNN可以容易地磨損到少量標記資料，因為跨熵損失沒有明確鼓勵類內的壓縮度或課程之間的大邊距。為了解決這個問題，我們建議使用基於畫素的基於類標籤的對比丟失（稱為對比預先估計）來先前推出CNN特徵提取器，然後微調包括Softmax分類器的整個網路，包括Softmax分類器 - 網上損失（稱為softmax細小調諧）。emb_pascal20_baseline和meb_pascal20_proped顯示使用2118來自跨熵丟失和所提出的策略的模型的Softmax輸入功能空間中各種類的分佈，使用2118 Pascal VOC 2012資料集。 Pascal VOC 2012驗證資料集上相應模型的平均值分別為39。1和62。7。隨著擬議策略訓練，課堂支援區域更緊湊，分開，導致更好的效能。

We use t-SNE for generating the visualizations。

We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings。 Our main contributions are as follows。

Simple approach： We propose a simple contrastive learning-based training strategy for improving the performance of semantic segmentation models。 We consider the simplicity of our training strategy as its main strength since it can be easily adopted by existing and future semantic segmentation approaches。

Strong results： We show that label-based contrastive pretraining results in large performance gains on two widely-used semantic segmentation datasets in both fully-supervised and semi-supervised settings， especially when the amount of labeled data is limited。

Detailed analyses： We show visualizations of class distributions in the feature spaces of trained models to provide insights into why the proposed training strategy works better （see figure： emb_pascal20_baseline and emb_pascal20_proposed）。 We also present various ablation studies that justify our design choices。

我們使用T-SNE來生成視覺化。

我們展示了擬議的訓練策略在完全監督和半監督的環境中的有效性。我們的主要貢獻如下。

簡單的方法：我們提出了一種簡單的基於對比的學習訓練策略，可以提高語義分割模型的效能。我們認為我們的訓練策略的簡單性是其主要優勢，因為它可以透過現有和未來的語義細分方法輕鬆採用。

更好的結果：我們展示了基於標籤的對比預製導致兩種廣泛使用的語義分段資料集的大型效能提升，尤其是當標記資料的數量有限時。

詳細分析：我們在訓練有素的模型的特徵空間中顯示了類分佈的視覺化，以便為提出的訓練策略更好地提供洞察力（參見圖 fem_pascal20_baseline和fem_pascal20_proposed）。我們還提出了各種融合研究，證明了我們的設計選擇。

Related works

These approaches learn representations in a discriminative fashion by contrasting positive pairs against negative pairs。 Recently， several approaches based on contrastive loss have been proposed for self-supervised visual representation learning。 These approaches treat each instance as a class and use contrastive loss-based instance discrimination for representation learning。 Specifically， they use augmented version for an instance to form the positive pair and other instances to form negative pairs for the contrastive loss。 Noting that using a large number of negatives is crucial for the success of contrastive loss-based representation learning， various recent approaches use memory banks to store the representations。 While some recent contrastive approaches have attempted to attribute their success to maximization of mutual information， argues that their success cannot be attributed to the properties of mutual information alone。 Recently， proposed supervised contrastive loss for the task of image classification。 Since CNNs have been introduced to solve the semantic segmentation problem， several deep CNN-based approaches have been proposed that gradually improved the performance using large amounts of pixel-level annotations。 However， collecting dense pixel-level annotations is difficult and costly。 To address this issue， several existing works focus on leveraging weaker forms of supervision such as bounding boxes， scribbles， points， and image-level labels either exclusively or along with dense pixel-level supervision。 While this work also focuses on improving semantic segmentation performance when the amount of pixel-level annotations is limited， we do not use any additional forms of annotations。

相關工程

這些方法透過對比負對對形成正面對以判別方式學習表示。最近，已經提出了用於自我監督的視覺表現學習的基於對比損失的幾種方法。這些方法將每個例項視為一個類，並使用基於對比的丟失的例項歧視來進行表示學習。具體地，它們使用增強版本來形成例項，以形成正對和其他例項以形成負對對比損耗。注意到使用大量負面的否定對於基於對比的損失的代表學習的成功至關重要，各種最近的方法使用記憶體庫來儲存表示。雖然最近的一些對比方法試圖將其成功歸因於相互資訊的最大化，但他們的成功不能歸因於單獨的互資訊的屬性。最近，建議監督影象分類任務的對比損失。由於已經引入了CNN來解決語義分割問題，因此已經提出了幾種基於CNN的基於CNN的方法，從而使用大量的畫素級註釋逐漸提高了效能。然而，收集密集畫素級註釋是困難且昂貴的。為了解決這個問題，一些現有的工作側重於利用較弱的監督形式，例如界限框，塗鴉，點和影象級標籤，或者與密集畫素級監督一起。雖然這項工作還專注於提高語義分段效能時，當畫素級註釋有限時，我們不使用任何其他形式的註釋。

Instead， we propose a contrastive learning-based pretraining strategy that achieves significant performance gains without the need for any additional data。 Another line of work that deals with limited labeled data includes semi-supervised approaches that leverage unlabeled images。 While some of these works use generative adversarial training to leverage unlabeled images， others use pseudo labels and consistency-based regularization with various data augmentations。 The proposed contrastive pretraining strategy is complimentary to these approaches and can be used in conjunction with them。 We demonstrate this in this paper by showing the effectiveness of contrastive pretraining in the semi-supervised setting by using pseudo labels。 Recently， proposed to train the CNN feature extractor of a semantic segmentation model by maximizing the log likelihood of extracted pixel features under a mixture of vMF distributions model。 During inference， they first segment the pixel features extracted from an image using spherical K-Means clustering， and then perform k-nearest neighbor search for each segment to retrieve the labels from segments in the training set。 While this approach is shown to improve the performance when compared to the widely-used pixel-wise softmax training， it is very complicated as it uses a two-stage expectation-maximization algorithm for training。 In comparison， the proposed training strategy is simple， and can be easily adopted by existing and future semantic segmentation approaches。

相反，我們提出了一種基於對比的學習預用策略，無需任何其他資料即可實現顯著效能。處理有限標記資料的另一項工作包括利用未標記影象的半監督方法。雖然其中一些作品使用生成的對抗性訓練來利用未標記的影象，但其他人使用偽標籤和基於一致性的正則化與各種資料增強。擬議的對比預製策略與這些方法互補，可以與它們一起使用。我們透過使用偽標籤顯示半監督設定對比預借預介質的有效性，我們在本文中證明了這一點。最近，建議透過在VMF分佈模型的混合下最大化提取的畫素特徵的日誌似然來訓練語義分割模型的CNN特徵提取器。在推理期間，首先將從影象中提取的畫素特徵分段使用球面K-icon群集，然後對每個段執行K-Collecti鄰搜尋，以從訓練集中的段中檢索標籤。雖然該方法被證明與廣泛使用的畫素-Wise SoftMax訓練相比提高效能，但它非常複雜，因為它使用了一種用於訓練的兩階段預期最大化演算法。相比之下，擬議的訓練策略很簡單，可以透過現有和未來的語義細分方法輕鬆採用。

Experiments

實驗

Datasets and metricsPASCAL VOC 2012

This dataset consists of 10，582 training （including the annotations provided by）， 1，449 validation， and 456 test images with pixel-level annotations for 20 foreground object classes and one background class。 The performance is measured in terms of pixel Intersection-Over-Union （IOU） averaged across the 21 classes。［5pt］： This dataset contains high quality pixel-level annotations of 5，000 images collected in street scenes from 50 different cities。 Following the evaluation protocol of， 19 semantic labels belonging to 7 super categories （ground， construction， object， nature， human， vehicle and sky） are used for evaluation， and the void label is ignored。 The performance is measured in terms of pixel IOU averaged across the 19 classes。 The training， validation， and test splits contain 2975， 500， and 1525 images， respectively。 For both the datasets， we perform experiments in both fully-supervised and semi-supervised settings varying the amounts of labeled and unlabeled training data。

資料集和MetricsPascal VOC 2012

該資料集由10，582個訓練集，包括由1，449驗證和456個測試影象組成，具有20個前景物件類和一個背景類的畫素級註釋。在21個類中平均平均值的畫素交匯處（iou）的衡量表現。此資料集包含在50個不同城市的街道場景中收集的5，000張影象的高質量畫素級註釋。遵循屬於7個超類別（地面，建築，物體，性質，人，車輛和天空）的19個語義標籤進行評估，忽略空隙標籤。效能是以跨19個類的畫素IOU的衡量標準。訓練，驗證和測試分別包含2975，500和1525個影象。對於兩種資料集，我們在完全監督和半監控的環境中執行實驗，改變標記和未標記的訓練資料的數量。

Model architecture

Our feature extractor follows DeepLabV3 \$ + \$ encoder-decoder architecture with the ResNet50-based encoder of DeepLabV3。 The output spatial resolution of the feature extractor is four times lower than the input resolution。 Our projection head consists of three \$ 1\times 1 \$ convolution layers with 256 channels followed by a unit-normalization layer。 The first two layers in the projection head use the ReLU activation function。

型號架構

我們的功能提取器跟隨DEEPLABV3 \$ + \$編碼器 - 解碼器架構與DEEPLABV3的基於ResET50的編碼器。特徵提取器的輸出空間解析度比輸入解析度低四倍。我們的投影頭由三個\$ 1\times 1 \$卷積圖層組成，具有256個通道，然後是單位歸一化層。投影頭中的前兩層使用Relu啟用功能。

Training and inference

Following， we use \$ 513\times 513 \$ random crops extracted from preprocessed （random left-right flipping and scaling） input images for training。 All the models are trained from scratch using asynchronous stochastic gradient descent on 8 replicas with minibatches of size 16， weight decay of \$ 4e^{-5} \$ ， momentum of 0。9， and cosine learning rate decay。 For contrastive pretraining， we use an initial learning rate of 0。1 and 300K training steps。 For softmax fine-tuning， we use an initial learning rate of 0。007 and 300K training steps when the number of labeled images is above 2500 in the case of PASCAL VOC 2012 dataset and above 1000 in the case of Cityscapes dataset， and 50K training steps in other settings We observed overfitting with longer training when the number of labeled images is low。。 When we use softmax training without contrastive pretraining， we use an initial learning rate of 0。03 and 600K training steps when the number of labeled images is above 2500 in the case of PASCAL VOC 2012 dataset and above 1000 in the case of Cityscapes dataset， and 300K training steps in other settings。 The temperature parameter \$ \tau \$ of contrastive loss is set to 0。07 in all the experiments。 We use color distortions from for contrastive pretraining， and random brightness and contrast adjustments for softmax fine-tuning Using hue and saturation adjustments from while training the softmax classifier resulted in a slight drop in the performance。。 For generating pseudo labels， we use a threshold of 0。8 for all the foreground classes of the PASCAL VOC 2012 and Cityscapes datasets， and a threshold of 0。97 for the background class of the PASCAL VOC 2012 dataset。

For \$ 513\times 513 \$ input， our feature extractor produces a \$ 129\times 129 \$ feature map。 Since the memory complexity of contrastive loss is quadratic in the number of pixels， to avoid GPU memory issues， we resize the feature map to \$ 65\times 65 \$ using bilinear resizing before computing the contrastive loss。 The corresponding low-resolution label map is obtained from the original label map using nearest neighbor downsampling。 For softmax training， we follow and upsample the logits from \$ 129\times 129 \$ to \$ 513\times 513 \$ using bilinear resizing before computing the pixel-wise cross entropy loss。 Since the model is fully-convolutional， during inference， we directly run it on an input image and upsample the output logits to input resolution using bilinear resizing。

訓練和推理

之後，我們使用\$ 513\times 513 \$隨機作物從預處理（隨機左側翻轉和縮放）輸入影象進行訓練。所有型號透過在8個複製品上使用非同步隨機梯度下降訓練，其中8次複製品尺寸為16，重量衰減為\$ 4e^{-5} \$，動量為0。9和餘弦學習率衰減。為了對比預製，我們使用0。1和300K訓練步驟的初始學習率。對於Softmax微調，我們使用初始學習率為0。007和300K訓練步驟的訓練步驟在Pascal VOC 2012 DataSet和CityCAPES資料集的情況下的標記影象上高於2500時，以及50K的訓練步驟當標記影象的數量低時，我們觀察到使用更長的訓練的其他設定。當我們使用SoftMax訓練而沒有對比預先估計時，我們使用初始學習率為0。03和600K訓練步驟的訓練步驟在Pascal VOC 2012 DataSet的情況下，在CityCapes資料集的情況下為1000時，以及300k其他設定中的訓練步驟。在所有實驗中，對比損耗的溫度引數\$\tau \$設定為0。07。我們使用來自對比的預磨損的顏色扭曲，以及使用色調和飽和度調整的軟MAX微調的隨機亮度和對比度調整，從訓練Softmax分類器的效能下降略有下降。對於生成偽標籤，我們使用Pascal VOC 2012和CityCapes資料集的所有前景類使用0。8的閾值，以及Pascal VOC 2012資料集的背景類的閾值為0。97。

用於\$ 513\times 513 \$輸入，我們的特徵提取器產生\$ 129\times 129 \$特徵圖。由於對比度損失的記憶體複雜性在畫素的數量中是二次的，因此要避免GPU儲存器問題，我們將特徵對映調整為\$ 65\times 65 \$在計算對比損耗之前使用雙線性調整大小。相應的低解析度標籤圖是從原始標籤對映獲得的，使用最近的鄰沿下采樣獲得。對於SoftMax訓練，我們使用雙線性調整在計算畫素方面的交叉熵損耗之前，從\$ 129\times\ 129 \$到\$ 513\times 129 \$上遵循和上置Logits。由於該模型是完全卷積的，在推理期間，我們直接在輸入影象上執行它，並使用Bilinear調整尺寸將輸出記錄上置為輸入解析度。

Results - Fully-supervised setting

Figurescityscapes_fs andpascal_fs show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets， respectively， obtained by contrastive pretraining in the fully-supervised setting。 \$ 2\times \$ while improving the performance。 \$ 5\times \$ more data （5295 images）。 These results clearly demonstrate the effectiveness of the proposed label-based contrastive pretraining。 The performance improvements seen on the PASCAL VOC 2012 dataset are much higher than the improvements seen on the Cityscapes dataset。visual_results shows some segmentation results of models trained with and without label-based contrastive pretraining using 2118 labeled images from the PASCAL VOC 2012 dataset。 Contrastive pretraining improves the segmentation results by reducing the confusion between background and various foreground classes， and also the confusion between different foreground classes。

結果 - 全監督設定

FIGURYSCAPES_FS ANDPASCAL_FS分別透過完全監督設定對比預借預先繪製的CITYSCAPES和PASCAL VOC 2012年資料集的驗證拆分進行效能改進。 \$ 2\times \$同時提高效能。 \$ 5\times \$更多資料（5295影象）。這些結果清楚地展示了所提出的基於標籤的對比預製覆蓋的有效性。 Pascal VOC 2012 DataSet上看到的效能改進遠高於CityScapes DataSet。visual_Result的改進，顯示了使用來自Pascal VOC 2012 DataSet的2118標記的影象，而無需基於標籤的對比預製訓練的模型的一些分段結果。透過減少背景和各種前景階級的混淆，以及不同前景階級之間的混淆，對比預借預製提高了分割結果。

Results - Semi-supervised setting

Figurescityscapes_ss andpascal_ss show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets， respectively， obtained by contrastive pretraining in the semi-supervised setting。 \$ 2\times \$ while improving the performance on the PASCAL VOC 2012 dataset。

Figurescityscapes_pl_improv andpascal_pl_improv show the performance improvements on the validation splits of Cityscapes and PASCAL VOC 2012 datasets， respectively， obtained by using pseudo labels。 Though pseudo labeling is a straightforward approach for making use of unlabeled data， it gives impressive performance gains， both with and without contrastive pretraining。

結果 - 半監控設定

FigurityScapes_ss和pascal_sss分別透過半監督設定中的對比預先繪製獲得了Cityscapes和Pascal VOC 2012年資料集的驗證拆分的效能改進。 \$ 2\times \$，同時提高Pascal VOC 2012資料集的效能。

figurityscapes_pl_improv andpascal_pl_improv分別透過使用偽標籤獲得的Citycapes和Pascal VOC 2012資料集的驗證拆分進行效能改進。儘管偽標籤是利用未標記資料的直接方法，但它具有令人印象深刻的效能收益，無論是在沒有對比的預製稅。

Ablation studies

In this section， we perform various ablation studies under the fully-supervised setting on the Cityscapes and PASCAL VOC 2012 datasets with 596 and 2118 labeled training images， respectively。

消融研究

在本節中，我們分別在CityCAPES和Pascal VOC 2012年資料集上的完全監督環境下進行各種消融研究，分別具有596和2118個標記的訓練影象。

Importance of distortions for contrastive loss

In the case of contrastive loss-based self-supervised learning， distortions are necessary to generate positive pairs。 But， in the case of label-based contrastive learning， positive pairs can be generated using labels， and hence， it is unclear how important distortions are。 In this work， we use the color distortions from a recent self-supervised learning method that worked well for the downstream task of image classification。distortions shows the effect of using these distortions in the contrastive pretraining stage。 We can see a small performance gain on the Cityscapes dataset and no gain on the PASCAL VOC 2012 dataset Differences lower than 0。5 are too small to draw any conclusion。。 These results suggest that distortions that work well for image recognition may not work for semantic segmentation。

失真對對比損失的重要性

在基於對比損失的自監督學習中，distortions是產生正對的必要條件。但是，在基於標籤的對比學習的情況下，可以使用標籤生成正對，並且因此，目前尚不清楚重要的扭曲。在這項工作中，我們使用最近的自我監督學習方法的顏色扭曲，這適用於影象分類的下游任務。Distortions顯示了在對比預製階段使用這些扭曲的效果。我們可以在Citycapes資料集上看到小的效能增益，並且Pascal VOC 2012年的GAGAL DataSet差異低於0。5的增益太小，無法得出任何結論。這些結果表明，用於影象識別良好的扭曲可能無法用於語義分割。

Contrastive loss variants

The pixel-wise label-based contrastive loss used in this work is first computed separately for each image and then averaged across all the images in a minibatch。 We refer to this as the

single image variant

。 An alternative option is to consider all the pixels in a minibatch as a single bag of pixels for computing the contrastive loss。

batch variant

。 Note that the memory complexity of contrastive loss is quadratic in the number of pixels。 Hence， to avoid GPU memory issues， we randomly sample 10K pixels from the entire minibatch for computing the batch variant of the contrastive loss。 Tableloss_variants compares the performances of these two variants。

對比損耗變體

本作中使用的基於畫素的基於標籤的對比損失首先為每個影象分開計算，然後在小纖維中的所有影象上平均。我們將此稱為

單影象變體

。另一種選擇是將小批次中的所有畫素視為用於計算對比損耗的單個畫素。

批次變體

。注意，對比損耗的記憶體複雜性在畫素的數量中是二次的。因此，為了避免GPU儲存器問題，我們從整個小匹配來隨機取樣10K畫素，以計算對比損耗的批次變體。 Tableloss_variants比較了這兩個變體的效能。

Pretraining on external classification dataset

To study the effect of additional pretraining on a large-scale image classification dataset， we compare the models trained from scratch with the models trained from ImageNet-pretrained weights in Tableimagenet_results。 When contrastive pretraining is not used， ImageNet pretraining results in large performance gains on both Cityscapes and PASCAL VOC 2012 datasets。 However， when contrastive pretraining is used， the performance gain due to ImageNet pretraining is limited （only 1。1 points on the Cityscapes dataset and no improvement on the PASCAL VOC 2012 dataset）。 Also， the results in the second and third rows show that contrastive pretraining， which does not use any additional labels， outperforms ImageNet pretraining （which uses more than a million additional image labels） by 3。8 points on the PASCAL VOC 2012 dataset， and is only slightly worse （1。3 points） on the Cityscapes dataset。 These results clearly demonstrate the effectiveness of contrastive pretraining in reducing the need for labeled data。

外部分類資料集

研究了額外預先訓練在大型影象分類資料集上的效果，我們比較從劃痕訓練的模型與TableImagenet_Result中的想象成普雷雷達權重訓練的模型。當未使用對比呈預灌注時，ImageNet預先預訂會導致城市景觀和Pascal VOC 2012資料集的大型效能增益。然而，當使用對比呈預灌注時，由於想象成普雷威預測引起的效能收益是有限的（Citycapes資料集只有1。1點，帕斯卡VOC 2012資料集沒有改進）。此外，第二行和第三行的結果表明，不使用任何額外標籤的對比預製，勝過想象成借鑑（使用超過一百萬額外的影象標籤）在Pascal VOC 2012資料集上，僅限於CityScapes DataSet上有點差（1。3分）。這些結果清楚地展示了對比借鑑降低了標記資料的需要的有效性。

Performance on test splits

Tablecity_test shows the performance improvements on the test splits of the Cityscapes and PASCAL VOC 2012 datasets obtained by contrastive pretraining in the fully-supervised setting。 Similar to the results on validation splits， label-based contrastive pretraining leads to significant performance improvements on test splits。

在測試拆分上的效能

TableCity_Test顯示了透過在完全監督環境中對比預借預先繪製而獲得的城市景觀和Pascal VOC 2012年測試分割的效能改進。與驗證分裂的結果類似，基於標籤的對比普雷威預測導致對測試分裂的顯著效能改進。

Conclusions and future work

Deep CNN-based semantic segmentation models trained with cross-entropy loss easily overfit to small amounts of training data， and hence perform poorly when trained with limited labeled data。 To address this issue， we proposed a simple and effective contrastive learning-based training strategy in which we first pretrain the feature extractor of the model using a pixel-wise label-based contrastive loss and then fine-tune the entire network including the softmax classifier using the cross-entropy loss。 This training approach increases both intra-class compactness and inter-class separability， thereby enabling a better pixel classifier。 We performed experiments on two widely-used semantic segmentation datasets， namely， PASCAL VOC 2012 and Cityscapes， in both fully-supervised and semi-supervised settings。 In both settings， we achieved large performance gains on both datasets by using contrastive pretraining， especially when the amount of labeled data is limited。 In this work， we used a simple pseudo labeling-based approach to leverage unlabeled images in the semi-supervised setting。 We thank Yukun Zhu and Liang-Chieh Chen from Google for their support with the DeepLab codebase。

結論和未來工作

基於CNN的基於CNN的語義分割模型，具有跨熵損耗的跨熵損失，容易過度佔用少量訓練資料，因此在受限標記資料有限的訓練時表現不佳。要解決此問題，我們提出了一種簡單有效的基於對比的學習訓練策略，我們首先使用基於畫素的基於標籤的對比損耗來預留模型的特徵提取器，然後微調包括Softmax分類器的整個網路使用跨熵損失。這種訓練方法增加了類內緊湊性和級別間可分離性，從而實現了更好的畫素分類器。我們在完全監督和半監督的環境中對兩個廣泛使用的語義分割資料集進行了實驗，即Pascal VOC和CityCapes。在兩個設定中，我們透過使用對比預製來實現兩個資料集的大量效能提升，特別是當標記資料的數量有限時。在這項工作中，我們使用了一個簡單的基於偽標籤的方法來利用半監督設定中的未標記影象。我們感謝Yukun Zhu和Liang-Chieh陳從谷歌與Deeblab CodeBase支援。

檢視原文，可以獲取中英文對照、更多更全的公式和圖片內容。AI千集

標簽： VOC Contrastive 2012 training loss

上一篇:畫家邱漢橋：畫為心聲，意在筆先

下一篇：北京的畫家