深入理解Linux記憶體管理（六）記憶體碎片整理

作者：由 LZT 發表于遊戲時間：2022-11-03

前情回顧

上一章介紹了夥伴系統的記憶體申請與釋放流程，但都是考慮在記憶體充足的情況下進行的。隨著記憶體頁被頻繁使用，可能出現大量記憶體碎片或者待回收記憶體頁，導致無法獲取到足夠的連續記憶體頁框。本章就是介紹當出現這樣的情況時，系統會有哪些機制確保記憶體申請能夠順利進行。在Linux中主要有兩種機制：記憶體碎片整理和記憶體回收，而記憶體回收包括快速記憶體回收，直接記憶體回收和kswapd記憶體回收。由於篇幅有限，本章節先介紹記憶體碎片整理機制。話不多說，下面直接進入主題。

資料結構

記憶體碎片整理

* MIGRATE_ASYNC means never block

* MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking

* on most operations but not ->writepage as the potential stall time

* is too significant

* MIGRATE_SYNC will block when migrating pages

* MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages

* with the CPU。 Instead， page copy happens outside the migratepage（）

* callback and is likely using a DMA engine。 See migrate_vma（） and HMM

* （mm/hmm。c） for users of this mode。

enum

migrate_mode

{

MIGRATE_ASYNC

，

MIGRATE_SYNC_LIGHT

，

MIGRATE_SYNC

，

MIGRATE_SYNC_NO_COPY

，

}；

只有三種類型的頁框支援記憶體碎片整理：MIGRATE_MOVABLE、MIGRATE_CMA和MIRGATE_RECLAIMABLE。記憶體碎片整理有如下四種模式：

非同步模式（MIGRATE_ASYNC）：在該模式不允許進行任何阻塞操作，當需要阻塞或者排程的時候，則停止記憶體碎片整理。在該模式下只會處理MIGRATE_MOVABLE、MIGRATE_CMA型別的頁框，而不會處理MIRGATE_RECLAIMABLE型別的頁框，因為該型別的頁框大多數是檔案頁，對檔案頁進行記憶體碎片整理，有可能涉及髒頁回寫，這會引起阻塞。

輕同步模式（MIGRATE_SYNC_LIGHT）：該模式允許絕大部分的阻塞操作，但是不阻塞等待髒檔案頁的回寫操作，因為回寫時間可能很長。

同步模式（MIGRATE_SYNC）：該模式允許在遷移頁框時允許阻塞，也就是允許頁回寫完成才返回結果，這是最耗時的模式。該模式會整zone掃描，並且不會跳過標記為PG_migrate_skip標誌的pageblock。

非複製同步模式（MIGRATE_SYNC_NO_COPY）：與同步模式類似，在遷移頁框時允許阻塞，但不會進行頁框複製。

struct

zone

{

。。。

#ifdef CONFIG_COMPACTION

* On compaction failure， 1<

* are skipped before trying again。 The number attempted since

* last failure is tracked with compact_considered。

* compact_order_failed is the minimum compaction failed order。

// 記憶體碎片整理推遲次數累計，當推遲的次數超過1 << compact_defer_shift時，超過後不允許再推遲了

unsigned

int

compact_considered

；

unsigned

int

compact_defer_shift

；

// 記錄zone記憶體碎片整理可能失敗的最大order

// 如果當前order大於等於compact_order_failed，則允許推遲（這裡是為了提高記憶體碎片整理的成功率），小於則直接啟動記憶體碎片整理

// 如果本次記憶體碎片整理成功了，則compact_order_failed置為order + 1

// 如果本次記憶體碎片整理失敗了，則compact_order_failed置為order

int

compact_order_failed

；

#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA

/* pfn where compaction free scanner should start */

// 記錄記憶體碎片掃描空閒頁時的起始頁幀號

unsigned

long

compact_cached_free_pfn

；

/* pfn where compaction migration scanner should start */

// 記錄記憶體碎片掃描待移動頁時的起始頁幀號，包括非同步（0）和同步（1）的場景

unsigned

long

compact_cached_migrate_pfn

［

ASYNC_AND_SYNC

］；

// 記憶體碎片整理掃描遷移頁框初始幀號

unsigned

long

compact_init_migrate_pfn

；

// 記憶體碎片整理掃描空閒頁框初始幀號

unsigned

long

compact_init_free_pfn

；

#endif

。。。

}；

由於記憶體碎片整理比較耗時，Linux實現了一些手段來減少耗時，主要有：1、減少記憶體整理的次數。2、跳過一些已經掃描過的記憶體頁。透過記憶體碎片整理推遲機制，對於預估可能失敗的記憶體碎片整理場景實行推遲處理；在記憶體掃描過程中，分為全zone掃描和部分記憶體頁掃描，如果是部分記憶體頁掃描則使用上次快取的結果，跳過已經掃描過的記憶體頁。

struct

compact_control

{

// 掃描pageblock時空閒頁面連結串列

struct

list_head

freepages

；

/* List of free pages to migrate to */

// 掃描pageblock時遷移頁面連結串列

struct

list_head

migratepages

；

/* List of pages being migrated */

// freepages連結串列中頁面數

unsigned

int

nr_freepages

；

/* Number of isolated free pages */

// migratepages連結串列中頁面數

unsigned

int

nr_migratepages

；

/* Number of pages to migrate */

// 隔離空閒頁掃描起始頁框

unsigned

long

free_pfn

；

/* isolate_freepages search base */

// 隔離待移動頁掃描起始頁框

unsigned

long

migrate_pfn

；

/* isolate_migratepages search base */

// 快速掃描起始頁框

unsigned

long

fast_start_pfn

；

/* a pfn to start linear scan from */

// 本次掃描的zone

struct

zone

；

// 做可遷移頁面掃描時，已經掃描過的頁面數

unsigned

long

total_migrate_scanned

；

// 做空閒頁面掃描時，已經掃描過的頁面數

unsigned

long

total_free_scanned

；

unsigned

short

fast_search_fail

；

/* failures to use free list searches */

// 快速搜尋時的order

short

search_order

；

/* order to start a fast search at */

const

gfp_t

gfp_mask

；

/* gfp mask of a direct compactor */

// 實際申請記憶體不足導致需要記憶體碎片整理的order

int

order

；

/* order a direct compactor needs */

// 本次記憶體申請的遷移型別

int

migratetype

；

/* migratetype of direct compactor */

// 本次記憶體申請的標記

const

unsigned

int

alloc_flags

；

/* alloc flags of a direct compactor */

// 本次記憶體申請允許的最大zone下標

const

int

highest_zoneidx

；

/* zone index of a direct compactor */

// 本次記憶體碎片整理的模式

enum

migrate_mode

mode

；

/* Async or sync migration mode */

// 是否忽略skip標誌

bool

ignore_skip_hint

；

/* Scan blocks even if marked skip */

bool

no_set_skip_hint

；

/* Don‘t mark blocks for skipping */

// 是否忽視合不合適block頁掃描

bool

ignore_block_suitable

；

/* Scan blocks considered unsuitable */

// 是否是直接記憶體碎片整理，如果是kcompactd任務或者透過/proc觸發的記憶體碎片

// 整理則為否

bool

direct_compaction

；

/* False from kcompactd or /proc/。。。 */

// 是否為積極的記憶體壓縮策略

bool

proactive_compaction

；

/* kcompactd proactive compaction */

// 是否全zone掃描

bool

whole_zone

；

/* Whole zone should/has been scanned */

}；

這是記憶體碎片整理的控制結構體，將掃描到的待遷移頁框和空閒頁框，從夥伴系統中隔離出來。一次成功的記憶體碎片整理應該是將所有待遷移的頁框全部移動到空閒頁框處。由於一次記憶體碎片整理可能還是無法獲得足夠多的連續記憶體，可能需要觸發多次記憶體碎片整理。

演算法

記憶體碎片整理

上圖是記憶體碎片整理的簡易模型。如果某一次記憶體申請4個連續頁框，該zone有足夠的頁框，但是由於碎片嚴重，無法提供連續頁框。記憶體碎片整理後，就可以滿足連續4個頁框的記憶體申請。下面將結合上面的資料結構，詳細展開分析記憶體碎片整理流程。

__alloc_pages_direct_compact

static

struct

page

__alloc_pages_direct_compact

（

gfp_t

gfp_mask

，

unsigned

int

order

，

unsigned

int

alloc_flags

，

const

struct

alloc_context

，

enum

compact_priority

prio

，

enum

compact_result

）

{

struct

page

NULL

；

unsigned

long

pflags

；

unsigned

int

noreclaim_flag

；

// 如果是申請一個頁框，是無需進行碎片整理的

（

！

order

）

return

NULL

；

// 嘗試進行記憶體碎片整理

compact_result

try_to_compact_pages

（

gfp_mask

，

order

，

alloc_flags

，

prio

，

page

）；

（

page

）

// 裝填新的記憶體頁

prep_new_page

（

page

，

order

，

gfp_mask

，

alloc_flags

）；

/* Try get a page from the freelist if available */

// 記憶體整理後，仍然無法獲得足夠記憶體，但還是嘗試透過freelist連結串列獲取（可能fallback？）

（

！

page

）

page

get_page_from_freelist

（

gfp_mask

，

order

，

alloc_flags

，

）；

（

page

）

{

struct

zone

page_zone

（

page

）；

// 為什麼要置為false？？？

zone

compact_blockskip_flush

false

；

// 成功後將compact_considered和compact_defer_shift置為0，

// compact_order_failed如果小於等於order，則被置為order + 1

compaction_defer_reset

（

zone

，

order

，

true

）；

count_vm_event

（

COMPACTSUCCESS

）；

return

page

；

}

。。。

return

NULL

；

}

該介面是記憶體碎片整理的入口，引數包括申請order、申請標誌、申請上下文、碎片整理的特權和碎片整理的結果。前幾個引數和記憶體申請時的含義一樣，而碎片整理的特權則是決定本次記憶體碎片整理使用的模式。可以看到該介面既有記憶體碎片整理流程，也有記憶體申請流程，也就是說記憶體碎片整理和記憶體申請是繫結在一起的。這裡總結下觸發記憶體碎片整理的條件：

kswapd任務對記憶體回收後，可能會觸發記憶體碎片整理。

慢路徑記憶體分配時zone無法提供足夠的連續頁框。

透過/proc/sys/vm/compact_memory手動觸發。

try_to_compact_pages

// mm\compaction。c

/**

* try_to_compact_pages - Direct compact to satisfy a high-order allocation

* @gfp_mask： The GFP mask of the current allocation

* @order： The order of the current allocation

* @alloc_flags： The allocation flags of the current allocation

* @ac： The context of current allocation

* @prio： Determines how hard direct compaction should try to succeed

* @capture： Pointer to free page created by compaction will be stored here

* This is the main entry point for direct page compaction。

enum

compact_result

try_to_compact_pages

（

gfp_t

gfp_mask

，

unsigned

int

order

，

unsigned

int

alloc_flags

，

const

struct

alloc_context

，

enum

compact_priority

prio

，

struct

page

capture

）

{

int

may_perform_io

gfp_mask

__GFP_IO

；

struct

zoneref

；

struct

zone

；

// 預設結果是跳過

enum

compact_result

COMPACT_SKIPPED

；

* Check if the GFP flags allow compaction - GFP_NOIO is really

* tricky context because the migration might require IO

// 如果禁止IO操作，則不進行頁面碎片整理（因為在禁止IO的上下文進行，有可能會觸發死鎖）

（

！

may_perform_io

）

return

COMPACT_SKIPPED

；

/* Compact each zone in the list */

// 遍歷zonelist中的每一個zone

for_each_zone_zonelist_nodemask

（

zone

，

zonelist

，

highest_zoneidx

，

nodemask

）

{

enum

compact_result

status

；

（

prio

MIN_COMPACT_PRIORITY

compaction_deferred

（

zone

，

order

））

{

// COMPACT_DEFERRED結果是推遲

max_t

（

enum

compact_result

，

COMPACT_DEFERRED

，

）；

continue

；

}

// 該zone需要進行記憶體碎片整理，並嘗試分配頁框

status

compact_zone_order

（

zone

，

order

，

gfp_mask

，

prio

，

alloc_flags

，

highest_zoneidx

，

capture

）；

max

（

status

，

）；

/* The allocation should succeed， stop compacting */

// 透過記憶體碎片整理完成

（

status

COMPACT_SUCCESS

）

{

* We think the allocation will succeed in this zone，

* but it is not certain， hence the false。 The caller

* will repeat this with true if allocation indeed

* succeeds in this zone。

// 這裡只是完成了記憶體鎖片整理，該zone中剩餘的頁框滿足了申請頁面數

// 但是頁面不一定都是連續的，其實是不確定最終能否成功，故這裡用flase

// 將compact_order_failed設定為order + 1，也就是下次order小於該值時，

// 會跳過整理

compaction_defer_reset

（

zone

，

order

，

false

）；

break

；

}

（

prio

！=

COMPACT_PRIO_ASYNC

（

status

COMPACT_COMPLETE

status

COMPACT_PARTIAL_SKIPPED

））

* We think that allocation won’t succeed in this zone

* so we defer compaction there。 If it ends up

* succeeding after all， it will be reset。

defer_compaction

（

zone

，

order

）；

* We might have stopped compacting due to need_resched（） in

* async compaction， or due to a fatal signal detected。 In that

* case do not try further zones

// 非同步模式不允許阻塞，如果當前任務需要被排程，則終止記憶體碎片整理流程

（（

prio

COMPACT_PRIO_ASYNC

need_resched

（））

fatal_signal_pending

（

current

））

break

；

}

return

；

}

遍歷zone列表，會對每個zone的狀態進行評估。如果該zone不適合進行記憶體碎片整理，則執行延遲操作，也即跳過該zone。如果該zone條件合適，則對該zone進行記憶體碎片整理，並且整理後嘗試申請頁框。當某次記憶體碎片整理成功或者在非同步模式下發生了任務排程，又或者觸發了致命的訊號，記憶體碎片整理流程就會終止。這裡總結記憶體碎片整理結束的條件：

該zone有足夠的連續記憶體，也即不需要做記憶體碎片整理，直接在該zone上申請即可，並結束記憶體碎片整理流程。

該zone無法在滿足水位的情況下分配記憶體，則會跳過該zone。如果是非同步模式，並且需要排程任務，則直接結束。否則待遍歷完所有zone後結束。

對某個zone做頁框遷移完成後，如果整理成功並分配到記憶體，則結束。如果分配不到，則換下一個zone，一直到遍歷完所有zone後結束。

compaction_deferred

// mm\compaction。c

/* Returns true if compaction should be skipped this time */

bool

compaction_deferred

（

struct

zone

，

int

order

）

{

unsigned

long

defer_limit

1UL

zone

compact_defer_shift

；

// compact_order_failed記錄的是記憶體碎片整理失敗的最小值，也就是說超過該值

// 記憶體碎片整理會失敗，所以只有超過該值才可能跳過（避免失敗耗時）

（

order

zone

compact_order_failed

）

return

false

；

/* Avoid possible overflow */

// 記憶體碎片整理推遲計數器，如果推遲次數已經達到預設的閾值（1 << zone->compact_defer_shift），則

// 不能再推遲了

（

zone

compact_considered

defer_limit

）

{

// 避免溢位

zone

compact_considered

defer_limit

；

return

false

；

}

return

true

；

}

記憶體碎片整理延遲函式，同時滿足兩種情況下會延遲：

當前order大於等於compact_order_failed，表示預估本次執行記憶體碎片整理，也很可能會失敗。compact_order_failed值在記憶體碎片整理失敗時，會置為order，在記憶體碎片整理成功時，會置為order + 1。永遠不會將該值置為0。

延遲計數器沒有超過閾值，其中計數器是compact_considered，而閾值是1 << compact_defer_shift。當記憶體碎片整理成功時，會將計數器重置為0，而compact_defer_shift遞增；當記憶體碎片整理成功並且分配頁框成功時，則將兩個值都重置為0。

延遲策略是為了提高記憶體碎片整理的成功率，減低因記憶體碎片整理失敗導致的效能損失。

compact_zone_order

// mm\compaction。c

static

enum

compact_result

compact_zone_order

（

struct

zone

，

int

order

，

gfp_t

gfp_mask

，

enum

compact_priority

prio

，

unsigned

int

alloc_flags

，

int

highest_zoneidx

，

struct

page

capture

）

{

enum

compact_result

ret

；

struct

compact_control

{

。

order

，

。

search_order

order

，

。

gfp_mask

，

。

zone

，

// 如果優先順序是非同步優先順序，則採用非同步模式，否則使用輕同步的方式

。

mode

（

prio

COMPACT_PRIO_ASYNC

）

？

MIGRATE_ASYNC

：

MIGRATE_SYNC_LIGHT

，

。

alloc_flags

，

。

highest_zoneidx

，

。

direct_compaction

true

，

// 同步模式會全zone掃描，並且不會跳過標記為PG_migrate_skip標誌的pageblock

。

whole_zone

（

prio

MIN_COMPACT_PRIORITY

），

。

ignore_skip_hint

（

prio

MIN_COMPACT_PRIORITY

），

。

ignore_block_suitable

（

prio

MIN_COMPACT_PRIORITY

）

}；

struct

capture_control

capc

{

。

，

。

page

NULL

，

}；

* Make sure the structs are really initialized before we expose the

* capture control， in case we are interrupted and the interrupt handler

* frees a page。

barrier

（）；

WRITE_ONCE

（

current

capture_control

，

capc

）；

// 對zone進行記憶體碎片整理

ret

compact_zone

（

，

capc

）；

VM_BUG_ON

（

！

list_empty

（

。

freepages

））；

VM_BUG_ON

（

！

list_empty

（

。

migratepages

））；

* Make sure we hide capture control first before we read the captured

* page pointer， otherwise an interrupt could free and capture a page

* and we would leak it。

WRITE_ONCE

（

current

capture_control

，

NULL

）；

capture

READ_ONCE

（

capc

。

page

）；

return

ret

；

}

構造記憶體碎片整理的入參，如果是非同步優先順序，則採用非同步模式；如果非非同步優先順序，則採用輕同步模式。

compact_zone

// mm\compaction。c

static

enum

compact_result

compact_zone

（

struct

compact_control

，

struct

capture_control

capc

）

{

enum

compact_result

ret

；

// 做pageblock掃描時，搜尋可移動頁面的起始位置

unsigned

long

start_pfn

zone

zone_start_pfn

；

// 做pageblock掃描時，搜尋空閒頁面的起始位置

unsigned

long

end_pfn

zone_end_pfn

（

zone

）；

unsigned

long

last_migrated_pfn

；

// 是否是同步模式

const

bool

sync

mode

！=

MIGRATE_ASYNC

；

bool

update_cached

；

* These counters track activities during zone compaction。 Initialize

* them before compacting a new zone。

// 初始化記憶體鎖片整理控制器

total_migrate_scanned

；

total_free_scanned

；

nr_migratepages

；

nr_freepages

；

INIT_LIST_HEAD

（

freepages

）；

INIT_LIST_HEAD

（

migratepages

）；

// 獲取本次申請頁面的遷移型別

migratetype

gfp_migratetype

（

gfp_mask

）；

// 判斷當前zone是否滿足記憶體碎片整理條件

ret

compaction_suitable

（

zone

，

order

，

alloc_flags

，

highest_zoneidx

）；

/* Compaction is likely to fail */

// COMPACT_SUCCESS此處表示有足夠記憶體，不需要記憶體碎片整理

// COMPACT_SKIPPED此處表示沒有足夠的記憶體，也是不需要記憶體碎片整理

（

ret

COMPACT_SUCCESS

ret

COMPACT_SKIPPED

）

return

ret

；

/* huh， compaction_suitable is returning something unexpected */

VM_BUG_ON

（

ret

！=

COMPACT_CONTINUE

）；

* Clear pageblock skip if there were failures recently and compaction

* is about to be retried after being deferred。

// 如果推遲次數和閾值都已經達到最大值，並且最近有一次成功的全記憶體碎片整理

// 則需要重設zone的compact_init_migrate_pfn、

// compact_cached_migrate_pfn、compact_cached_free_pfn等快取資訊

// 我理解這裡是為了要全zone掃描，提高成功率

（

compaction_restarting

（

zone

，

order

））

__reset_isolation_suitable

（

zone

）；

* Setup to move all movable pages to the end of the zone。 Used cached

* information on where the scanners should start （unless we explicitly

* want to compact the whole zone）， but check that it is initialised

* by ensuring the values are within zone boundaries。

fast_start_pfn

；

// 如果進行全zone掃描，則將掃描可移動頁框指標記錄為zone的第一個頁框

// 並且將掃描空閒頁框指標記錄為最後一個pageblock的起始頁框

// 同步模式下會全zone掃描

（

whole_zone

）

{

migrate_pfn

start_pfn

；

free_pfn

pageblock_start_pfn

（

end_pfn

）；

}

else

{

// 非全zone掃描，則從快取中獲取相關的起始頁框

// 非同步或者輕同步模式

migrate_pfn

zone

compact_cached_migrate_pfn

［

sync

］；

free_pfn

zone

compact_cached_free_pfn

；

// 如果快取值不合法了，則更新成合法值

（

free_pfn

start_pfn

free_pfn

end_pfn

）

{

free_pfn

pageblock_start_pfn

（

end_pfn

）；

zone

compact_cached_free_pfn

free_pfn

；

}

（

migrate_pfn

start_pfn

migrate_pfn

end_pfn

）

{

migrate_pfn

start_pfn

；

zone

compact_cached_migrate_pfn

［

］

migrate_pfn

；

zone

compact_cached_migrate_pfn

［

］

migrate_pfn

；

}

// 如果掃描待遷移頁框是從頭開始的，則表示是全zone掃描，需要設定相應狀態

（

migrate_pfn

zone

compact_init_migrate_pfn

）

whole_zone

true

；

}

last_migrated_pfn

；

* Migrate has separate cached PFNs for ASYNC and SYNC* migration on

* the basis that some migrations will fail in ASYNC mode。 However，

* if the cached PFNs match and pageblocks are skipped due to having

* no isolation candidates， then the sync state does not matter。

* Until a pageblock with isolation candidates is found， keep the

* cached PFNs in sync to avoid revisiting the same blocks。

update_cached

！

sync

zone

compact_cached_migrate_pfn

［

］

zone

compact_cached_migrate_pfn

［

］；

trace_mm_compaction_begin

（

start_pfn

，

migrate_pfn

，

free_pfn

，

end_pfn

，

sync

）；

migrate_prep_local

（）；

// 判斷記憶體碎片整理是否結束，結束條件如下：

// 1、掃描待移動頁框和掃描空閒頁框指標重合，表示已經結束了。

// 2、如果是積極記憶體碎片整理策略（我理解只有在kcompactd任務的場景），如果kswap任務也正在對node進行記憶體回收，則結束；

// 否則會一直整理到使用者預設的低水位碎片比率。

// 3、整理過程中發現有足夠的空閒頁框，或者分配可移動頁框且CMA有足夠空間，則停止整理。

// 注意：如果是透過手動觸發的記憶體碎片整理，只看第一個條件。

while

（（

ret

compact_finished

（

））

COMPACT_CONTINUE

）

{

int

err

；

unsigned

long

start_pfn

migrate_pfn

；

* Avoid multiple rescans which can happen if a page cannot be

* isolated （dirty/writeback in async mode） or if the migrated

* pages are being allocated before the pageblock is cleared。

* The first rescan will capture the entire pageblock for

* migration。 If it fails， it‘ll be marked skip and scanning

* will proceed as normal。

rescan

false

；

（

pageblock_start_pfn

（

last_migrated_pfn

）

pageblock_start_pfn

（

start_pfn

））

{

rescan

true

；

}

// 掃描可移動頁框，並將對應的page從lru中取下來，存放在migratepages

switch

（

isolate_migratepages

（

））

{

case

ISOLATE_ABORT

：

// 掃描可隔離頁框被終止，則將migratepages恢復至lru中

ret

COMPACT_CONTENDED

；

putback_movable_pages

（

migratepages

）；

nr_migratepages

；

goto

out

；

case

ISOLATE_NONE

：

// 該pageblock沒有掃描到可移動頁框

（

update_cached

）

{

zone

compact_cached_migrate_pfn

［

］

zone

compact_cached_migrate_pfn

［

］；

}

* We haven’t isolated and migrated anything， but

* there might still be unflushed migrations from

* previous cc->order aligned block。

goto

check_drain

；

case

ISOLATE_SUCCESS

：

update_cached

false

；

last_migrated_pfn

start_pfn

；

}

// 進行頁面遷移操作，將migratepages中的頁面，遷移到空閒頁面中

err

migrate_pages

（

migratepages

，

compaction_alloc

，

compaction_free

，

（

unsigned

long

）

，

mode

，

MR_COMPACTION

）；

trace_mm_compaction_migratepages

（

nr_migratepages

，

err

，

migratepages

）；

/* All pages were either migrated or will be released */

nr_migratepages

；

（

err

）

{

// 如果頁面遷移失敗，則將剩下的可移動頁框還原到lru中

putback_movable_pages

（

migratepages

）；

* migrate_pages（） may return -ENOMEM when scanners meet

* and we want compact_finished（） to detect it

（

err

ENOMEM

！

compact_scanners_met

（

））

{

ret

COMPACT_CONTENDED

；

goto

out

；

}

* We failed to migrate at least one page in the current

* order-aligned block， so skip the rest of it。

（

direct_compaction

（

mode

MIGRATE_ASYNC

））

{

migrate_pfn

block_end_pfn

（

migrate_pfn

，

order

）；

/* Draining pcplists is useless in this case */

last_migrated_pfn

；

}

check_drain

：

* Has the migration scanner moved away from the previous

* cc->order aligned block where we migrated from？ If yes，

* flush the pages that were freed， so that they can merge and

* compact_finished（） can detect immediately if allocation

* would succeed。

（

order

last_migrated_pfn

）

{

unsigned

long

current_block_start

block_start_pfn

（

migrate_pfn

，

order

）；

（

last_migrated_pfn

current_block_start

）

{

lru_add_drain_cpu_zone

（

zone

）；

/* No more flushing until we migrate again */

last_migrated_pfn

；

}

/* Stop if a page has been captured */

（

capc

page

）

{

ret

COMPACT_SUCCESS

；

break

；

}

out

：

* Release free pages and update where the free scanner should restart，

* so we don‘t leave any returned pages behind in the next attempt。

// 如果空間列表中還有page，則還原到lru中

（

nr_freepages

）

{

unsigned

long

free_pfn

release_freepages

（

freepages

）；

nr_freepages

；

VM_BUG_ON

（

free_pfn

）；

/* The cached pfn is always the first in a pageblock */

free_pfn

pageblock_start_pfn

（

free_pfn

）；

* Only go back， not forward。 The cached pfn might have been

* already reset to zone end in compact_finished（）

（

free_pfn

zone

compact_cached_free_pfn

）

zone

compact_cached_free_pfn

free_pfn

；

}

count_compact_events

（

COMPACTMIGRATE_SCANNED

，

total_migrate_scanned

）；

count_compact_events

（

COMPACTFREE_SCANNED

，

total_free_scanned

）；

trace_mm_compaction_end

（

start_pfn

，

migrate_pfn

，

free_pfn

，

end_pfn

，

sync

，

ret

）；

return

ret

；

}

記憶體碎片整理最核心的函式，其是針對一個zone進行的。該介面可分為三大部分的功能：1、判斷該zone是否滿足碎片整理條件；2、隔離出待遷移頁框；3、實行頁框遷移操作。（天哪！！！一個函式集中了這麼多功能：joy：）。下面將一一分析這三部分功能點。

判斷該zone是否滿足碎片整理條件

// mm\compaction。c

enum

compact_result

compaction_suitable

（

struct

zone

，

int

order

，

unsigned

int

alloc_flags

，

int

highest_zoneidx

）

{

enum

compact_result

ret

；

int

fragindex

；

ret

__compaction_suitable

（

zone

，

order

，

alloc_flags

，

highest_zoneidx

，

zone_page_state

（

zone

，

NR_FREE_PAGES

））；

* fragmentation index determines if allocation failures are due to

* low memory or external fragmentation

* index of -1000 would imply allocations might succeed depending on

* watermarks， but we already failed the high-order watermark check

* index towards 0 implies failure is due to lack of memory

* index towards 1000 implies failure is due to fragmentation

* Only compact if a failure would be due to fragmentation。 Also

* ignore fragindex for non-costly orders where the alternative to

* a successful reclaim/compaction is OOM。 Fragindex and the

* vm。extfrag_threshold sysctl is meant as a heuristic to prevent

* excessive compaction for costly orders， but it should not be at the

* expense of system stability。

（

ret

COMPACT_CONTINUE

（

order

PAGE_ALLOC_COSTLY_ORDER

））

{

// 為當前記憶體碎片情況“打分”

// 1、如果打分是趨向0，則意味著本次記憶體申請失敗是由於記憶體不足導致

// 2、如果打分是趨向1000，則意味著本次記憶體申請失敗是由於記憶體碎片導致

// 只有因記憶體碎片導致的失敗，做碎片整理才有意義

fragindex

fragmentation_index

（

zone

，

order

）；

// sysctl_extfrag_threshold是透過虛擬檔案系統預設的值（/proc/sys/vm/extfrag_threshold）

// 範圍是0~1000，只有超過這個值才進行記憶體碎片整理

（

fragindex

sysctl_extfrag_threshold

）

ret

COMPACT_NOT_SUITABLE_ZONE

；

}

trace_mm_compaction_suitable

（

zone

，

order

，

ret

）；

（

ret

COMPACT_NOT_SUITABLE_ZONE

）

ret

COMPACT_SKIPPED

；

return

ret

；

}

* compaction_suitable： Is this suitable to run compaction on this zone now？

* Returns

* COMPACT_SKIPPED - If there are too few free pages for compaction

* COMPACT_SUCCESS - If the allocation would succeed without compaction

* COMPACT_CONTINUE - If compaction should run now

static

enum

compact_result

__compaction_suitable

（

struct

zone

，

int

order

，

unsigned

int

alloc_flags

，

int

highest_zoneidx

，

unsigned

long

wmark_target

）

{

unsigned

long

watermark

；

// 如果是透過虛擬檔案系統（/proc/sys/vm/compact_memory）觸發的記憶體碎片整理，則

// 強制進行

（

is_via_compact_memory

（

order

））

return

COMPACT_CONTINUE

；

// 根據alloc_flags獲取某水位至少需要watermark頁記憶體，一般是MIN水位

watermark

wmark_pages

（

zone

，

alloc_flags

ALLOC_WMARK_MASK

）；

* If watermarks for high-order allocation are already met， there

* should be no need for compaction at all。

// 如果當前zone能在滿足水位的基礎上分配2^order記憶體，則不進行記憶體碎片整理

（

zone_watermark_ok

（

zone

，

order

，

watermark

，

highest_zoneidx

，

alloc_flags

））

return

COMPACT_SUCCESS

；

* Watermarks for order-0 must be met for compaction to be able to

* isolate free pages for migration targets。 This means that the

* watermark and alloc_flags have to match， or be more pessimistic than

* the check in __isolate_free_page（）。 We don’t use the direct

* compactor‘s alloc_flags， as they are not relevant for freepage

* isolation。 We however do use the direct compactor’s highest_zoneidx

* to skip over zones where lowmem reserves would prevent allocation

* even if compaction succeeds。

* For costly orders， we require low watermark instead of min for

* compaction to proceed to increase its chances。

* ALLOC_CMA is used， as pages in CMA pageblocks are considered

* suitable migration targets

// 到這裡，已經無法在滿足水位alloc_flags預設水位的基礎上分配記憶體了，則

// 考慮根據order大小將水位調整成low或者min

watermark

（

order

PAGE_ALLOC_COSTLY_ORDER

）

？

low_wmark_pages

（

zone

）

：

min_wmark_pages

（

zone

）；

// 預估該zone至少需要的gap（？？？）

watermark

compact_gap

（

order

）；

（

！

__zone_watermark_ok

（

zone

，

watermark

，

highest_zoneidx

，

ALLOC_CMA

，

wmark_target

））

return

COMPACT_SKIPPED

；

return

COMPACT_CONTINUE

；

}

以下幾種場景是不需要進行記憶體碎片整理的：

該zone能在滿足要求水位的情況下，可以分配足夠連續頁框，自然就不需要耗時去做碎片整理。這應該是考慮到雖然前面透過快路徑無法獲取到足夠頁框，但是在慢路徑過程中，其他任務釋放了頁框。

在場景1不滿足的情況下，並且當前zone只能滿足低水位或最小水位的情況下，無法分配一個頁框，這時候頁無需進行碎片整理。因為這時該zone已經徹底沒有空閒頁了。

在場景1和2都不滿足的情況下，會對記憶體碎片情況進行打分。這裡需要預估當前zone無法分配頁框，是因為記憶體不足還是因為記憶體碎片導致的，只有是因記憶體碎片導致的情況下，進行記憶體碎片整理才有意義。並且還需要考慮到使用者設定的記憶體碎片閾值（/proc/sys/vm/extfrag_threshold），如果該值很大，表示使用者不希望進行記憶體碎片整理（效能優先）；反之，該值很小，表示使用者希望多進行記憶體碎片整理（記憶體優先）。

除了以上三種場景，無論手動觸發還是因記憶體不足觸發的記憶體碎片整理，都會進行。

隔離出待遷移頁框

// mm\compaction。c

* Isolate all pages that can be migrated from the first suitable block，

* starting at the block pointed to by the migrate scanner pfn within

* compact_control。

static

isolate_migrate_t

isolate_migratepages

（

struct

compact_control

）

{

unsigned

long

block_start_pfn

；

unsigned

long

block_end_pfn

；

unsigned

long

low_pfn

；

struct

page

；

// 是否允許隔離不可回收頁框和如果是非同步模式，更偏向於隔離不會阻塞的頁框

const

isolate_mode_t

isolate_mode

（

sysctl_compact_unevictable_allowed

？

ISOLATE_UNEVICTABLE

：

）

（

mode

！=

MIGRATE_SYNC

？

ISOLATE_ASYNC_MIGRATE

：

）；

bool

fast_find_block

；

* Start at where we last stopped， or beginning of the zone as

* initialized by compact_zone（）。 The first failure will use

* the lowest PFN as the starting point for linear scanning。

low_pfn

fast_find_migrateblock

（

）；

block_start_pfn

pageblock_start_pfn

（

low_pfn

）；

（

block_start_pfn

zone

zone_start_pfn

）

block_start_pfn

zone

zone_start_pfn

；

* fast_find_migrateblock marks a pageblock skipped so to avoid

* the isolation_suitable check below， check whether the fast

* search was successful。

fast_find_block

low_pfn

！=

migrate_pfn

！

fast_search_fail

；

/* Only scan within a pageblock boundary */

block_end_pfn

pageblock_end_pfn

（

low_pfn

）；

* Iterate over whole pageblocks until we find the first suitable。

* Do not cross the free scanner。

// 每次都是以一個pageblock大小為單元進行掃描，範圍是［block_start_pfn， block_end_pfn］

for

（；

block_end_pfn

free_pfn

；

fast_find_block

false

，

low_pfn

block_end_pfn

，

block_start_pfn

block_end_pfn

，

block_end_pfn

pageblock_nr_pages

）

{

* This can potentially iterate a massively long zone with

* many pageblocks unsuitable， so periodically check if we

* need to schedule。

// 避免掃描佔用大量的CPU時間，當掃描超過32個pageblcok時，會休眠

（

！

（

low_pfn

（

SWAP_CLUSTER_MAX

pageblock_nr_pages

）））

cond_resched

（）；

// 獲得pageblock（範圍是［block_start_pfn， block_end_pfn］）的第一個page

page

pageblock_pfn_to_page

（

block_start_pfn

，

block_end_pfn

，

zone

）；

// 如果page不可用，則跳到下一個pageblock

（

！

page

）

continue

；

* If isolation recently failed， do not retry。 Only check the

* pageblock once。 COMPACT_CLUSTER_MAX causes a pageblock

* to be visited multiple times。 Assume skip was checked

* before making it “skip” so other compaction instances do

* not scan the same block。

// 如果忽略PB_migrate_skip標誌（同步模式、kcompactd任務、手動模式、指定分配一定範圍的頁框），則不做檢查，直接往下處理；

// 否則判斷page是否被設定了PB_migrate_skip標記，已設定，則跳過；否則往下處理

（

IS_ALIGNED

（

low_pfn

，

pageblock_nr_pages

）

！

fast_find_block

！

isolation_suitable

（

，

page

））

continue

；

* For async compaction， also only scan in MOVABLE blocks

* without huge pages。 Async compaction is optimistic to see

* if the minimum amount of work satisfies the allocation。

* The cached PFN is updated as it‘s possible that all

* remaining blocks between source and target are unsuitable

* and the compaction scanners fail to meet。

// 1、如果page是複合型別，並且大小比一個pageblock大，則跳過該page，因為該page沒有碎片。

// 2、非同步模式時，不允許阻塞，只能處理MIGRATE_MOVABLE或者MIGRATE_CMA型別的頁框。如果本次申請的是MIGRATE_MOVABLE頁框，

// 但page不是MIGRATE_MOVABLE或者MIGRATE_CMA，則跳過；如果本次申請的不是MIGRATE_MOVABLE頁框，則需要與page遷移型別一直，否則跳過。

// 3、如果是非非同步模式，或者kcompact任務或手動觸發的，則需要進行隔離。

（

！

suitable_migration_source

（

，

page

））

{

// 如果在需要設定skip模式下，則需要更新掃描的快取

update_cached_migrate

（

，

block_end_pfn

）；

continue

；

}

/* Perform the isolation */

// 實施隔離操作：從［low_pfn， block_end_pfn］中隔離出被使用的page，存放到cc的連結串列中，返回的是最後掃描並處理的頁框號

low_pfn

isolate_migratepages_block

（

，

low_pfn

，

block_end_pfn

，

isolate_mode

）；

// 如果node已經被隔離了太多頁框，以下三種情況會終止隔離：

// 1、如果cc中還有待遷移頁面沒有處理完

// 2、當前是非同步模式

// 3、捕獲到致命訊號

（

！

low_pfn

）

return

ISOLATE_ABORT

；

* Either we isolated something and proceed with migration。 Or

* we failed and compact_zone should decide if we should

* continue or not。

break

；

}

/* Record where migration scanner will be restarted。 */

// 記錄下一次掃描的初始位置

migrate_pfn

low_pfn

；

// 如果隔離到頁面，表示成功，否則失敗

return

nr_migratepages

？

ISOLATE_SUCCESS

：

ISOLATE_NONE

；

}

// mm\compaction。c

/**

* isolate_migratepages_block（） - isolate all migrate-able pages within

* a single pageblock

* @cc： Compaction control structure。

* @low_pfn： The first PFN to isolate

* @end_pfn： The one-past-the-last PFN to isolate， within same pageblock

* @isolate_mode： Isolation mode to be used。

* Isolate all pages that can be migrated from the range specified by

* ［low_pfn， end_pfn）。 The range is expected to be within same pageblock。

* Returns zero if there is a fatal signal pending， otherwise PFN of the

* first page that was not scanned （which may be both less， equal to or more

* than end_pfn）。

* The pages are isolated on cc->migratepages list （not required to be empty），

* and cc->nr_migratepages is updated accordingly。 The cc->migrate_pfn field

* is neither read nor updated。

static

unsigned

long

isolate_migratepages_block

（

struct

compact_control

，

unsigned

long

low_pfn

，

unsigned

long

end_pfn

，

isolate_mode_t

isolate_mode

）

{

pg_data_t

pgdat

zone

zone_pgdat

；

unsigned

long

nr_scanned

，

nr_isolated

；

struct

lruvec

；

unsigned

long

flags

；

bool

locked

false

；

struct

page

NULL

，

valid_page

NULL

；

unsigned

long

start_pfn

low_pfn

；

bool

skip_on_failure

false

；

unsigned

long

next_skip_pfn

；

bool

skip_updated

false

；

* Ensure that there are not too many pages isolated from the LRU

* list by either parallel reclaimers or compaction。 If there are，

* delay for some time until fewer pages are isolated

// 考慮到回收和碎片整理的平衡，不允許隔離過多頁框

while

（

unlikely

（

too_many_isolated

（

pgdat

）））

{

/* stop isolation if there are still pages not migrated */

（

nr_migratepages

）

return

；

/* async migration should just abort */

（

mode

MIGRATE_ASYNC

）

return

；

congestion_wait

（

BLK_RW_ASYNC

，

）；

（

fatal_signal_pending

（

current

））

return

；

}

cond_resched

（）；

// 如果是手動或者kcompactiond任務中，並且是非同步模式，則跳過隔離失敗頁

（

direct_compaction

（

mode

MIGRATE_ASYNC

））

{

skip_on_failure

true

；

next_skip_pfn

block_end_pfn

（

low_pfn

，

order

）；

}

/* Time to isolate some pages for migration */

for

（；

low_pfn

end_pfn

；

low_pfn

）

{

（

skip_on_failure

low_pfn

next_skip_pfn

）

{

* We have isolated all migration candidates in the

* previous order-aligned block， and did not skip it due

* to failure。 We should migrate the pages now and

* hopefully succeed compaction。

（

nr_isolated

）

break

；

* We failed to isolate in the previous order-aligned

* block。 Set the new boundary to the end of the

* current block。 Note we can’t simply increase

* next_skip_pfn by 1 << order， as low_pfn might have

* been incremented by a higher number due to skipping

* a compound or a high-order buddy page in the

* previous loop iteration。

next_skip_pfn

block_end_pfn

（

low_pfn

，

order

）；

}

* Periodically drop the lock （if held） regardless of its

* contention， to give chance to IRQs。 Abort completely if

* a fatal signal is pending。

// 如果捕獲了異常訊號，這裡會釋放鎖並終止掃描

（

！

（

low_pfn

SWAP_CLUSTER_MAX

）

compact_unlock_should_abort

（

pgdat

lru_lock

，

flags

，

locked

，

））

{

low_pfn

；

goto

fatal_pending

；

}

// 非法頁幀號

（

！

pfn_valid_within

（

low_pfn

））

goto

isolate_fail

；

// low_pfn為合法頁框，則已掃描頁框數加1

nr_scanned

；

// 獲取頁幀號對應的頁描述符

page

pfn_to_page

（

low_pfn

）；

* Check if the pageblock has already been marked skipped。

* Only the aligned PFN is checked as the caller isolates

* COMPACT_CLUSTER_MAX at a time so the second call must

* not falsely conclude that the block should be skipped。

（

！

valid_page

IS_ALIGNED

（

low_pfn

，

pageblock_nr_pages

））

{

（

！

ignore_skip_hint

get_pageblock_skip

（

page

））

{

low_pfn

end_pfn

；

goto

isolate_abort

；

}

valid_page

page

；

}

* Skip if free。 We read page order here without zone lock

* which is generally unsafe， but the race window is small and

* the worst thing that can happen is that we skip some

* potential isolation targets。

// 如果page在夥伴系統中，表示該頁沒有被使用，則跳過。頁面遷移只針對已使用的頁面

（

PageBuddy

（

page

））

{

unsigned

long

freepage_order

buddy_order_unsafe

（

page

）；

* Without lock， we cannot be sure that what we got is

* a valid page order。 Consider only values in the

* valid order range to prevent low_pfn overflow。

// 跳過複合型別的page

（

freepage_order

MAX_ORDER

）

low_pfn

（

1UL

freepage_order

）

；

continue

；

}

* Regardless of being on LRU， compound pages such as THP and

* hugetlbfs are not to be compacted unless we are attempting

* an allocation much larger than the huge page size （eg CMA）。

* We can potentially save a lot of iterations if we skip them

* at once。 The check is racy， but we can consider only valid

* values and the only danger is skipping too much。

// 如果是透明大頁或者普通大頁，並且當前不是在分配大頁的場景，也跳過碎片整理

（

PageCompound

（

page

）

！

alloc_contig

）

{

const

unsigned

int

order

compound_order

（

page

）；

（

likely

（

order

MAX_ORDER

））

low_pfn

（

1UL

order

）

；

goto

isolate_fail

；

}

* Check may be lockless but that‘s ok as we recheck later。

* It’s possible to migrate LRU and non-lru movable pages。

* Skip any other type of page

// 執行到這裡，表示page正在被使用。一般是已經隔離或者不可移動的page

// 但也有一些場景是可移動的page？？？

（

！

PageLRU

（

page

））

{

* __PageMovable can return false positive so we need

* to verify it under page_lock。

// 如果page為可移動，並且不被隔離，則需要嘗試隔離該page

（

unlikely

（

__PageMovable

（

page

））

！

PageIsolated

（

page

））

{

（

locked

）

{

spin_unlock_irqrestore

（

pgdat

lru_lock

，

flags

）；

locked

false

；

}

// 如果page不是可移動，或者已經被隔離了，又或者正在被釋放，則走出錯流程

// 否則對page進行隔離操作，並設定隔離屬性

（

！

isolate_movable_page

（

page

，

isolate_mode

））

goto

isolate_success

；

}

goto

isolate_fail

；

}

* Migration will fail if an anonymous page is pinned in memory，

* so avoid taking lru_lock and isolating it unnecessarily in an

* admittedly racy check。

// 如果page是匿名頁，並且被引用次數大於被對映的次數，表示該頁正在被“釘”住，不允許遷移，故跳過

（

！

page_mapping

（

page

）

page_count

（

page

）

page_mapcount

（

page

））

goto

isolate_fail

；

* Only allow to migrate anonymous pages in GFP_NOFS context

* because those do not depend on fs locks。

// 如果在GFP_NOFS標記的上下文中，只允許遷移匿名頁，因為不允許使用fs鎖

（

！

（

gfp_mask

__GFP_FS

）

page_mapping

（

page

））

goto

isolate_fail

；

/* If we already hold the lock， we can skip some rechecking */

// 如果page未上鎖，則需要上鎖並且重新做一些檢查

（

！

locked

）

{

locked

compact_lock_irqsave

（

pgdat

lru_lock

，

flags

，

）；

/* Try get exclusive access under lock */

（

！

skip_updated

）

{

skip_updated

true

；

（

test_and_set_skip

（

，

page

，

low_pfn

））

goto

isolate_abort

；

}

/* Recheck PageLRU and PageCompound under lock */

（

！

PageLRU

（

page

））

goto

isolate_fail

；

* Page become compound since the non-locked check，

* and it‘s on LRU。 It can only be a THP so the order

* is safe to read and it’s 0 for tail pages。

（

unlikely

（

PageCompound

（

page

）

！

alloc_contig

））

{

low_pfn

compound_nr

（

page

）

；

goto

isolate_fail

；

}

lruvec

mem_cgroup_page_lruvec

（

page

，

pgdat

）；

/* Try isolate the page */

// 嘗試將page從lru中隔離出來，並清除lru屬性

（

__isolate_lru_page

（

page

，

isolate_mode

）

！=

）

goto

isolate_fail

；

/* The whole page is taken off the LRU； skip the tail pages。 */

// 隔離的page是複合頁（只在申請指定範圍頁框的場景？），則需要跳過其大小

（

PageCompound

（

page

））

low_pfn

compound_nr

（

page

）

；

/* Successfully isolated */

// 如果是cgroup中的lru，則從中取出來

del_page_from_lru_list

（

page

，

lruvec

，

page_lru

（

page

））；

mod_node_page_state

（

page_pgdat

（

page

），

NR_ISOLATED_ANON

page_is_file_lru

（

page

），

thp_nr_pages

（

page

））；

isolate_success

：

// 將page新增到隔離裡邊中

list_add

（

page

lru

，

migratepages

）；

nr_migratepages

compound_nr

（

page

）；

nr_isolated

compound_nr

（

page

）；

* Avoid isolating too much unless this block is being

* rescanned （e。g。 dirty/writeback pages， parallel allocation）

* or a lock is contended。 For contention， isolate quickly to

* potentially remove one source of contention。

（

nr_migratepages

COMPACT_CLUSTER_MAX

！

rescan

！

contended

）

{

low_pfn

；

break

；

}

continue

；

isolate_fail

：

// 如果不是失敗就跳過設定，則繼續對該pageblock掃描

（

！

skip_on_failure

）

continue

；

* We have isolated some pages， but then failed。 Release them

* instead of migrating， as we cannot form the cc->order buddy

* page anyway。

// 該pageblock已經失敗，並且需要跳過了，則將已經隔離出來的page放回到對應的連結串列中（大頁的、非non-lru、lru中等）

（

nr_isolated

）

{

（

locked

）

{

spin_unlock_irqrestore

（

pgdat

lru_lock

，

flags

）；

locked

false

；

}

putback_movable_pages

（

migratepages

）；

nr_migratepages

；

nr_isolated

；

}

// 沒有隔離到page，跳到下一個pageblock繼續遍歷

（

low_pfn

next_skip_pfn

）

{

low_pfn

next_skip_pfn

；

* The check near the loop beginning would have updated

* next_skip_pfn too， but this is a bit simpler。

next_skip_pfn

1UL

order

；

}

* The PageBuddy（） check could have potentially brought us outside

* the range to be scanned。

（

unlikely

（

low_pfn

end_pfn

））

low_pfn

end_pfn

；

isolate_abort

：

// 隔離停止，如果page已經加鎖，則進行解鎖

（

locked

）

spin_unlock_irqrestore

（

pgdat

lru_lock

，

flags

）；

* Updated the cached scanner pfn once the pageblock has been scanned

* Pages will either be migrated in which case there is no point

* scanning in the near future or migration failed in which case the

* failure reason may persist。 The block is marked for skipping if

* there were no pages isolated in the block or if the block is

* rescanned twice in a row。

// pageblock隔離成功，設定該pageblock的skip屬性，下次跳過該pageblock的處理

（

low_pfn

end_pfn

（

！

nr_isolated

rescan

））

{

（

valid_page

！

skip_updated

）

set_pageblock_skip

（

valid_page

）；

update_cached_migrate

（

，

low_pfn

）；

}

trace_mm_compaction_isolate_migratepages

（

start_pfn

，

low_pfn

，

nr_scanned

，

nr_isolated

）；

fatal_pending

：

total_migrate_scanned

nr_scanned

；

（

nr_isolated

）

count_compact_events

（

COMPACTISOLATED

，

nr_isolated

）；

return

low_pfn

；

}

隔離前，需要明確需要在那種模式下進行，有如下三種模式：

/* Isolate unmapped pages */

// 隔離沒有對映的頁

#define ISOLATE_UNMAPPED （（__force isolate_mode_t）0x2）

/* Isolate for asynchronous migration */

// 隔離不會阻塞的頁

#define ISOLATE_ASYNC_MIGRATE （（__force isolate_mode_t）0x4）

/* Isolate unevictable pages */

// 隔離不可回收的頁

#define ISOLATE_UNEVICTABLE （（__force isolate_mode_t）0x8）

每次進行掃描時，都是以pageblock為單元進行的。在非全zone掃描場景，會使用zone的掃描快取compact_cached_free_pfn和compact_cached_migrate_pfn，這兩個值分別記錄上次掃描pageblock後的位置。當一個pageblock無法隔離到頁框，該pageblock會標記為PB_migrate_skip，那麼下次掃描的時候，可能會跳過該pageblock（同步、手動觸發、kcompactd任務和指定範圍頁框申請的場景下不會跳過）。下面是隔離操作的大致流程：

在開始的時候，migrate_pfn、compact_cached_migrate_pfn都是指向zone的起始頁幀start_pfn，而free_pfn、compact_cached_free_pfn都是指向最後一個pageblock的起始頁幀。在啟動碎片整理掃描時，發現pageblock［1］本身記憶體不足，則將其設定成PG_migrate_skip並跳過該pageblock。當繼續掃描pageblock［2］時，發現能隔離出x個頁框，同時也會將其置為PG_migrate_skip。這時會啟動空閒頁框掃描，如果pageblock［n］能隔離出y個頁框，則進行遷移並將compact_cached_free_pfn置為pageblock［n-1］的起始頁幀號。如果x > y，則需要繼續啟動空閒頁框的掃描。最終當compact_cached_migrate_pfn和compact_cached_free_pfn指向了同一個pageblock時，則結束。

下面總結隔離結束的條件：

當zone的所有pageblock都無需掃描，則結束。

當zone已經隔離了太多頁面時，並且隔離連結串列中還有沒處理完的頁框，或當前是非同步模式，或捕獲到致命訊號，則結束。

成功從某個pageblock隔離到頁框，這是正常結束場景。

那標記為PB_migrate_skip的pageblock，誰來負責清理呢？主要有如下兩種場景：

compact_cached_free_pfn和compact_cached_migrate_pfn相遇時（指向同一個pageblock），則會設定compact_blockskip_flush為true。當kswapd準備睡眠的時候，會清除該zone的所有PB_migrate_skip。這也很好理解，如果再不清除，下次就沒pageblock掃描了。

非kswapd場景下，當推遲次數達到最大，並且閾值也達到最大時，也會清除zone的PB_migrate_skip。相關實現在__reset_isolation_suitable中。

實行頁框遷移操作

第二步結束後，如果是正常場景，即隔離到頁面時，會進行頁面遷移操作。

* migrate_pages - migrate the pages specified in a list， to the free pages

* supplied as the target for the page migration

* @from： The list of pages to be migrated。

* @get_new_page： The function used to allocate free pages to be used

* as the target of the page migration。

* @put_new_page： The function used to free target pages if migration

* fails， or NULL if no special handling is necessary。

* @private： Private data to be passed on to get_new_page（）

* @mode： The migration mode that specifies the constraints for

* page migration， if any。

* @reason： The reason for page migration。

* The function returns after 10 attempts or if no pages are movable any more

* because the list has become empty or no retryable pages exist any more。

* The caller should call putback_movable_pages（） to return pages to the LRU

* or free list only if ret ！= 0。

* Returns the number of pages that were not migrated， or an error code。

// 引數說明：

// from 待遷移的連結串列

// get_new_page 獲得空閒頁面的函式

// put_new_page 釋放空閒頁面的函式，用於遷移失敗場景，將空閒頁面釋放

// private 上述兩個函式的入參

// mode 遷移模式

// reason 遷移原因

int

migrate_pages

（

struct

list_head

from

，

new_page_t

get_new_page

，

free_page_t

put_new_page

，

unsigned

long

private

，

enum

migrate_mode

mode

，

int

reason

）

{

。。。

int

swapwrite

current

flags

PF_SWAPWRITE

；

int

，

nr_subpages

；

// 做記憶體頁遷移時，需要當前任務往swap區寫的能力

（

！

swapwrite

）

current

flags

PF_SWAPWRITE

；

for

（

pass

；

pass

（

retry

thp_retry

）；

pass

）

{

retry

；

thp_retry

；

// 遍歷from列表，page是當前page，page2是下一個page

list_for_each_entry_safe

（

page

，

page2

，

from

，

lru

）

{

// 非大頁場景

unmap_and_move

（

get_new_page

，

put_new_page

，

private

，

page

，

pass

，

mode

，

reason

）；

。。。

（

！

swapwrite

）

current

flags

PF_SWAPWRITE

；

return

；

}

// 進行頁面遷移操作，將migratepages中的頁面，遷移到空閒頁面中

err

migrate_pages

（

migratepages

，

compaction_alloc

，

compaction_free

，

（

unsigned

long

）

，

mode

，

MR_COMPACTION

）；

記憶體遷移需要區分大頁和非大頁的場景。這裡只考慮非大頁場景，會有如下判斷：

如果不支援透明大頁，而當前頁剛好是透明大頁的話，則直接返錯

如果page只在lru中，沒有被使用，直接釋放

如果page被使用，則申請一個空閒頁，然後將現有的page unmap掉，並移動到新的空閒頁中

如果步驟三失敗了，需要將page放回原來連結串列或者清除其隔離屬性，並且將新申請的頁表釋放掉。

下面是申請一個空閒page的實現流程：

* This is a migrate-callback that “allocates” freepages by taking pages

* from the isolated freelists in the block we are migrating to。

static

struct

page

compaction_alloc

（

struct

page

migratepage

，

unsigned

long

data

）

{

struct

compact_control

（

struct

compact_control

）

data

；

struct

page

freepage

；

// 如果空閒頁連結串列中為空，則嘗試隔離一些出來

（

list_empty

（

freepages

））

{

// 進行隔離空閒頁框，與隔離待遷移頁框類似

isolate_freepages

（

）；

// 沒能隔離到頁框，則返錯

（

list_empty

（

freepages

））

return

NULL

；

}

// 取出連結串列首個page，返回給呼叫者使用

freepage

list_entry

（

freepages

。

，

struct

page

，

lru

）；

list_del

（

freepage

lru

）；

nr_freepages

——

；

return

freepage

；

}

同樣，釋放流程如下所示：

* This is a migrate-callback that “frees” freepages back to the isolated

* freelist。 All pages on the freelist are from the same zone， so there is no

* special handling needed for NUMA。

static

void

compaction_free

（

struct

page

，

unsigned

long

data

）

{

struct

compact_control

（

struct

compact_control

）

data

；

list_add

（

page

lru

，

freepages

）；

nr_freepages

；

}

比較簡單，就不展開說明了。需要注意的是，如果遷移完成，cc中還有空閒page，也需要釋放掉。

這裡想展開分析另外一個點，就是當一個page正在遷移，而恰好需要對其進行訪問，這時候會怎麼處理才能保證不會出錯呢？

__unmap_and_move

static

int

__unmap_and_move

（

struct

page

，

struct

page

newpage

，

int

force

，

enum

migrate_mode

mode

）

{

int

EAGAIN

；

int

page_was_mapped

；

struct

anon_vma

NULL

；

bool

is_lru

！

__PageMovable

（

page

）；

// 嘗試對page進行加鎖操作（設定PG_locked標誌），注意此時程序還能訪問該page

（

！

trylock_page

（

page

））

{

// 加鎖失敗，如果非強制操作，或者是非同步的模式，則直接返回。因為下面有加鎖操作，會阻塞

（

！

force

mode

MIGRATE_ASYNC

）

goto

out

；

* It‘s not safe for direct compaction to call lock_page。

* For example， during page readahead pages are added locked

* to the LRU。 Later， when the IO completes the pages are

* marked uptodate and unlocked。 However， the queueing

* could be merging multiple pages for one bio （e。g。

* mpage_readahead）。 If an allocation happens for the

* second or third page， the process can end up locking

* the same page twice and deadlocking。 Rather than

* trying to be clever about what pages can be locked，

* avoid the use of lock_page for direct compaction

* altogether。

（

current

flags

PF_MEMALLOC

）

goto

out

；

// 阻塞等待鎖，（輕）同步模式

lock_page

（

page

）；

}

// 如果page正在回寫，則只能是同步模式，並且是強制執行的設定時，才會等待回寫操作完成，

// 非同步或者輕同步模式都不會等待

（

PageWriteback

（

page

））

{

* Only in the case of a full synchronous migration is it

* necessary to wait for PageWriteback。 In the async case，

* the retry loop is too short and in the sync-light case，

* the overhead of stalling is too much

switch

（

mode

）

{

case

MIGRATE_SYNC

：

case

MIGRATE_SYNC_NO_COPY

：

break

；

default

：

// 非同步模式直接結束

EBUSY

；

goto

out_unlock

；

}

// 同步模式，並且是強制執行的情況下，才會等待頁面回寫完成

（

！

force

）

goto

out_unlock

；

wait_on_page_writeback

（

page

）；

}

* By try_to_unmap（）， page->mapcount goes down to 0 here。 In this case，

* we cannot notice that anon_vma is freed while we migrates a page。

* This get_anon_vma（） delays freeing anon_vma pointer until the end

* of migration。 File cache pages are no problem because of page_lock（）

* File Caches may use write_page（） or lock_page（） in migration， then，

* just care Anon page here。

* Only page_get_anon_vma（） understands the subtleties of

* getting a hold on an anon_vma from outside one of its mms。

* But if we cannot get anon_vma， then we won’t need it anyway，

* because that implies that the anon page is no longer mapped

* （and cannot be remapped so long as we hold the page lock）。

// 如果是匿名頁並且非ksm的情況，則獲取匿名頁的vma

（

PageAnon

（

page

）

！

PageKsm

（

page

））

anon_vma

page_get_anon_vma

（

page

）；

* Block others from accessing the new page when we get around to

* establishing additional references。 We are usually the only one

* holding a reference to newpage at this point。 We used to have a BUG

* here if trylock_page（newpage） fails， but would like to allow for

* cases where there might be a race with the previous use of newpage。

* This is much like races on refcount of oldpage： just don‘t BUG（）。

// 嘗試對新頁進行加鎖，這裡加鎖是避免遷移過程中，該頁被使用

（

unlikely

（

！

trylock_page

（

newpage

）））

goto

out_unlock

；

// 如果不是lru上的頁面（說明該頁沒有被使用？？？），直接移動即可，無需unmap操作

（

unlikely

（

！

is_lru

））

{

move_to_new_page

（

newpage

，

page

，

mode

）；

goto

out_unlock_both

；

}

* Corner case handling：

* 1。 When a new swap-cache page is read into， it is added to the LRU

* and treated as swapcache but it has no rmap yet。

* Calling try_to_unmap（） against a page->mapping==NULL page will

* trigger a BUG。 So handle it here。

* 2。 An orphaned page （see truncate_complete_page） might have

* fs-private metadata。 The page can be picked up due to memory

* offlining。 Everywhere else except page reclaim， the page is

* invisible to the vm， so the page can not be migrated。 So try to

* free the metadata， so the page can be freed。

// 如果mapping是NULL，則無需unmap（該page可能正在被回收？？？），會有兩種情況：

// 1、該頁是匿名頁，並且正在換出，其已經unmap過的。

// 2、一些孤立的頁，可能是因為這些頁正在“下線”，這些頁不能被使用，故可以直接回收。加入有私有資料，則需要釋放

（

！

page

mapping

）

{

VM_BUG_ON_PAGE

（

PageAnon

（

page

），

page

）；

（

page_has_private

（

page

））

{

try_to_free_buffers

（

page

）；

goto

out_unlock_both

；

}

else

（

page_mapped

（

page

））

{

/* Establish migration ptes */

VM_BUG_ON_PAGE

（

PageAnon

（

page

）

！

PageKsm

（

page

）

！

anon_vma

，

page

）；

// 將所有映射了該頁的程序，進行unmap操作（也即反向對映）

// TTU_MIGRATION表示unmap是因為頁框遷移導致的

// TTU_IGNORE_MLOCK表示可以對mlock的頁框進行操作

// unmap後，對page的訪問都會阻塞

try_to_unmap

（

page

，

TTU_MIGRATION

TTU_IGNORE_MLOCK

）；

page_was_mapped

；

}

// 只有當page沒有被對映，才能進行遷移

（

！

page_mapped

（

page

））

move_to_new_page

（

newpage

，

page

，

mode

）；

// 如果page之前被unmap了，也就是之前映射了該page的程序都插入了“特殊的頁表項”。

// 當遷移動作完成時，需要將這個“特殊的頁表項”改成指向遷移後頁框的頁表項

// 這時，對page的訪問才能正常進行

（

page_was_mapped

）

remove_migration_ptes

（

page

，

MIGRATEPAGE_SUCCESS

？

newpage

：

page

，

false

）；

out_unlock_both

：

unlock_page

（

newpage

）；

out_unlock

：

/* Drop an anon_vma reference if we took one */

（

anon_vma

）

put_anon_vma

（

anon_vma

）；

unlock_page

（

page

）；

out

：

* If migration is successful， decrease refcount of the newpage

* which will not free the page because new page owner increased

* refcounter。 As well， if it is LRU page， add the page to LRU

* list in here。 Use the old state of the isolated source page to

* determine if we migrated a LRU page。 newpage was already unlocked

* and possibly modified by its owner - don’t rely on the page

* state。

// 成功後，將newpage放入到lru中

（

MIGRATEPAGE_SUCCESS

）

{

（

unlikely

（

！

is_lru

））

put_page

（

newpage

）；

else

putback_lru_page

（

newpage

）；

}

return

；

}

在migrate_pages中，有一個步驟是對正在使用的page進行去對映，遷移結束後，需要將頁表項指向遷移後的page，這個操作就是__unmap_and_move，這也是實現同步的關鍵。以下對該流程展開詳細分析：

對舊頁加鎖（設定PG_locked標誌）。這裡上鎖是因為準備進行遷移了，不能讓其他任務修改其內容，但是還是可以進行訪問。要注意的一點是，加入加鎖失敗，則說明可能page已經被其他任務加鎖，需要等待釋放。非同步模式不會等待，直接返回，而（輕）同步模式會等待。

加鎖後，如果page正處於回寫，那麼非同步或輕同步模式，都不會阻塞等待回寫完成。只有開啟了強制執行的同步模式，才會等待。

對新頁進行加鎖，其他任務不能對新頁進行操作。加鎖失敗則結束。

如果舊page不是在lru上，則表示其沒有被程序對映，可以直接遷移，無需unmap動作。

對舊page進行unmap操作，就是找到所有映射了該page的程序，將對應的頁表項修改成一個“特殊”的頁表項。從此時開始，任何訪問該page的程序，都會找到這個“特殊”的頁表項，然後嘗試對該page進行加鎖，但由於第一步已經被加鎖了，所以其他任務會一直等待到鎖釋放。

進行頁框遷移操作。

修改第5步的“特殊”頁表項，將其指向遷移後的page。

對舊page和新page都解鎖，喚醒第5步等待鎖的任務。

總結

本章節主要講述了記憶體碎片整理的一些細節，當系統長時間執行之後，難免會出現碎片，碎片過多時，就會影響到申請連續記憶體的成功率。由於記憶體碎片整理會涉及頁框的遷移動作，所以只會對MIGRATE_RECLAIMABLE、MIGRATE_MOVABLE、MIGRATE_CMA這三種記憶體頁框進行整理。其觸發條件有如下三種場景：

“快路徑”無法分配到連續記憶體，進入“慢路徑”時會進行記憶體碎片整理。

kswapd任務中，進行記憶體回收後會進行記憶體碎片整理。

手動觸發，往/proc/sys/vm/compact_memory中寫入1時。

指定範圍分配連續頁框時，而該範圍的頁框有部分已經被使用，則需要透過碎片整理的方式進行遷移。

標簽： page pfn CC zone 記憶體

上一篇:超聲波潔牙、拋光、噴砂有什麼區別，它們分別是幹什麼的？太原紫臺口腔講解

下一篇：因為雀斑被同學排擠，羞辱甚至動手的，恐怕沒有幾個吧

深入理解Linux記憶體管理（六）記憶體碎片整理

猜你喜歡

Linux裡面VIRT和RES代表什麼意思？

阿里雲伺服器選擇?

多少執行記憶體可以玩原神？

manim使用(五) 動畫效果-建立動畫

2019必備單品|既省錢又漂亮的牛仔褲