您當前的位置:首頁 > 遊戲

深入理解Linux記憶體管理(六)記憶體碎片整理

作者:由 LZT 發表于 遊戲時間:2022-11-03

前情回顧

上一章介紹了夥伴系統的記憶體申請與釋放流程,但都是考慮在記憶體充足的情況下進行的。隨著記憶體頁被頻繁使用,可能出現大量記憶體碎片或者待回收記憶體頁,導致無法獲取到足夠的連續記憶體頁框。本章就是介紹當出現這樣的情況時,系統會有哪些機制確保記憶體申請能夠順利進行。在Linux中主要有兩種機制:記憶體碎片整理和記憶體回收,而記憶體回收包括快速記憶體回收,直接記憶體回收和kswapd記憶體回收。由於篇幅有限,本章節先介紹記憶體碎片整理機制。話不多說,下面直接進入主題。

資料結構

記憶體碎片整理

/*

* MIGRATE_ASYNC means never block

* MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking

* on most operations but not ->writepage as the potential stall time

* is too significant

* MIGRATE_SYNC will block when migrating pages

* MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages

* with the CPU。 Instead, page copy happens outside the migratepage()

* callback and is likely using a DMA engine。 See migrate_vma() and HMM

* (mm/hmm。c) for users of this mode。

*/

enum

migrate_mode

{

MIGRATE_ASYNC

MIGRATE_SYNC_LIGHT

MIGRATE_SYNC

MIGRATE_SYNC_NO_COPY

};

只有三種類型的頁框支援記憶體碎片整理:MIGRATE_MOVABLE、MIGRATE_CMA和MIRGATE_RECLAIMABLE。記憶體碎片整理有如下四種模式:

非同步模式(MIGRATE_ASYNC):在該模式不允許進行任何阻塞操作,當需要阻塞或者排程的時候,則停止記憶體碎片整理。在該模式下只會處理MIGRATE_MOVABLE、MIGRATE_CMA型別的頁框,而不會處理MIRGATE_RECLAIMABLE型別的頁框,因為該型別的頁框大多數是檔案頁,對檔案頁進行記憶體碎片整理,有可能涉及髒頁回寫,這會引起阻塞。

輕同步模式(MIGRATE_SYNC_LIGHT):該模式允許絕大部分的阻塞操作,但是不阻塞等待髒檔案頁的回寫操作,因為回寫時間可能很長。

同步模式(MIGRATE_SYNC):該模式允許在遷移頁框時允許阻塞,也就是允許頁回寫完成才返回結果,這是最耗時的模式。該模式會整zone掃描,並且不會跳過標記為PG_migrate_skip標誌的pageblock。

非複製同步模式(MIGRATE_SYNC_NO_COPY):與同步模式類似,在遷移頁框時允許阻塞,但不會進行頁框複製。

struct

zone

{

。。。

#ifdef CONFIG_COMPACTION

/*

* On compaction failure, 1<

* are skipped before trying again。 The number attempted since

* last failure is tracked with compact_considered。

* compact_order_failed is the minimum compaction failed order。

*/

// 記憶體碎片整理推遲次數累計,當推遲的次數超過1 << compact_defer_shift時,超過後不允許再推遲了

unsigned

int

compact_considered

unsigned

int

compact_defer_shift

// 記錄zone記憶體碎片整理可能失敗的最大order

// 如果當前order大於等於compact_order_failed,則允許推遲(這裡是為了提高記憶體碎片整理的成功率),小於則直接啟動記憶體碎片整理

// 如果本次記憶體碎片整理成功了,則compact_order_failed置為order + 1

// 如果本次記憶體碎片整理失敗了,則compact_order_failed置為order

int

compact_order_failed

#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA

/* pfn where compaction free scanner should start */

// 記錄記憶體碎片掃描空閒頁時的起始頁幀號

unsigned

long

compact_cached_free_pfn

/* pfn where compaction migration scanner should start */

// 記錄記憶體碎片掃描待移動頁時的起始頁幀號,包括非同步(0)和同步(1)的場景

unsigned

long

compact_cached_migrate_pfn

ASYNC_AND_SYNC

];

// 記憶體碎片整理掃描遷移頁框初始幀號

unsigned

long

compact_init_migrate_pfn

// 記憶體碎片整理掃描空閒頁框初始幀號

unsigned

long

compact_init_free_pfn

#endif

。。。

};

由於記憶體碎片整理比較耗時,Linux實現了一些手段來減少耗時,主要有:1、減少記憶體整理的次數。2、跳過一些已經掃描過的記憶體頁。透過記憶體碎片整理推遲機制,對於預估可能失敗的記憶體碎片整理場景實行推遲處理;在記憶體掃描過程中,分為全zone掃描和部分記憶體頁掃描,如果是部分記憶體頁掃描則使用上次快取的結果,跳過已經掃描過的記憶體頁。

struct

compact_control

{

// 掃描pageblock時空閒頁面連結串列

struct

list_head

freepages

/* List of free pages to migrate to */

// 掃描pageblock時遷移頁面連結串列

struct

list_head

migratepages

/* List of pages being migrated */

// freepages連結串列中頁面數

unsigned

int

nr_freepages

/* Number of isolated free pages */

// migratepages連結串列中頁面數

unsigned

int

nr_migratepages

/* Number of pages to migrate */

// 隔離空閒頁掃描起始頁框

unsigned

long

free_pfn

/* isolate_freepages search base */

// 隔離待移動頁掃描起始頁框

unsigned

long

migrate_pfn

/* isolate_migratepages search base */

// 快速掃描起始頁框

unsigned

long

fast_start_pfn

/* a pfn to start linear scan from */

// 本次掃描的zone

struct

zone

*

zone

// 做可遷移頁面掃描時,已經掃描過的頁面數

unsigned

long

total_migrate_scanned

// 做空閒頁面掃描時,已經掃描過的頁面數

unsigned

long

total_free_scanned

unsigned

short

fast_search_fail

/* failures to use free list searches */

// 快速搜尋時的order

short

search_order

/* order to start a fast search at */

const

gfp_t

gfp_mask

/* gfp mask of a direct compactor */

// 實際申請記憶體不足導致需要記憶體碎片整理的order

int

order

/* order a direct compactor needs */

// 本次記憶體申請的遷移型別

int

migratetype

/* migratetype of direct compactor */

// 本次記憶體申請的標記

const

unsigned

int

alloc_flags

/* alloc flags of a direct compactor */

// 本次記憶體申請允許的最大zone下標

const

int

highest_zoneidx

/* zone index of a direct compactor */

// 本次記憶體碎片整理的模式

enum

migrate_mode

mode

/* Async or sync migration mode */

// 是否忽略skip標誌

bool

ignore_skip_hint

/* Scan blocks even if marked skip */

bool

no_set_skip_hint

/* Don‘t mark blocks for skipping */

// 是否忽視合不合適block頁掃描

bool

ignore_block_suitable

/* Scan blocks considered unsuitable */

// 是否是直接記憶體碎片整理,如果是kcompactd任務或者透過/proc觸發的記憶體碎片

// 整理則為否

bool

direct_compaction

/* False from kcompactd or /proc/。。。 */

// 是否為積極的記憶體壓縮策略

bool

proactive_compaction

/* kcompactd proactive compaction */

// 是否全zone掃描

bool

whole_zone

/* Whole zone should/has been scanned */

};

這是記憶體碎片整理的控制結構體,將掃描到的待遷移頁框和空閒頁框,從夥伴系統中隔離出來。一次成功的記憶體碎片整理應該是將所有待遷移的頁框全部移動到空閒頁框處。由於一次記憶體碎片整理可能還是無法獲得足夠多的連續記憶體,可能需要觸發多次記憶體碎片整理。

演算法

記憶體碎片整理

深入理解Linux記憶體管理(六)記憶體碎片整理

上圖是記憶體碎片整理的簡易模型。如果某一次記憶體申請4個連續頁框,該zone有足夠的頁框,但是由於碎片嚴重,無法提供連續頁框。記憶體碎片整理後,就可以滿足連續4個頁框的記憶體申請。下面將結合上面的資料結構,詳細展開分析記憶體碎片整理流程。

__alloc_pages_direct_compact

static

struct

page

*

__alloc_pages_direct_compact

gfp_t

gfp_mask

unsigned

int

order

unsigned

int

alloc_flags

const

struct

alloc_context

*

ac

enum

compact_priority

prio

enum

compact_result

*

compact_result

{

struct

page

*

page

=

NULL

unsigned

long

pflags

unsigned

int

noreclaim_flag

// 如果是申請一個頁框,是無需進行碎片整理的

if

order

return

NULL

// 嘗試進行記憶體碎片整理

*

compact_result

=

try_to_compact_pages

gfp_mask

order

alloc_flags

ac

prio

&

page

);

if

page

// 裝填新的記憶體頁

prep_new_page

page

order

gfp_mask

alloc_flags

);

/* Try get a page from the freelist if available */

// 記憶體整理後,仍然無法獲得足夠記憶體,但還是嘗試透過freelist連結串列獲取(可能fallback?)

if

page

page

=

get_page_from_freelist

gfp_mask

order

alloc_flags

ac

);

if

page

{

struct

zone

*

zone

=

page_zone

page

);

// 為什麼要置為false???

zone

->

compact_blockskip_flush

=

false

// 成功後將compact_considered和compact_defer_shift置為0,

// compact_order_failed如果小於等於order,則被置為order + 1

compaction_defer_reset

zone

order

true

);

count_vm_event

COMPACTSUCCESS

);

return

page

}

。。。

return

NULL

}

該介面是記憶體碎片整理的入口,引數包括申請order、申請標誌、申請上下文、碎片整理的特權和碎片整理的結果。前幾個引數和記憶體申請時的含義一樣,而碎片整理的特權則是決定本次記憶體碎片整理使用的模式。可以看到該介面既有記憶體碎片整理流程,也有記憶體申請流程,也就是說記憶體碎片整理和記憶體申請是繫結在一起的。這裡總結下觸發記憶體碎片整理的條件:

kswapd任務對記憶體回收後,可能會觸發記憶體碎片整理。

慢路徑記憶體分配時zone無法提供足夠的連續頁框。

透過/proc/sys/vm/compact_memory手動觸發。

try_to_compact_pages

// mm\compaction。c

/**

* try_to_compact_pages - Direct compact to satisfy a high-order allocation

* @gfp_mask: The GFP mask of the current allocation

* @order: The order of the current allocation

* @alloc_flags: The allocation flags of the current allocation

* @ac: The context of current allocation

* @prio: Determines how hard direct compaction should try to succeed

* @capture: Pointer to free page created by compaction will be stored here

*

* This is the main entry point for direct page compaction。

*/

enum

compact_result

try_to_compact_pages

gfp_t

gfp_mask

unsigned

int

order

unsigned

int

alloc_flags

const

struct

alloc_context

*

ac

enum

compact_priority

prio

struct

page

**

capture

{

int

may_perform_io

=

gfp_mask

&

__GFP_IO

struct

zoneref

*

z

struct

zone

*

zone

// 預設結果是跳過

enum

compact_result

rc

=

COMPACT_SKIPPED

/*

* Check if the GFP flags allow compaction - GFP_NOIO is really

* tricky context because the migration might require IO

*/

// 如果禁止IO操作,則不進行頁面碎片整理(因為在禁止IO的上下文進行,有可能會觸發死鎖)

if

may_perform_io

return

COMPACT_SKIPPED

/* Compact each zone in the list */

// 遍歷zonelist中的每一個zone

for_each_zone_zonelist_nodemask

zone

z

ac

->

zonelist

ac

->

highest_zoneidx

ac

->

nodemask

{

enum

compact_result

status

if

prio

>

MIN_COMPACT_PRIORITY

&&

compaction_deferred

zone

order

))

{

// COMPACT_DEFERRED結果是推遲

rc

=

max_t

enum

compact_result

COMPACT_DEFERRED

rc

);

continue

}

// 該zone需要進行記憶體碎片整理,並嘗試分配頁框

status

=

compact_zone_order

zone

order

gfp_mask

prio

alloc_flags

ac

->

highest_zoneidx

capture

);

rc

=

max

status

rc

);

/* The allocation should succeed, stop compacting */

// 透過記憶體碎片整理完成

if

status

==

COMPACT_SUCCESS

{

/*

* We think the allocation will succeed in this zone,

* but it is not certain, hence the false。 The caller

* will repeat this with true if allocation indeed

* succeeds in this zone。

*/

// 這裡只是完成了記憶體鎖片整理,該zone中剩餘的頁框滿足了申請頁面數

// 但是頁面不一定都是連續的,其實是不確定最終能否成功,故這裡用flase

// 將compact_order_failed設定為order + 1,也就是下次order小於該值時,

// 會跳過整理

compaction_defer_reset

zone

order

false

);

break

}

if

prio

!=

COMPACT_PRIO_ASYNC

&&

status

==

COMPACT_COMPLETE

||

status

==

COMPACT_PARTIAL_SKIPPED

))

/*

* We think that allocation won’t succeed in this zone

* so we defer compaction there。 If it ends up

* succeeding after all, it will be reset。

*/

defer_compaction

zone

order

);

/*

* We might have stopped compacting due to need_resched() in

* async compaction, or due to a fatal signal detected。 In that

* case do not try further zones

*/

// 非同步模式不允許阻塞,如果當前任務需要被排程,則終止記憶體碎片整理流程

if

((

prio

==

COMPACT_PRIO_ASYNC

&&

need_resched

())

||

fatal_signal_pending

current

))

break

}

return

rc

}

遍歷zone列表,會對每個zone的狀態進行評估。如果該zone不適合進行記憶體碎片整理,則執行延遲操作,也即跳過該zone。如果該zone條件合適,則對該zone進行記憶體碎片整理,並且整理後嘗試申請頁框。當某次記憶體碎片整理成功或者在非同步模式下發生了任務排程,又或者觸發了致命的訊號,記憶體碎片整理流程就會終止。這裡總結記憶體碎片整理結束的條件:

該zone有足夠的連續記憶體,也即不需要做記憶體碎片整理,直接在該zone上申請即可,並結束記憶體碎片整理流程。

該zone無法在滿足水位的情況下分配記憶體,則會跳過該zone。如果是非同步模式,並且需要排程任務,則直接結束。否則待遍歷完所有zone後結束。

對某個zone做頁框遷移完成後,如果整理成功並分配到記憶體,則結束。如果分配不到,則換下一個zone,一直到遍歷完所有zone後結束。

compaction_deferred

// mm\compaction。c

/* Returns true if compaction should be skipped this time */

bool

compaction_deferred

struct

zone

*

zone

int

order

{

unsigned

long

defer_limit

=

1UL

<<

zone

->

compact_defer_shift

// compact_order_failed記錄的是記憶體碎片整理失敗的最小值,也就是說超過該值

// 記憶體碎片整理會失敗,所以只有超過該值才可能跳過(避免失敗耗時)

if

order

<

zone

->

compact_order_failed

return

false

/* Avoid possible overflow */

// 記憶體碎片整理推遲計數器,如果推遲次數已經達到預設的閾值(1 << zone->compact_defer_shift),則

// 不能再推遲了

if

++

zone

->

compact_considered

>=

defer_limit

{

// 避免溢位

zone

->

compact_considered

=

defer_limit

return

false

}

return

true

}

記憶體碎片整理延遲函式,同時滿足兩種情況下會延遲:

當前order大於等於compact_order_failed,表示預估本次執行記憶體碎片整理,也很可能會失敗。compact_order_failed值在記憶體碎片整理失敗時,會置為order,在記憶體碎片整理成功時,會置為order + 1。永遠不會將該值置為0。

延遲計數器沒有超過閾值,其中計數器是compact_considered,而閾值是1 << compact_defer_shift。當記憶體碎片整理成功時,會將計數器重置為0,而compact_defer_shift遞增;當記憶體碎片整理成功並且分配頁框成功時,則將兩個值都重置為0。

延遲策略是為了提高記憶體碎片整理的成功率,減低因記憶體碎片整理失敗導致的效能損失。

compact_zone_order

// mm\compaction。c

static

enum

compact_result

compact_zone_order

struct

zone

*

zone

int

order

gfp_t

gfp_mask

enum

compact_priority

prio

unsigned

int

alloc_flags

int

highest_zoneidx

struct

page

**

capture

{

enum

compact_result

ret

struct

compact_control

cc

=

{

order

=

order

search_order

=

order

gfp_mask

=

gfp_mask

zone

=

zone

// 如果優先順序是非同步優先順序,則採用非同步模式,否則使用輕同步的方式

mode

=

prio

==

COMPACT_PRIO_ASYNC

MIGRATE_ASYNC

MIGRATE_SYNC_LIGHT

alloc_flags

=

alloc_flags

highest_zoneidx

=

highest_zoneidx

direct_compaction

=

true

// 同步模式會全zone掃描,並且不會跳過標記為PG_migrate_skip標誌的pageblock

whole_zone

=

prio

==

MIN_COMPACT_PRIORITY

),

ignore_skip_hint

=

prio

==

MIN_COMPACT_PRIORITY

),

ignore_block_suitable

=

prio

==

MIN_COMPACT_PRIORITY

};

struct

capture_control

capc

=

{

cc

=

&

cc

page

=

NULL

};

/*

* Make sure the structs are really initialized before we expose the

* capture control, in case we are interrupted and the interrupt handler

* frees a page。

*/

barrier

();

WRITE_ONCE

current

->

capture_control

&

capc

);

// 對zone進行記憶體碎片整理

ret

=

compact_zone

&

cc

&

capc

);

VM_BUG_ON

list_empty

&

cc

freepages

));

VM_BUG_ON

list_empty

&

cc

migratepages

));

/*

* Make sure we hide capture control first before we read the captured

* page pointer, otherwise an interrupt could free and capture a page

* and we would leak it。

*/

WRITE_ONCE

current

->

capture_control

NULL

);

*

capture

=

READ_ONCE

capc

page

);

return

ret

}

構造記憶體碎片整理的入參,如果是非同步優先順序,則採用非同步模式;如果非非同步優先順序,則採用輕同步模式。

compact_zone

// mm\compaction。c

static

enum

compact_result

compact_zone

struct

compact_control

*

cc

struct

capture_control

*

capc

{

enum

compact_result

ret

// 做pageblock掃描時,搜尋可移動頁面的起始位置

unsigned

long

start_pfn

=

cc

->

zone

->

zone_start_pfn

// 做pageblock掃描時,搜尋空閒頁面的起始位置

unsigned

long

end_pfn

=

zone_end_pfn

cc

->

zone

);

unsigned

long

last_migrated_pfn

// 是否是同步模式

const

bool

sync

=

cc

->

mode

!=

MIGRATE_ASYNC

bool

update_cached

/*

* These counters track activities during zone compaction。 Initialize

* them before compacting a new zone。

*/

// 初始化記憶體鎖片整理控制器

cc

->

total_migrate_scanned

=

0

cc

->

total_free_scanned

=

0

cc

->

nr_migratepages

=

0

cc

->

nr_freepages

=

0

INIT_LIST_HEAD

&

cc

->

freepages

);

INIT_LIST_HEAD

&

cc

->

migratepages

);

// 獲取本次申請頁面的遷移型別

cc

->

migratetype

=

gfp_migratetype

cc

->

gfp_mask

);

// 判斷當前zone是否滿足記憶體碎片整理條件

ret

=

compaction_suitable

cc

->

zone

cc

->

order

cc

->

alloc_flags

cc

->

highest_zoneidx

);

/* Compaction is likely to fail */

// COMPACT_SUCCESS此處表示有足夠記憶體,不需要記憶體碎片整理

// COMPACT_SKIPPED此處表示沒有足夠的記憶體,也是不需要記憶體碎片整理

if

ret

==

COMPACT_SUCCESS

||

ret

==

COMPACT_SKIPPED

return

ret

/* huh, compaction_suitable is returning something unexpected */

VM_BUG_ON

ret

!=

COMPACT_CONTINUE

);

/*

* Clear pageblock skip if there were failures recently and compaction

* is about to be retried after being deferred。

*/

// 如果推遲次數和閾值都已經達到最大值,並且最近有一次成功的全記憶體碎片整理

// 則需要重設zone的compact_init_migrate_pfn、

// compact_cached_migrate_pfn、compact_cached_free_pfn等快取資訊

// 我理解這裡是為了要全zone掃描,提高成功率

if

compaction_restarting

cc

->

zone

cc

->

order

))

__reset_isolation_suitable

cc

->

zone

);

/*

* Setup to move all movable pages to the end of the zone。 Used cached

* information on where the scanners should start (unless we explicitly

* want to compact the whole zone), but check that it is initialised

* by ensuring the values are within zone boundaries。

*/

cc

->

fast_start_pfn

=

0

// 如果進行全zone掃描,則將掃描可移動頁框指標記錄為zone的第一個頁框

// 並且將掃描空閒頁框指標記錄為最後一個pageblock的起始頁框

// 同步模式下會全zone掃描

if

cc

->

whole_zone

{

cc

->

migrate_pfn

=

start_pfn

cc

->

free_pfn

=

pageblock_start_pfn

end_pfn

-

1

);

}

else

{

// 非全zone掃描,則從快取中獲取相關的起始頁框

// 非同步或者輕同步模式

cc

->

migrate_pfn

=

cc

->

zone

->

compact_cached_migrate_pfn

sync

];

cc

->

free_pfn

=

cc

->

zone

->

compact_cached_free_pfn

// 如果快取值不合法了,則更新成合法值

if

cc

->

free_pfn

<

start_pfn

||

cc

->

free_pfn

>=

end_pfn

{

cc

->

free_pfn

=

pageblock_start_pfn

end_pfn

-

1

);

cc

->

zone

->

compact_cached_free_pfn

=

cc

->

free_pfn

}

if

cc

->

migrate_pfn

<

start_pfn

||

cc

->

migrate_pfn

>=

end_pfn

{

cc

->

migrate_pfn

=

start_pfn

cc

->

zone

->

compact_cached_migrate_pfn

0

=

cc

->

migrate_pfn

cc

->

zone

->

compact_cached_migrate_pfn

1

=

cc

->

migrate_pfn

}

// 如果掃描待遷移頁框是從頭開始的,則表示是全zone掃描,需要設定相應狀態

if

cc

->

migrate_pfn

<=

cc

->

zone

->

compact_init_migrate_pfn

cc

->

whole_zone

=

true

}

last_migrated_pfn

=

0

/*

* Migrate has separate cached PFNs for ASYNC and SYNC* migration on

* the basis that some migrations will fail in ASYNC mode。 However,

* if the cached PFNs match and pageblocks are skipped due to having

* no isolation candidates, then the sync state does not matter。

* Until a pageblock with isolation candidates is found, keep the

* cached PFNs in sync to avoid revisiting the same blocks。

*/

update_cached

=

sync

&&

cc

->

zone

->

compact_cached_migrate_pfn

0

==

cc

->

zone

->

compact_cached_migrate_pfn

1

];

trace_mm_compaction_begin

start_pfn

cc

->

migrate_pfn

cc

->

free_pfn

end_pfn

sync

);

migrate_prep_local

();

// 判斷記憶體碎片整理是否結束,結束條件如下:

// 1、掃描待移動頁框和掃描空閒頁框指標重合,表示已經結束了。

// 2、如果是積極記憶體碎片整理策略(我理解只有在kcompactd任務的場景),如果kswap任務也正在對node進行記憶體回收,則結束;

// 否則會一直整理到使用者預設的低水位碎片比率。

// 3、整理過程中發現有足夠的空閒頁框,或者分配可移動頁框且CMA有足夠空間,則停止整理。

// 注意:如果是透過手動觸發的記憶體碎片整理,只看第一個條件。

while

((

ret

=

compact_finished

cc

))

==

COMPACT_CONTINUE

{

int

err

unsigned

long

start_pfn

=

cc

->

migrate_pfn

/*

* Avoid multiple rescans which can happen if a page cannot be

* isolated (dirty/writeback in async mode) or if the migrated

* pages are being allocated before the pageblock is cleared。

* The first rescan will capture the entire pageblock for

* migration。 If it fails, it‘ll be marked skip and scanning

* will proceed as normal。

*/

cc

->

rescan

=

false

if

pageblock_start_pfn

last_migrated_pfn

==

pageblock_start_pfn

start_pfn

))

{

cc

->

rescan

=

true

}

// 掃描可移動頁框,並將對應的page從lru中取下來,存放在migratepages

switch

isolate_migratepages

cc

))

{

case

ISOLATE_ABORT

// 掃描可隔離頁框被終止,則將migratepages恢復至lru中

ret

=

COMPACT_CONTENDED

putback_movable_pages

&

cc

->

migratepages

);

cc

->

nr_migratepages

=

0

goto

out

case

ISOLATE_NONE

// 該pageblock沒有掃描到可移動頁框

if

update_cached

{

cc

->

zone

->

compact_cached_migrate_pfn

1

=

cc

->

zone

->

compact_cached_migrate_pfn

0

];

}

/*

* We haven’t isolated and migrated anything, but

* there might still be unflushed migrations from

* previous cc->order aligned block。

*/

goto

check_drain

case

ISOLATE_SUCCESS

update_cached

=

false

last_migrated_pfn

=

start_pfn

}

// 進行頁面遷移操作,將migratepages中的頁面,遷移到空閒頁面中

err

=

migrate_pages

&

cc

->

migratepages

compaction_alloc

compaction_free

unsigned

long

cc

cc

->

mode

MR_COMPACTION

);

trace_mm_compaction_migratepages

cc

->

nr_migratepages

err

&

cc

->

migratepages

);

/* All pages were either migrated or will be released */

cc

->

nr_migratepages

=

0

if

err

{

// 如果頁面遷移失敗,則將剩下的可移動頁框還原到lru中

putback_movable_pages

&

cc

->

migratepages

);

/*

* migrate_pages() may return -ENOMEM when scanners meet

* and we want compact_finished() to detect it

*/

if

err

==

-

ENOMEM

&&

compact_scanners_met

cc

))

{

ret

=

COMPACT_CONTENDED

goto

out

}

/*

* We failed to migrate at least one page in the current

* order-aligned block, so skip the rest of it。

*/

if

cc

->

direct_compaction

&&

cc

->

mode

==

MIGRATE_ASYNC

))

{

cc

->

migrate_pfn

=

block_end_pfn

cc

->

migrate_pfn

-

1

cc

->

order

);

/* Draining pcplists is useless in this case */

last_migrated_pfn

=

0

}

}

check_drain

/*

* Has the migration scanner moved away from the previous

* cc->order aligned block where we migrated from? If yes,

* flush the pages that were freed, so that they can merge and

* compact_finished() can detect immediately if allocation

* would succeed。

*/

if

cc

->

order

>

0

&&

last_migrated_pfn

{

unsigned

long

current_block_start

=

block_start_pfn

cc

->

migrate_pfn

cc

->

order

);

if

last_migrated_pfn

<

current_block_start

{

lru_add_drain_cpu_zone

cc

->

zone

);

/* No more flushing until we migrate again */

last_migrated_pfn

=

0

}

}

/* Stop if a page has been captured */

if

capc

&&

capc

->

page

{

ret

=

COMPACT_SUCCESS

break

}

}

out

/*

* Release free pages and update where the free scanner should restart,

* so we don‘t leave any returned pages behind in the next attempt。

*/

// 如果空間列表中還有page,則還原到lru中

if

cc

->

nr_freepages

>

0

{

unsigned

long

free_pfn

=

release_freepages

&

cc

->

freepages

);

cc

->

nr_freepages

=

0

VM_BUG_ON

free_pfn

==

0

);

/* The cached pfn is always the first in a pageblock */

free_pfn

=

pageblock_start_pfn

free_pfn

);

/*

* Only go back, not forward。 The cached pfn might have been

* already reset to zone end in compact_finished()

*/

if

free_pfn

>

cc

->

zone

->

compact_cached_free_pfn

cc

->

zone

->

compact_cached_free_pfn

=

free_pfn

}

count_compact_events

COMPACTMIGRATE_SCANNED

cc

->

total_migrate_scanned

);

count_compact_events

COMPACTFREE_SCANNED

cc

->

total_free_scanned

);

trace_mm_compaction_end

start_pfn

cc

->

migrate_pfn

cc

->

free_pfn

end_pfn

sync

ret

);

return

ret

}

記憶體碎片整理最核心的函式,其是針對一個zone進行的。該介面可分為三大部分的功能:1、判斷該zone是否滿足碎片整理條件;2、隔離出待遷移頁框;3、實行頁框遷移操作。(天哪!!!一個函式集中了這麼多功能:joy:)。下面將一一分析這三部分功能點。

判斷該zone是否滿足碎片整理條件

// mm\compaction。c

enum

compact_result

compaction_suitable

struct

zone

*

zone

int

order

unsigned

int

alloc_flags

int

highest_zoneidx

{

enum

compact_result

ret

int

fragindex

ret

=

__compaction_suitable

zone

order

alloc_flags

highest_zoneidx

zone_page_state

zone

NR_FREE_PAGES

));

/*

* fragmentation index determines if allocation failures are due to

* low memory or external fragmentation

*

* index of -1000 would imply allocations might succeed depending on

* watermarks, but we already failed the high-order watermark check

* index towards 0 implies failure is due to lack of memory

* index towards 1000 implies failure is due to fragmentation

*

* Only compact if a failure would be due to fragmentation。 Also

* ignore fragindex for non-costly orders where the alternative to

* a successful reclaim/compaction is OOM。 Fragindex and the

* vm。extfrag_threshold sysctl is meant as a heuristic to prevent

* excessive compaction for costly orders, but it should not be at the

* expense of system stability。

*/

if

ret

==

COMPACT_CONTINUE

&&

order

>

PAGE_ALLOC_COSTLY_ORDER

))

{

// 為當前記憶體碎片情況“打分”

// 1、如果打分是趨向0,則意味著本次記憶體申請失敗是由於記憶體不足導致

// 2、如果打分是趨向1000,則意味著本次記憶體申請失敗是由於記憶體碎片導致

// 只有因記憶體碎片導致的失敗,做碎片整理才有意義

fragindex

=

fragmentation_index

zone

order

);

// sysctl_extfrag_threshold是透過虛擬檔案系統預設的值(/proc/sys/vm/extfrag_threshold)

// 範圍是0~1000,只有超過這個值才進行記憶體碎片整理

if

fragindex

>=

0

&&

fragindex

<=

sysctl_extfrag_threshold

ret

=

COMPACT_NOT_SUITABLE_ZONE

}

trace_mm_compaction_suitable

zone

order

ret

);

if

ret

==

COMPACT_NOT_SUITABLE_ZONE

ret

=

COMPACT_SKIPPED

return

ret

}

/*

* compaction_suitable: Is this suitable to run compaction on this zone now?

* Returns

* COMPACT_SKIPPED - If there are too few free pages for compaction

* COMPACT_SUCCESS - If the allocation would succeed without compaction

* COMPACT_CONTINUE - If compaction should run now

*/

static

enum

compact_result

__compaction_suitable

struct

zone

*

zone

int

order

unsigned

int

alloc_flags

int

highest_zoneidx

unsigned

long

wmark_target

{

unsigned

long

watermark

// 如果是透過虛擬檔案系統(/proc/sys/vm/compact_memory)觸發的記憶體碎片整理,則

// 強制進行

if

is_via_compact_memory

order

))

return

COMPACT_CONTINUE

// 根據alloc_flags獲取某水位至少需要watermark頁記憶體,一般是MIN水位

watermark

=

wmark_pages

zone

alloc_flags

&

ALLOC_WMARK_MASK

);

/*

* If watermarks for high-order allocation are already met, there

* should be no need for compaction at all。

*/

// 如果當前zone能在滿足水位的基礎上分配2^order記憶體,則不進行記憶體碎片整理

if

zone_watermark_ok

zone

order

watermark

highest_zoneidx

alloc_flags

))

return

COMPACT_SUCCESS

/*

* Watermarks for order-0 must be met for compaction to be able to

* isolate free pages for migration targets。 This means that the

* watermark and alloc_flags have to match, or be more pessimistic than

* the check in __isolate_free_page()。 We don’t use the direct

* compactor‘s alloc_flags, as they are not relevant for freepage

* isolation。 We however do use the direct compactor’s highest_zoneidx

* to skip over zones where lowmem reserves would prevent allocation

* even if compaction succeeds。

* For costly orders, we require low watermark instead of min for

* compaction to proceed to increase its chances。

* ALLOC_CMA is used, as pages in CMA pageblocks are considered

* suitable migration targets

*/

// 到這裡,已經無法在滿足水位alloc_flags預設水位的基礎上分配記憶體了,則

// 考慮根據order大小將水位調整成low或者min

watermark

=

order

>

PAGE_ALLOC_COSTLY_ORDER

low_wmark_pages

zone

min_wmark_pages

zone

);

// 預估該zone至少需要的gap(???)

watermark

+=

compact_gap

order

);

if

__zone_watermark_ok

zone

0

watermark

highest_zoneidx

ALLOC_CMA

wmark_target

))

return

COMPACT_SKIPPED

return

COMPACT_CONTINUE

}

以下幾種場景是不需要進行記憶體碎片整理的:

該zone能在滿足要求水位的情況下,可以分配足夠連續頁框,自然就不需要耗時去做碎片整理。這應該是考慮到雖然前面透過快路徑無法獲取到足夠頁框,但是在慢路徑過程中,其他任務釋放了頁框。

在場景1不滿足的情況下,並且當前zone只能滿足低水位或最小水位的情況下,無法分配一個頁框,這時候頁無需進行碎片整理。因為這時該zone已經徹底沒有空閒頁了。

在場景1和2都不滿足的情況下,會對記憶體碎片情況進行打分。這裡需要預估當前zone無法分配頁框,是因為記憶體不足還是因為記憶體碎片導致的,只有是因記憶體碎片導致的情況下,進行記憶體碎片整理才有意義。並且還需要考慮到使用者設定的記憶體碎片閾值(/proc/sys/vm/extfrag_threshold),如果該值很大,表示使用者不希望進行記憶體碎片整理(效能優先);反之,該值很小,表示使用者希望多進行記憶體碎片整理(記憶體優先)。

除了以上三種場景,無論手動觸發還是因記憶體不足觸發的記憶體碎片整理,都會進行。

隔離出待遷移頁框

// mm\compaction。c

/*

* Isolate all pages that can be migrated from the first suitable block,

* starting at the block pointed to by the migrate scanner pfn within

* compact_control。

*/

static

isolate_migrate_t

isolate_migratepages

struct

compact_control

*

cc

{

unsigned

long

block_start_pfn

unsigned

long

block_end_pfn

unsigned

long

low_pfn

struct

page

*

page

// 是否允許隔離不可回收頁框 和 如果是非同步模式,更偏向於隔離不會阻塞的頁框

const

isolate_mode_t

isolate_mode

=

sysctl_compact_unevictable_allowed

ISOLATE_UNEVICTABLE

0

|

cc

->

mode

!=

MIGRATE_SYNC

ISOLATE_ASYNC_MIGRATE

0

);

bool

fast_find_block

/*

* Start at where we last stopped, or beginning of the zone as

* initialized by compact_zone()。 The first failure will use

* the lowest PFN as the starting point for linear scanning。

*/

low_pfn

=

fast_find_migrateblock

cc

);

block_start_pfn

=

pageblock_start_pfn

low_pfn

);

if

block_start_pfn

<

cc

->

zone

->

zone_start_pfn

block_start_pfn

=

cc

->

zone

->

zone_start_pfn

/*

* fast_find_migrateblock marks a pageblock skipped so to avoid

* the isolation_suitable check below, check whether the fast

* search was successful。

*/

fast_find_block

=

low_pfn

!=

cc

->

migrate_pfn

&&

cc

->

fast_search_fail

/* Only scan within a pageblock boundary */

block_end_pfn

=

pageblock_end_pfn

low_pfn

);

/*

* Iterate over whole pageblocks until we find the first suitable。

* Do not cross the free scanner。

*/

// 每次都是以一個pageblock大小為單元進行掃描,範圍是[block_start_pfn, block_end_pfn]

for

(;

block_end_pfn

<=

cc

->

free_pfn

fast_find_block

=

false

low_pfn

=

block_end_pfn

block_start_pfn

=

block_end_pfn

block_end_pfn

+=

pageblock_nr_pages

{

/*

* This can potentially iterate a massively long zone with

* many pageblocks unsuitable, so periodically check if we

* need to schedule。

*/

// 避免掃描佔用大量的CPU時間,當掃描超過32個pageblcok時,會休眠

if

low_pfn

%

SWAP_CLUSTER_MAX

*

pageblock_nr_pages

)))

cond_resched

();

// 獲得pageblock(範圍是[block_start_pfn, block_end_pfn])的第一個page

page

=

pageblock_pfn_to_page

block_start_pfn

block_end_pfn

cc

->

zone

);

// 如果page不可用,則跳到下一個pageblock

if

page

continue

/*

* If isolation recently failed, do not retry。 Only check the

* pageblock once。 COMPACT_CLUSTER_MAX causes a pageblock

* to be visited multiple times。 Assume skip was checked

* before making it “skip” so other compaction instances do

* not scan the same block。

*/

// 如果忽略PB_migrate_skip標誌(同步模式、kcompactd任務、手動模式、指定分配一定範圍的頁框),則不做檢查,直接往下處理;

// 否則判斷page是否被設定了PB_migrate_skip標記,已設定,則跳過;否則往下處理

if

IS_ALIGNED

low_pfn

pageblock_nr_pages

&&

fast_find_block

&&

isolation_suitable

cc

page

))

continue

/*

* For async compaction, also only scan in MOVABLE blocks

* without huge pages。 Async compaction is optimistic to see

* if the minimum amount of work satisfies the allocation。

* The cached PFN is updated as it‘s possible that all

* remaining blocks between source and target are unsuitable

* and the compaction scanners fail to meet。

*/

// 1、如果page是複合型別,並且大小比一個pageblock大,則跳過該page,因為該page沒有碎片。

// 2、非同步模式時,不允許阻塞,只能處理MIGRATE_MOVABLE或者MIGRATE_CMA型別的頁框。如果本次申請的是MIGRATE_MOVABLE頁框,

// 但page不是MIGRATE_MOVABLE或者MIGRATE_CMA,則跳過;如果本次申請的不是MIGRATE_MOVABLE頁框,則需要與page遷移型別一直,否則跳過。

// 3、如果是非非同步模式,或者kcompact任務或手動觸發的,則需要進行隔離。

if

suitable_migration_source

cc

page

))

{

// 如果在需要設定skip模式下,則需要更新掃描的快取

update_cached_migrate

cc

block_end_pfn

);

continue

}

/* Perform the isolation */

// 實施隔離操作:從[low_pfn, block_end_pfn]中隔離出被使用的page,存放到cc的連結串列中,返回的是最後掃描並處理的頁框號

low_pfn

=

isolate_migratepages_block

cc

low_pfn

block_end_pfn

isolate_mode

);

// 如果node已經被隔離了太多頁框,以下三種情況會終止隔離:

// 1、如果cc中還有待遷移頁面沒有處理完

// 2、當前是非同步模式

// 3、捕獲到致命訊號

if

low_pfn

return

ISOLATE_ABORT

/*

* Either we isolated something and proceed with migration。 Or

* we failed and compact_zone should decide if we should

* continue or not。

*/

break

}

/* Record where migration scanner will be restarted。 */

// 記錄下一次掃描的初始位置

cc

->

migrate_pfn

=

low_pfn

// 如果隔離到頁面,表示成功,否則失敗

return

cc

->

nr_migratepages

ISOLATE_SUCCESS

ISOLATE_NONE

}

// mm\compaction。c

/**

* isolate_migratepages_block() - isolate all migrate-able pages within

* a single pageblock

* @cc: Compaction control structure。

* @low_pfn: The first PFN to isolate

* @end_pfn: The one-past-the-last PFN to isolate, within same pageblock

* @isolate_mode: Isolation mode to be used。

*

* Isolate all pages that can be migrated from the range specified by

* [low_pfn, end_pfn)。 The range is expected to be within same pageblock。

* Returns zero if there is a fatal signal pending, otherwise PFN of the

* first page that was not scanned (which may be both less, equal to or more

* than end_pfn)。

*

* The pages are isolated on cc->migratepages list (not required to be empty),

* and cc->nr_migratepages is updated accordingly。 The cc->migrate_pfn field

* is neither read nor updated。

*/

static

unsigned

long

isolate_migratepages_block

struct

compact_control

*

cc

unsigned

long

low_pfn

unsigned

long

end_pfn

isolate_mode_t

isolate_mode

{

pg_data_t

*

pgdat

=

cc

->

zone

->

zone_pgdat

unsigned

long

nr_scanned

=

0

nr_isolated

=

0

struct

lruvec

*

lruvec

unsigned

long

flags

=

0

bool

locked

=

false

struct

page

*

page

=

NULL

*

valid_page

=

NULL

unsigned

long

start_pfn

=

low_pfn

bool

skip_on_failure

=

false

unsigned

long

next_skip_pfn

=

0

bool

skip_updated

=

false

/*

* Ensure that there are not too many pages isolated from the LRU

* list by either parallel reclaimers or compaction。 If there are,

* delay for some time until fewer pages are isolated

*/

// 考慮到回收和碎片整理的平衡,不允許隔離過多頁框

while

unlikely

too_many_isolated

pgdat

)))

{

/* stop isolation if there are still pages not migrated */

if

cc

->

nr_migratepages

return

0

/* async migration should just abort */

if

cc

->

mode

==

MIGRATE_ASYNC

return

0

congestion_wait

BLK_RW_ASYNC

HZ

/

10

);

if

fatal_signal_pending

current

))

return

0

}

cond_resched

();

// 如果是手動或者kcompactiond任務中,並且是非同步模式,則跳過隔離失敗頁

if

cc

->

direct_compaction

&&

cc

->

mode

==

MIGRATE_ASYNC

))

{

skip_on_failure

=

true

next_skip_pfn

=

block_end_pfn

low_pfn

cc

->

order

);

}

/* Time to isolate some pages for migration */

for

(;

low_pfn

<

end_pfn

low_pfn

++

{

if

skip_on_failure

&&

low_pfn

>=

next_skip_pfn

{

/*

* We have isolated all migration candidates in the

* previous order-aligned block, and did not skip it due

* to failure。 We should migrate the pages now and

* hopefully succeed compaction。

*/

if

nr_isolated

break

/*

* We failed to isolate in the previous order-aligned

* block。 Set the new boundary to the end of the

* current block。 Note we can’t simply increase

* next_skip_pfn by 1 << order, as low_pfn might have

* been incremented by a higher number due to skipping

* a compound or a high-order buddy page in the

* previous loop iteration。

*/

next_skip_pfn

=

block_end_pfn

low_pfn

cc

->

order

);

}

/*

* Periodically drop the lock (if held) regardless of its

* contention, to give chance to IRQs。 Abort completely if

* a fatal signal is pending。

*/

// 如果捕獲了異常訊號,這裡會釋放鎖並終止掃描

if

low_pfn

%

SWAP_CLUSTER_MAX

&&

compact_unlock_should_abort

&

pgdat

->

lru_lock

flags

&

locked

cc

))

{

low_pfn

=

0

goto

fatal_pending

}

// 非法頁幀號

if

pfn_valid_within

low_pfn

))

goto

isolate_fail

// low_pfn為合法頁框,則已掃描頁框數加1

nr_scanned

++

// 獲取頁幀號對應的頁描述符

page

=

pfn_to_page

low_pfn

);

/*

* Check if the pageblock has already been marked skipped。

* Only the aligned PFN is checked as the caller isolates

* COMPACT_CLUSTER_MAX at a time so the second call must

* not falsely conclude that the block should be skipped。

*/

if

valid_page

&&

IS_ALIGNED

low_pfn

pageblock_nr_pages

))

{

if

cc

->

ignore_skip_hint

&&

get_pageblock_skip

page

))

{

low_pfn

=

end_pfn

goto

isolate_abort

}

valid_page

=

page

}

/*

* Skip if free。 We read page order here without zone lock

* which is generally unsafe, but the race window is small and

* the worst thing that can happen is that we skip some

* potential isolation targets。

*/

// 如果page在夥伴系統中,表示該頁沒有被使用,則跳過。頁面遷移只針對已使用的頁面

if

PageBuddy

page

))

{

unsigned

long

freepage_order

=

buddy_order_unsafe

page

);

/*

* Without lock, we cannot be sure that what we got is

* a valid page order。 Consider only values in the

* valid order range to prevent low_pfn overflow。

*/

// 跳過複合型別的page

if

freepage_order

>

0

&&

freepage_order

<

MAX_ORDER

low_pfn

+=

1UL

<<

freepage_order

-

1

continue

}

/*

* Regardless of being on LRU, compound pages such as THP and

* hugetlbfs are not to be compacted unless we are attempting

* an allocation much larger than the huge page size (eg CMA)。

* We can potentially save a lot of iterations if we skip them

* at once。 The check is racy, but we can consider only valid

* values and the only danger is skipping too much。

*/

// 如果是透明大頁或者普通大頁,並且當前不是在分配大頁的場景,也跳過碎片整理

if

PageCompound

page

&&

cc

->

alloc_contig

{

const

unsigned

int

order

=

compound_order

page

);

if

likely

order

<

MAX_ORDER

))

low_pfn

+=

1UL

<<

order

-

1

goto

isolate_fail

}

/*

* Check may be lockless but that‘s ok as we recheck later。

* It’s possible to migrate LRU and non-lru movable pages。

* Skip any other type of page

*/

// 執行到這裡,表示page正在被使用。一般是已經隔離或者不可移動的page

// 但也有一些場景是可移動的page???

if

PageLRU

page

))

{

/*

* __PageMovable can return false positive so we need

* to verify it under page_lock。

*/

// 如果page為可移動,並且不被隔離,則需要嘗試隔離該page

if

unlikely

__PageMovable

page

))

&&

PageIsolated

page

))

{

if

locked

{

spin_unlock_irqrestore

&

pgdat

->

lru_lock

flags

);

locked

=

false

}

// 如果page不是可移動,或者已經被隔離了,又或者正在被釋放,則走出錯流程

// 否則對page進行隔離操作,並設定隔離屬性

if

isolate_movable_page

page

isolate_mode

))

goto

isolate_success

}

goto

isolate_fail

}

/*

* Migration will fail if an anonymous page is pinned in memory,

* so avoid taking lru_lock and isolating it unnecessarily in an

* admittedly racy check。

*/

// 如果page是匿名頁,並且被引用次數大於被對映的次數,表示該頁正在被“釘”住,不允許遷移,故跳過

if

page_mapping

page

&&

page_count

page

>

page_mapcount

page

))

goto

isolate_fail

/*

* Only allow to migrate anonymous pages in GFP_NOFS context

* because those do not depend on fs locks。

*/

// 如果在GFP_NOFS標記的上下文中,只允許遷移匿名頁,因為不允許使用fs鎖

if

cc

->

gfp_mask

&

__GFP_FS

&&

page_mapping

page

))

goto

isolate_fail

/* If we already hold the lock, we can skip some rechecking */

// 如果page未上鎖,則需要上鎖並且重新做一些檢查

if

locked

{

locked

=

compact_lock_irqsave

&

pgdat

->

lru_lock

&

flags

cc

);

/* Try get exclusive access under lock */

if

skip_updated

{

skip_updated

=

true

if

test_and_set_skip

cc

page

low_pfn

))

goto

isolate_abort

}

/* Recheck PageLRU and PageCompound under lock */

if

PageLRU

page

))

goto

isolate_fail

/*

* Page become compound since the non-locked check,

* and it‘s on LRU。 It can only be a THP so the order

* is safe to read and it’s 0 for tail pages。

*/

if

unlikely

PageCompound

page

&&

cc

->

alloc_contig

))

{

low_pfn

+=

compound_nr

page

-

1

goto

isolate_fail

}

}

lruvec

=

mem_cgroup_page_lruvec

page

pgdat

);

/* Try isolate the page */

// 嘗試將page從lru中隔離出來,並清除lru屬性

if

__isolate_lru_page

page

isolate_mode

!=

0

goto

isolate_fail

/* The whole page is taken off the LRU; skip the tail pages。 */

// 隔離的page是複合頁(只在申請指定範圍頁框的場景?),則需要跳過其大小

if

PageCompound

page

))

low_pfn

+=

compound_nr

page

-

1

/* Successfully isolated */

// 如果是cgroup中的lru,則從中取出來

del_page_from_lru_list

page

lruvec

page_lru

page

));

mod_node_page_state

page_pgdat

page

),

NR_ISOLATED_ANON

+

page_is_file_lru

page

),

thp_nr_pages

page

));

isolate_success

// 將page新增到隔離裡邊中

list_add

&

page

->

lru

&

cc

->

migratepages

);

cc

->

nr_migratepages

+=

compound_nr

page

);

nr_isolated

+=

compound_nr

page

);

/*

* Avoid isolating too much unless this block is being

* rescanned (e。g。 dirty/writeback pages, parallel allocation)

* or a lock is contended。 For contention, isolate quickly to

* potentially remove one source of contention。

*/

if

cc

->

nr_migratepages

>=

COMPACT_CLUSTER_MAX

&&

cc

->

rescan

&&

cc

->

contended

{

++

low_pfn

break

}

continue

isolate_fail

// 如果不是失敗就跳過設定,則繼續對該pageblock掃描

if

skip_on_failure

continue

/*

* We have isolated some pages, but then failed。 Release them

* instead of migrating, as we cannot form the cc->order buddy

* page anyway。

*/

// 該pageblock已經失敗,並且需要跳過了,則將已經隔離出來的page放回到對應的連結串列中(大頁的、非non-lru、lru中等)

if

nr_isolated

{

if

locked

{

spin_unlock_irqrestore

&

pgdat

->

lru_lock

flags

);

locked

=

false

}

putback_movable_pages

&

cc

->

migratepages

);

cc

->

nr_migratepages

=

0

nr_isolated

=

0

}

// 沒有隔離到page,跳到下一個pageblock繼續遍歷

if

low_pfn

<

next_skip_pfn

{

low_pfn

=

next_skip_pfn

-

1

/*

* The check near the loop beginning would have updated

* next_skip_pfn too, but this is a bit simpler。

*/

next_skip_pfn

+=

1UL

<<

cc

->

order

}

}

/*

* The PageBuddy() check could have potentially brought us outside

* the range to be scanned。

*/

if

unlikely

low_pfn

>

end_pfn

))

low_pfn

=

end_pfn

isolate_abort

// 隔離停止,如果page已經加鎖,則進行解鎖

if

locked

spin_unlock_irqrestore

&

pgdat

->

lru_lock

flags

);

/*

* Updated the cached scanner pfn once the pageblock has been scanned

* Pages will either be migrated in which case there is no point

* scanning in the near future or migration failed in which case the

* failure reason may persist。 The block is marked for skipping if

* there were no pages isolated in the block or if the block is

* rescanned twice in a row。

*/

// pageblock隔離成功,設定該pageblock的skip屬性,下次跳過該pageblock的處理

if

low_pfn

==

end_pfn

&&

nr_isolated

||

cc

->

rescan

))

{

if

valid_page

&&

skip_updated

set_pageblock_skip

valid_page

);

update_cached_migrate

cc

low_pfn

);

}

trace_mm_compaction_isolate_migratepages

start_pfn

low_pfn

nr_scanned

nr_isolated

);

fatal_pending

cc

->

total_migrate_scanned

+=

nr_scanned

if

nr_isolated

count_compact_events

COMPACTISOLATED

nr_isolated

);

return

low_pfn

}

隔離前,需要明確需要在那種模式下進行,有如下三種模式:

/* Isolate unmapped pages */

// 隔離沒有對映的頁

#define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x2)

/* Isolate for asynchronous migration */

// 隔離不會阻塞的頁

#define ISOLATE_ASYNC_MIGRATE ((__force isolate_mode_t)0x4)

/* Isolate unevictable pages */

// 隔離不可回收的頁

#define ISOLATE_UNEVICTABLE ((__force isolate_mode_t)0x8)

每次進行掃描時,都是以pageblock為單元進行的。在非全zone掃描場景,會使用zone的掃描快取compact_cached_free_pfn和compact_cached_migrate_pfn,這兩個值分別記錄上次掃描pageblock後的位置。當一個pageblock無法隔離到頁框,該pageblock會標記為PB_migrate_skip,那麼下次掃描的時候,可能會跳過該pageblock(同步、手動觸發、kcompactd任務和指定範圍頁框申請的場景下不會跳過)。下面是隔離操作的大致流程:

深入理解Linux記憶體管理(六)記憶體碎片整理

在開始的時候,migrate_pfn、compact_cached_migrate_pfn都是指向zone的起始頁幀start_pfn,而free_pfn、compact_cached_free_pfn都是指向最後一個pageblock的起始頁幀。在啟動碎片整理掃描時,發現pageblock[1]本身記憶體不足,則將其設定成PG_migrate_skip並跳過該pageblock。當繼續掃描pageblock[2]時,發現能隔離出x個頁框,同時也會將其置為PG_migrate_skip。這時會啟動空閒頁框掃描,如果pageblock[n]能隔離出y個頁框,則進行遷移並將compact_cached_free_pfn置為pageblock[n-1]的起始頁幀號。如果x > y,則需要繼續啟動空閒頁框的掃描。最終當compact_cached_migrate_pfn和compact_cached_free_pfn指向了同一個pageblock時,則結束。

下面總結隔離結束的條件:

當zone的所有pageblock都無需掃描,則結束。

當zone已經隔離了太多頁面時,並且隔離連結串列中還有沒處理完的頁框,或當前是非同步模式,或捕獲到致命訊號,則結束。

成功從某個pageblock隔離到頁框,這是正常結束場景。

那標記為PB_migrate_skip的pageblock,誰來負責清理呢?主要有如下兩種場景:

compact_cached_free_pfn和compact_cached_migrate_pfn相遇時(指向同一個pageblock),則會設定compact_blockskip_flush為true。當kswapd準備睡眠的時候,會清除該zone的所有PB_migrate_skip。這也很好理解,如果再不清除,下次就沒pageblock掃描了。

非kswapd場景下,當推遲次數達到最大,並且閾值也達到最大時,也會清除zone的PB_migrate_skip。相關實現在__reset_isolation_suitable中。

實行頁框遷移操作

第二步結束後,如果是正常場景,即隔離到頁面時,會進行頁面遷移操作。

/*

* migrate_pages - migrate the pages specified in a list, to the free pages

* supplied as the target for the page migration

*

* @from: The list of pages to be migrated。

* @get_new_page: The function used to allocate free pages to be used

* as the target of the page migration。

* @put_new_page: The function used to free target pages if migration

* fails, or NULL if no special handling is necessary。

* @private: Private data to be passed on to get_new_page()

* @mode: The migration mode that specifies the constraints for

* page migration, if any。

* @reason: The reason for page migration。

*

* The function returns after 10 attempts or if no pages are movable any more

* because the list has become empty or no retryable pages exist any more。

* The caller should call putback_movable_pages() to return pages to the LRU

* or free list only if ret != 0。

*

* Returns the number of pages that were not migrated, or an error code。

*/

// 引數說明:

// from 待遷移的連結串列

// get_new_page 獲得空閒頁面的函式

// put_new_page 釋放空閒頁面的函式,用於遷移失敗場景,將空閒頁面釋放

// private 上述兩個函式的入參

// mode 遷移模式

// reason 遷移原因

int

migrate_pages

struct

list_head

*

from

new_page_t

get_new_page

free_page_t

put_new_page

unsigned

long

private

enum

migrate_mode

mode

int

reason

{

。。。

int

swapwrite

=

current

->

flags

&

PF_SWAPWRITE

int

rc

nr_subpages

// 做記憶體頁遷移時,需要當前任務往swap區寫的能力

if

swapwrite

current

->

flags

|=

PF_SWAPWRITE

for

pass

=

0

pass

<

10

&&

retry

||

thp_retry

);

pass

++

{

retry

=

0

thp_retry

=

0

// 遍歷from列表,page是當前page,page2是下一個page

list_for_each_entry_safe

page

page2

from

lru

{

// 非大頁場景

rc

=

unmap_and_move

get_new_page

put_new_page

private

page

pass

>

2

mode

reason

);

。。。

if

swapwrite

current

->

flags

&=

~

PF_SWAPWRITE

return

rc

}

// 進行頁面遷移操作,將migratepages中的頁面,遷移到空閒頁面中

err

=

migrate_pages

&

cc

->

migratepages

compaction_alloc

compaction_free

unsigned

long

cc

cc

->

mode

MR_COMPACTION

);

記憶體遷移需要區分大頁和非大頁的場景。這裡只考慮非大頁場景,會有如下判斷:

如果不支援透明大頁,而當前頁剛好是透明大頁的話,則直接返錯

如果page只在lru中,沒有被使用,直接釋放

如果page被使用,則申請一個空閒頁,然後將現有的page unmap掉,並移動到新的空閒頁中

如果步驟三失敗了,需要將page放回原來連結串列或者清除其隔離屬性,並且將新申請的頁表釋放掉。

下面是申請一個空閒page的實現流程:

/*

* This is a migrate-callback that “allocates” freepages by taking pages

* from the isolated freelists in the block we are migrating to。

*/

static

struct

page

*

compaction_alloc

struct

page

*

migratepage

unsigned

long

data

{

struct

compact_control

*

cc

=

struct

compact_control

*

data

struct

page

*

freepage

// 如果空閒頁連結串列中為空,則嘗試隔離一些出來

if

list_empty

&

cc

->

freepages

))

{

// 進行隔離空閒頁框,與隔離待遷移頁框類似

isolate_freepages

cc

);

// 沒能隔離到頁框,則返錯

if

list_empty

&

cc

->

freepages

))

return

NULL

}

// 取出連結串列首個page,返回給呼叫者使用

freepage

=

list_entry

cc

->

freepages

next

struct

page

lru

);

list_del

&

freepage

->

lru

);

cc

->

nr_freepages

——

return

freepage

}

同樣,釋放流程如下所示:

/*

* This is a migrate-callback that “frees” freepages back to the isolated

* freelist。 All pages on the freelist are from the same zone, so there is no

* special handling needed for NUMA。

*/

static

void

compaction_free

struct

page

*

page

unsigned

long

data

{

struct

compact_control

*

cc

=

struct

compact_control

*

data

list_add

&

page

->

lru

&

cc

->

freepages

);

cc

->

nr_freepages

++

}

比較簡單,就不展開說明了。需要注意的是,如果遷移完成,cc中還有空閒page,也需要釋放掉。

這裡想展開分析另外一個點,就是當一個page正在遷移,而恰好需要對其進行訪問,這時候會怎麼處理才能保證不會出錯呢?

__unmap_and_move

static

int

__unmap_and_move

struct

page

*

page

struct

page

*

newpage

int

force

enum

migrate_mode

mode

{

int

rc

=

-

EAGAIN

int

page_was_mapped

=

0

struct

anon_vma

*

anon_vma

=

NULL

bool

is_lru

=

__PageMovable

page

);

// 嘗試對page進行加鎖操作(設定PG_locked標誌),注意此時程序還能訪問該page

if

trylock_page

page

))

{

// 加鎖失敗,如果非強制操作,或者是非同步的模式,則直接返回。因為下面有加鎖操作,會阻塞

if

force

||

mode

==

MIGRATE_ASYNC

goto

out

/*

* It‘s not safe for direct compaction to call lock_page。

* For example, during page readahead pages are added locked

* to the LRU。 Later, when the IO completes the pages are

* marked uptodate and unlocked。 However, the queueing

* could be merging multiple pages for one bio (e。g。

* mpage_readahead)。 If an allocation happens for the

* second or third page, the process can end up locking

* the same page twice and deadlocking。 Rather than

* trying to be clever about what pages can be locked,

* avoid the use of lock_page for direct compaction

* altogether。

*/

if

current

->

flags

&

PF_MEMALLOC

goto

out

// 阻塞等待鎖,(輕)同步模式

lock_page

page

);

}

// 如果page正在回寫,則只能是同步模式,並且是強制執行的設定時,才會等待回寫操作完成,

// 非同步或者輕同步模式都不會等待

if

PageWriteback

page

))

{

/*

* Only in the case of a full synchronous migration is it

* necessary to wait for PageWriteback。 In the async case,

* the retry loop is too short and in the sync-light case,

* the overhead of stalling is too much

*/

switch

mode

{

case

MIGRATE_SYNC

case

MIGRATE_SYNC_NO_COPY

break

default

// 非同步模式直接結束

rc

=

-

EBUSY

goto

out_unlock

}

// 同步模式,並且是強制執行的情況下,才會等待頁面回寫完成

if

force

goto

out_unlock

wait_on_page_writeback

page

);

}

/*

* By try_to_unmap(), page->mapcount goes down to 0 here。 In this case,

* we cannot notice that anon_vma is freed while we migrates a page。

* This get_anon_vma() delays freeing anon_vma pointer until the end

* of migration。 File cache pages are no problem because of page_lock()

* File Caches may use write_page() or lock_page() in migration, then,

* just care Anon page here。

*

* Only page_get_anon_vma() understands the subtleties of

* getting a hold on an anon_vma from outside one of its mms。

* But if we cannot get anon_vma, then we won’t need it anyway,

* because that implies that the anon page is no longer mapped

* (and cannot be remapped so long as we hold the page lock)。

*/

// 如果是匿名頁並且非ksm的情況,則獲取匿名頁的vma

if

PageAnon

page

&&

PageKsm

page

))

anon_vma

=

page_get_anon_vma

page

);

/*

* Block others from accessing the new page when we get around to

* establishing additional references。 We are usually the only one

* holding a reference to newpage at this point。 We used to have a BUG

* here if trylock_page(newpage) fails, but would like to allow for

* cases where there might be a race with the previous use of newpage。

* This is much like races on refcount of oldpage: just don‘t BUG()。

*/

// 嘗試對新頁進行加鎖,這裡加鎖是避免遷移過程中,該頁被使用

if

unlikely

trylock_page

newpage

)))

goto

out_unlock

// 如果不是lru上的頁面(說明該頁沒有被使用???),直接移動即可,無需unmap操作

if

unlikely

is_lru

))

{

rc

=

move_to_new_page

newpage

page

mode

);

goto

out_unlock_both

}

/*

* Corner case handling:

* 1。 When a new swap-cache page is read into, it is added to the LRU

* and treated as swapcache but it has no rmap yet。

* Calling try_to_unmap() against a page->mapping==NULL page will

* trigger a BUG。 So handle it here。

* 2。 An orphaned page (see truncate_complete_page) might have

* fs-private metadata。 The page can be picked up due to memory

* offlining。 Everywhere else except page reclaim, the page is

* invisible to the vm, so the page can not be migrated。 So try to

* free the metadata, so the page can be freed。

*/

// 如果mapping是NULL,則無需unmap(該page可能正在被回收???),會有兩種情況:

// 1、該頁是匿名頁,並且正在換出,其已經unmap過的。

// 2、一些孤立的頁,可能是因為這些頁正在“下線”,這些頁不能被使用,故可以直接回收。加入有私有資料,則需要釋放

if

page

->

mapping

{

VM_BUG_ON_PAGE

PageAnon

page

),

page

);

if

page_has_private

page

))

{

try_to_free_buffers

page

);

goto

out_unlock_both

}

}

else

if

page_mapped

page

))

{

/* Establish migration ptes */

VM_BUG_ON_PAGE

PageAnon

page

&&

PageKsm

page

&&

anon_vma

page

);

// 將所有映射了該頁的程序,進行unmap操作(也即反向對映)

// TTU_MIGRATION表示unmap是因為頁框遷移導致的

// TTU_IGNORE_MLOCK表示可以對mlock的頁框進行操作

// unmap後,對page的訪問都會阻塞

try_to_unmap

page

TTU_MIGRATION

|

TTU_IGNORE_MLOCK

);

page_was_mapped

=

1

}

// 只有當page沒有被對映,才能進行遷移

if

page_mapped

page

))

rc

=

move_to_new_page

newpage

page

mode

);

// 如果page之前被unmap了,也就是之前映射了該page的程序都插入了“特殊的頁表項”。

// 當遷移動作完成時,需要將這個“特殊的頁表項”改成指向遷移後頁框的頁表項

// 這時,對page的訪問才能正常進行

if

page_was_mapped

remove_migration_ptes

page

rc

==

MIGRATEPAGE_SUCCESS

newpage

page

false

);

out_unlock_both

unlock_page

newpage

);

out_unlock

/* Drop an anon_vma reference if we took one */

if

anon_vma

put_anon_vma

anon_vma

);

unlock_page

page

);

out

/*

* If migration is successful, decrease refcount of the newpage

* which will not free the page because new page owner increased

* refcounter。 As well, if it is LRU page, add the page to LRU

* list in here。 Use the old state of the isolated source page to

* determine if we migrated a LRU page。 newpage was already unlocked

* and possibly modified by its owner - don’t rely on the page

* state。

*/

// 成功後,將newpage放入到lru中

if

rc

==

MIGRATEPAGE_SUCCESS

{

if

unlikely

is_lru

))

put_page

newpage

);

else

putback_lru_page

newpage

);

}

return

rc

}

在migrate_pages中,有一個步驟是對正在使用的page進行去對映,遷移結束後,需要將頁表項指向遷移後的page,這個操作就是__unmap_and_move,這也是實現同步的關鍵。以下對該流程展開詳細分析:

對舊頁加鎖(設定PG_locked標誌)。這裡上鎖是因為準備進行遷移了,不能讓其他任務修改其內容,但是還是可以進行訪問。要注意的一點是,加入加鎖失敗,則說明可能page已經被其他任務加鎖,需要等待釋放。非同步模式不會等待,直接返回,而(輕)同步模式會等待。

加鎖後,如果page正處於回寫,那麼非同步或輕同步模式,都不會阻塞等待回寫完成。只有開啟了強制執行的同步模式,才會等待。

對新頁進行加鎖,其他任務不能對新頁進行操作。加鎖失敗則結束。

如果舊page不是在lru上,則表示其沒有被程序對映,可以直接遷移,無需unmap動作。

對舊page進行unmap操作,就是找到所有映射了該page的程序,將對應的頁表項修改成一個“特殊”的頁表項。從此時開始,任何訪問該page的程序,都會找到這個“特殊”的頁表項,然後嘗試對該page進行加鎖,但由於第一步已經被加鎖了,所以其他任務會一直等待到鎖釋放。

進行頁框遷移操作。

修改第5步的“特殊”頁表項,將其指向遷移後的page。

對舊page和新page都解鎖,喚醒第5步等待鎖的任務。

總結

本章節主要講述了記憶體碎片整理的一些細節,當系統長時間執行之後,難免會出現碎片,碎片過多時,就會影響到申請連續記憶體的成功率。由於記憶體碎片整理會涉及頁框的遷移動作,所以只會對MIGRATE_RECLAIMABLE、MIGRATE_MOVABLE、MIGRATE_CMA這三種記憶體頁框進行整理。其觸發條件有如下三種場景:

“快路徑”無法分配到連續記憶體,進入“慢路徑”時會進行記憶體碎片整理。

kswapd任務中,進行記憶體回收後會進行記憶體碎片整理。

手動觸發,往/proc/sys/vm/compact_memory中寫入1時。

指定範圍分配連續頁框時,而該範圍的頁框有部分已經被使用,則需要透過碎片整理的方式進行遷移。

標簽: page  pfn  CC  zone  記憶體