深入理解Linux記憶體管理(六)記憶體碎片整理
前情回顧
上一章介紹了夥伴系統的記憶體申請與釋放流程,但都是考慮在記憶體充足的情況下進行的。隨著記憶體頁被頻繁使用,可能出現大量記憶體碎片或者待回收記憶體頁,導致無法獲取到足夠的連續記憶體頁框。本章就是介紹當出現這樣的情況時,系統會有哪些機制確保記憶體申請能夠順利進行。在Linux中主要有兩種機制:記憶體碎片整理和記憶體回收,而記憶體回收包括快速記憶體回收,直接記憶體回收和kswapd記憶體回收。由於篇幅有限,本章節先介紹記憶體碎片整理機制。話不多說,下面直接進入主題。
資料結構
記憶體碎片整理
/*
* MIGRATE_ASYNC means never block
* MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
* MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages
* with the CPU。 Instead, page copy happens outside the migratepage()
* callback and is likely using a DMA engine。 See migrate_vma() and HMM
* (mm/hmm。c) for users of this mode。
*/
enum
migrate_mode
{
MIGRATE_ASYNC
,
MIGRATE_SYNC_LIGHT
,
MIGRATE_SYNC
,
MIGRATE_SYNC_NO_COPY
,
};
只有三種類型的頁框支援記憶體碎片整理:MIGRATE_MOVABLE、MIGRATE_CMA和MIRGATE_RECLAIMABLE。記憶體碎片整理有如下四種模式:
非同步模式(MIGRATE_ASYNC):在該模式不允許進行任何阻塞操作,當需要阻塞或者排程的時候,則停止記憶體碎片整理。在該模式下只會處理MIGRATE_MOVABLE、MIGRATE_CMA型別的頁框,而不會處理MIRGATE_RECLAIMABLE型別的頁框,因為該型別的頁框大多數是檔案頁,對檔案頁進行記憶體碎片整理,有可能涉及髒頁回寫,這會引起阻塞。
輕同步模式(MIGRATE_SYNC_LIGHT):該模式允許絕大部分的阻塞操作,但是不阻塞等待髒檔案頁的回寫操作,因為回寫時間可能很長。
同步模式(MIGRATE_SYNC):該模式允許在遷移頁框時允許阻塞,也就是允許頁回寫完成才返回結果,這是最耗時的模式。該模式會整zone掃描,並且不會跳過標記為PG_migrate_skip標誌的pageblock。
非複製同步模式(MIGRATE_SYNC_NO_COPY):與同步模式類似,在遷移頁框時允許阻塞,但不會進行頁框複製。
struct
zone
{
。。。
#ifdef CONFIG_COMPACTION
/*
* On compaction failure, 1< * are skipped before trying again。 The number attempted since * last failure is tracked with compact_considered。 * compact_order_failed is the minimum compaction failed order。 */ // 記憶體碎片整理推遲次數累計,當推遲的次數超過1 << compact_defer_shift時,超過後不允許再推遲了 unsigned int compact_considered ; unsigned int compact_defer_shift ; // 記錄zone記憶體碎片整理可能失敗的最大order // 如果當前order大於等於compact_order_failed,則允許推遲(這裡是為了提高記憶體碎片整理的成功率),小於則直接啟動記憶體碎片整理 // 如果本次記憶體碎片整理成功了,則compact_order_failed置為order + 1 // 如果本次記憶體碎片整理失敗了,則compact_order_failed置為order int compact_order_failed ; #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA /* pfn where compaction free scanner should start */ // 記錄記憶體碎片掃描空閒頁時的起始頁幀號 unsigned long compact_cached_free_pfn ; /* pfn where compaction migration scanner should start */ // 記錄記憶體碎片掃描待移動頁時的起始頁幀號,包括非同步(0)和同步(1)的場景 unsigned long compact_cached_migrate_pfn [ ASYNC_AND_SYNC ]; // 記憶體碎片整理掃描遷移頁框初始幀號 unsigned long compact_init_migrate_pfn ; // 記憶體碎片整理掃描空閒頁框初始幀號 unsigned long compact_init_free_pfn ; #endif 。。。 }; 由於記憶體碎片整理比較耗時,Linux實現了一些手段來減少耗時,主要有:1、減少記憶體整理的次數。2、跳過一些已經掃描過的記憶體頁。透過記憶體碎片整理推遲機制,對於預估可能失敗的記憶體碎片整理場景實行推遲處理;在記憶體掃描過程中,分為全zone掃描和部分記憶體頁掃描,如果是部分記憶體頁掃描則使用上次快取的結果,跳過已經掃描過的記憶體頁。 struct compact_control { // 掃描pageblock時空閒頁面連結串列 struct list_head freepages ; /* List of free pages to migrate to */ // 掃描pageblock時遷移頁面連結串列 struct list_head migratepages ; /* List of pages being migrated */ // freepages連結串列中頁面數 unsigned int nr_freepages ; /* Number of isolated free pages */ // migratepages連結串列中頁面數 unsigned int nr_migratepages ; /* Number of pages to migrate */ // 隔離空閒頁掃描起始頁框 unsigned long free_pfn ; /* isolate_freepages search base */ // 隔離待移動頁掃描起始頁框 unsigned long migrate_pfn ; /* isolate_migratepages search base */ // 快速掃描起始頁框 unsigned long fast_start_pfn ; /* a pfn to start linear scan from */ // 本次掃描的zone struct zone * zone ; // 做可遷移頁面掃描時,已經掃描過的頁面數 unsigned long total_migrate_scanned ; // 做空閒頁面掃描時,已經掃描過的頁面數 unsigned long total_free_scanned ; unsigned short fast_search_fail ; /* failures to use free list searches */ // 快速搜尋時的order short search_order ; /* order to start a fast search at */ const gfp_t gfp_mask ; /* gfp mask of a direct compactor */ // 實際申請記憶體不足導致需要記憶體碎片整理的order int order ; /* order a direct compactor needs */ // 本次記憶體申請的遷移型別 int migratetype ; /* migratetype of direct compactor */ // 本次記憶體申請的標記 const unsigned int alloc_flags ; /* alloc flags of a direct compactor */ // 本次記憶體申請允許的最大zone下標 const int highest_zoneidx ; /* zone index of a direct compactor */ // 本次記憶體碎片整理的模式 enum migrate_mode mode ; /* Async or sync migration mode */ // 是否忽略skip標誌 bool ignore_skip_hint ; /* Scan blocks even if marked skip */ bool no_set_skip_hint ; /* Don‘t mark blocks for skipping */ // 是否忽視合不合適block頁掃描 bool ignore_block_suitable ; /* Scan blocks considered unsuitable */ // 是否是直接記憶體碎片整理,如果是kcompactd任務或者透過/proc觸發的記憶體碎片 // 整理則為否 bool direct_compaction ; /* False from kcompactd or /proc/。。。 */ // 是否為積極的記憶體壓縮策略 bool proactive_compaction ; /* kcompactd proactive compaction */ // 是否全zone掃描 bool whole_zone ; /* Whole zone should/has been scanned */ }; 這是記憶體碎片整理的控制結構體,將掃描到的待遷移頁框和空閒頁框,從夥伴系統中隔離出來。一次成功的記憶體碎片整理應該是將所有待遷移的頁框全部移動到空閒頁框處。由於一次記憶體碎片整理可能還是無法獲得足夠多的連續記憶體,可能需要觸發多次記憶體碎片整理。 演算法 記憶體碎片整理 上圖是記憶體碎片整理的簡易模型。如果某一次記憶體申請4個連續頁框,該zone有足夠的頁框,但是由於碎片嚴重,無法提供連續頁框。記憶體碎片整理後,就可以滿足連續4個頁框的記憶體申請。下面將結合上面的資料結構,詳細展開分析記憶體碎片整理流程。 __alloc_pages_direct_compact static struct page * __alloc_pages_direct_compact ( gfp_t gfp_mask , unsigned int order , unsigned int alloc_flags , const struct alloc_context * ac , enum compact_priority prio , enum compact_result * compact_result ) { struct page * page = NULL ; unsigned long pflags ; unsigned int noreclaim_flag ; // 如果是申請一個頁框,是無需進行碎片整理的 if ( ! order ) return NULL ; // 嘗試進行記憶體碎片整理 * compact_result = try_to_compact_pages ( gfp_mask , order , alloc_flags , ac , prio , & page ); if ( page ) // 裝填新的記憶體頁 prep_new_page ( page , order , gfp_mask , alloc_flags ); /* Try get a page from the freelist if available */ // 記憶體整理後,仍然無法獲得足夠記憶體,但還是嘗試透過freelist連結串列獲取(可能fallback?) if ( ! page ) page = get_page_from_freelist ( gfp_mask , order , alloc_flags , ac ); if ( page ) { struct zone * zone = page_zone ( page ); // 為什麼要置為false??? zone -> compact_blockskip_flush = false ; // 成功後將compact_considered和compact_defer_shift置為0, // compact_order_failed如果小於等於order,則被置為order + 1 compaction_defer_reset ( zone , order , true ); count_vm_event ( COMPACTSUCCESS ); return page ; } 。。。 return NULL ; } 該介面是記憶體碎片整理的入口,引數包括申請order、申請標誌、申請上下文、碎片整理的特權和碎片整理的結果。前幾個引數和記憶體申請時的含義一樣,而碎片整理的特權則是決定本次記憶體碎片整理使用的模式。可以看到該介面既有記憶體碎片整理流程,也有記憶體申請流程,也就是說記憶體碎片整理和記憶體申請是繫結在一起的。這裡總結下觸發記憶體碎片整理的條件: kswapd任務對記憶體回收後,可能會觸發記憶體碎片整理。 慢路徑記憶體分配時zone無法提供足夠的連續頁框。 透過/proc/sys/vm/compact_memory手動觸發。 try_to_compact_pages // mm\compaction。c /** * try_to_compact_pages - Direct compact to satisfy a high-order allocation * @gfp_mask: The GFP mask of the current allocation * @order: The order of the current allocation * @alloc_flags: The allocation flags of the current allocation * @ac: The context of current allocation * @prio: Determines how hard direct compaction should try to succeed * @capture: Pointer to free page created by compaction will be stored here * * This is the main entry point for direct page compaction。 */ enum compact_result try_to_compact_pages ( gfp_t gfp_mask , unsigned int order , unsigned int alloc_flags , const struct alloc_context * ac , enum compact_priority prio , struct page ** capture ) { int may_perform_io = gfp_mask & __GFP_IO ; struct zoneref * z ; struct zone * zone ; // 預設結果是跳過 enum compact_result rc = COMPACT_SKIPPED ; /* * Check if the GFP flags allow compaction - GFP_NOIO is really * tricky context because the migration might require IO */ // 如果禁止IO操作,則不進行頁面碎片整理(因為在禁止IO的上下文進行,有可能會觸發死鎖) if ( ! may_perform_io ) return COMPACT_SKIPPED ; /* Compact each zone in the list */ // 遍歷zonelist中的每一個zone for_each_zone_zonelist_nodemask ( zone , z , ac -> zonelist , ac -> highest_zoneidx , ac -> nodemask ) { enum compact_result status ; if ( prio > MIN_COMPACT_PRIORITY && compaction_deferred ( zone , order )) { // COMPACT_DEFERRED結果是推遲 rc = max_t ( enum compact_result , COMPACT_DEFERRED , rc ); continue ; } // 該zone需要進行記憶體碎片整理,並嘗試分配頁框 status = compact_zone_order ( zone , order , gfp_mask , prio , alloc_flags , ac -> highest_zoneidx , capture ); rc = max ( status , rc ); /* The allocation should succeed, stop compacting */ // 透過記憶體碎片整理完成 if ( status == COMPACT_SUCCESS ) { /* * We think the allocation will succeed in this zone, * but it is not certain, hence the false。 The caller * will repeat this with true if allocation indeed * succeeds in this zone。 */ // 這裡只是完成了記憶體鎖片整理,該zone中剩餘的頁框滿足了申請頁面數 // 但是頁面不一定都是連續的,其實是不確定最終能否成功,故這裡用flase // 將compact_order_failed設定為order + 1,也就是下次order小於該值時, // 會跳過整理 compaction_defer_reset ( zone , order , false ); break ; } if ( prio != COMPACT_PRIO_ASYNC && ( status == COMPACT_COMPLETE || status == COMPACT_PARTIAL_SKIPPED )) /* * We think that allocation won’t succeed in this zone * so we defer compaction there。 If it ends up * succeeding after all, it will be reset。 */ defer_compaction ( zone , order ); /* * We might have stopped compacting due to need_resched() in * async compaction, or due to a fatal signal detected。 In that * case do not try further zones */ // 非同步模式不允許阻塞,如果當前任務需要被排程,則終止記憶體碎片整理流程 if (( prio == COMPACT_PRIO_ASYNC && need_resched ()) || fatal_signal_pending ( current )) break ; } return rc ; } 遍歷zone列表,會對每個zone的狀態進行評估。如果該zone不適合進行記憶體碎片整理,則執行延遲操作,也即跳過該zone。如果該zone條件合適,則對該zone進行記憶體碎片整理,並且整理後嘗試申請頁框。當某次記憶體碎片整理成功或者在非同步模式下發生了任務排程,又或者觸發了致命的訊號,記憶體碎片整理流程就會終止。這裡總結記憶體碎片整理結束的條件: 該zone有足夠的連續記憶體,也即不需要做記憶體碎片整理,直接在該zone上申請即可,並結束記憶體碎片整理流程。 該zone無法在滿足水位的情況下分配記憶體,則會跳過該zone。如果是非同步模式,並且需要排程任務,則直接結束。否則待遍歷完所有zone後結束。 對某個zone做頁框遷移完成後,如果整理成功並分配到記憶體,則結束。如果分配不到,則換下一個zone,一直到遍歷完所有zone後結束。 compaction_deferred // mm\compaction。c /* Returns true if compaction should be skipped this time */ bool compaction_deferred ( struct zone * zone , int order ) { unsigned long defer_limit = 1UL << zone -> compact_defer_shift ; // compact_order_failed記錄的是記憶體碎片整理失敗的最小值,也就是說超過該值 // 記憶體碎片整理會失敗,所以只有超過該值才可能跳過(避免失敗耗時) if ( order < zone -> compact_order_failed ) return false ; /* Avoid possible overflow */ // 記憶體碎片整理推遲計數器,如果推遲次數已經達到預設的閾值(1 << zone->compact_defer_shift),則 // 不能再推遲了 if ( ++ zone -> compact_considered >= defer_limit ) { // 避免溢位 zone -> compact_considered = defer_limit ; return false ; } return true ; } 記憶體碎片整理延遲函式,同時滿足兩種情況下會延遲: 當前order大於等於compact_order_failed,表示預估本次執行記憶體碎片整理,也很可能會失敗。compact_order_failed值在記憶體碎片整理失敗時,會置為order,在記憶體碎片整理成功時,會置為order + 1。永遠不會將該值置為0。 延遲計數器沒有超過閾值,其中計數器是compact_considered,而閾值是1 << compact_defer_shift。當記憶體碎片整理成功時,會將計數器重置為0,而compact_defer_shift遞增;當記憶體碎片整理成功並且分配頁框成功時,則將兩個值都重置為0。 延遲策略是為了提高記憶體碎片整理的成功率,減低因記憶體碎片整理失敗導致的效能損失。 compact_zone_order // mm\compaction。c static enum compact_result compact_zone_order ( struct zone * zone , int order , gfp_t gfp_mask , enum compact_priority prio , unsigned int alloc_flags , int highest_zoneidx , struct page ** capture ) { enum compact_result ret ; struct compact_control cc = { 。 order = order , 。 search_order = order , 。 gfp_mask = gfp_mask , 。 zone = zone , // 如果優先順序是非同步優先順序,則採用非同步模式,否則使用輕同步的方式 。 mode = ( prio == COMPACT_PRIO_ASYNC ) ? MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT , 。 alloc_flags = alloc_flags , 。 highest_zoneidx = highest_zoneidx , 。 direct_compaction = true , // 同步模式會全zone掃描,並且不會跳過標記為PG_migrate_skip標誌的pageblock 。 whole_zone = ( prio == MIN_COMPACT_PRIORITY ), 。 ignore_skip_hint = ( prio == MIN_COMPACT_PRIORITY ), 。 ignore_block_suitable = ( prio == MIN_COMPACT_PRIORITY ) }; struct capture_control capc = { 。 cc = & cc , 。 page = NULL , }; /* * Make sure the structs are really initialized before we expose the * capture control, in case we are interrupted and the interrupt handler * frees a page。 */ barrier (); WRITE_ONCE ( current -> capture_control , & capc ); // 對zone進行記憶體碎片整理 ret = compact_zone ( & cc , & capc ); VM_BUG_ON ( ! list_empty ( & cc 。 freepages )); VM_BUG_ON ( ! list_empty ( & cc 。 migratepages )); /* * Make sure we hide capture control first before we read the captured * page pointer, otherwise an interrupt could free and capture a page * and we would leak it。 */ WRITE_ONCE ( current -> capture_control , NULL ); * capture = READ_ONCE ( capc 。 page ); return ret ; } 構造記憶體碎片整理的入參,如果是非同步優先順序,則採用非同步模式;如果非非同步優先順序,則採用輕同步模式。 compact_zone // mm\compaction。c static enum compact_result compact_zone ( struct compact_control * cc , struct capture_control * capc ) { enum compact_result ret ; // 做pageblock掃描時,搜尋可移動頁面的起始位置 unsigned long start_pfn = cc -> zone -> zone_start_pfn ; // 做pageblock掃描時,搜尋空閒頁面的起始位置 unsigned long end_pfn = zone_end_pfn ( cc -> zone ); unsigned long last_migrated_pfn ; // 是否是同步模式 const bool sync = cc -> mode != MIGRATE_ASYNC ; bool update_cached ; /* * These counters track activities during zone compaction。 Initialize * them before compacting a new zone。 */ // 初始化記憶體鎖片整理控制器 cc -> total_migrate_scanned = 0 ; cc -> total_free_scanned = 0 ; cc -> nr_migratepages = 0 ; cc -> nr_freepages = 0 ; INIT_LIST_HEAD ( & cc -> freepages ); INIT_LIST_HEAD ( & cc -> migratepages ); // 獲取本次申請頁面的遷移型別 cc -> migratetype = gfp_migratetype ( cc -> gfp_mask ); // 判斷當前zone是否滿足記憶體碎片整理條件 ret = compaction_suitable ( cc -> zone , cc -> order , cc -> alloc_flags , cc -> highest_zoneidx ); /* Compaction is likely to fail */ // COMPACT_SUCCESS此處表示有足夠記憶體,不需要記憶體碎片整理 // COMPACT_SKIPPED此處表示沒有足夠的記憶體,也是不需要記憶體碎片整理 if ( ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED ) return ret ; /* huh, compaction_suitable is returning something unexpected */ VM_BUG_ON ( ret != COMPACT_CONTINUE ); /* * Clear pageblock skip if there were failures recently and compaction * is about to be retried after being deferred。 */ // 如果推遲次數和閾值都已經達到最大值,並且最近有一次成功的全記憶體碎片整理 // 則需要重設zone的compact_init_migrate_pfn、 // compact_cached_migrate_pfn、compact_cached_free_pfn等快取資訊 // 我理解這裡是為了要全zone掃描,提高成功率 if ( compaction_restarting ( cc -> zone , cc -> order )) __reset_isolation_suitable ( cc -> zone ); /* * Setup to move all movable pages to the end of the zone。 Used cached * information on where the scanners should start (unless we explicitly * want to compact the whole zone), but check that it is initialised * by ensuring the values are within zone boundaries。 */ cc -> fast_start_pfn = 0 ; // 如果進行全zone掃描,則將掃描可移動頁框指標記錄為zone的第一個頁框 // 並且將掃描空閒頁框指標記錄為最後一個pageblock的起始頁框 // 同步模式下會全zone掃描 if ( cc -> whole_zone ) { cc -> migrate_pfn = start_pfn ; cc -> free_pfn = pageblock_start_pfn ( end_pfn - 1 ); } else { // 非全zone掃描,則從快取中獲取相關的起始頁框 // 非同步或者輕同步模式 cc -> migrate_pfn = cc -> zone -> compact_cached_migrate_pfn [ sync ]; cc -> free_pfn = cc -> zone -> compact_cached_free_pfn ; // 如果快取值不合法了,則更新成合法值 if ( cc -> free_pfn < start_pfn || cc -> free_pfn >= end_pfn ) { cc -> free_pfn = pageblock_start_pfn ( end_pfn - 1 ); cc -> zone -> compact_cached_free_pfn = cc -> free_pfn ; } if ( cc -> migrate_pfn < start_pfn || cc -> migrate_pfn >= end_pfn ) { cc -> migrate_pfn = start_pfn ; cc -> zone -> compact_cached_migrate_pfn [ 0 ] = cc -> migrate_pfn ; cc -> zone -> compact_cached_migrate_pfn [ 1 ] = cc -> migrate_pfn ; } // 如果掃描待遷移頁框是從頭開始的,則表示是全zone掃描,需要設定相應狀態 if ( cc -> migrate_pfn <= cc -> zone -> compact_init_migrate_pfn ) cc -> whole_zone = true ; } last_migrated_pfn = 0 ; /* * Migrate has separate cached PFNs for ASYNC and SYNC* migration on * the basis that some migrations will fail in ASYNC mode。 However, * if the cached PFNs match and pageblocks are skipped due to having * no isolation candidates, then the sync state does not matter。 * Until a pageblock with isolation candidates is found, keep the * cached PFNs in sync to avoid revisiting the same blocks。 */ update_cached = ! sync && cc -> zone -> compact_cached_migrate_pfn [ 0 ] == cc -> zone -> compact_cached_migrate_pfn [ 1 ]; trace_mm_compaction_begin ( start_pfn , cc -> migrate_pfn , cc -> free_pfn , end_pfn , sync ); migrate_prep_local (); // 判斷記憶體碎片整理是否結束,結束條件如下: // 1、掃描待移動頁框和掃描空閒頁框指標重合,表示已經結束了。 // 2、如果是積極記憶體碎片整理策略(我理解只有在kcompactd任務的場景),如果kswap任務也正在對node進行記憶體回收,則結束; // 否則會一直整理到使用者預設的低水位碎片比率。 // 3、整理過程中發現有足夠的空閒頁框,或者分配可移動頁框且CMA有足夠空間,則停止整理。 // 注意:如果是透過手動觸發的記憶體碎片整理,只看第一個條件。 while (( ret = compact_finished ( cc )) == COMPACT_CONTINUE ) { int err ; unsigned long start_pfn = cc -> migrate_pfn ; /* * Avoid multiple rescans which can happen if a page cannot be * isolated (dirty/writeback in async mode) or if the migrated * pages are being allocated before the pageblock is cleared。 * The first rescan will capture the entire pageblock for * migration。 If it fails, it‘ll be marked skip and scanning * will proceed as normal。 */ cc -> rescan = false ; if ( pageblock_start_pfn ( last_migrated_pfn ) == pageblock_start_pfn ( start_pfn )) { cc -> rescan = true ; } // 掃描可移動頁框,並將對應的page從lru中取下來,存放在migratepages switch ( isolate_migratepages ( cc )) { case ISOLATE_ABORT : // 掃描可隔離頁框被終止,則將migratepages恢復至lru中 ret = COMPACT_CONTENDED ; putback_movable_pages ( & cc -> migratepages ); cc -> nr_migratepages = 0 ; goto out ; case ISOLATE_NONE : // 該pageblock沒有掃描到可移動頁框 if ( update_cached ) { cc -> zone -> compact_cached_migrate_pfn [ 1 ] = cc -> zone -> compact_cached_migrate_pfn [ 0 ]; } /* * We haven’t isolated and migrated anything, but * there might still be unflushed migrations from * previous cc->order aligned block。 */ goto check_drain ; case ISOLATE_SUCCESS : update_cached = false ; last_migrated_pfn = start_pfn ; ; } // 進行頁面遷移操作,將migratepages中的頁面,遷移到空閒頁面中 err = migrate_pages ( & cc -> migratepages , compaction_alloc , compaction_free , ( unsigned long ) cc , cc -> mode , MR_COMPACTION ); trace_mm_compaction_migratepages ( cc -> nr_migratepages , err , & cc -> migratepages ); /* All pages were either migrated or will be released */ cc -> nr_migratepages = 0 ; if ( err ) { // 如果頁面遷移失敗,則將剩下的可移動頁框還原到lru中 putback_movable_pages ( & cc -> migratepages ); /* * migrate_pages() may return -ENOMEM when scanners meet * and we want compact_finished() to detect it */ if ( err == - ENOMEM && ! compact_scanners_met ( cc )) { ret = COMPACT_CONTENDED ; goto out ; } /* * We failed to migrate at least one page in the current * order-aligned block, so skip the rest of it。 */ if ( cc -> direct_compaction && ( cc -> mode == MIGRATE_ASYNC )) { cc -> migrate_pfn = block_end_pfn ( cc -> migrate_pfn - 1 , cc -> order ); /* Draining pcplists is useless in this case */ last_migrated_pfn = 0 ; } } check_drain : /* * Has the migration scanner moved away from the previous * cc->order aligned block where we migrated from? If yes, * flush the pages that were freed, so that they can merge and * compact_finished() can detect immediately if allocation * would succeed。 */ if ( cc -> order > 0 && last_migrated_pfn ) { unsigned long current_block_start = block_start_pfn ( cc -> migrate_pfn , cc -> order ); if ( last_migrated_pfn < current_block_start ) { lru_add_drain_cpu_zone ( cc -> zone ); /* No more flushing until we migrate again */ last_migrated_pfn = 0 ; } } /* Stop if a page has been captured */ if ( capc && capc -> page ) { ret = COMPACT_SUCCESS ; break ; } } out : /* * Release free pages and update where the free scanner should restart, * so we don‘t leave any returned pages behind in the next attempt。 */ // 如果空間列表中還有page,則還原到lru中 if ( cc -> nr_freepages > 0 ) { unsigned long free_pfn = release_freepages ( & cc -> freepages ); cc -> nr_freepages = 0 ; VM_BUG_ON ( free_pfn == 0 ); /* The cached pfn is always the first in a pageblock */ free_pfn = pageblock_start_pfn ( free_pfn ); /* * Only go back, not forward。 The cached pfn might have been * already reset to zone end in compact_finished() */ if ( free_pfn > cc -> zone -> compact_cached_free_pfn ) cc -> zone -> compact_cached_free_pfn = free_pfn ; } count_compact_events ( COMPACTMIGRATE_SCANNED , cc -> total_migrate_scanned ); count_compact_events ( COMPACTFREE_SCANNED , cc -> total_free_scanned ); trace_mm_compaction_end ( start_pfn , cc -> migrate_pfn , cc -> free_pfn , end_pfn , sync , ret ); return ret ; } 記憶體碎片整理最核心的函式,其是針對一個zone進行的。該介面可分為三大部分的功能:1、判斷該zone是否滿足碎片整理條件;2、隔離出待遷移頁框;3、實行頁框遷移操作。(天哪!!!一個函式集中了這麼多功能:joy:)。下面將一一分析這三部分功能點。 判斷該zone是否滿足碎片整理條件 // mm\compaction。c enum compact_result compaction_suitable ( struct zone * zone , int order , unsigned int alloc_flags , int highest_zoneidx ) { enum compact_result ret ; int fragindex ; ret = __compaction_suitable ( zone , order , alloc_flags , highest_zoneidx , zone_page_state ( zone , NR_FREE_PAGES )); /* * fragmentation index determines if allocation failures are due to * low memory or external fragmentation * * index of -1000 would imply allocations might succeed depending on * watermarks, but we already failed the high-order watermark check * index towards 0 implies failure is due to lack of memory * index towards 1000 implies failure is due to fragmentation * * Only compact if a failure would be due to fragmentation。 Also * ignore fragindex for non-costly orders where the alternative to * a successful reclaim/compaction is OOM。 Fragindex and the * vm。extfrag_threshold sysctl is meant as a heuristic to prevent * excessive compaction for costly orders, but it should not be at the * expense of system stability。 */ if ( ret == COMPACT_CONTINUE && ( order > PAGE_ALLOC_COSTLY_ORDER )) { // 為當前記憶體碎片情況“打分” // 1、如果打分是趨向0,則意味著本次記憶體申請失敗是由於記憶體不足導致 // 2、如果打分是趨向1000,則意味著本次記憶體申請失敗是由於記憶體碎片導致 // 只有因記憶體碎片導致的失敗,做碎片整理才有意義 fragindex = fragmentation_index ( zone , order ); // sysctl_extfrag_threshold是透過虛擬檔案系統預設的值(/proc/sys/vm/extfrag_threshold) // 範圍是0~1000,只有超過這個值才進行記憶體碎片整理 if ( fragindex >= 0 && fragindex <= sysctl_extfrag_threshold ) ret = COMPACT_NOT_SUITABLE_ZONE ; } trace_mm_compaction_suitable ( zone , order , ret ); if ( ret == COMPACT_NOT_SUITABLE_ZONE ) ret = COMPACT_SKIPPED ; return ret ; } /* * compaction_suitable: Is this suitable to run compaction on this zone now? * Returns * COMPACT_SKIPPED - If there are too few free pages for compaction * COMPACT_SUCCESS - If the allocation would succeed without compaction * COMPACT_CONTINUE - If compaction should run now */ static enum compact_result __compaction_suitable ( struct zone * zone , int order , unsigned int alloc_flags , int highest_zoneidx , unsigned long wmark_target ) { unsigned long watermark ; // 如果是透過虛擬檔案系統(/proc/sys/vm/compact_memory)觸發的記憶體碎片整理,則 // 強制進行 if ( is_via_compact_memory ( order )) return COMPACT_CONTINUE ; // 根據alloc_flags獲取某水位至少需要watermark頁記憶體,一般是MIN水位 watermark = wmark_pages ( zone , alloc_flags & ALLOC_WMARK_MASK ); /* * If watermarks for high-order allocation are already met, there * should be no need for compaction at all。 */ // 如果當前zone能在滿足水位的基礎上分配2^order記憶體,則不進行記憶體碎片整理 if ( zone_watermark_ok ( zone , order , watermark , highest_zoneidx , alloc_flags )) return COMPACT_SUCCESS ; /* * Watermarks for order-0 must be met for compaction to be able to * isolate free pages for migration targets。 This means that the * watermark and alloc_flags have to match, or be more pessimistic than * the check in __isolate_free_page()。 We don’t use the direct * compactor‘s alloc_flags, as they are not relevant for freepage * isolation。 We however do use the direct compactor’s highest_zoneidx * to skip over zones where lowmem reserves would prevent allocation * even if compaction succeeds。 * For costly orders, we require low watermark instead of min for * compaction to proceed to increase its chances。 * ALLOC_CMA is used, as pages in CMA pageblocks are considered * suitable migration targets */ // 到這裡,已經無法在滿足水位alloc_flags預設水位的基礎上分配記憶體了,則 // 考慮根據order大小將水位調整成low或者min watermark = ( order > PAGE_ALLOC_COSTLY_ORDER ) ? low_wmark_pages ( zone ) : min_wmark_pages ( zone ); // 預估該zone至少需要的gap(???) watermark += compact_gap ( order ); if ( ! __zone_watermark_ok ( zone , 0 , watermark , highest_zoneidx , ALLOC_CMA , wmark_target )) return COMPACT_SKIPPED ; return COMPACT_CONTINUE ; } 以下幾種場景是不需要進行記憶體碎片整理的: 該zone能在滿足要求水位的情況下,可以分配足夠連續頁框,自然就不需要耗時去做碎片整理。這應該是考慮到雖然前面透過快路徑無法獲取到足夠頁框,但是在慢路徑過程中,其他任務釋放了頁框。 在場景1不滿足的情況下,並且當前zone只能滿足低水位或最小水位的情況下,無法分配一個頁框,這時候頁無需進行碎片整理。因為這時該zone已經徹底沒有空閒頁了。 在場景1和2都不滿足的情況下,會對記憶體碎片情況進行打分。這裡需要預估當前zone無法分配頁框,是因為記憶體不足還是因為記憶體碎片導致的,只有是因記憶體碎片導致的情況下,進行記憶體碎片整理才有意義。並且還需要考慮到使用者設定的記憶體碎片閾值(/proc/sys/vm/extfrag_threshold),如果該值很大,表示使用者不希望進行記憶體碎片整理(效能優先);反之,該值很小,表示使用者希望多進行記憶體碎片整理(記憶體優先)。 除了以上三種場景,無論手動觸發還是因記憶體不足觸發的記憶體碎片整理,都會進行。 隔離出待遷移頁框 // mm\compaction。c /* * Isolate all pages that can be migrated from the first suitable block, * starting at the block pointed to by the migrate scanner pfn within * compact_control。 */ static isolate_migrate_t isolate_migratepages ( struct compact_control * cc ) { unsigned long block_start_pfn ; unsigned long block_end_pfn ; unsigned long low_pfn ; struct page * page ; // 是否允許隔離不可回收頁框 和 如果是非同步模式,更偏向於隔離不會阻塞的頁框 const isolate_mode_t isolate_mode = ( sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0 ) | ( cc -> mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0 ); bool fast_find_block ; /* * Start at where we last stopped, or beginning of the zone as * initialized by compact_zone()。 The first failure will use * the lowest PFN as the starting point for linear scanning。 */ low_pfn = fast_find_migrateblock ( cc ); block_start_pfn = pageblock_start_pfn ( low_pfn ); if ( block_start_pfn < cc -> zone -> zone_start_pfn ) block_start_pfn = cc -> zone -> zone_start_pfn ; /* * fast_find_migrateblock marks a pageblock skipped so to avoid * the isolation_suitable check below, check whether the fast * search was successful。 */ fast_find_block = low_pfn != cc -> migrate_pfn && ! cc -> fast_search_fail ; /* Only scan within a pageblock boundary */ block_end_pfn = pageblock_end_pfn ( low_pfn ); /* * Iterate over whole pageblocks until we find the first suitable。 * Do not cross the free scanner。 */ // 每次都是以一個pageblock大小為單元進行掃描,範圍是[block_start_pfn, block_end_pfn] for (; block_end_pfn <= cc -> free_pfn ; fast_find_block = false , low_pfn = block_end_pfn , block_start_pfn = block_end_pfn , block_end_pfn += pageblock_nr_pages ) { /* * This can potentially iterate a massively long zone with * many pageblocks unsuitable, so periodically check if we * need to schedule。 */ // 避免掃描佔用大量的CPU時間,當掃描超過32個pageblcok時,會休眠 if ( ! ( low_pfn % ( SWAP_CLUSTER_MAX * pageblock_nr_pages ))) cond_resched (); // 獲得pageblock(範圍是[block_start_pfn, block_end_pfn])的第一個page page = pageblock_pfn_to_page ( block_start_pfn , block_end_pfn , cc -> zone ); // 如果page不可用,則跳到下一個pageblock if ( ! page ) continue ; /* * If isolation recently failed, do not retry。 Only check the * pageblock once。 COMPACT_CLUSTER_MAX causes a pageblock * to be visited multiple times。 Assume skip was checked * before making it “skip” so other compaction instances do * not scan the same block。 */ // 如果忽略PB_migrate_skip標誌(同步模式、kcompactd任務、手動模式、指定分配一定範圍的頁框),則不做檢查,直接往下處理; // 否則判斷page是否被設定了PB_migrate_skip標記,已設定,則跳過;否則往下處理 if ( IS_ALIGNED ( low_pfn , pageblock_nr_pages ) && ! fast_find_block && ! isolation_suitable ( cc , page )) continue ; /* * For async compaction, also only scan in MOVABLE blocks * without huge pages。 Async compaction is optimistic to see * if the minimum amount of work satisfies the allocation。 * The cached PFN is updated as it‘s possible that all * remaining blocks between source and target are unsuitable * and the compaction scanners fail to meet。 */ // 1、如果page是複合型別,並且大小比一個pageblock大,則跳過該page,因為該page沒有碎片。 // 2、非同步模式時,不允許阻塞,只能處理MIGRATE_MOVABLE或者MIGRATE_CMA型別的頁框。如果本次申請的是MIGRATE_MOVABLE頁框, // 但page不是MIGRATE_MOVABLE或者MIGRATE_CMA,則跳過;如果本次申請的不是MIGRATE_MOVABLE頁框,則需要與page遷移型別一直,否則跳過。 // 3、如果是非非同步模式,或者kcompact任務或手動觸發的,則需要進行隔離。 if ( ! suitable_migration_source ( cc , page )) { // 如果在需要設定skip模式下,則需要更新掃描的快取 update_cached_migrate ( cc , block_end_pfn ); continue ; } /* Perform the isolation */ // 實施隔離操作:從[low_pfn, block_end_pfn]中隔離出被使用的page,存放到cc的連結串列中,返回的是最後掃描並處理的頁框號 low_pfn = isolate_migratepages_block ( cc , low_pfn , block_end_pfn , isolate_mode ); // 如果node已經被隔離了太多頁框,以下三種情況會終止隔離: // 1、如果cc中還有待遷移頁面沒有處理完 // 2、當前是非同步模式 // 3、捕獲到致命訊號 if ( ! low_pfn ) return ISOLATE_ABORT ; /* * Either we isolated something and proceed with migration。 Or * we failed and compact_zone should decide if we should * continue or not。 */ break ; } /* Record where migration scanner will be restarted。 */ // 記錄下一次掃描的初始位置 cc -> migrate_pfn = low_pfn ; // 如果隔離到頁面,表示成功,否則失敗 return cc -> nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE ; } // mm\compaction。c /** * isolate_migratepages_block() - isolate all migrate-able pages within * a single pageblock * @cc: Compaction control structure。 * @low_pfn: The first PFN to isolate * @end_pfn: The one-past-the-last PFN to isolate, within same pageblock * @isolate_mode: Isolation mode to be used。 * * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn)。 The range is expected to be within same pageblock。 * Returns zero if there is a fatal signal pending, otherwise PFN of the * first page that was not scanned (which may be both less, equal to or more * than end_pfn)。 * * The pages are isolated on cc->migratepages list (not required to be empty), * and cc->nr_migratepages is updated accordingly。 The cc->migrate_pfn field * is neither read nor updated。 */ static unsigned long isolate_migratepages_block ( struct compact_control * cc , unsigned long low_pfn , unsigned long end_pfn , isolate_mode_t isolate_mode ) { pg_data_t * pgdat = cc -> zone -> zone_pgdat ; unsigned long nr_scanned = 0 , nr_isolated = 0 ; struct lruvec * lruvec ; unsigned long flags = 0 ; bool locked = false ; struct page * page = NULL , * valid_page = NULL ; unsigned long start_pfn = low_pfn ; bool skip_on_failure = false ; unsigned long next_skip_pfn = 0 ; bool skip_updated = false ; /* * Ensure that there are not too many pages isolated from the LRU * list by either parallel reclaimers or compaction。 If there are, * delay for some time until fewer pages are isolated */ // 考慮到回收和碎片整理的平衡,不允許隔離過多頁框 while ( unlikely ( too_many_isolated ( pgdat ))) { /* stop isolation if there are still pages not migrated */ if ( cc -> nr_migratepages ) return 0 ; /* async migration should just abort */ if ( cc -> mode == MIGRATE_ASYNC ) return 0 ; congestion_wait ( BLK_RW_ASYNC , HZ / 10 ); if ( fatal_signal_pending ( current )) return 0 ; } cond_resched (); // 如果是手動或者kcompactiond任務中,並且是非同步模式,則跳過隔離失敗頁 if ( cc -> direct_compaction && ( cc -> mode == MIGRATE_ASYNC )) { skip_on_failure = true ; next_skip_pfn = block_end_pfn ( low_pfn , cc -> order ); } /* Time to isolate some pages for migration */ for (; low_pfn < end_pfn ; low_pfn ++ ) { if ( skip_on_failure && low_pfn >= next_skip_pfn ) { /* * We have isolated all migration candidates in the * previous order-aligned block, and did not skip it due * to failure。 We should migrate the pages now and * hopefully succeed compaction。 */ if ( nr_isolated ) break ; /* * We failed to isolate in the previous order-aligned * block。 Set the new boundary to the end of the * current block。 Note we can’t simply increase * next_skip_pfn by 1 << order, as low_pfn might have * been incremented by a higher number due to skipping * a compound or a high-order buddy page in the * previous loop iteration。 */ next_skip_pfn = block_end_pfn ( low_pfn , cc -> order ); } /* * Periodically drop the lock (if held) regardless of its * contention, to give chance to IRQs。 Abort completely if * a fatal signal is pending。 */ // 如果捕獲了異常訊號,這裡會釋放鎖並終止掃描 if ( ! ( low_pfn % SWAP_CLUSTER_MAX ) && compact_unlock_should_abort ( & pgdat -> lru_lock , flags , & locked , cc )) { low_pfn = 0 ; goto fatal_pending ; } // 非法頁幀號 if ( ! pfn_valid_within ( low_pfn )) goto isolate_fail ; // low_pfn為合法頁框,則已掃描頁框數加1 nr_scanned ++ ; // 獲取頁幀號對應的頁描述符 page = pfn_to_page ( low_pfn ); /* * Check if the pageblock has already been marked skipped。 * Only the aligned PFN is checked as the caller isolates * COMPACT_CLUSTER_MAX at a time so the second call must * not falsely conclude that the block should be skipped。 */ if ( ! valid_page && IS_ALIGNED ( low_pfn , pageblock_nr_pages )) { if ( ! cc -> ignore_skip_hint && get_pageblock_skip ( page )) { low_pfn = end_pfn ; goto isolate_abort ; } valid_page = page ; } /* * Skip if free。 We read page order here without zone lock * which is generally unsafe, but the race window is small and * the worst thing that can happen is that we skip some * potential isolation targets。 */ // 如果page在夥伴系統中,表示該頁沒有被使用,則跳過。頁面遷移只針對已使用的頁面 if ( PageBuddy ( page )) { unsigned long freepage_order = buddy_order_unsafe ( page ); /* * Without lock, we cannot be sure that what we got is * a valid page order。 Consider only values in the * valid order range to prevent low_pfn overflow。 */ // 跳過複合型別的page if ( freepage_order > 0 && freepage_order < MAX_ORDER ) low_pfn += ( 1UL << freepage_order ) - 1 ; continue ; } /* * Regardless of being on LRU, compound pages such as THP and * hugetlbfs are not to be compacted unless we are attempting * an allocation much larger than the huge page size (eg CMA)。 * We can potentially save a lot of iterations if we skip them * at once。 The check is racy, but we can consider only valid * values and the only danger is skipping too much。 */ // 如果是透明大頁或者普通大頁,並且當前不是在分配大頁的場景,也跳過碎片整理 if ( PageCompound ( page ) && ! cc -> alloc_contig ) { const unsigned int order = compound_order ( page ); if ( likely ( order < MAX_ORDER )) low_pfn += ( 1UL << order ) - 1 ; goto isolate_fail ; } /* * Check may be lockless but that‘s ok as we recheck later。 * It’s possible to migrate LRU and non-lru movable pages。 * Skip any other type of page */ // 執行到這裡,表示page正在被使用。一般是已經隔離或者不可移動的page // 但也有一些場景是可移動的page??? if ( ! PageLRU ( page )) { /* * __PageMovable can return false positive so we need * to verify it under page_lock。 */ // 如果page為可移動,並且不被隔離,則需要嘗試隔離該page if ( unlikely ( __PageMovable ( page )) && ! PageIsolated ( page )) { if ( locked ) { spin_unlock_irqrestore ( & pgdat -> lru_lock , flags ); locked = false ; } // 如果page不是可移動,或者已經被隔離了,又或者正在被釋放,則走出錯流程 // 否則對page進行隔離操作,並設定隔離屬性 if ( ! isolate_movable_page ( page , isolate_mode )) goto isolate_success ; } goto isolate_fail ; } /* * Migration will fail if an anonymous page is pinned in memory, * so avoid taking lru_lock and isolating it unnecessarily in an * admittedly racy check。 */ // 如果page是匿名頁,並且被引用次數大於被對映的次數,表示該頁正在被“釘”住,不允許遷移,故跳過 if ( ! page_mapping ( page ) && page_count ( page ) > page_mapcount ( page )) goto isolate_fail ; /* * Only allow to migrate anonymous pages in GFP_NOFS context * because those do not depend on fs locks。 */ // 如果在GFP_NOFS標記的上下文中,只允許遷移匿名頁,因為不允許使用fs鎖 if ( ! ( cc -> gfp_mask & __GFP_FS ) && page_mapping ( page )) goto isolate_fail ; /* If we already hold the lock, we can skip some rechecking */ // 如果page未上鎖,則需要上鎖並且重新做一些檢查 if ( ! locked ) { locked = compact_lock_irqsave ( & pgdat -> lru_lock , & flags , cc ); /* Try get exclusive access under lock */ if ( ! skip_updated ) { skip_updated = true ; if ( test_and_set_skip ( cc , page , low_pfn )) goto isolate_abort ; } /* Recheck PageLRU and PageCompound under lock */ if ( ! PageLRU ( page )) goto isolate_fail ; /* * Page become compound since the non-locked check, * and it‘s on LRU。 It can only be a THP so the order * is safe to read and it’s 0 for tail pages。 */ if ( unlikely ( PageCompound ( page ) && ! cc -> alloc_contig )) { low_pfn += compound_nr ( page ) - 1 ; goto isolate_fail ; } } lruvec = mem_cgroup_page_lruvec ( page , pgdat ); /* Try isolate the page */ // 嘗試將page從lru中隔離出來,並清除lru屬性 if ( __isolate_lru_page ( page , isolate_mode ) != 0 ) goto isolate_fail ; /* The whole page is taken off the LRU; skip the tail pages。 */ // 隔離的page是複合頁(只在申請指定範圍頁框的場景?),則需要跳過其大小 if ( PageCompound ( page )) low_pfn += compound_nr ( page ) - 1 ; /* Successfully isolated */ // 如果是cgroup中的lru,則從中取出來 del_page_from_lru_list ( page , lruvec , page_lru ( page )); mod_node_page_state ( page_pgdat ( page ), NR_ISOLATED_ANON + page_is_file_lru ( page ), thp_nr_pages ( page )); isolate_success : // 將page新增到隔離裡邊中 list_add ( & page -> lru , & cc -> migratepages ); cc -> nr_migratepages += compound_nr ( page ); nr_isolated += compound_nr ( page ); /* * Avoid isolating too much unless this block is being * rescanned (e。g。 dirty/writeback pages, parallel allocation) * or a lock is contended。 For contention, isolate quickly to * potentially remove one source of contention。 */ if ( cc -> nr_migratepages >= COMPACT_CLUSTER_MAX && ! cc -> rescan && ! cc -> contended ) { ++ low_pfn ; break ; } continue ; isolate_fail : // 如果不是失敗就跳過設定,則繼續對該pageblock掃描 if ( ! skip_on_failure ) continue ; /* * We have isolated some pages, but then failed。 Release them * instead of migrating, as we cannot form the cc->order buddy * page anyway。 */ // 該pageblock已經失敗,並且需要跳過了,則將已經隔離出來的page放回到對應的連結串列中(大頁的、非non-lru、lru中等) if ( nr_isolated ) { if ( locked ) { spin_unlock_irqrestore ( & pgdat -> lru_lock , flags ); locked = false ; } putback_movable_pages ( & cc -> migratepages ); cc -> nr_migratepages = 0 ; nr_isolated = 0 ; } // 沒有隔離到page,跳到下一個pageblock繼續遍歷 if ( low_pfn < next_skip_pfn ) { low_pfn = next_skip_pfn - 1 ; /* * The check near the loop beginning would have updated * next_skip_pfn too, but this is a bit simpler。 */ next_skip_pfn += 1UL << cc -> order ; } } /* * The PageBuddy() check could have potentially brought us outside * the range to be scanned。 */ if ( unlikely ( low_pfn > end_pfn )) low_pfn = end_pfn ; isolate_abort : // 隔離停止,如果page已經加鎖,則進行解鎖 if ( locked ) spin_unlock_irqrestore ( & pgdat -> lru_lock , flags ); /* * Updated the cached scanner pfn once the pageblock has been scanned * Pages will either be migrated in which case there is no point * scanning in the near future or migration failed in which case the * failure reason may persist。 The block is marked for skipping if * there were no pages isolated in the block or if the block is * rescanned twice in a row。 */ // pageblock隔離成功,設定該pageblock的skip屬性,下次跳過該pageblock的處理 if ( low_pfn == end_pfn && ( ! nr_isolated || cc -> rescan )) { if ( valid_page && ! skip_updated ) set_pageblock_skip ( valid_page ); update_cached_migrate ( cc , low_pfn ); } trace_mm_compaction_isolate_migratepages ( start_pfn , low_pfn , nr_scanned , nr_isolated ); fatal_pending : cc -> total_migrate_scanned += nr_scanned ; if ( nr_isolated ) count_compact_events ( COMPACTISOLATED , nr_isolated ); return low_pfn ; } 隔離前,需要明確需要在那種模式下進行,有如下三種模式: /* Isolate unmapped pages */ // 隔離沒有對映的頁 #define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x2) /* Isolate for asynchronous migration */ // 隔離不會阻塞的頁 #define ISOLATE_ASYNC_MIGRATE ((__force isolate_mode_t)0x4) /* Isolate unevictable pages */ // 隔離不可回收的頁 #define ISOLATE_UNEVICTABLE ((__force isolate_mode_t)0x8) 每次進行掃描時,都是以pageblock為單元進行的。在非全zone掃描場景,會使用zone的掃描快取compact_cached_free_pfn和compact_cached_migrate_pfn,這兩個值分別記錄上次掃描pageblock後的位置。當一個pageblock無法隔離到頁框,該pageblock會標記為PB_migrate_skip,那麼下次掃描的時候,可能會跳過該pageblock(同步、手動觸發、kcompactd任務和指定範圍頁框申請的場景下不會跳過)。下面是隔離操作的大致流程: 在開始的時候,migrate_pfn、compact_cached_migrate_pfn都是指向zone的起始頁幀start_pfn,而free_pfn、compact_cached_free_pfn都是指向最後一個pageblock的起始頁幀。在啟動碎片整理掃描時,發現pageblock[1]本身記憶體不足,則將其設定成PG_migrate_skip並跳過該pageblock。當繼續掃描pageblock[2]時,發現能隔離出x個頁框,同時也會將其置為PG_migrate_skip。這時會啟動空閒頁框掃描,如果pageblock[n]能隔離出y個頁框,則進行遷移並將compact_cached_free_pfn置為pageblock[n-1]的起始頁幀號。如果x > y,則需要繼續啟動空閒頁框的掃描。最終當compact_cached_migrate_pfn和compact_cached_free_pfn指向了同一個pageblock時,則結束。 下面總結隔離結束的條件: 當zone的所有pageblock都無需掃描,則結束。 當zone已經隔離了太多頁面時,並且隔離連結串列中還有沒處理完的頁框,或當前是非同步模式,或捕獲到致命訊號,則結束。 成功從某個pageblock隔離到頁框,這是正常結束場景。 那標記為PB_migrate_skip的pageblock,誰來負責清理呢?主要有如下兩種場景: compact_cached_free_pfn和compact_cached_migrate_pfn相遇時(指向同一個pageblock),則會設定compact_blockskip_flush為true。當kswapd準備睡眠的時候,會清除該zone的所有PB_migrate_skip。這也很好理解,如果再不清除,下次就沒pageblock掃描了。 非kswapd場景下,當推遲次數達到最大,並且閾值也達到最大時,也會清除zone的PB_migrate_skip。相關實現在__reset_isolation_suitable中。 實行頁框遷移操作 第二步結束後,如果是正常場景,即隔離到頁面時,會進行頁面遷移操作。 /* * migrate_pages - migrate the pages specified in a list, to the free pages * supplied as the target for the page migration * * @from: The list of pages to be migrated。 * @get_new_page: The function used to allocate free pages to be used * as the target of the page migration。 * @put_new_page: The function used to free target pages if migration * fails, or NULL if no special handling is necessary。 * @private: Private data to be passed on to get_new_page() * @mode: The migration mode that specifies the constraints for * page migration, if any。 * @reason: The reason for page migration。 * * The function returns after 10 attempts or if no pages are movable any more * because the list has become empty or no retryable pages exist any more。 * The caller should call putback_movable_pages() to return pages to the LRU * or free list only if ret != 0。 * * Returns the number of pages that were not migrated, or an error code。 */ // 引數說明: // from 待遷移的連結串列 // get_new_page 獲得空閒頁面的函式 // put_new_page 釋放空閒頁面的函式,用於遷移失敗場景,將空閒頁面釋放 // private 上述兩個函式的入參 // mode 遷移模式 // reason 遷移原因 int migrate_pages ( struct list_head * from , new_page_t get_new_page , free_page_t put_new_page , unsigned long private , enum migrate_mode mode , int reason ) { 。。。 int swapwrite = current -> flags & PF_SWAPWRITE ; int rc , nr_subpages ; // 做記憶體頁遷移時,需要當前任務往swap區寫的能力 if ( ! swapwrite ) current -> flags |= PF_SWAPWRITE ; for ( pass = 0 ; pass < 10 && ( retry || thp_retry ); pass ++ ) { retry = 0 ; thp_retry = 0 ; // 遍歷from列表,page是當前page,page2是下一個page list_for_each_entry_safe ( page , page2 , from , lru ) { // 非大頁場景 rc = unmap_and_move ( get_new_page , put_new_page , private , page , pass > 2 , mode , reason ); 。。。 if ( ! swapwrite ) current -> flags &= ~ PF_SWAPWRITE ; return rc ; } // 進行頁面遷移操作,將migratepages中的頁面,遷移到空閒頁面中 err = migrate_pages ( & cc -> migratepages , compaction_alloc , compaction_free , ( unsigned long ) cc , cc -> mode , MR_COMPACTION ); 記憶體遷移需要區分大頁和非大頁的場景。這裡只考慮非大頁場景,會有如下判斷: 如果不支援透明大頁,而當前頁剛好是透明大頁的話,則直接返錯 如果page只在lru中,沒有被使用,直接釋放 如果page被使用,則申請一個空閒頁,然後將現有的page unmap掉,並移動到新的空閒頁中 如果步驟三失敗了,需要將page放回原來連結串列或者清除其隔離屬性,並且將新申請的頁表釋放掉。 下面是申請一個空閒page的實現流程: /* * This is a migrate-callback that “allocates” freepages by taking pages * from the isolated freelists in the block we are migrating to。 */ static struct page * compaction_alloc ( struct page * migratepage , unsigned long data ) { struct compact_control * cc = ( struct compact_control * ) data ; struct page * freepage ; // 如果空閒頁連結串列中為空,則嘗試隔離一些出來 if ( list_empty ( & cc -> freepages )) { // 進行隔離空閒頁框,與隔離待遷移頁框類似 isolate_freepages ( cc ); // 沒能隔離到頁框,則返錯 if ( list_empty ( & cc -> freepages )) return NULL ; } // 取出連結串列首個page,返回給呼叫者使用 freepage = list_entry ( cc -> freepages 。 next , struct page , lru ); list_del ( & freepage -> lru ); cc -> nr_freepages —— ; return freepage ; } 同樣,釋放流程如下所示: /* * This is a migrate-callback that “frees” freepages back to the isolated * freelist。 All pages on the freelist are from the same zone, so there is no * special handling needed for NUMA。 */ static void compaction_free ( struct page * page , unsigned long data ) { struct compact_control * cc = ( struct compact_control * ) data ; list_add ( & page -> lru , & cc -> freepages ); cc -> nr_freepages ++ ; } 比較簡單,就不展開說明了。需要注意的是,如果遷移完成,cc中還有空閒page,也需要釋放掉。 這裡想展開分析另外一個點,就是當一個page正在遷移,而恰好需要對其進行訪問,這時候會怎麼處理才能保證不會出錯呢? __unmap_and_move static int __unmap_and_move ( struct page * page , struct page * newpage , int force , enum migrate_mode mode ) { int rc = - EAGAIN ; int page_was_mapped = 0 ; struct anon_vma * anon_vma = NULL ; bool is_lru = ! __PageMovable ( page ); // 嘗試對page進行加鎖操作(設定PG_locked標誌),注意此時程序還能訪問該page if ( ! trylock_page ( page )) { // 加鎖失敗,如果非強制操作,或者是非同步的模式,則直接返回。因為下面有加鎖操作,會阻塞 if ( ! force || mode == MIGRATE_ASYNC ) goto out ; /* * It‘s not safe for direct compaction to call lock_page。 * For example, during page readahead pages are added locked * to the LRU。 Later, when the IO completes the pages are * marked uptodate and unlocked。 However, the queueing * could be merging multiple pages for one bio (e。g。 * mpage_readahead)。 If an allocation happens for the * second or third page, the process can end up locking * the same page twice and deadlocking。 Rather than * trying to be clever about what pages can be locked, * avoid the use of lock_page for direct compaction * altogether。 */ if ( current -> flags & PF_MEMALLOC ) goto out ; // 阻塞等待鎖,(輕)同步模式 lock_page ( page ); } // 如果page正在回寫,則只能是同步模式,並且是強制執行的設定時,才會等待回寫操作完成, // 非同步或者輕同步模式都不會等待 if ( PageWriteback ( page )) { /* * Only in the case of a full synchronous migration is it * necessary to wait for PageWriteback。 In the async case, * the retry loop is too short and in the sync-light case, * the overhead of stalling is too much */ switch ( mode ) { case MIGRATE_SYNC : case MIGRATE_SYNC_NO_COPY : break ; default : // 非同步模式直接結束 rc = - EBUSY ; goto out_unlock ; } // 同步模式,並且是強制執行的情況下,才會等待頁面回寫完成 if ( ! force ) goto out_unlock ; wait_on_page_writeback ( page ); } /* * By try_to_unmap(), page->mapcount goes down to 0 here。 In this case, * we cannot notice that anon_vma is freed while we migrates a page。 * This get_anon_vma() delays freeing anon_vma pointer until the end * of migration。 File cache pages are no problem because of page_lock() * File Caches may use write_page() or lock_page() in migration, then, * just care Anon page here。 * * Only page_get_anon_vma() understands the subtleties of * getting a hold on an anon_vma from outside one of its mms。 * But if we cannot get anon_vma, then we won’t need it anyway, * because that implies that the anon page is no longer mapped * (and cannot be remapped so long as we hold the page lock)。 */ // 如果是匿名頁並且非ksm的情況,則獲取匿名頁的vma if ( PageAnon ( page ) && ! PageKsm ( page )) anon_vma = page_get_anon_vma ( page ); /* * Block others from accessing the new page when we get around to * establishing additional references。 We are usually the only one * holding a reference to newpage at this point。 We used to have a BUG * here if trylock_page(newpage) fails, but would like to allow for * cases where there might be a race with the previous use of newpage。 * This is much like races on refcount of oldpage: just don‘t BUG()。 */ // 嘗試對新頁進行加鎖,這裡加鎖是避免遷移過程中,該頁被使用 if ( unlikely ( ! trylock_page ( newpage ))) goto out_unlock ; // 如果不是lru上的頁面(說明該頁沒有被使用???),直接移動即可,無需unmap操作 if ( unlikely ( ! is_lru )) { rc = move_to_new_page ( newpage , page , mode ); goto out_unlock_both ; } /* * Corner case handling: * 1。 When a new swap-cache page is read into, it is added to the LRU * and treated as swapcache but it has no rmap yet。 * Calling try_to_unmap() against a page->mapping==NULL page will * trigger a BUG。 So handle it here。 * 2。 An orphaned page (see truncate_complete_page) might have * fs-private metadata。 The page can be picked up due to memory * offlining。 Everywhere else except page reclaim, the page is * invisible to the vm, so the page can not be migrated。 So try to * free the metadata, so the page can be freed。 */ // 如果mapping是NULL,則無需unmap(該page可能正在被回收???),會有兩種情況: // 1、該頁是匿名頁,並且正在換出,其已經unmap過的。 // 2、一些孤立的頁,可能是因為這些頁正在“下線”,這些頁不能被使用,故可以直接回收。加入有私有資料,則需要釋放 if ( ! page -> mapping ) { VM_BUG_ON_PAGE ( PageAnon ( page ), page ); if ( page_has_private ( page )) { try_to_free_buffers ( page ); goto out_unlock_both ; } } else if ( page_mapped ( page )) { /* Establish migration ptes */ VM_BUG_ON_PAGE ( PageAnon ( page ) && ! PageKsm ( page ) && ! anon_vma , page ); // 將所有映射了該頁的程序,進行unmap操作(也即反向對映) // TTU_MIGRATION表示unmap是因為頁框遷移導致的 // TTU_IGNORE_MLOCK表示可以對mlock的頁框進行操作 // unmap後,對page的訪問都會阻塞 try_to_unmap ( page , TTU_MIGRATION | TTU_IGNORE_MLOCK ); page_was_mapped = 1 ; } // 只有當page沒有被對映,才能進行遷移 if ( ! page_mapped ( page )) rc = move_to_new_page ( newpage , page , mode ); // 如果page之前被unmap了,也就是之前映射了該page的程序都插入了“特殊的頁表項”。 // 當遷移動作完成時,需要將這個“特殊的頁表項”改成指向遷移後頁框的頁表項 // 這時,對page的訪問才能正常進行 if ( page_was_mapped ) remove_migration_ptes ( page , rc == MIGRATEPAGE_SUCCESS ? newpage : page , false ); out_unlock_both : unlock_page ( newpage ); out_unlock : /* Drop an anon_vma reference if we took one */ if ( anon_vma ) put_anon_vma ( anon_vma ); unlock_page ( page ); out : /* * If migration is successful, decrease refcount of the newpage * which will not free the page because new page owner increased * refcounter。 As well, if it is LRU page, add the page to LRU * list in here。 Use the old state of the isolated source page to * determine if we migrated a LRU page。 newpage was already unlocked * and possibly modified by its owner - don’t rely on the page * state。 */ // 成功後,將newpage放入到lru中 if ( rc == MIGRATEPAGE_SUCCESS ) { if ( unlikely ( ! is_lru )) put_page ( newpage ); else putback_lru_page ( newpage ); } return rc ; } 在migrate_pages中,有一個步驟是對正在使用的page進行去對映,遷移結束後,需要將頁表項指向遷移後的page,這個操作就是__unmap_and_move,這也是實現同步的關鍵。以下對該流程展開詳細分析: 對舊頁加鎖(設定PG_locked標誌)。這裡上鎖是因為準備進行遷移了,不能讓其他任務修改其內容,但是還是可以進行訪問。要注意的一點是,加入加鎖失敗,則說明可能page已經被其他任務加鎖,需要等待釋放。非同步模式不會等待,直接返回,而(輕)同步模式會等待。 加鎖後,如果page正處於回寫,那麼非同步或輕同步模式,都不會阻塞等待回寫完成。只有開啟了強制執行的同步模式,才會等待。 對新頁進行加鎖,其他任務不能對新頁進行操作。加鎖失敗則結束。 如果舊page不是在lru上,則表示其沒有被程序對映,可以直接遷移,無需unmap動作。 對舊page進行unmap操作,就是找到所有映射了該page的程序,將對應的頁表項修改成一個“特殊”的頁表項。從此時開始,任何訪問該page的程序,都會找到這個“特殊”的頁表項,然後嘗試對該page進行加鎖,但由於第一步已經被加鎖了,所以其他任務會一直等待到鎖釋放。 進行頁框遷移操作。 修改第5步的“特殊”頁表項,將其指向遷移後的page。 對舊page和新page都解鎖,喚醒第5步等待鎖的任務。 總結 本章節主要講述了記憶體碎片整理的一些細節,當系統長時間執行之後,難免會出現碎片,碎片過多時,就會影響到申請連續記憶體的成功率。由於記憶體碎片整理會涉及頁框的遷移動作,所以只會對MIGRATE_RECLAIMABLE、MIGRATE_MOVABLE、MIGRATE_CMA這三種記憶體頁框進行整理。其觸發條件有如下三種場景: “快路徑”無法分配到連續記憶體,進入“慢路徑”時會進行記憶體碎片整理。 kswapd任務中,進行記憶體回收後會進行記憶體碎片整理。 手動觸發,往/proc/sys/vm/compact_memory中寫入1時。 指定範圍分配連續頁框時,而該範圍的頁框有部分已經被使用,則需要透過碎片整理的方式進行遷移。