| .. | .. |
|---|
| 4 | 4 | Page migration |
|---|
| 5 | 5 | ============== |
|---|
| 6 | 6 | |
|---|
| 7 | | -Page migration allows the moving of the physical location of pages between |
|---|
| 8 | | -nodes in a numa system while the process is running. This means that the |
|---|
| 7 | +Page migration allows moving the physical location of pages between |
|---|
| 8 | +nodes in a NUMA system while the process is running. This means that the |
|---|
| 9 | 9 | virtual addresses that the process sees do not change. However, the |
|---|
| 10 | 10 | system rearranges the physical location of those pages. |
|---|
| 11 | 11 | |
|---|
| 12 | | -The main intend of page migration is to reduce the latency of memory access |
|---|
| 12 | +Also see :ref:`Heterogeneous Memory Management (HMM) <hmm>` |
|---|
| 13 | +for migrating pages to or from device private memory. |
|---|
| 14 | + |
|---|
| 15 | +The main intent of page migration is to reduce the latency of memory accesses |
|---|
| 13 | 16 | by moving pages near to the processor where the process accessing that memory |
|---|
| 14 | 17 | is running. |
|---|
| 15 | 18 | |
|---|
| 16 | 19 | Page migration allows a process to manually relocate the node on which its |
|---|
| 17 | 20 | pages are located through the MF_MOVE and MF_MOVE_ALL options while setting |
|---|
| 18 | | -a new memory policy via mbind(). The pages of process can also be relocated |
|---|
| 21 | +a new memory policy via mbind(). The pages of a process can also be relocated |
|---|
| 19 | 22 | from another process using the sys_migrate_pages() function call. The |
|---|
| 20 | | -migrate_pages function call takes two sets of nodes and moves pages of a |
|---|
| 23 | +migrate_pages() function call takes two sets of nodes and moves pages of a |
|---|
| 21 | 24 | process that are located on the from nodes to the destination nodes. |
|---|
| 22 | 25 | Page migration functions are provided by the numactl package by Andi Kleen |
|---|
| 23 | 26 | (a version later than 0.9.3 is required. Get it from |
|---|
| 24 | | -ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma |
|---|
| 25 | | -which provides an interface similar to other numa functionality for page |
|---|
| 27 | +https://github.com/numactl/numactl.git). numactl provides libnuma |
|---|
| 28 | +which provides an interface similar to other NUMA functionality for page |
|---|
| 26 | 29 | migration. cat ``/proc/<pid>/numa_maps`` allows an easy review of where the |
|---|
| 27 | 30 | pages of a process are located. See also the numa_maps documentation in the |
|---|
| 28 | 31 | proc(5) man page. |
|---|
| .. | .. |
|---|
| 30 | 33 | Manual migration is useful if for example the scheduler has relocated |
|---|
| 31 | 34 | a process to a processor on a distant node. A batch scheduler or an |
|---|
| 32 | 35 | administrator may detect the situation and move the pages of the process |
|---|
| 33 | | -nearer to the new processor. The kernel itself does only provide |
|---|
| 36 | +nearer to the new processor. The kernel itself only provides |
|---|
| 34 | 37 | manual page migration support. Automatic page migration may be implemented |
|---|
| 35 | 38 | through user space processes that move pages. A special function call |
|---|
| 36 | 39 | "move_pages" allows the moving of individual pages within a process. |
|---|
| 37 | | -A NUMA profiler may f.e. obtain a log showing frequent off node |
|---|
| 40 | +For example, A NUMA profiler may obtain a log showing frequent off-node |
|---|
| 38 | 41 | accesses and may use the result to move pages to more advantageous |
|---|
| 39 | 42 | locations. |
|---|
| 40 | 43 | |
|---|
| 41 | 44 | Larger installations usually partition the system using cpusets into |
|---|
| 42 | 45 | sections of nodes. Paul Jackson has equipped cpusets with the ability to |
|---|
| 43 | 46 | move pages when a task is moved to another cpuset (See |
|---|
| 44 | | -Documentation/cgroup-v1/cpusets.txt). |
|---|
| 45 | | -Cpusets allows the automation of process locality. If a task is moved to |
|---|
| 47 | +:ref:`CPUSETS <cpusets>`). |
|---|
| 48 | +Cpusets allow the automation of process locality. If a task is moved to |
|---|
| 46 | 49 | a new cpuset then also all its pages are moved with it so that the |
|---|
| 47 | 50 | performance of the process does not sink dramatically. Also the pages |
|---|
| 48 | 51 | of processes in a cpuset are moved if the allowed memory nodes of a |
|---|
| .. | .. |
|---|
| 67 | 70 | Lists of pages to be migrated are generated by scanning over |
|---|
| 68 | 71 | pages and moving them into lists. This is done by |
|---|
| 69 | 72 | calling isolate_lru_page(). |
|---|
| 70 | | - Calling isolate_lru_page increases the references to the page |
|---|
| 73 | + Calling isolate_lru_page() increases the references to the page |
|---|
| 71 | 74 | so that it cannot vanish while the page migration occurs. |
|---|
| 72 | | - It also prevents the swapper or other scans to encounter |
|---|
| 75 | + It also prevents the swapper or other scans from encountering |
|---|
| 73 | 76 | the page. |
|---|
| 74 | 77 | |
|---|
| 75 | 78 | 2. We need to have a function of type new_page_t that can be |
|---|
| .. | .. |
|---|
| 91 | 94 | |
|---|
| 92 | 95 | Steps: |
|---|
| 93 | 96 | |
|---|
| 94 | | -1. Lock the page to be migrated |
|---|
| 97 | +1. Lock the page to be migrated. |
|---|
| 95 | 98 | |
|---|
| 96 | 99 | 2. Ensure that writeback is complete. |
|---|
| 97 | 100 | |
|---|
| 98 | 101 | 3. Lock the new page that we want to move to. It is locked so that accesses to |
|---|
| 99 | | - this (not yet uptodate) page immediately lock while the move is in progress. |
|---|
| 102 | + this (not yet up-to-date) page immediately block while the move is in progress. |
|---|
| 100 | 103 | |
|---|
| 101 | 104 | 4. All the page table references to the page are converted to migration |
|---|
| 102 | 105 | entries. This decreases the mapcount of a page. If the resulting |
|---|
| 103 | 106 | mapcount is not zero then we do not migrate the page. All user space |
|---|
| 104 | | - processes that attempt to access the page will now wait on the page lock. |
|---|
| 107 | + processes that attempt to access the page will now wait on the page lock |
|---|
| 108 | + or wait for the migration page table entry to be removed. |
|---|
| 105 | 109 | |
|---|
| 106 | 110 | 5. The i_pages lock is taken. This will cause all processes trying |
|---|
| 107 | 111 | to access the page via the mapping to block on the spinlock. |
|---|
| 108 | 112 | |
|---|
| 109 | | -6. The refcount of the page is examined and we back out if references remain |
|---|
| 110 | | - otherwise we know that we are the only one referencing this page. |
|---|
| 113 | +6. The refcount of the page is examined and we back out if references remain. |
|---|
| 114 | + Otherwise, we know that we are the only one referencing this page. |
|---|
| 111 | 115 | |
|---|
| 112 | 116 | 7. The radix tree is checked and if it does not contain the pointer to this |
|---|
| 113 | 117 | page then we back out because someone else modified the radix tree. |
|---|
| .. | .. |
|---|
| 134 | 138 | |
|---|
| 135 | 139 | 15. Queued up writeback on the new page is triggered. |
|---|
| 136 | 140 | |
|---|
| 137 | | -16. If migration entries were page then replace them with real ptes. Doing |
|---|
| 138 | | - so will enable access for user space processes not already waiting for |
|---|
| 139 | | - the page lock. |
|---|
| 141 | +16. If migration entries were inserted into the page table, then replace them |
|---|
| 142 | + with real ptes. Doing so will enable access for user space processes not |
|---|
| 143 | + already waiting for the page lock. |
|---|
| 140 | 144 | |
|---|
| 141 | | -19. The page locks are dropped from the old and new page. |
|---|
| 145 | +17. The page locks are dropped from the old and new page. |
|---|
| 142 | 146 | Processes waiting on the page lock will redo their page faults |
|---|
| 143 | 147 | and will reach the new page. |
|---|
| 144 | 148 | |
|---|
| 145 | | -20. The new page is moved to the LRU and can be scanned by the swapper |
|---|
| 146 | | - etc again. |
|---|
| 149 | +18. The new page is moved to the LRU and can be scanned by the swapper, |
|---|
| 150 | + etc. again. |
|---|
| 147 | 151 | |
|---|
| 148 | 152 | Non-LRU page migration |
|---|
| 149 | 153 | ====================== |
|---|
| 150 | 154 | |
|---|
| 151 | | -Although original migration aimed for reducing the latency of memory access |
|---|
| 152 | | -for NUMA, compaction who want to create high-order page is also main customer. |
|---|
| 155 | +Although migration originally aimed for reducing the latency of memory accesses |
|---|
| 156 | +for NUMA, compaction also uses migration to create high-order pages. |
|---|
| 153 | 157 | |
|---|
| 154 | 158 | Current problem of the implementation is that it is designed to migrate only |
|---|
| 155 | | -*LRU* pages. However, there are potential non-lru pages which can be migrated |
|---|
| 159 | +*LRU* pages. However, there are potential non-LRU pages which can be migrated |
|---|
| 156 | 160 | in drivers, for example, zsmalloc, virtio-balloon pages. |
|---|
| 157 | 161 | |
|---|
| 158 | 162 | For virtio-balloon pages, some parts of migration code path have been hooked |
|---|
| 159 | 163 | up and added virtio-balloon specific functions to intercept migration logics. |
|---|
| 160 | 164 | It's too specific to a driver so other drivers who want to make their pages |
|---|
| 161 | | -movable would have to add own specific hooks in migration path. |
|---|
| 165 | +movable would have to add their own specific hooks in the migration path. |
|---|
| 162 | 166 | |
|---|
| 163 | | -To overclome the problem, VM supports non-LRU page migration which provides |
|---|
| 167 | +To overcome the problem, VM supports non-LRU page migration which provides |
|---|
| 164 | 168 | generic functions for non-LRU movable pages without driver specific hooks |
|---|
| 165 | | -migration path. |
|---|
| 169 | +in the migration path. |
|---|
| 166 | 170 | |
|---|
| 167 | | -If a driver want to make own pages movable, it should define three functions |
|---|
| 171 | +If a driver wants to make its pages movable, it should define three functions |
|---|
| 168 | 172 | which are function pointers of struct address_space_operations. |
|---|
| 169 | 173 | |
|---|
| 170 | 174 | 1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);`` |
|---|
| 171 | 175 | |
|---|
| 172 | | - What VM expects on isolate_page function of driver is to return *true* |
|---|
| 173 | | - if driver isolates page successfully. On returing true, VM marks the page |
|---|
| 176 | + What VM expects from isolate_page() function of driver is to return *true* |
|---|
| 177 | + if driver isolates the page successfully. On returning true, VM marks the page |
|---|
| 174 | 178 | as PG_isolated so concurrent isolation in several CPUs skip the page |
|---|
| 175 | 179 | for isolation. If a driver cannot isolate the page, it should return *false*. |
|---|
| 176 | 180 | |
|---|
| 177 | 181 | Once page is successfully isolated, VM uses page.lru fields so driver |
|---|
| 178 | | - shouldn't expect to preserve values in that fields. |
|---|
| 182 | + shouldn't expect to preserve values in those fields. |
|---|
| 179 | 183 | |
|---|
| 180 | 184 | 2. ``int (*migratepage) (struct address_space *mapping,`` |
|---|
| 181 | 185 | | ``struct page *newpage, struct page *oldpage, enum migrate_mode);`` |
|---|
| 182 | 186 | |
|---|
| 183 | | - After isolation, VM calls migratepage of driver with isolated page. |
|---|
| 184 | | - The function of migratepage is to move content of the old page to new page |
|---|
| 187 | + After isolation, VM calls migratepage() of driver with the isolated page. |
|---|
| 188 | + The function of migratepage() is to move the contents of the old page to the |
|---|
| 189 | + new page |
|---|
| 185 | 190 | and set up fields of struct page newpage. Keep in mind that you should |
|---|
| 186 | 191 | indicate to the VM the oldpage is no longer movable via __ClearPageMovable() |
|---|
| 187 | | - under page_lock if you migrated the oldpage successfully and returns |
|---|
| 192 | + under page_lock if you migrated the oldpage successfully and returned |
|---|
| 188 | 193 | MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver |
|---|
| 189 | 194 | can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time |
|---|
| 190 | | - because VM interprets -EAGAIN as "temporal migration failure". On returning |
|---|
| 191 | | - any error except -EAGAIN, VM will give up the page migration without retrying |
|---|
| 192 | | - in this time. |
|---|
| 195 | + because VM interprets -EAGAIN as "temporary migration failure". On returning |
|---|
| 196 | + any error except -EAGAIN, VM will give up the page migration without |
|---|
| 197 | + retrying. |
|---|
| 193 | 198 | |
|---|
| 194 | | - Driver shouldn't touch page.lru field VM using in the functions. |
|---|
| 199 | + Driver shouldn't touch the page.lru field while in the migratepage() function. |
|---|
| 195 | 200 | |
|---|
| 196 | 201 | 3. ``void (*putback_page)(struct page *);`` |
|---|
| 197 | 202 | |
|---|
| 198 | | - If migration fails on isolated page, VM should return the isolated page |
|---|
| 199 | | - to the driver so VM calls driver's putback_page with migration failed page. |
|---|
| 200 | | - In this function, driver should put the isolated page back to the own data |
|---|
| 203 | + If migration fails on the isolated page, VM should return the isolated page |
|---|
| 204 | + to the driver so VM calls the driver's putback_page() with the isolated page. |
|---|
| 205 | + In this function, the driver should put the isolated page back into its own data |
|---|
| 201 | 206 | structure. |
|---|
| 202 | 207 | |
|---|
| 203 | | -4. non-lru movable page flags |
|---|
| 208 | +4. non-LRU movable page flags |
|---|
| 204 | 209 | |
|---|
| 205 | | - There are two page flags for supporting non-lru movable page. |
|---|
| 210 | + There are two page flags for supporting non-LRU movable page. |
|---|
| 206 | 211 | |
|---|
| 207 | 212 | * PG_movable |
|---|
| 208 | 213 | |
|---|
| 209 | | - Driver should use the below function to make page movable under page_lock:: |
|---|
| 214 | + Driver should use the function below to make page movable under page_lock:: |
|---|
| 210 | 215 | |
|---|
| 211 | 216 | void __SetPageMovable(struct page *page, struct address_space *mapping) |
|---|
| 212 | 217 | |
|---|
| 213 | 218 | It needs argument of address_space for registering migration |
|---|
| 214 | 219 | family functions which will be called by VM. Exactly speaking, |
|---|
| 215 | | - PG_movable is not a real flag of struct page. Rather than, VM |
|---|
| 216 | | - reuses page->mapping's lower bits to represent it. |
|---|
| 220 | + PG_movable is not a real flag of struct page. Rather, VM |
|---|
| 221 | + reuses the page->mapping's lower bits to represent it:: |
|---|
| 217 | 222 | |
|---|
| 218 | | -:: |
|---|
| 219 | 223 | #define PAGE_MAPPING_MOVABLE 0x2 |
|---|
| 220 | 224 | page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; |
|---|
| 221 | 225 | |
|---|
| 222 | 226 | so driver shouldn't access page->mapping directly. Instead, driver should |
|---|
| 223 | | - use page_mapping which mask off the low two bits of page->mapping under |
|---|
| 224 | | - page lock so it can get right struct address_space. |
|---|
| 227 | + use page_mapping() which masks off the low two bits of page->mapping under |
|---|
| 228 | + page lock so it can get the right struct address_space. |
|---|
| 225 | 229 | |
|---|
| 226 | | - For testing of non-lru movable page, VM supports __PageMovable function. |
|---|
| 227 | | - However, it doesn't guarantee to identify non-lru movable page because |
|---|
| 228 | | - page->mapping field is unified with other variables in struct page. |
|---|
| 229 | | - As well, if driver releases the page after isolation by VM, page->mapping |
|---|
| 230 | | - doesn't have stable value although it has PAGE_MAPPING_MOVABLE |
|---|
| 231 | | - (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether |
|---|
| 232 | | - page is LRU or non-lru movable once the page has been isolated. Because |
|---|
| 233 | | - LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also |
|---|
| 234 | | - good for just peeking to test non-lru movable pages before more expensive |
|---|
| 235 | | - checking with lock_page in pfn scanning to select victim. |
|---|
| 230 | + For testing of non-LRU movable pages, VM supports __PageMovable() function. |
|---|
| 231 | + However, it doesn't guarantee to identify non-LRU movable pages because |
|---|
| 232 | + the page->mapping field is unified with other variables in struct page. |
|---|
| 233 | + If the driver releases the page after isolation by VM, page->mapping |
|---|
| 234 | + doesn't have a stable value although it has PAGE_MAPPING_MOVABLE set |
|---|
| 235 | + (look at __ClearPageMovable). But __PageMovable() is cheap to call whether |
|---|
| 236 | + page is LRU or non-LRU movable once the page has been isolated because LRU |
|---|
| 237 | + pages can never have PAGE_MAPPING_MOVABLE set in page->mapping. It is also |
|---|
| 238 | + good for just peeking to test non-LRU movable pages before more expensive |
|---|
| 239 | + checking with lock_page() in pfn scanning to select a victim. |
|---|
| 236 | 240 | |
|---|
| 237 | | - For guaranteeing non-lru movable page, VM provides PageMovable function. |
|---|
| 238 | | - Unlike __PageMovable, PageMovable functions validates page->mapping and |
|---|
| 239 | | - mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden |
|---|
| 240 | | - destroying of page->mapping. |
|---|
| 241 | + For guaranteeing non-LRU movable page, VM provides PageMovable() function. |
|---|
| 242 | + Unlike __PageMovable(), PageMovable() validates page->mapping and |
|---|
| 243 | + mapping->a_ops->isolate_page under lock_page(). The lock_page() prevents |
|---|
| 244 | + sudden destroying of page->mapping. |
|---|
| 241 | 245 | |
|---|
| 242 | | - Driver using __SetPageMovable should clear the flag via __ClearMovablePage |
|---|
| 243 | | - under page_lock before the releasing the page. |
|---|
| 246 | + Drivers using __SetPageMovable() should clear the flag via |
|---|
| 247 | + __ClearMovablePage() under page_lock() before the releasing the page. |
|---|
| 244 | 248 | |
|---|
| 245 | 249 | * PG_isolated |
|---|
| 246 | 250 | |
|---|
| 247 | 251 | To prevent concurrent isolation among several CPUs, VM marks isolated page |
|---|
| 248 | | - as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru |
|---|
| 249 | | - movable page, it can skip it. Driver doesn't need to manipulate the flag |
|---|
| 250 | | - because VM will set/clear it automatically. Keep in mind that if driver |
|---|
| 251 | | - sees PG_isolated page, it means the page have been isolated by VM so it |
|---|
| 252 | | - shouldn't touch page.lru field. |
|---|
| 253 | | - PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag |
|---|
| 254 | | - for own purpose. |
|---|
| 252 | + as PG_isolated under lock_page(). So if a CPU encounters PG_isolated |
|---|
| 253 | + non-LRU movable page, it can skip it. Driver doesn't need to manipulate the |
|---|
| 254 | + flag because VM will set/clear it automatically. Keep in mind that if the |
|---|
| 255 | + driver sees a PG_isolated page, it means the page has been isolated by the |
|---|
| 256 | + VM so it shouldn't touch the page.lru field. |
|---|
| 257 | + The PG_isolated flag is aliased with the PG_reclaim flag so drivers |
|---|
| 258 | + shouldn't use PG_isolated for its own purposes. |
|---|
| 259 | + |
|---|
| 260 | +Monitoring Migration |
|---|
| 261 | +===================== |
|---|
| 262 | + |
|---|
| 263 | +The following events (counters) can be used to monitor page migration. |
|---|
| 264 | + |
|---|
| 265 | +1. PGMIGRATE_SUCCESS: Normal page migration success. Each count means that a |
|---|
| 266 | + page was migrated. If the page was a non-THP page, then this counter is |
|---|
| 267 | + increased by one. If the page was a THP, then this counter is increased by |
|---|
| 268 | + the number of THP subpages. For example, migration of a single 2MB THP that |
|---|
| 269 | + has 4KB-size base pages (subpages) will cause this counter to increase by |
|---|
| 270 | + 512. |
|---|
| 271 | + |
|---|
| 272 | +2. PGMIGRATE_FAIL: Normal page migration failure. Same counting rules as for |
|---|
| 273 | + PGMIGRATE_SUCCESS, above: this will be increased by the number of subpages, |
|---|
| 274 | + if it was a THP. |
|---|
| 275 | + |
|---|
| 276 | +3. THP_MIGRATION_SUCCESS: A THP was migrated without being split. |
|---|
| 277 | + |
|---|
| 278 | +4. THP_MIGRATION_FAIL: A THP could not be migrated nor it could be split. |
|---|
| 279 | + |
|---|
| 280 | +5. THP_MIGRATION_SPLIT: A THP was migrated, but not as such: first, the THP had |
|---|
| 281 | + to be split. After splitting, a migration retry was used for it's sub-pages. |
|---|
| 282 | + |
|---|
| 283 | +THP_MIGRATION_* events also update the appropriate PGMIGRATE_SUCCESS or |
|---|
| 284 | +PGMIGRATE_FAIL events. For example, a THP migration failure will cause both |
|---|
| 285 | +THP_MIGRATION_FAIL and PGMIGRATE_FAIL to increase. |
|---|
| 255 | 286 | |
|---|
| 256 | 287 | Christoph Lameter, May 8, 2006. |
|---|
| 257 | 288 | Minchan Kim, Mar 28, 2016. |
|---|