.. | .. |
---|
4 | 4 | Page migration |
---|
5 | 5 | ============== |
---|
6 | 6 | |
---|
7 | | -Page migration allows the moving of the physical location of pages between |
---|
8 | | -nodes in a numa system while the process is running. This means that the |
---|
| 7 | +Page migration allows moving the physical location of pages between |
---|
| 8 | +nodes in a NUMA system while the process is running. This means that the |
---|
9 | 9 | virtual addresses that the process sees do not change. However, the |
---|
10 | 10 | system rearranges the physical location of those pages. |
---|
11 | 11 | |
---|
12 | | -The main intend of page migration is to reduce the latency of memory access |
---|
| 12 | +Also see :ref:`Heterogeneous Memory Management (HMM) <hmm>` |
---|
| 13 | +for migrating pages to or from device private memory. |
---|
| 14 | + |
---|
| 15 | +The main intent of page migration is to reduce the latency of memory accesses |
---|
13 | 16 | by moving pages near to the processor where the process accessing that memory |
---|
14 | 17 | is running. |
---|
15 | 18 | |
---|
16 | 19 | Page migration allows a process to manually relocate the node on which its |
---|
17 | 20 | pages are located through the MF_MOVE and MF_MOVE_ALL options while setting |
---|
18 | | -a new memory policy via mbind(). The pages of process can also be relocated |
---|
| 21 | +a new memory policy via mbind(). The pages of a process can also be relocated |
---|
19 | 22 | from another process using the sys_migrate_pages() function call. The |
---|
20 | | -migrate_pages function call takes two sets of nodes and moves pages of a |
---|
| 23 | +migrate_pages() function call takes two sets of nodes and moves pages of a |
---|
21 | 24 | process that are located on the from nodes to the destination nodes. |
---|
22 | 25 | Page migration functions are provided by the numactl package by Andi Kleen |
---|
23 | 26 | (a version later than 0.9.3 is required. Get it from |
---|
24 | | -ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma |
---|
25 | | -which provides an interface similar to other numa functionality for page |
---|
| 27 | +https://github.com/numactl/numactl.git). numactl provides libnuma |
---|
| 28 | +which provides an interface similar to other NUMA functionality for page |
---|
26 | 29 | migration. cat ``/proc/<pid>/numa_maps`` allows an easy review of where the |
---|
27 | 30 | pages of a process are located. See also the numa_maps documentation in the |
---|
28 | 31 | proc(5) man page. |
---|
.. | .. |
---|
30 | 33 | Manual migration is useful if for example the scheduler has relocated |
---|
31 | 34 | a process to a processor on a distant node. A batch scheduler or an |
---|
32 | 35 | administrator may detect the situation and move the pages of the process |
---|
33 | | -nearer to the new processor. The kernel itself does only provide |
---|
| 36 | +nearer to the new processor. The kernel itself only provides |
---|
34 | 37 | manual page migration support. Automatic page migration may be implemented |
---|
35 | 38 | through user space processes that move pages. A special function call |
---|
36 | 39 | "move_pages" allows the moving of individual pages within a process. |
---|
37 | | -A NUMA profiler may f.e. obtain a log showing frequent off node |
---|
| 40 | +For example, A NUMA profiler may obtain a log showing frequent off-node |
---|
38 | 41 | accesses and may use the result to move pages to more advantageous |
---|
39 | 42 | locations. |
---|
40 | 43 | |
---|
41 | 44 | Larger installations usually partition the system using cpusets into |
---|
42 | 45 | sections of nodes. Paul Jackson has equipped cpusets with the ability to |
---|
43 | 46 | move pages when a task is moved to another cpuset (See |
---|
44 | | -Documentation/cgroup-v1/cpusets.txt). |
---|
45 | | -Cpusets allows the automation of process locality. If a task is moved to |
---|
| 47 | +:ref:`CPUSETS <cpusets>`). |
---|
| 48 | +Cpusets allow the automation of process locality. If a task is moved to |
---|
46 | 49 | a new cpuset then also all its pages are moved with it so that the |
---|
47 | 50 | performance of the process does not sink dramatically. Also the pages |
---|
48 | 51 | of processes in a cpuset are moved if the allowed memory nodes of a |
---|
.. | .. |
---|
67 | 70 | Lists of pages to be migrated are generated by scanning over |
---|
68 | 71 | pages and moving them into lists. This is done by |
---|
69 | 72 | calling isolate_lru_page(). |
---|
70 | | - Calling isolate_lru_page increases the references to the page |
---|
| 73 | + Calling isolate_lru_page() increases the references to the page |
---|
71 | 74 | so that it cannot vanish while the page migration occurs. |
---|
72 | | - It also prevents the swapper or other scans to encounter |
---|
| 75 | + It also prevents the swapper or other scans from encountering |
---|
73 | 76 | the page. |
---|
74 | 77 | |
---|
75 | 78 | 2. We need to have a function of type new_page_t that can be |
---|
.. | .. |
---|
91 | 94 | |
---|
92 | 95 | Steps: |
---|
93 | 96 | |
---|
94 | | -1. Lock the page to be migrated |
---|
| 97 | +1. Lock the page to be migrated. |
---|
95 | 98 | |
---|
96 | 99 | 2. Ensure that writeback is complete. |
---|
97 | 100 | |
---|
98 | 101 | 3. Lock the new page that we want to move to. It is locked so that accesses to |
---|
99 | | - this (not yet uptodate) page immediately lock while the move is in progress. |
---|
| 102 | + this (not yet up-to-date) page immediately block while the move is in progress. |
---|
100 | 103 | |
---|
101 | 104 | 4. All the page table references to the page are converted to migration |
---|
102 | 105 | entries. This decreases the mapcount of a page. If the resulting |
---|
103 | 106 | mapcount is not zero then we do not migrate the page. All user space |
---|
104 | | - processes that attempt to access the page will now wait on the page lock. |
---|
| 107 | + processes that attempt to access the page will now wait on the page lock |
---|
| 108 | + or wait for the migration page table entry to be removed. |
---|
105 | 109 | |
---|
106 | 110 | 5. The i_pages lock is taken. This will cause all processes trying |
---|
107 | 111 | to access the page via the mapping to block on the spinlock. |
---|
108 | 112 | |
---|
109 | | -6. The refcount of the page is examined and we back out if references remain |
---|
110 | | - otherwise we know that we are the only one referencing this page. |
---|
| 113 | +6. The refcount of the page is examined and we back out if references remain. |
---|
| 114 | + Otherwise, we know that we are the only one referencing this page. |
---|
111 | 115 | |
---|
112 | 116 | 7. The radix tree is checked and if it does not contain the pointer to this |
---|
113 | 117 | page then we back out because someone else modified the radix tree. |
---|
.. | .. |
---|
134 | 138 | |
---|
135 | 139 | 15. Queued up writeback on the new page is triggered. |
---|
136 | 140 | |
---|
137 | | -16. If migration entries were page then replace them with real ptes. Doing |
---|
138 | | - so will enable access for user space processes not already waiting for |
---|
139 | | - the page lock. |
---|
| 141 | +16. If migration entries were inserted into the page table, then replace them |
---|
| 142 | + with real ptes. Doing so will enable access for user space processes not |
---|
| 143 | + already waiting for the page lock. |
---|
140 | 144 | |
---|
141 | | -19. The page locks are dropped from the old and new page. |
---|
| 145 | +17. The page locks are dropped from the old and new page. |
---|
142 | 146 | Processes waiting on the page lock will redo their page faults |
---|
143 | 147 | and will reach the new page. |
---|
144 | 148 | |
---|
145 | | -20. The new page is moved to the LRU and can be scanned by the swapper |
---|
146 | | - etc again. |
---|
| 149 | +18. The new page is moved to the LRU and can be scanned by the swapper, |
---|
| 150 | + etc. again. |
---|
147 | 151 | |
---|
148 | 152 | Non-LRU page migration |
---|
149 | 153 | ====================== |
---|
150 | 154 | |
---|
151 | | -Although original migration aimed for reducing the latency of memory access |
---|
152 | | -for NUMA, compaction who want to create high-order page is also main customer. |
---|
| 155 | +Although migration originally aimed for reducing the latency of memory accesses |
---|
| 156 | +for NUMA, compaction also uses migration to create high-order pages. |
---|
153 | 157 | |
---|
154 | 158 | Current problem of the implementation is that it is designed to migrate only |
---|
155 | | -*LRU* pages. However, there are potential non-lru pages which can be migrated |
---|
| 159 | +*LRU* pages. However, there are potential non-LRU pages which can be migrated |
---|
156 | 160 | in drivers, for example, zsmalloc, virtio-balloon pages. |
---|
157 | 161 | |
---|
158 | 162 | For virtio-balloon pages, some parts of migration code path have been hooked |
---|
159 | 163 | up and added virtio-balloon specific functions to intercept migration logics. |
---|
160 | 164 | It's too specific to a driver so other drivers who want to make their pages |
---|
161 | | -movable would have to add own specific hooks in migration path. |
---|
| 165 | +movable would have to add their own specific hooks in the migration path. |
---|
162 | 166 | |
---|
163 | | -To overclome the problem, VM supports non-LRU page migration which provides |
---|
| 167 | +To overcome the problem, VM supports non-LRU page migration which provides |
---|
164 | 168 | generic functions for non-LRU movable pages without driver specific hooks |
---|
165 | | -migration path. |
---|
| 169 | +in the migration path. |
---|
166 | 170 | |
---|
167 | | -If a driver want to make own pages movable, it should define three functions |
---|
| 171 | +If a driver wants to make its pages movable, it should define three functions |
---|
168 | 172 | which are function pointers of struct address_space_operations. |
---|
169 | 173 | |
---|
170 | 174 | 1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);`` |
---|
171 | 175 | |
---|
172 | | - What VM expects on isolate_page function of driver is to return *true* |
---|
173 | | - if driver isolates page successfully. On returing true, VM marks the page |
---|
| 176 | + What VM expects from isolate_page() function of driver is to return *true* |
---|
| 177 | + if driver isolates the page successfully. On returning true, VM marks the page |
---|
174 | 178 | as PG_isolated so concurrent isolation in several CPUs skip the page |
---|
175 | 179 | for isolation. If a driver cannot isolate the page, it should return *false*. |
---|
176 | 180 | |
---|
177 | 181 | Once page is successfully isolated, VM uses page.lru fields so driver |
---|
178 | | - shouldn't expect to preserve values in that fields. |
---|
| 182 | + shouldn't expect to preserve values in those fields. |
---|
179 | 183 | |
---|
180 | 184 | 2. ``int (*migratepage) (struct address_space *mapping,`` |
---|
181 | 185 | | ``struct page *newpage, struct page *oldpage, enum migrate_mode);`` |
---|
182 | 186 | |
---|
183 | | - After isolation, VM calls migratepage of driver with isolated page. |
---|
184 | | - The function of migratepage is to move content of the old page to new page |
---|
| 187 | + After isolation, VM calls migratepage() of driver with the isolated page. |
---|
| 188 | + The function of migratepage() is to move the contents of the old page to the |
---|
| 189 | + new page |
---|
185 | 190 | and set up fields of struct page newpage. Keep in mind that you should |
---|
186 | 191 | indicate to the VM the oldpage is no longer movable via __ClearPageMovable() |
---|
187 | | - under page_lock if you migrated the oldpage successfully and returns |
---|
| 192 | + under page_lock if you migrated the oldpage successfully and returned |
---|
188 | 193 | MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver |
---|
189 | 194 | can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time |
---|
190 | | - because VM interprets -EAGAIN as "temporal migration failure". On returning |
---|
191 | | - any error except -EAGAIN, VM will give up the page migration without retrying |
---|
192 | | - in this time. |
---|
| 195 | + because VM interprets -EAGAIN as "temporary migration failure". On returning |
---|
| 196 | + any error except -EAGAIN, VM will give up the page migration without |
---|
| 197 | + retrying. |
---|
193 | 198 | |
---|
194 | | - Driver shouldn't touch page.lru field VM using in the functions. |
---|
| 199 | + Driver shouldn't touch the page.lru field while in the migratepage() function. |
---|
195 | 200 | |
---|
196 | 201 | 3. ``void (*putback_page)(struct page *);`` |
---|
197 | 202 | |
---|
198 | | - If migration fails on isolated page, VM should return the isolated page |
---|
199 | | - to the driver so VM calls driver's putback_page with migration failed page. |
---|
200 | | - In this function, driver should put the isolated page back to the own data |
---|
| 203 | + If migration fails on the isolated page, VM should return the isolated page |
---|
| 204 | + to the driver so VM calls the driver's putback_page() with the isolated page. |
---|
| 205 | + In this function, the driver should put the isolated page back into its own data |
---|
201 | 206 | structure. |
---|
202 | 207 | |
---|
203 | | -4. non-lru movable page flags |
---|
| 208 | +4. non-LRU movable page flags |
---|
204 | 209 | |
---|
205 | | - There are two page flags for supporting non-lru movable page. |
---|
| 210 | + There are two page flags for supporting non-LRU movable page. |
---|
206 | 211 | |
---|
207 | 212 | * PG_movable |
---|
208 | 213 | |
---|
209 | | - Driver should use the below function to make page movable under page_lock:: |
---|
| 214 | + Driver should use the function below to make page movable under page_lock:: |
---|
210 | 215 | |
---|
211 | 216 | void __SetPageMovable(struct page *page, struct address_space *mapping) |
---|
212 | 217 | |
---|
213 | 218 | It needs argument of address_space for registering migration |
---|
214 | 219 | family functions which will be called by VM. Exactly speaking, |
---|
215 | | - PG_movable is not a real flag of struct page. Rather than, VM |
---|
216 | | - reuses page->mapping's lower bits to represent it. |
---|
| 220 | + PG_movable is not a real flag of struct page. Rather, VM |
---|
| 221 | + reuses the page->mapping's lower bits to represent it:: |
---|
217 | 222 | |
---|
218 | | -:: |
---|
219 | 223 | #define PAGE_MAPPING_MOVABLE 0x2 |
---|
220 | 224 | page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; |
---|
221 | 225 | |
---|
222 | 226 | so driver shouldn't access page->mapping directly. Instead, driver should |
---|
223 | | - use page_mapping which mask off the low two bits of page->mapping under |
---|
224 | | - page lock so it can get right struct address_space. |
---|
| 227 | + use page_mapping() which masks off the low two bits of page->mapping under |
---|
| 228 | + page lock so it can get the right struct address_space. |
---|
225 | 229 | |
---|
226 | | - For testing of non-lru movable page, VM supports __PageMovable function. |
---|
227 | | - However, it doesn't guarantee to identify non-lru movable page because |
---|
228 | | - page->mapping field is unified with other variables in struct page. |
---|
229 | | - As well, if driver releases the page after isolation by VM, page->mapping |
---|
230 | | - doesn't have stable value although it has PAGE_MAPPING_MOVABLE |
---|
231 | | - (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether |
---|
232 | | - page is LRU or non-lru movable once the page has been isolated. Because |
---|
233 | | - LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also |
---|
234 | | - good for just peeking to test non-lru movable pages before more expensive |
---|
235 | | - checking with lock_page in pfn scanning to select victim. |
---|
| 230 | + For testing of non-LRU movable pages, VM supports __PageMovable() function. |
---|
| 231 | + However, it doesn't guarantee to identify non-LRU movable pages because |
---|
| 232 | + the page->mapping field is unified with other variables in struct page. |
---|
| 233 | + If the driver releases the page after isolation by VM, page->mapping |
---|
| 234 | + doesn't have a stable value although it has PAGE_MAPPING_MOVABLE set |
---|
| 235 | + (look at __ClearPageMovable). But __PageMovable() is cheap to call whether |
---|
| 236 | + page is LRU or non-LRU movable once the page has been isolated because LRU |
---|
| 237 | + pages can never have PAGE_MAPPING_MOVABLE set in page->mapping. It is also |
---|
| 238 | + good for just peeking to test non-LRU movable pages before more expensive |
---|
| 239 | + checking with lock_page() in pfn scanning to select a victim. |
---|
236 | 240 | |
---|
237 | | - For guaranteeing non-lru movable page, VM provides PageMovable function. |
---|
238 | | - Unlike __PageMovable, PageMovable functions validates page->mapping and |
---|
239 | | - mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden |
---|
240 | | - destroying of page->mapping. |
---|
| 241 | + For guaranteeing non-LRU movable page, VM provides PageMovable() function. |
---|
| 242 | + Unlike __PageMovable(), PageMovable() validates page->mapping and |
---|
| 243 | + mapping->a_ops->isolate_page under lock_page(). The lock_page() prevents |
---|
| 244 | + sudden destroying of page->mapping. |
---|
241 | 245 | |
---|
242 | | - Driver using __SetPageMovable should clear the flag via __ClearMovablePage |
---|
243 | | - under page_lock before the releasing the page. |
---|
| 246 | + Drivers using __SetPageMovable() should clear the flag via |
---|
| 247 | + __ClearMovablePage() under page_lock() before the releasing the page. |
---|
244 | 248 | |
---|
245 | 249 | * PG_isolated |
---|
246 | 250 | |
---|
247 | 251 | To prevent concurrent isolation among several CPUs, VM marks isolated page |
---|
248 | | - as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru |
---|
249 | | - movable page, it can skip it. Driver doesn't need to manipulate the flag |
---|
250 | | - because VM will set/clear it automatically. Keep in mind that if driver |
---|
251 | | - sees PG_isolated page, it means the page have been isolated by VM so it |
---|
252 | | - shouldn't touch page.lru field. |
---|
253 | | - PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag |
---|
254 | | - for own purpose. |
---|
| 252 | + as PG_isolated under lock_page(). So if a CPU encounters PG_isolated |
---|
| 253 | + non-LRU movable page, it can skip it. Driver doesn't need to manipulate the |
---|
| 254 | + flag because VM will set/clear it automatically. Keep in mind that if the |
---|
| 255 | + driver sees a PG_isolated page, it means the page has been isolated by the |
---|
| 256 | + VM so it shouldn't touch the page.lru field. |
---|
| 257 | + The PG_isolated flag is aliased with the PG_reclaim flag so drivers |
---|
| 258 | + shouldn't use PG_isolated for its own purposes. |
---|
| 259 | + |
---|
| 260 | +Monitoring Migration |
---|
| 261 | +===================== |
---|
| 262 | + |
---|
| 263 | +The following events (counters) can be used to monitor page migration. |
---|
| 264 | + |
---|
| 265 | +1. PGMIGRATE_SUCCESS: Normal page migration success. Each count means that a |
---|
| 266 | + page was migrated. If the page was a non-THP page, then this counter is |
---|
| 267 | + increased by one. If the page was a THP, then this counter is increased by |
---|
| 268 | + the number of THP subpages. For example, migration of a single 2MB THP that |
---|
| 269 | + has 4KB-size base pages (subpages) will cause this counter to increase by |
---|
| 270 | + 512. |
---|
| 271 | + |
---|
| 272 | +2. PGMIGRATE_FAIL: Normal page migration failure. Same counting rules as for |
---|
| 273 | + PGMIGRATE_SUCCESS, above: this will be increased by the number of subpages, |
---|
| 274 | + if it was a THP. |
---|
| 275 | + |
---|
| 276 | +3. THP_MIGRATION_SUCCESS: A THP was migrated without being split. |
---|
| 277 | + |
---|
| 278 | +4. THP_MIGRATION_FAIL: A THP could not be migrated nor it could be split. |
---|
| 279 | + |
---|
| 280 | +5. THP_MIGRATION_SPLIT: A THP was migrated, but not as such: first, the THP had |
---|
| 281 | + to be split. After splitting, a migration retry was used for it's sub-pages. |
---|
| 282 | + |
---|
| 283 | +THP_MIGRATION_* events also update the appropriate PGMIGRATE_SUCCESS or |
---|
| 284 | +PGMIGRATE_FAIL events. For example, a THP migration failure will cause both |
---|
| 285 | +THP_MIGRATION_FAIL and PGMIGRATE_FAIL to increase. |
---|
255 | 286 | |
---|
256 | 287 | Christoph Lameter, May 8, 2006. |
---|
257 | 288 | Minchan Kim, Mar 28, 2016. |
---|