| .. | .. |
|---|
| 1 | | -.. hmm: |
|---|
| 1 | +.. _hmm: |
|---|
| 2 | 2 | |
|---|
| 3 | 3 | ===================================== |
|---|
| 4 | 4 | Heterogeneous Memory Management (HMM) |
|---|
| .. | .. |
|---|
| 10 | 10 | this document). |
|---|
| 11 | 11 | |
|---|
| 12 | 12 | HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., |
|---|
| 13 | | -allowing a device to transparently access program address coherently with |
|---|
| 13 | +allowing a device to transparently access program addresses coherently with |
|---|
| 14 | 14 | the CPU meaning that any valid pointer on the CPU is also a valid pointer |
|---|
| 15 | 15 | for the device. This is becoming mandatory to simplify the use of advanced |
|---|
| 16 | 16 | heterogeneous computing where GPU, DSP, or FPGA are used to perform various |
|---|
| .. | .. |
|---|
| 22 | 22 | section gives an overview of the HMM design. The fourth section explains how |
|---|
| 23 | 23 | CPU page-table mirroring works and the purpose of HMM in this context. The |
|---|
| 24 | 24 | fifth section deals with how device memory is represented inside the kernel. |
|---|
| 25 | | -Finally, the last section presents a new migration helper that allows lever- |
|---|
| 26 | | -aging the device DMA engine. |
|---|
| 25 | +Finally, the last section presents a new migration helper that allows |
|---|
| 26 | +leveraging the device DMA engine. |
|---|
| 27 | 27 | |
|---|
| 28 | 28 | .. contents:: :local: |
|---|
| 29 | 29 | |
|---|
| .. | .. |
|---|
| 39 | 39 | i.e., one in which any application memory region can be used by a device |
|---|
| 40 | 40 | transparently. |
|---|
| 41 | 41 | |
|---|
| 42 | | -Split address space happens because device can only access memory allocated |
|---|
| 43 | | -through device specific API. This implies that all memory objects in a program |
|---|
| 42 | +Split address space happens because devices can only access memory allocated |
|---|
| 43 | +through a device specific API. This implies that all memory objects in a program |
|---|
| 44 | 44 | are not equal from the device point of view which complicates large programs |
|---|
| 45 | 45 | that rely on a wide set of libraries. |
|---|
| 46 | 46 | |
|---|
| 47 | | -Concretely this means that code that wants to leverage devices like GPUs needs |
|---|
| 48 | | -to copy object between generically allocated memory (malloc, mmap private, mmap |
|---|
| 47 | +Concretely, this means that code that wants to leverage devices like GPUs needs |
|---|
| 48 | +to copy objects between generically allocated memory (malloc, mmap private, mmap |
|---|
| 49 | 49 | share) and memory allocated through the device driver API (this still ends up |
|---|
| 50 | 50 | with an mmap but of the device file). |
|---|
| 51 | 51 | |
|---|
| 52 | 52 | For flat data sets (array, grid, image, ...) this isn't too hard to achieve but |
|---|
| 53 | | -complex data sets (list, tree, ...) are hard to get right. Duplicating a |
|---|
| 53 | +for complex data sets (list, tree, ...) it's hard to get right. Duplicating a |
|---|
| 54 | 54 | complex data set needs to re-map all the pointer relations between each of its |
|---|
| 55 | | -elements. This is error prone and program gets harder to debug because of the |
|---|
| 55 | +elements. This is error prone and programs get harder to debug because of the |
|---|
| 56 | 56 | duplicate data set and addresses. |
|---|
| 57 | 57 | |
|---|
| 58 | 58 | Split address space also means that libraries cannot transparently use data |
|---|
| .. | .. |
|---|
| 77 | 77 | |
|---|
| 78 | 78 | I/O buses cripple shared address spaces due to a few limitations. Most I/O |
|---|
| 79 | 79 | buses only allow basic memory access from device to main memory; even cache |
|---|
| 80 | | -coherency is often optional. Access to device memory from CPU is even more |
|---|
| 80 | +coherency is often optional. Access to device memory from a CPU is even more |
|---|
| 81 | 81 | limited. More often than not, it is not cache coherent. |
|---|
| 82 | 82 | |
|---|
| 83 | 83 | If we only consider the PCIE bus, then a device can access main memory (often |
|---|
| 84 | 84 | through an IOMMU) and be cache coherent with the CPUs. However, it only allows |
|---|
| 85 | | -a limited set of atomic operations from device on main memory. This is worse |
|---|
| 85 | +a limited set of atomic operations from the device on main memory. This is worse |
|---|
| 86 | 86 | in the other direction: the CPU can only access a limited range of the device |
|---|
| 87 | 87 | memory and cannot perform atomic operations on it. Thus device memory cannot |
|---|
| 88 | 88 | be considered the same as regular memory from the kernel point of view. |
|---|
| .. | .. |
|---|
| 93 | 93 | order of magnitude higher latency than when the device accesses its own memory. |
|---|
| 94 | 94 | |
|---|
| 95 | 95 | Some platforms are developing new I/O buses or additions/modifications to PCIE |
|---|
| 96 | | -to address some of these limitations (OpenCAPI, CCIX). They mainly allow two- |
|---|
| 97 | | -way cache coherency between CPU and device and allow all atomic operations the |
|---|
| 96 | +to address some of these limitations (OpenCAPI, CCIX). They mainly allow |
|---|
| 97 | +two-way cache coherency between CPU and device and allow all atomic operations the |
|---|
| 98 | 98 | architecture supports. Sadly, not all platforms are following this trend and |
|---|
| 99 | 99 | some major architectures are left without hardware solutions to these problems. |
|---|
| 100 | 100 | |
|---|
| 101 | 101 | So for shared address space to make sense, not only must we allow devices to |
|---|
| 102 | 102 | access any memory but we must also permit any memory to be migrated to device |
|---|
| 103 | | -memory while device is using it (blocking CPU access while it happens). |
|---|
| 103 | +memory while the device is using it (blocking CPU access while it happens). |
|---|
| 104 | 104 | |
|---|
| 105 | 105 | |
|---|
| 106 | 106 | Shared address space and migration |
|---|
| 107 | 107 | ================================== |
|---|
| 108 | 108 | |
|---|
| 109 | | -HMM intends to provide two main features. First one is to share the address |
|---|
| 109 | +HMM intends to provide two main features. The first one is to share the address |
|---|
| 110 | 110 | space by duplicating the CPU page table in the device page table so the same |
|---|
| 111 | 111 | address points to the same physical memory for any valid main memory address in |
|---|
| 112 | 112 | the process address space. |
|---|
| .. | .. |
|---|
| 121 | 121 | hardware specific details to the device driver. |
|---|
| 122 | 122 | |
|---|
| 123 | 123 | The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that |
|---|
| 124 | | -allows allocating a struct page for each page of the device memory. Those pages |
|---|
| 124 | +allows allocating a struct page for each page of device memory. Those pages |
|---|
| 125 | 125 | are special because the CPU cannot map them. However, they allow migrating |
|---|
| 126 | 126 | main memory to device memory using existing migration mechanisms and everything |
|---|
| 127 | | -looks like a page is swapped out to disk from the CPU point of view. Using a |
|---|
| 128 | | -struct page gives the easiest and cleanest integration with existing mm mech- |
|---|
| 129 | | -anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE |
|---|
| 127 | +looks like a page that is swapped out to disk from the CPU point of view. Using a |
|---|
| 128 | +struct page gives the easiest and cleanest integration with existing mm |
|---|
| 129 | +mechanisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE |
|---|
| 130 | 130 | memory for the device memory and second to perform migration. Policy decisions |
|---|
| 131 | | -of what and when to migrate things is left to the device driver. |
|---|
| 131 | +of what and when to migrate is left to the device driver. |
|---|
| 132 | 132 | |
|---|
| 133 | 133 | Note that any CPU access to a device page triggers a page fault and a migration |
|---|
| 134 | 134 | back to main memory. For example, when a page backing a given CPU address A is |
|---|
| .. | .. |
|---|
| 136 | 136 | address A triggers a page fault and initiates a migration back to main memory. |
|---|
| 137 | 137 | |
|---|
| 138 | 138 | With these two features, HMM not only allows a device to mirror process address |
|---|
| 139 | | -space and keeping both CPU and device page table synchronized, but also lever- |
|---|
| 140 | | -ages device memory by migrating the part of the data set that is actively being |
|---|
| 139 | +space and keeps both CPU and device page tables synchronized, but also |
|---|
| 140 | +leverages device memory by migrating the part of the data set that is actively being |
|---|
| 141 | 141 | used by the device. |
|---|
| 142 | 142 | |
|---|
| 143 | 143 | |
|---|
| .. | .. |
|---|
| 147 | 147 | Address space mirroring's main objective is to allow duplication of a range of |
|---|
| 148 | 148 | CPU page table into a device page table; HMM helps keep both synchronized. A |
|---|
| 149 | 149 | device driver that wants to mirror a process address space must start with the |
|---|
| 150 | | -registration of an hmm_mirror struct:: |
|---|
| 150 | +registration of a mmu_interval_notifier:: |
|---|
| 151 | 151 | |
|---|
| 152 | | - int hmm_mirror_register(struct hmm_mirror *mirror, |
|---|
| 153 | | - struct mm_struct *mm); |
|---|
| 154 | | - int hmm_mirror_register_locked(struct hmm_mirror *mirror, |
|---|
| 155 | | - struct mm_struct *mm); |
|---|
| 152 | + int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub, |
|---|
| 153 | + struct mm_struct *mm, unsigned long start, |
|---|
| 154 | + unsigned long length, |
|---|
| 155 | + const struct mmu_interval_notifier_ops *ops); |
|---|
| 156 | 156 | |
|---|
| 157 | | - |
|---|
| 158 | | -The locked variant is to be used when the driver is already holding mmap_sem |
|---|
| 159 | | -of the mm in write mode. The mirror struct has a set of callbacks that are used |
|---|
| 160 | | -to propagate CPU page tables:: |
|---|
| 161 | | - |
|---|
| 162 | | - struct hmm_mirror_ops { |
|---|
| 163 | | - /* sync_cpu_device_pagetables() - synchronize page tables |
|---|
| 164 | | - * |
|---|
| 165 | | - * @mirror: pointer to struct hmm_mirror |
|---|
| 166 | | - * @update_type: type of update that occurred to the CPU page table |
|---|
| 167 | | - * @start: virtual start address of the range to update |
|---|
| 168 | | - * @end: virtual end address of the range to update |
|---|
| 169 | | - * |
|---|
| 170 | | - * This callback ultimately originates from mmu_notifiers when the CPU |
|---|
| 171 | | - * page table is updated. The device driver must update its page table |
|---|
| 172 | | - * in response to this callback. The update argument tells what action |
|---|
| 173 | | - * to perform. |
|---|
| 174 | | - * |
|---|
| 175 | | - * The device driver must not return from this callback until the device |
|---|
| 176 | | - * page tables are completely updated (TLBs flushed, etc); this is a |
|---|
| 177 | | - * synchronous call. |
|---|
| 178 | | - */ |
|---|
| 179 | | - void (*update)(struct hmm_mirror *mirror, |
|---|
| 180 | | - enum hmm_update action, |
|---|
| 181 | | - unsigned long start, |
|---|
| 182 | | - unsigned long end); |
|---|
| 183 | | - }; |
|---|
| 184 | | - |
|---|
| 185 | | -The device driver must perform the update action to the range (mark range |
|---|
| 186 | | -read only, or fully unmap, ...). The device must be done with the update before |
|---|
| 187 | | -the driver callback returns. |
|---|
| 157 | +During the ops->invalidate() callback the device driver must perform the |
|---|
| 158 | +update action to the range (mark range read only, or fully unmap, etc.). The |
|---|
| 159 | +device must complete the update before the driver callback returns. |
|---|
| 188 | 160 | |
|---|
| 189 | 161 | When the device driver wants to populate a range of virtual addresses, it can |
|---|
| 190 | | -use either:: |
|---|
| 162 | +use:: |
|---|
| 191 | 163 | |
|---|
| 192 | | - int hmm_vma_get_pfns(struct vm_area_struct *vma, |
|---|
| 193 | | - struct hmm_range *range, |
|---|
| 194 | | - unsigned long start, |
|---|
| 195 | | - unsigned long end, |
|---|
| 196 | | - hmm_pfn_t *pfns); |
|---|
| 197 | | - int hmm_vma_fault(struct vm_area_struct *vma, |
|---|
| 198 | | - struct hmm_range *range, |
|---|
| 199 | | - unsigned long start, |
|---|
| 200 | | - unsigned long end, |
|---|
| 201 | | - hmm_pfn_t *pfns, |
|---|
| 202 | | - bool write, |
|---|
| 203 | | - bool block); |
|---|
| 164 | + int hmm_range_fault(struct hmm_range *range); |
|---|
| 204 | 165 | |
|---|
| 205 | | -The first one (hmm_vma_get_pfns()) will only fetch present CPU page table |
|---|
| 206 | | -entries and will not trigger a page fault on missing or non-present entries. |
|---|
| 207 | | -The second one does trigger a page fault on missing or read-only entry if the |
|---|
| 208 | | -write parameter is true. Page faults use the generic mm page fault code path |
|---|
| 209 | | -just like a CPU page fault. |
|---|
| 166 | +It will trigger a page fault on missing or read-only entries if write access is |
|---|
| 167 | +requested (see below). Page faults use the generic mm page fault code path just |
|---|
| 168 | +like a CPU page fault. |
|---|
| 210 | 169 | |
|---|
| 211 | 170 | Both functions copy CPU page table entries into their pfns array argument. Each |
|---|
| 212 | 171 | entry in that array corresponds to an address in the virtual range. HMM |
|---|
| 213 | 172 | provides a set of flags to help the driver identify special CPU page table |
|---|
| 214 | 173 | entries. |
|---|
| 215 | 174 | |
|---|
| 216 | | -Locking with the update() callback is the most important aspect the driver must |
|---|
| 217 | | -respect in order to keep things properly synchronized. The usage pattern is:: |
|---|
| 175 | +Locking within the sync_cpu_device_pagetables() callback is the most important |
|---|
| 176 | +aspect the driver must respect in order to keep things properly synchronized. |
|---|
| 177 | +The usage pattern is:: |
|---|
| 218 | 178 | |
|---|
| 219 | 179 | int driver_populate_range(...) |
|---|
| 220 | 180 | { |
|---|
| 221 | 181 | struct hmm_range range; |
|---|
| 222 | 182 | ... |
|---|
| 183 | + |
|---|
| 184 | + range.notifier = &interval_sub; |
|---|
| 185 | + range.start = ...; |
|---|
| 186 | + range.end = ...; |
|---|
| 187 | + range.hmm_pfns = ...; |
|---|
| 188 | + |
|---|
| 189 | + if (!mmget_not_zero(interval_sub->notifier.mm)) |
|---|
| 190 | + return -EFAULT; |
|---|
| 191 | + |
|---|
| 223 | 192 | again: |
|---|
| 224 | | - ret = hmm_vma_get_pfns(vma, &range, start, end, pfns); |
|---|
| 225 | | - if (ret) |
|---|
| 193 | + range.notifier_seq = mmu_interval_read_begin(&interval_sub); |
|---|
| 194 | + mmap_read_lock(mm); |
|---|
| 195 | + ret = hmm_range_fault(&range); |
|---|
| 196 | + if (ret) { |
|---|
| 197 | + mmap_read_unlock(mm); |
|---|
| 198 | + if (ret == -EBUSY) |
|---|
| 199 | + goto again; |
|---|
| 226 | 200 | return ret; |
|---|
| 201 | + } |
|---|
| 202 | + mmap_read_unlock(mm); |
|---|
| 203 | + |
|---|
| 227 | 204 | take_lock(driver->update); |
|---|
| 228 | | - if (!hmm_vma_range_done(vma, &range)) { |
|---|
| 205 | + if (mmu_interval_read_retry(&ni, range.notifier_seq) { |
|---|
| 229 | 206 | release_lock(driver->update); |
|---|
| 230 | 207 | goto again; |
|---|
| 231 | 208 | } |
|---|
| 232 | 209 | |
|---|
| 233 | | - // Use pfns array content to update device page table |
|---|
| 210 | + /* Use pfns array content to update device page table, |
|---|
| 211 | + * under the update lock */ |
|---|
| 234 | 212 | |
|---|
| 235 | 213 | release_lock(driver->update); |
|---|
| 236 | 214 | return 0; |
|---|
| 237 | 215 | } |
|---|
| 238 | 216 | |
|---|
| 239 | 217 | The driver->update lock is the same lock that the driver takes inside its |
|---|
| 240 | | -update() callback. That lock must be held before hmm_vma_range_done() to avoid |
|---|
| 241 | | -any race with a concurrent CPU page table update. |
|---|
| 218 | +invalidate() callback. That lock must be held before calling |
|---|
| 219 | +mmu_interval_read_retry() to avoid any race with a concurrent CPU page table |
|---|
| 220 | +update. |
|---|
| 242 | 221 | |
|---|
| 243 | | -HMM implements all this on top of the mmu_notifier API because we wanted a |
|---|
| 244 | | -simpler API and also to be able to perform optimizations latter on like doing |
|---|
| 245 | | -concurrent device updates in multi-devices scenario. |
|---|
| 222 | +Leverage default_flags and pfn_flags_mask |
|---|
| 223 | +========================================= |
|---|
| 246 | 224 | |
|---|
| 247 | | -HMM also serves as an impedance mismatch between how CPU page table updates |
|---|
| 248 | | -are done (by CPU write to the page table and TLB flushes) and how devices |
|---|
| 249 | | -update their own page table. Device updates are a multi-step process. First, |
|---|
| 250 | | -appropriate commands are written to a buffer, then this buffer is scheduled for |
|---|
| 251 | | -execution on the device. It is only once the device has executed commands in |
|---|
| 252 | | -the buffer that the update is done. Creating and scheduling the update command |
|---|
| 253 | | -buffer can happen concurrently for multiple devices. Waiting for each device to |
|---|
| 254 | | -report commands as executed is serialized (there is no point in doing this |
|---|
| 255 | | -concurrently). |
|---|
| 225 | +The hmm_range struct has 2 fields, default_flags and pfn_flags_mask, that specify |
|---|
| 226 | +fault or snapshot policy for the whole range instead of having to set them |
|---|
| 227 | +for each entry in the pfns array. |
|---|
| 228 | + |
|---|
| 229 | +For instance if the device driver wants pages for a range with at least read |
|---|
| 230 | +permission, it sets:: |
|---|
| 231 | + |
|---|
| 232 | + range->default_flags = HMM_PFN_REQ_FAULT; |
|---|
| 233 | + range->pfn_flags_mask = 0; |
|---|
| 234 | + |
|---|
| 235 | +and calls hmm_range_fault() as described above. This will fill fault all pages |
|---|
| 236 | +in the range with at least read permission. |
|---|
| 237 | + |
|---|
| 238 | +Now let's say the driver wants to do the same except for one page in the range for |
|---|
| 239 | +which it wants to have write permission. Now driver set:: |
|---|
| 240 | + |
|---|
| 241 | + range->default_flags = HMM_PFN_REQ_FAULT; |
|---|
| 242 | + range->pfn_flags_mask = HMM_PFN_REQ_WRITE; |
|---|
| 243 | + range->pfns[index_of_write] = HMM_PFN_REQ_WRITE; |
|---|
| 244 | + |
|---|
| 245 | +With this, HMM will fault in all pages with at least read (i.e., valid) and for the |
|---|
| 246 | +address == range->start + (index_of_write << PAGE_SHIFT) it will fault with |
|---|
| 247 | +write permission i.e., if the CPU pte does not have write permission set then HMM |
|---|
| 248 | +will call handle_mm_fault(). |
|---|
| 249 | + |
|---|
| 250 | +After hmm_range_fault completes the flag bits are set to the current state of |
|---|
| 251 | +the page tables, ie HMM_PFN_VALID | HMM_PFN_WRITE will be set if the page is |
|---|
| 252 | +writable. |
|---|
| 256 | 253 | |
|---|
| 257 | 254 | |
|---|
| 258 | 255 | Represent and manage device memory from core kernel point of view |
|---|
| 259 | 256 | ================================================================= |
|---|
| 260 | 257 | |
|---|
| 261 | | -Several different designs were tried to support device memory. First one used |
|---|
| 262 | | -a device specific data structure to keep information about migrated memory and |
|---|
| 263 | | -HMM hooked itself in various places of mm code to handle any access to |
|---|
| 258 | +Several different designs were tried to support device memory. The first one |
|---|
| 259 | +used a device specific data structure to keep information about migrated memory |
|---|
| 260 | +and HMM hooked itself in various places of mm code to handle any access to |
|---|
| 264 | 261 | addresses that were backed by device memory. It turns out that this ended up |
|---|
| 265 | 262 | replicating most of the fields of struct page and also needed many kernel code |
|---|
| 266 | 263 | paths to be updated to understand this new kind of memory. |
|---|
| .. | .. |
|---|
| 271 | 268 | unaware of the difference. We only need to make sure that no one ever tries to |
|---|
| 272 | 269 | map those pages from the CPU side. |
|---|
| 273 | 270 | |
|---|
| 274 | | -HMM provides a set of helpers to register and hotplug device memory as a new |
|---|
| 275 | | -region needing a struct page. This is offered through a very simple API:: |
|---|
| 276 | | - |
|---|
| 277 | | - struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, |
|---|
| 278 | | - struct device *device, |
|---|
| 279 | | - unsigned long size); |
|---|
| 280 | | - void hmm_devmem_remove(struct hmm_devmem *devmem); |
|---|
| 281 | | - |
|---|
| 282 | | -The hmm_devmem_ops is where most of the important things are:: |
|---|
| 283 | | - |
|---|
| 284 | | - struct hmm_devmem_ops { |
|---|
| 285 | | - void (*free)(struct hmm_devmem *devmem, struct page *page); |
|---|
| 286 | | - int (*fault)(struct hmm_devmem *devmem, |
|---|
| 287 | | - struct vm_area_struct *vma, |
|---|
| 288 | | - unsigned long addr, |
|---|
| 289 | | - struct page *page, |
|---|
| 290 | | - unsigned flags, |
|---|
| 291 | | - pmd_t *pmdp); |
|---|
| 292 | | - }; |
|---|
| 293 | | - |
|---|
| 294 | | -The first callback (free()) happens when the last reference on a device page is |
|---|
| 295 | | -dropped. This means the device page is now free and no longer used by anyone. |
|---|
| 296 | | -The second callback happens whenever the CPU tries to access a device page |
|---|
| 297 | | -which it cannot do. This second callback must trigger a migration back to |
|---|
| 298 | | -system memory. |
|---|
| 299 | | - |
|---|
| 300 | | - |
|---|
| 301 | 271 | Migration to and from device memory |
|---|
| 302 | 272 | =================================== |
|---|
| 303 | 273 | |
|---|
| 304 | | -Because the CPU cannot access device memory, migration must use the device DMA |
|---|
| 305 | | -engine to perform copy from and to device memory. For this we need a new |
|---|
| 306 | | -migration helper:: |
|---|
| 274 | +Because the CPU cannot access device memory directly, the device driver must |
|---|
| 275 | +use hardware DMA or device specific load/store instructions to migrate data. |
|---|
| 276 | +The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() |
|---|
| 277 | +functions are designed to make drivers easier to write and to centralize common |
|---|
| 278 | +code across drivers. |
|---|
| 307 | 279 | |
|---|
| 308 | | - int migrate_vma(const struct migrate_vma_ops *ops, |
|---|
| 309 | | - struct vm_area_struct *vma, |
|---|
| 310 | | - unsigned long mentries, |
|---|
| 311 | | - unsigned long start, |
|---|
| 312 | | - unsigned long end, |
|---|
| 313 | | - unsigned long *src, |
|---|
| 314 | | - unsigned long *dst, |
|---|
| 315 | | - void *private); |
|---|
| 280 | +Before migrating pages to device private memory, special device private |
|---|
| 281 | +``struct page`` need to be created. These will be used as special "swap" |
|---|
| 282 | +page table entries so that a CPU process will fault if it tries to access |
|---|
| 283 | +a page that has been migrated to device private memory. |
|---|
| 316 | 284 | |
|---|
| 317 | | -Unlike other migration functions it works on a range of virtual address, there |
|---|
| 318 | | -are two reasons for that. First, device DMA copy has a high setup overhead cost |
|---|
| 319 | | -and thus batching multiple pages is needed as otherwise the migration overhead |
|---|
| 320 | | -makes the whole exercise pointless. The second reason is because the |
|---|
| 321 | | -migration might be for a range of addresses the device is actively accessing. |
|---|
| 285 | +These can be allocated and freed with:: |
|---|
| 322 | 286 | |
|---|
| 323 | | -The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy()) |
|---|
| 324 | | -controls destination memory allocation and copy operation. Second one is there |
|---|
| 325 | | -to allow the device driver to perform cleanup operations after migration:: |
|---|
| 287 | + struct resource *res; |
|---|
| 288 | + struct dev_pagemap pagemap; |
|---|
| 326 | 289 | |
|---|
| 327 | | - struct migrate_vma_ops { |
|---|
| 328 | | - void (*alloc_and_copy)(struct vm_area_struct *vma, |
|---|
| 329 | | - const unsigned long *src, |
|---|
| 330 | | - unsigned long *dst, |
|---|
| 331 | | - unsigned long start, |
|---|
| 332 | | - unsigned long end, |
|---|
| 333 | | - void *private); |
|---|
| 334 | | - void (*finalize_and_map)(struct vm_area_struct *vma, |
|---|
| 335 | | - const unsigned long *src, |
|---|
| 336 | | - const unsigned long *dst, |
|---|
| 337 | | - unsigned long start, |
|---|
| 338 | | - unsigned long end, |
|---|
| 339 | | - void *private); |
|---|
| 340 | | - }; |
|---|
| 290 | + res = request_free_mem_region(&iomem_resource, /* number of bytes */, |
|---|
| 291 | + "name of driver resource"); |
|---|
| 292 | + pagemap.type = MEMORY_DEVICE_PRIVATE; |
|---|
| 293 | + pagemap.range.start = res->start; |
|---|
| 294 | + pagemap.range.end = res->end; |
|---|
| 295 | + pagemap.nr_range = 1; |
|---|
| 296 | + pagemap.ops = &device_devmem_ops; |
|---|
| 297 | + memremap_pages(&pagemap, numa_node_id()); |
|---|
| 341 | 298 | |
|---|
| 342 | | -It is important to stress that these migration helpers allow for holes in the |
|---|
| 343 | | -virtual address range. Some pages in the range might not be migrated for all |
|---|
| 344 | | -the usual reasons (page is pinned, page is locked, ...). This helper does not |
|---|
| 345 | | -fail but just skips over those pages. |
|---|
| 299 | + memunmap_pages(&pagemap); |
|---|
| 300 | + release_mem_region(pagemap.range.start, range_len(&pagemap.range)); |
|---|
| 346 | 301 | |
|---|
| 347 | | -The alloc_and_copy() might decide to not migrate all pages in the |
|---|
| 348 | | -range (for reasons under the callback control). For those, the callback just |
|---|
| 349 | | -has to leave the corresponding dst entry empty. |
|---|
| 302 | +There are also devm_request_free_mem_region(), devm_memremap_pages(), |
|---|
| 303 | +devm_memunmap_pages(), and devm_release_mem_region() when the resources can |
|---|
| 304 | +be tied to a ``struct device``. |
|---|
| 350 | 305 | |
|---|
| 351 | | -Finally, the migration of the struct page might fail (for file backed page) for |
|---|
| 352 | | -various reasons (failure to freeze reference, or update page cache, ...). If |
|---|
| 353 | | -that happens, then the finalize_and_map() can catch any pages that were not |
|---|
| 354 | | -migrated. Note those pages were still copied to a new page and thus we wasted |
|---|
| 355 | | -bandwidth but this is considered as a rare event and a price that we are |
|---|
| 356 | | -willing to pay to keep all the code simpler. |
|---|
| 306 | +The overall migration steps are similar to migrating NUMA pages within system |
|---|
| 307 | +memory (see :ref:`Page migration <page_migration>`) but the steps are split |
|---|
| 308 | +between device driver specific code and shared common code: |
|---|
| 357 | 309 | |
|---|
| 310 | +1. ``mmap_read_lock()`` |
|---|
| 311 | + |
|---|
| 312 | + The device driver has to pass a ``struct vm_area_struct`` to |
|---|
| 313 | + migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to |
|---|
| 314 | + be held for the duration of the migration. |
|---|
| 315 | + |
|---|
| 316 | +2. ``migrate_vma_setup(struct migrate_vma *args)`` |
|---|
| 317 | + |
|---|
| 318 | + The device driver initializes the ``struct migrate_vma`` fields and passes |
|---|
| 319 | + the pointer to migrate_vma_setup(). The ``args->flags`` field is used to |
|---|
| 320 | + filter which source pages should be migrated. For example, setting |
|---|
| 321 | + ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and |
|---|
| 322 | + ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in |
|---|
| 323 | + device private memory. If the latter flag is set, the ``args->pgmap_owner`` |
|---|
| 324 | + field is used to identify device private pages owned by the driver. This |
|---|
| 325 | + avoids trying to migrate device private pages residing in other devices. |
|---|
| 326 | + Currently only anonymous private VMA ranges can be migrated to or from |
|---|
| 327 | + system memory and device private memory. |
|---|
| 328 | + |
|---|
| 329 | + One of the first steps migrate_vma_setup() does is to invalidate other |
|---|
| 330 | + device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and |
|---|
| 331 | + ``mmu_notifier_invalidate_range_end()`` calls around the page table |
|---|
| 332 | + walks to fill in the ``args->src`` array with PFNs to be migrated. |
|---|
| 333 | + The ``invalidate_range_start()`` callback is passed a |
|---|
| 334 | + ``struct mmu_notifier_range`` with the ``event`` field set to |
|---|
| 335 | + ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to |
|---|
| 336 | + the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is |
|---|
| 337 | + allows the device driver to skip the invalidation callback and only |
|---|
| 338 | + invalidate device private MMU mappings that are actually migrating. |
|---|
| 339 | + This is explained more in the next section. |
|---|
| 340 | + |
|---|
| 341 | + While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()`` |
|---|
| 342 | + entry results in a valid "zero" PFN stored in the ``args->src`` array. |
|---|
| 343 | + This lets the driver allocate device private memory and clear it instead |
|---|
| 344 | + of copying a page of zeros. Valid PTE entries to system memory or |
|---|
| 345 | + device private struct pages will be locked with ``lock_page()``, isolated |
|---|
| 346 | + from the LRU (if system memory since device private pages are not on |
|---|
| 347 | + the LRU), unmapped from the process, and a special migration PTE is |
|---|
| 348 | + inserted in place of the original PTE. |
|---|
| 349 | + migrate_vma_setup() also clears the ``args->dst`` array. |
|---|
| 350 | + |
|---|
| 351 | +3. The device driver allocates destination pages and copies source pages to |
|---|
| 352 | + destination pages. |
|---|
| 353 | + |
|---|
| 354 | + The driver checks each ``src`` entry to see if the ``MIGRATE_PFN_MIGRATE`` |
|---|
| 355 | + bit is set and skips entries that are not migrating. The device driver |
|---|
| 356 | + can also choose to skip migrating a page by not filling in the ``dst`` |
|---|
| 357 | + array for that page. |
|---|
| 358 | + |
|---|
| 359 | + The driver then allocates either a device private struct page or a |
|---|
| 360 | + system memory page, locks the page with ``lock_page()``, and fills in the |
|---|
| 361 | + ``dst`` array entry with:: |
|---|
| 362 | + |
|---|
| 363 | + dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; |
|---|
| 364 | + |
|---|
| 365 | + Now that the driver knows that this page is being migrated, it can |
|---|
| 366 | + invalidate device private MMU mappings and copy device private memory |
|---|
| 367 | + to system memory or another device private page. The core Linux kernel |
|---|
| 368 | + handles CPU page table invalidations so the device driver only has to |
|---|
| 369 | + invalidate its own MMU mappings. |
|---|
| 370 | + |
|---|
| 371 | + The driver can use ``migrate_pfn_to_page(src[i])`` to get the |
|---|
| 372 | + ``struct page`` of the source and either copy the source page to the |
|---|
| 373 | + destination or clear the destination device private memory if the pointer |
|---|
| 374 | + is ``NULL`` meaning the source page was not populated in system memory. |
|---|
| 375 | + |
|---|
| 376 | +4. ``migrate_vma_pages()`` |
|---|
| 377 | + |
|---|
| 378 | + This step is where the migration is actually "committed". |
|---|
| 379 | + |
|---|
| 380 | + If the source page was a ``pte_none()`` or ``is_zero_pfn()`` page, this |
|---|
| 381 | + is where the newly allocated page is inserted into the CPU's page table. |
|---|
| 382 | + This can fail if a CPU thread faults on the same page. However, the page |
|---|
| 383 | + table is locked and only one of the new pages will be inserted. |
|---|
| 384 | + The device driver will see that the ``MIGRATE_PFN_MIGRATE`` bit is cleared |
|---|
| 385 | + if it loses the race. |
|---|
| 386 | + |
|---|
| 387 | + If the source page was locked, isolated, etc. the source ``struct page`` |
|---|
| 388 | + information is now copied to destination ``struct page`` finalizing the |
|---|
| 389 | + migration on the CPU side. |
|---|
| 390 | + |
|---|
| 391 | +5. Device driver updates device MMU page tables for pages still migrating, |
|---|
| 392 | + rolling back pages not migrating. |
|---|
| 393 | + |
|---|
| 394 | + If the ``src`` entry still has ``MIGRATE_PFN_MIGRATE`` bit set, the device |
|---|
| 395 | + driver can update the device MMU and set the write enable bit if the |
|---|
| 396 | + ``MIGRATE_PFN_WRITE`` bit is set. |
|---|
| 397 | + |
|---|
| 398 | +6. ``migrate_vma_finalize()`` |
|---|
| 399 | + |
|---|
| 400 | + This step replaces the special migration page table entry with the new |
|---|
| 401 | + page's page table entry and releases the reference to the source and |
|---|
| 402 | + destination ``struct page``. |
|---|
| 403 | + |
|---|
| 404 | +7. ``mmap_read_unlock()`` |
|---|
| 405 | + |
|---|
| 406 | + The lock can now be released. |
|---|
| 358 | 407 | |
|---|
| 359 | 408 | Memory cgroup (memcg) and rss accounting |
|---|
| 360 | 409 | ======================================== |
|---|
| 361 | 410 | |
|---|
| 362 | | -For now device memory is accounted as any regular page in rss counters (either |
|---|
| 411 | +For now, device memory is accounted as any regular page in rss counters (either |
|---|
| 363 | 412 | anonymous if device page is used for anonymous, file if device page is used for |
|---|
| 364 | | -file backed page or shmem if device page is used for shared memory). This is a |
|---|
| 413 | +file backed page, or shmem if device page is used for shared memory). This is a |
|---|
| 365 | 414 | deliberate choice to keep existing applications, that might start using device |
|---|
| 366 | 415 | memory without knowing about it, running unimpacted. |
|---|
| 367 | 416 | |
|---|
| .. | .. |
|---|
| 381 | 430 | resource control. |
|---|
| 382 | 431 | |
|---|
| 383 | 432 | |
|---|
| 384 | | -Note that device memory can never be pinned by device driver nor through GUP |
|---|
| 433 | +Note that device memory can never be pinned by a device driver nor through GUP |
|---|
| 385 | 434 | and thus such memory is always free upon process exit. Or when last reference |
|---|
| 386 | 435 | is dropped in case of shared memory or file backed memory. |
|---|