hc
2023-12-11 d2ccde1c8e90d38cee87a1b0309ad2827f3fd30d
kernel/Documentation/admin-guide/mm/userfaultfd.rst
....@@ -12,114 +12,192 @@
1212 memory page faults, something otherwise only the kernel code could do.
1313
1414 For example userfaults allows a proper and more optimal implementation
15
-of the PROT_NONE+SIGSEGV trick.
15
+of the ``PROT_NONE+SIGSEGV`` trick.
1616
1717 Design
1818 ======
1919
20
-Userfaults are delivered and resolved through the userfaultfd syscall.
20
+Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
2121
22
-The userfaultfd (aside from registering and unregistering virtual
22
+The ``userfaultfd`` (aside from registering and unregistering virtual
2323 memory ranges) provides two primary functionalities:
2424
25
-1) read/POLLIN protocol to notify a userland thread of the faults
25
+1) ``read/POLLIN`` protocol to notify a userland thread of the faults
2626 happening
2727
28
-2) various UFFDIO_* ioctls that can manage the virtual memory regions
29
- registered in the userfaultfd that allows userland to efficiently
28
+2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions
29
+ registered in the ``userfaultfd`` that allows userland to efficiently
3030 resolve the userfaults it receives via 1) or to manage the virtual
3131 memory in the background
3232
3333 The real advantage of userfaults if compared to regular virtual memory
3434 management of mremap/mprotect is that the userfaults in all their
3535 operations never involve heavyweight structures like vmas (in fact the
36
-userfaultfd runtime load never takes the mmap_sem for writing).
36
+``userfaultfd`` runtime load never takes the mmap_lock for writing).
3737
3838 Vmas are not suitable for page- (or hugepage) granular fault tracking
3939 when dealing with virtual address spaces that could span
4040 Terabytes. Too many vmas would be needed for that.
4141
42
-The userfaultfd once opened by invoking the syscall, can also be
42
+The ``userfaultfd`` once opened by invoking the syscall, can also be
4343 passed using unix domain sockets to a manager process, so the same
4444 manager process could handle the userfaults of a multitude of
4545 different processes without them being aware about what is going on
46
-(well of course unless they later try to use the userfaultfd
46
+(well of course unless they later try to use the ``userfaultfd``
4747 themselves on the same region the manager is already tracking, which
48
-is a corner case that would currently return -EBUSY).
48
+is a corner case that would currently return ``-EBUSY``).
4949
5050 API
5151 ===
5252
53
-When first opened the userfaultfd must be enabled invoking the
54
-UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
55
-a later API version) which will specify the read/POLLIN protocol
56
-userland intends to speak on the UFFD and the uffdio_api.features
57
-userland requires. The UFFDIO_API ioctl if successful (i.e. if the
58
-requested uffdio_api.api is spoken also by the running kernel and the
53
+When first opened the ``userfaultfd`` must be enabled invoking the
54
+``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
55
+a later API version) which will specify the ``read/POLLIN`` protocol
56
+userland intends to speak on the ``UFFD`` and the ``uffdio_api.features``
57
+userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the
58
+requested ``uffdio_api.api`` is spoken also by the running kernel and the
5959 requested features are going to be enabled) will return into
60
-uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
60
+``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of
6161 respectively all the available features of the read(2) protocol and
6262 the generic ioctl available.
6363
64
-The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
65
-defines what memory types are supported by the userfaultfd and what
66
-events, except page fault notifications, may be generated.
64
+The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
65
+defines what memory types are supported by the ``userfaultfd`` and what
66
+events, except page fault notifications, may be generated:
6767
68
-If the kernel supports registering userfaultfd ranges on hugetlbfs
69
-virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
70
-uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
71
-set if the kernel supports registering userfaultfd ranges on shared
72
-memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
73
-MAP_SHARED, memfd_create, etc).
68
+- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
69
+ other than page faults are supported. These events are described in more
70
+ detail below in the `Non-cooperative userfaultfd`_ section.
7471
75
-The userland application that wants to use userfaultfd with hugetlbfs
76
-or shared memory need to set the corresponding flag in
77
-uffdio_api.features to enable those features.
72
+- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
73
+ indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
74
+ registrations for hugetlbfs and shared memory (covering all shmem APIs,
75
+ i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
76
+ etc) virtual memory areas, respectively.
7877
79
-If the userland desires to receive notifications for events other than
80
-page faults, it has to verify that uffdio_api.features has appropriate
81
-UFFD_FEATURE_EVENT_* bits set. These events are described in more
82
-detail below in "Non-cooperative userfaultfd" section.
78
+- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
79
+ ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
80
+ areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
81
+ support for shmem virtual memory areas.
8382
84
-Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
85
-be invoked (if present in the returned uffdio_api.ioctls bitmask) to
86
-register a memory range in the userfaultfd by setting the
87
-uffdio_register structure accordingly. The uffdio_register.mode
83
+The userland application should set the feature flags it intends to use
84
+when invoking the ``UFFDIO_API`` ioctl, to request that those features be
85
+enabled if supported.
86
+
87
+Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
88
+ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
89
+bitmask) to register a memory range in the ``userfaultfd`` by setting the
90
+uffdio_register structure accordingly. The ``uffdio_register.mode``
8891 bitmask will specify to the kernel which kind of faults to track for
89
-the range (UFFDIO_REGISTER_MODE_MISSING would track missing
90
-pages). The UFFDIO_REGISTER ioctl will return the
91
-uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
92
+the range. The ``UFFDIO_REGISTER`` ioctl will return the
93
+``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
9294 userfaults on the range registered. Not all ioctls will necessarily be
93
-supported for all memory types depending on the underlying virtual
94
-memory backend (anonymous memory vs tmpfs vs real filebacked
95
-mappings).
95
+supported for all memory types (e.g. anonymous memory vs. shmem vs.
96
+hugetlbfs), or all types of intercepted faults.
9697
97
-Userland can use the uffdio_register.ioctls to manage the virtual
98
+Userland can use the ``uffdio_register.ioctls`` to manage the virtual
9899 address space in the background (to add or potentially also remove
99
-memory from the userfaultfd registered range). This means a userfault
100
+memory from the ``userfaultfd`` registered range). This means a userfault
100101 could be triggering just before userland maps in the background the
101102 user-faulted page.
102103
103
-The primary ioctl to resolve userfaults is UFFDIO_COPY. That
104
-atomically copies a page into the userfault registered range and wakes
105
-up the blocked userfaults (unless uffdio_copy.mode &
106
-UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
107
-UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
108
-half copied page since it'll keep userfaulting until the copy has
109
-finished.
104
+Resolving Userfaults
105
+--------------------
106
+
107
+There are three basic ways to resolve userfaults:
108
+
109
+- ``UFFDIO_COPY`` atomically copies some existing page contents from
110
+ userspace.
111
+
112
+- ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
113
+
114
+- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
115
+
116
+These operations are atomic in the sense that they guarantee nothing can
117
+see a half-populated page, since readers will keep userfaulting until the
118
+operation has finished.
119
+
120
+By default, these wake up userfaults blocked on the range in question.
121
+They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
122
+that waking will be done separately at some later time.
123
+
124
+Which ioctl to choose depends on the kind of page fault, and what we'd
125
+like to do to resolve it:
126
+
127
+- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be
128
+ resolved by either providing a new page (``UFFDIO_COPY``), or mapping
129
+ the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map
130
+ the zero page for a missing fault. With userfaultfd, userspace can
131
+ decide what content to provide before the faulting thread continues.
132
+
133
+- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in
134
+ the page cache). Userspace has the option of modifying the page's
135
+ contents before resolving the fault. Once the contents are correct
136
+ (modified or not), userspace asks the kernel to map the page and let the
137
+ faulting thread continue with ``UFFDIO_CONTINUE``.
138
+
139
+Notes:
140
+
141
+- You can tell which kind of fault occurred by examining
142
+ ``pagefault.flags`` within the ``uffd_msg``, checking for the
143
+ ``UFFD_PAGEFAULT_FLAG_*`` flags.
144
+
145
+- None of the page-delivering ioctls default to the range that you
146
+ registered with. You must fill in all fields for the appropriate
147
+ ioctl struct including the range.
148
+
149
+- You get the address of the access that triggered the missing page
150
+ event out of a struct uffd_msg that you read in the thread from the
151
+ uffd. You can supply as many pages as you want with these IOCTLs.
152
+ Keep in mind that unless you used DONTWAKE then the first of any of
153
+ those IOCTLs wakes up the faulting thread.
154
+
155
+- Be sure to test for all errors including
156
+ (``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges
157
+ supplied were incorrect.
158
+
159
+Write Protect Notifications
160
+---------------------------
161
+
162
+This is equivalent to (but faster than) using mprotect and a SIGSEGV
163
+signal handler.
164
+
165
+Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``.
166
+Instead of using mprotect(2) you use
167
+``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
168
+while ``mode = UFFDIO_WRITEPROTECT_MODE_WP``
169
+in the struct passed in. The range does not default to and does not
170
+have to be identical to the range you registered with. You can write
171
+protect as many ranges as you like (inside the registered range).
172
+Then, in the thread reading from uffd the struct will have
173
+``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send
174
+``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
175
+again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP``
176
+set. This wakes up the thread which will continue to run with writes. This
177
+allows you to do the bookkeeping about the write in the uffd reading
178
+thread before the ioctl.
179
+
180
+If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and
181
+``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in
182
+which you supply a page and undo write protect. Note that there is a
183
+difference between writes into a WP area and into a !WP area. The
184
+former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
185
+``UFFD_PAGEFAULT_FLAG_WRITE``. The latter did not fail on protection but
186
+you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
187
+used.
110188
111189 QEMU/KVM
112190 ========
113191
114
-QEMU/KVM is using the userfaultfd syscall to implement postcopy live
192
+QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live
115193 migration. Postcopy live migration is one form of memory
116194 externalization consisting of a virtual machine running with part or
117195 all of its memory residing on a different node in the cloud. The
118
-userfaultfd abstraction is generic enough that not a single line of
196
+``userfaultfd`` abstraction is generic enough that not a single line of
119197 KVM kernel code had to be modified in order to add postcopy live
120198 migration to QEMU.
121199
122
-Guest async page faults, FOLL_NOWAIT and all other GUP features work
200
+Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work
123201 just fine in combination with userfaults. Userfaults trigger async
124202 page faults in the guest scheduler so those guest processes that
125203 aren't waiting for userfaults (i.e. network bound) can keep running in
....@@ -132,19 +210,19 @@
132210 The implementation of postcopy live migration currently uses one
133211 single bidirectional socket but in the future two different sockets
134212 will be used (to reduce the latency of the userfaults to the minimum
135
-possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
213
+possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``).
136214
137215 The QEMU in the source node writes all pages that it knows are missing
138216 in the destination node, into the socket, and the migration thread of
139
-the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
140
-ioctls on the userfaultfd in order to map the received pages into the
141
-guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
217
+the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE``
218
+ioctls on the ``userfaultfd`` in order to map the received pages into the
219
+guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page).
142220
143221 A different postcopy thread in the destination node listens with
144
-poll() to the userfaultfd in parallel. When a POLLIN event is
222
+poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is
145223 generated after a userfault triggers, the postcopy thread read() from
146
-the userfaultfd and receives the fault address (or -EAGAIN in case the
147
-userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
224
+the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the
225
+userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run
148226 by the parallel QEMU migration thread).
149227
150228 After the QEMU postcopy thread (running in the destination node) gets
....@@ -155,7 +233,7 @@
155233 (just the time to flush the tcp_wmem queue through the network) the
156234 migration thread in the QEMU running in the destination node will
157235 receive the page that triggered the userfault and it'll map it as
158
-usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
236
+usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it
159237 was spontaneously sent by the source or if it was an urgent page
160238 requested through a userfault).
161239
....@@ -168,74 +246,74 @@
168246 over it when receiving incoming userfaults. After sending each page of
169247 course the bitmap is updated accordingly. It's also useful to avoid
170248 sending the same page twice (in case the userfault is read by the
171
-postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
249
+postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration
172250 thread).
173251
174252 Non-cooperative userfaultfd
175253 ===========================
176254
177
-When the userfaultfd is monitored by an external manager, the manager
255
+When the ``userfaultfd`` is monitored by an external manager, the manager
178256 must be able to track changes in the process virtual memory
179257 layout. Userfaultfd can notify the manager about such changes using
180258 the same read(2) protocol as for the page fault notifications. The
181259 manager has to explicitly enable these events by setting appropriate
182
-bits in uffdio_api.features passed to UFFDIO_API ioctl:
260
+bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl:
183261
184
-UFFD_FEATURE_EVENT_FORK
185
- enable userfaultfd hooks for fork(). When this feature is
186
- enabled, the userfaultfd context of the parent process is
262
+``UFFD_FEATURE_EVENT_FORK``
263
+ enable ``userfaultfd`` hooks for fork(). When this feature is
264
+ enabled, the ``userfaultfd`` context of the parent process is
187265 duplicated into the newly created process. The manager
188
- receives UFFD_EVENT_FORK with file descriptor of the new
189
- userfaultfd context in the uffd_msg.fork.
266
+ receives ``UFFD_EVENT_FORK`` with file descriptor of the new
267
+ ``userfaultfd`` context in the ``uffd_msg.fork``.
190268
191
-UFFD_FEATURE_EVENT_REMAP
269
+``UFFD_FEATURE_EVENT_REMAP``
192270 enable notifications about mremap() calls. When the
193271 non-cooperative process moves a virtual memory area to a
194272 different location, the manager will receive
195
- UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
273
+ ``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and
196274 new addresses of the area and its original length.
197275
198
-UFFD_FEATURE_EVENT_REMOVE
276
+``UFFD_FEATURE_EVENT_REMOVE``
199277 enable notifications about madvise(MADV_REMOVE) and
200
- madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
201
- be generated upon these calls to madvise. The uffd_msg.remove
278
+ madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will
279
+ be generated upon these calls to madvise(). The ``uffd_msg.remove``
202280 will contain start and end addresses of the removed area.
203281
204
-UFFD_FEATURE_EVENT_UNMAP
282
+``UFFD_FEATURE_EVENT_UNMAP``
205283 enable notifications about memory unmapping. The manager will
206
- get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
284
+ get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and
207285 end addresses of the unmapped area.
208286
209
-Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
287
+Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP``
210288 are pretty similar, they quite differ in the action expected from the
211
-userfaultfd manager. In the former case, the virtual memory is
289
+``userfaultfd`` manager. In the former case, the virtual memory is
212290 removed, but the area is not, the area remains monitored by the
213
-userfaultfd, and if a page fault occurs in that area it will be
291
+``userfaultfd``, and if a page fault occurs in that area it will be
214292 delivered to the manager. The proper resolution for such page fault is
215293 to zeromap the faulting address. However, in the latter case, when an
216294 area is unmapped, either explicitly (with munmap() system call), or
217295 implicitly (e.g. during mremap()), the area is removed and in turn the
218
-userfaultfd context for such area disappears too and the manager will
296
+``userfaultfd`` context for such area disappears too and the manager will
219297 not get further userland page faults from the removed area. Still, the
220298 notification is required in order to prevent manager from using
221
-UFFDIO_COPY on the unmapped area.
299
+``UFFDIO_COPY`` on the unmapped area.
222300
223301 Unlike userland page faults which have to be synchronous and require
224302 explicit or implicit wakeup, all the events are delivered
225303 asynchronously and the non-cooperative process resumes execution as
226
-soon as manager executes read(). The userfaultfd manager should
227
-carefully synchronize calls to UFFDIO_COPY with the events
228
-processing. To aid the synchronization, the UFFDIO_COPY ioctl will
229
-return -ENOSPC when the monitored process exits at the time of
230
-UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
231
-its virtual memory layout simultaneously with outstanding UFFDIO_COPY
304
+soon as manager executes read(). The ``userfaultfd`` manager should
305
+carefully synchronize calls to ``UFFDIO_COPY`` with the events
306
+processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will
307
+return ``-ENOSPC`` when the monitored process exits at the time of
308
+``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed
309
+its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY``
232310 operation.
233311
234312 The current asynchronous model of the event delivery is optimal for
235
-single threaded non-cooperative userfaultfd manager implementations. A
313
+single threaded non-cooperative ``userfaultfd`` manager implementations. A
236314 synchronous event delivery model can be added later as a new
237
-userfaultfd feature to facilitate multithreading enhancements of the
238
-non cooperative manager, for example to allow UFFDIO_COPY ioctls to
315
+``userfaultfd`` feature to facilitate multithreading enhancements of the
316
+non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to
239317 run in parallel to the event reception. Single threaded
240318 implementations should continue to use the current async event
241319 delivery model instead.