| .. | .. |
|---|
| 40 | 40 | appropriate (malloc, mmap, huge pages, etc). This memory area is then |
|---|
| 41 | 41 | registered with the kernel using the new setsockopt XDP_UMEM_REG. The |
|---|
| 42 | 42 | UMEM also has two rings: the FILL ring and the COMPLETION ring. The |
|---|
| 43 | | -fill ring is used by the application to send down addr for the kernel |
|---|
| 43 | +FILL ring is used by the application to send down addr for the kernel |
|---|
| 44 | 44 | to fill in with RX packet data. References to these frames will then |
|---|
| 45 | 45 | appear in the RX ring once each packet has been received. The |
|---|
| 46 | | -completion ring, on the other hand, contains frame addr that the |
|---|
| 46 | +COMPLETION ring, on the other hand, contains frame addr that the |
|---|
| 47 | 47 | kernel has transmitted completely and can now be used again by user |
|---|
| 48 | 48 | space, for either TX or RX. Thus, the frame addrs appearing in the |
|---|
| 49 | | -completion ring are addrs that were previously transmitted using the |
|---|
| 49 | +COMPLETION ring are addrs that were previously transmitted using the |
|---|
| 50 | 50 | TX ring. In summary, the RX and FILL rings are used for the RX path |
|---|
| 51 | 51 | and the TX and COMPLETION rings are used for the TX path. |
|---|
| 52 | 52 | |
|---|
| .. | .. |
|---|
| 91 | 91 | ======== |
|---|
| 92 | 92 | |
|---|
| 93 | 93 | In order to use an AF_XDP socket, a number of associated objects need |
|---|
| 94 | | -to be setup. |
|---|
| 94 | +to be setup. These objects and their options are explained in the |
|---|
| 95 | +following sections. |
|---|
| 95 | 96 | |
|---|
| 96 | | -Jonathan Corbet has also written an excellent article on LWN, |
|---|
| 97 | | -"Accelerating networking with AF_XDP". It can be found at |
|---|
| 98 | | -https://lwn.net/Articles/750845/. |
|---|
| 97 | +For an overview on how AF_XDP works, you can also take a look at the |
|---|
| 98 | +Linux Plumbers paper from 2018 on the subject: |
|---|
| 99 | +http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do |
|---|
| 100 | +NOT consult the paper from 2017 on "AF_PACKET v4", the first attempt |
|---|
| 101 | +at AF_XDP. Nearly everything changed since then. Jonathan Corbet has |
|---|
| 102 | +also written an excellent article on LWN, "Accelerating networking |
|---|
| 103 | +with AF_XDP". It can be found at https://lwn.net/Articles/750845/. |
|---|
| 99 | 104 | |
|---|
| 100 | 105 | UMEM |
|---|
| 101 | 106 | ---- |
|---|
| .. | .. |
|---|
| 113 | 118 | struct sockaddr_xdp member sxdp_flags, and passing the file descriptor |
|---|
| 114 | 119 | of A to struct sockaddr_xdp member sxdp_shared_umem_fd. |
|---|
| 115 | 120 | |
|---|
| 116 | | -The UMEM has two single-producer/single-consumer rings, that are used |
|---|
| 121 | +The UMEM has two single-producer/single-consumer rings that are used |
|---|
| 117 | 122 | to transfer ownership of UMEM frames between the kernel and the |
|---|
| 118 | 123 | user-space application. |
|---|
| 119 | 124 | |
|---|
| 120 | 125 | Rings |
|---|
| 121 | 126 | ----- |
|---|
| 122 | 127 | |
|---|
| 123 | | -There are a four different kind of rings: Fill, Completion, RX and |
|---|
| 128 | +There are a four different kind of rings: FILL, COMPLETION, RX and |
|---|
| 124 | 129 | TX. All rings are single-producer/single-consumer, so the user-space |
|---|
| 125 | 130 | application need explicit synchronization of multiple |
|---|
| 126 | 131 | processes/threads are reading/writing to them. |
|---|
| 127 | 132 | |
|---|
| 128 | | -The UMEM uses two rings: Fill and Completion. Each socket associated |
|---|
| 133 | +The UMEM uses two rings: FILL and COMPLETION. Each socket associated |
|---|
| 129 | 134 | with the UMEM must have an RX queue, TX queue or both. Say, that there |
|---|
| 130 | 135 | is a setup with four sockets (all doing TX and RX). Then there will be |
|---|
| 131 | | -one Fill ring, one Completion ring, four TX rings and four RX rings. |
|---|
| 136 | +one FILL ring, one COMPLETION ring, four TX rings and four RX rings. |
|---|
| 132 | 137 | |
|---|
| 133 | 138 | The rings are head(producer)/tail(consumer) based rings. A producer |
|---|
| 134 | 139 | writes the data ring at the index pointed out by struct xdp_ring |
|---|
| .. | .. |
|---|
| 146 | 151 | UMEM Fill Ring |
|---|
| 147 | 152 | ~~~~~~~~~~~~~~ |
|---|
| 148 | 153 | |
|---|
| 149 | | -The Fill ring is used to transfer ownership of UMEM frames from |
|---|
| 154 | +The FILL ring is used to transfer ownership of UMEM frames from |
|---|
| 150 | 155 | user-space to kernel-space. The UMEM addrs are passed in the ring. As |
|---|
| 151 | 156 | an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has |
|---|
| 152 | 157 | 16 chunks and can pass addrs between 0 and 64k. |
|---|
| 153 | 158 | |
|---|
| 154 | 159 | Frames passed to the kernel are used for the ingress path (RX rings). |
|---|
| 155 | 160 | |
|---|
| 156 | | -The user application produces UMEM addrs to this ring. Note that the |
|---|
| 157 | | -kernel will mask the incoming addr. E.g. for a chunk size of 2k, the |
|---|
| 158 | | -log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050 |
|---|
| 159 | | -and 3000 refers to the same chunk. |
|---|
| 161 | +The user application produces UMEM addrs to this ring. Note that, if |
|---|
| 162 | +running the application with aligned chunk mode, the kernel will mask |
|---|
| 163 | +the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of |
|---|
| 164 | +the addr will be masked off, meaning that 2048, 2050 and 3000 refers |
|---|
| 165 | +to the same chunk. If the user application is run in the unaligned |
|---|
| 166 | +chunks mode, then the incoming addr will be left untouched. |
|---|
| 160 | 167 | |
|---|
| 161 | 168 | |
|---|
| 162 | | -UMEM Completetion Ring |
|---|
| 163 | | -~~~~~~~~~~~~~~~~~~~~~~ |
|---|
| 169 | +UMEM Completion Ring |
|---|
| 170 | +~~~~~~~~~~~~~~~~~~~~ |
|---|
| 164 | 171 | |
|---|
| 165 | | -The Completion Ring is used transfer ownership of UMEM frames from |
|---|
| 166 | | -kernel-space to user-space. Just like the Fill ring, UMEM indicies are |
|---|
| 172 | +The COMPLETION Ring is used transfer ownership of UMEM frames from |
|---|
| 173 | +kernel-space to user-space. Just like the FILL ring, UMEM indices are |
|---|
| 167 | 174 | used. |
|---|
| 168 | 175 | |
|---|
| 169 | 176 | Frames passed from the kernel to user-space are frames that has been |
|---|
| .. | .. |
|---|
| 179 | 186 | is a struct xdp_desc descriptor. The descriptor contains UMEM offset |
|---|
| 180 | 187 | (addr) and the length of the data (len). |
|---|
| 181 | 188 | |
|---|
| 182 | | -If no frames have been passed to kernel via the Fill ring, no |
|---|
| 189 | +If no frames have been passed to kernel via the FILL ring, no |
|---|
| 183 | 190 | descriptors will (or can) appear on the RX ring. |
|---|
| 184 | 191 | |
|---|
| 185 | 192 | The user application consumes struct xdp_desc descriptors from this |
|---|
| .. | .. |
|---|
| 197 | 204 | The user application produces struct xdp_desc descriptors to this |
|---|
| 198 | 205 | ring. |
|---|
| 199 | 206 | |
|---|
| 207 | +Libbpf |
|---|
| 208 | +====== |
|---|
| 209 | + |
|---|
| 210 | +Libbpf is a helper library for eBPF and XDP that makes using these |
|---|
| 211 | +technologies a lot simpler. It also contains specific helper functions |
|---|
| 212 | +in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It |
|---|
| 213 | +contains two types of functions: those that can be used to make the |
|---|
| 214 | +setup of AF_XDP socket easier and ones that can be used in the data |
|---|
| 215 | +plane to access the rings safely and quickly. To see an example on how |
|---|
| 216 | +to use this API, please take a look at the sample application in |
|---|
| 217 | +samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data |
|---|
| 218 | +plane operations. |
|---|
| 219 | + |
|---|
| 220 | +We recommend that you use this library unless you have become a power |
|---|
| 221 | +user. It will make your program a lot simpler. |
|---|
| 222 | + |
|---|
| 200 | 223 | XSKMAP / BPF_MAP_TYPE_XSKMAP |
|---|
| 201 | | ----------------------------- |
|---|
| 224 | +============================ |
|---|
| 202 | 225 | |
|---|
| 203 | 226 | On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that |
|---|
| 204 | 227 | is used in conjunction with bpf_redirect_map() to pass the ingress |
|---|
| .. | .. |
|---|
| 214 | 237 | successfully pass data to the socket. Please refer to the sample |
|---|
| 215 | 238 | application (samples/bpf/) in for an example. |
|---|
| 216 | 239 | |
|---|
| 240 | +Configuration Flags and Socket Options |
|---|
| 241 | +====================================== |
|---|
| 242 | + |
|---|
| 243 | +These are the various configuration flags that can be used to control |
|---|
| 244 | +and monitor the behavior of AF_XDP sockets. |
|---|
| 245 | + |
|---|
| 246 | +XDP_COPY and XDP_ZERO_COPY bind flags |
|---|
| 247 | +------------------------------------- |
|---|
| 248 | + |
|---|
| 249 | +When you bind to a socket, the kernel will first try to use zero-copy |
|---|
| 250 | +copy. If zero-copy is not supported, it will fall back on using copy |
|---|
| 251 | +mode, i.e. copying all packets out to user space. But if you would |
|---|
| 252 | +like to force a certain mode, you can use the following flags. If you |
|---|
| 253 | +pass the XDP_COPY flag to the bind call, the kernel will force the |
|---|
| 254 | +socket into copy mode. If it cannot use copy mode, the bind call will |
|---|
| 255 | +fail with an error. Conversely, the XDP_ZERO_COPY flag will force the |
|---|
| 256 | +socket into zero-copy mode or fail. |
|---|
| 257 | + |
|---|
| 258 | +XDP_SHARED_UMEM bind flag |
|---|
| 259 | +------------------------- |
|---|
| 260 | + |
|---|
| 261 | +This flag enables you to bind multiple sockets to the same UMEM. It |
|---|
| 262 | +works on the same queue id, between queue ids and between |
|---|
| 263 | +netdevs/devices. In this mode, each socket has their own RX and TX |
|---|
| 264 | +rings as usual, but you are going to have one or more FILL and |
|---|
| 265 | +COMPLETION ring pairs. You have to create one of these pairs per |
|---|
| 266 | +unique netdev and queue id tuple that you bind to. |
|---|
| 267 | + |
|---|
| 268 | +Starting with the case were we would like to share a UMEM between |
|---|
| 269 | +sockets bound to the same netdev and queue id. The UMEM (tied to the |
|---|
| 270 | +fist socket created) will only have a single FILL ring and a single |
|---|
| 271 | +COMPLETION ring as there is only on unique netdev,queue_id tuple that |
|---|
| 272 | +we have bound to. To use this mode, create the first socket and bind |
|---|
| 273 | +it in the normal way. Create a second socket and create an RX and a TX |
|---|
| 274 | +ring, or at least one of them, but no FILL or COMPLETION rings as the |
|---|
| 275 | +ones from the first socket will be used. In the bind call, set he |
|---|
| 276 | +XDP_SHARED_UMEM option and provide the initial socket's fd in the |
|---|
| 277 | +sxdp_shared_umem_fd field. You can attach an arbitrary number of extra |
|---|
| 278 | +sockets this way. |
|---|
| 279 | + |
|---|
| 280 | +What socket will then a packet arrive on? This is decided by the XDP |
|---|
| 281 | +program. Put all the sockets in the XSK_MAP and just indicate which |
|---|
| 282 | +index in the array you would like to send each packet to. A simple |
|---|
| 283 | +round-robin example of distributing packets is shown below: |
|---|
| 284 | + |
|---|
| 285 | +.. code-block:: c |
|---|
| 286 | + |
|---|
| 287 | + #include <linux/bpf.h> |
|---|
| 288 | + #include "bpf_helpers.h" |
|---|
| 289 | + |
|---|
| 290 | + #define MAX_SOCKS 16 |
|---|
| 291 | + |
|---|
| 292 | + struct { |
|---|
| 293 | + __uint(type, BPF_MAP_TYPE_XSKMAP); |
|---|
| 294 | + __uint(max_entries, MAX_SOCKS); |
|---|
| 295 | + __uint(key_size, sizeof(int)); |
|---|
| 296 | + __uint(value_size, sizeof(int)); |
|---|
| 297 | + } xsks_map SEC(".maps"); |
|---|
| 298 | + |
|---|
| 299 | + static unsigned int rr; |
|---|
| 300 | + |
|---|
| 301 | + SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) |
|---|
| 302 | + { |
|---|
| 303 | + rr = (rr + 1) & (MAX_SOCKS - 1); |
|---|
| 304 | + |
|---|
| 305 | + return bpf_redirect_map(&xsks_map, rr, XDP_DROP); |
|---|
| 306 | + } |
|---|
| 307 | + |
|---|
| 308 | +Note, that since there is only a single set of FILL and COMPLETION |
|---|
| 309 | +rings, and they are single producer, single consumer rings, you need |
|---|
| 310 | +to make sure that multiple processes or threads do not use these rings |
|---|
| 311 | +concurrently. There are no synchronization primitives in the |
|---|
| 312 | +libbpf code that protects multiple users at this point in time. |
|---|
| 313 | + |
|---|
| 314 | +Libbpf uses this mode if you create more than one socket tied to the |
|---|
| 315 | +same UMEM. However, note that you need to supply the |
|---|
| 316 | +XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the |
|---|
| 317 | +xsk_socket__create calls and load your own XDP program as there is no |
|---|
| 318 | +built in one in libbpf that will route the traffic for you. |
|---|
| 319 | + |
|---|
| 320 | +The second case is when you share a UMEM between sockets that are |
|---|
| 321 | +bound to different queue ids and/or netdevs. In this case you have to |
|---|
| 322 | +create one FILL ring and one COMPLETION ring for each unique |
|---|
| 323 | +netdev,queue_id pair. Let us say you want to create two sockets bound |
|---|
| 324 | +to two different queue ids on the same netdev. Create the first socket |
|---|
| 325 | +and bind it in the normal way. Create a second socket and create an RX |
|---|
| 326 | +and a TX ring, or at least one of them, and then one FILL and |
|---|
| 327 | +COMPLETION ring for this socket. Then in the bind call, set he |
|---|
| 328 | +XDP_SHARED_UMEM option and provide the initial socket's fd in the |
|---|
| 329 | +sxdp_shared_umem_fd field as you registered the UMEM on that |
|---|
| 330 | +socket. These two sockets will now share one and the same UMEM. |
|---|
| 331 | + |
|---|
| 332 | +There is no need to supply an XDP program like the one in the previous |
|---|
| 333 | +case where sockets were bound to the same queue id and |
|---|
| 334 | +device. Instead, use the NIC's packet steering capabilities to steer |
|---|
| 335 | +the packets to the right queue. In the previous example, there is only |
|---|
| 336 | +one queue shared among sockets, so the NIC cannot do this steering. It |
|---|
| 337 | +can only steer between queues. |
|---|
| 338 | + |
|---|
| 339 | +In libbpf, you need to use the xsk_socket__create_shared() API as it |
|---|
| 340 | +takes a reference to a FILL ring and a COMPLETION ring that will be |
|---|
| 341 | +created for you and bound to the shared UMEM. You can use this |
|---|
| 342 | +function for all the sockets you create, or you can use it for the |
|---|
| 343 | +second and following ones and use xsk_socket__create() for the first |
|---|
| 344 | +one. Both methods yield the same result. |
|---|
| 345 | + |
|---|
| 346 | +Note that a UMEM can be shared between sockets on the same queue id |
|---|
| 347 | +and device, as well as between queues on the same device and between |
|---|
| 348 | +devices at the same time. |
|---|
| 349 | + |
|---|
| 350 | +XDP_USE_NEED_WAKEUP bind flag |
|---|
| 351 | +----------------------------- |
|---|
| 352 | + |
|---|
| 353 | +This option adds support for a new flag called need_wakeup that is |
|---|
| 354 | +present in the FILL ring and the TX ring, the rings for which user |
|---|
| 355 | +space is a producer. When this option is set in the bind call, the |
|---|
| 356 | +need_wakeup flag will be set if the kernel needs to be explicitly |
|---|
| 357 | +woken up by a syscall to continue processing packets. If the flag is |
|---|
| 358 | +zero, no syscall is needed. |
|---|
| 359 | + |
|---|
| 360 | +If the flag is set on the FILL ring, the application needs to call |
|---|
| 361 | +poll() to be able to continue to receive packets on the RX ring. This |
|---|
| 362 | +can happen, for example, when the kernel has detected that there are no |
|---|
| 363 | +more buffers on the FILL ring and no buffers left on the RX HW ring of |
|---|
| 364 | +the NIC. In this case, interrupts are turned off as the NIC cannot |
|---|
| 365 | +receive any packets (as there are no buffers to put them in), and the |
|---|
| 366 | +need_wakeup flag is set so that user space can put buffers on the |
|---|
| 367 | +FILL ring and then call poll() so that the kernel driver can put these |
|---|
| 368 | +buffers on the HW ring and start to receive packets. |
|---|
| 369 | + |
|---|
| 370 | +If the flag is set for the TX ring, it means that the application |
|---|
| 371 | +needs to explicitly notify the kernel to send any packets put on the |
|---|
| 372 | +TX ring. This can be accomplished either by a poll() call, as in the |
|---|
| 373 | +RX path, or by calling sendto(). |
|---|
| 374 | + |
|---|
| 375 | +An example of how to use this flag can be found in |
|---|
| 376 | +samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers |
|---|
| 377 | +would look like this for the TX path: |
|---|
| 378 | + |
|---|
| 379 | +.. code-block:: c |
|---|
| 380 | + |
|---|
| 381 | + if (xsk_ring_prod__needs_wakeup(&my_tx_ring)) |
|---|
| 382 | + sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0); |
|---|
| 383 | + |
|---|
| 384 | +I.e., only use the syscall if the flag is set. |
|---|
| 385 | + |
|---|
| 386 | +We recommend that you always enable this mode as it usually leads to |
|---|
| 387 | +better performance especially if you run the application and the |
|---|
| 388 | +driver on the same core, but also if you use different cores for the |
|---|
| 389 | +application and the kernel driver, as it reduces the number of |
|---|
| 390 | +syscalls needed for the TX path. |
|---|
| 391 | + |
|---|
| 392 | +XDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts |
|---|
| 393 | +------------------------------------------------------ |
|---|
| 394 | + |
|---|
| 395 | +These setsockopts sets the number of descriptors that the RX, TX, |
|---|
| 396 | +FILL, and COMPLETION rings respectively should have. It is mandatory |
|---|
| 397 | +to set the size of at least one of the RX and TX rings. If you set |
|---|
| 398 | +both, you will be able to both receive and send traffic from your |
|---|
| 399 | +application, but if you only want to do one of them, you can save |
|---|
| 400 | +resources by only setting up one of them. Both the FILL ring and the |
|---|
| 401 | +COMPLETION ring are mandatory as you need to have a UMEM tied to your |
|---|
| 402 | +socket. But if the XDP_SHARED_UMEM flag is used, any socket after the |
|---|
| 403 | +first one does not have a UMEM and should in that case not have any |
|---|
| 404 | +FILL or COMPLETION rings created as the ones from the shared UMEM will |
|---|
| 405 | +be used. Note, that the rings are single-producer single-consumer, so |
|---|
| 406 | +do not try to access them from multiple processes at the same |
|---|
| 407 | +time. See the XDP_SHARED_UMEM section. |
|---|
| 408 | + |
|---|
| 409 | +In libbpf, you can create Rx-only and Tx-only sockets by supplying |
|---|
| 410 | +NULL to the rx and tx arguments, respectively, to the |
|---|
| 411 | +xsk_socket__create function. |
|---|
| 412 | + |
|---|
| 413 | +If you create a Tx-only socket, we recommend that you do not put any |
|---|
| 414 | +packets on the fill ring. If you do this, drivers might think you are |
|---|
| 415 | +going to receive something when you in fact will not, and this can |
|---|
| 416 | +negatively impact performance. |
|---|
| 417 | + |
|---|
| 418 | +XDP_UMEM_REG setsockopt |
|---|
| 419 | +----------------------- |
|---|
| 420 | + |
|---|
| 421 | +This setsockopt registers a UMEM to a socket. This is the area that |
|---|
| 422 | +contain all the buffers that packet can recide in. The call takes a |
|---|
| 423 | +pointer to the beginning of this area and the size of it. Moreover, it |
|---|
| 424 | +also has parameter called chunk_size that is the size that the UMEM is |
|---|
| 425 | +divided into. It can only be 2K or 4K at the moment. If you have an |
|---|
| 426 | +UMEM area that is 128K and a chunk size of 2K, this means that you |
|---|
| 427 | +will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM |
|---|
| 428 | +area and that your largest packet size can be 2K. |
|---|
| 429 | + |
|---|
| 430 | +There is also an option to set the headroom of each single buffer in |
|---|
| 431 | +the UMEM. If you set this to N bytes, it means that the packet will |
|---|
| 432 | +start N bytes into the buffer leaving the first N bytes for the |
|---|
| 433 | +application to use. The final option is the flags field, but it will |
|---|
| 434 | +be dealt with in separate sections for each UMEM flag. |
|---|
| 435 | + |
|---|
| 436 | +SO_BINDTODEVICE setsockopt |
|---|
| 437 | +-------------------------- |
|---|
| 438 | + |
|---|
| 439 | +This is a generic SOL_SOCKET option that can be used to tie AF_XDP |
|---|
| 440 | +socket to a particular network interface. It is useful when a socket |
|---|
| 441 | +is created by a privileged process and passed to a non-privileged one. |
|---|
| 442 | +Once the option is set, kernel will refuse attempts to bind that socket |
|---|
| 443 | +to a different interface. Updating the value requires CAP_NET_RAW. |
|---|
| 444 | + |
|---|
| 445 | +XDP_STATISTICS getsockopt |
|---|
| 446 | +------------------------- |
|---|
| 447 | + |
|---|
| 448 | +Gets drop statistics of a socket that can be useful for debug |
|---|
| 449 | +purposes. The supported statistics are shown below: |
|---|
| 450 | + |
|---|
| 451 | +.. code-block:: c |
|---|
| 452 | + |
|---|
| 453 | + struct xdp_statistics { |
|---|
| 454 | + __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ |
|---|
| 455 | + __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ |
|---|
| 456 | + __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ |
|---|
| 457 | + }; |
|---|
| 458 | + |
|---|
| 459 | +XDP_OPTIONS getsockopt |
|---|
| 460 | +---------------------- |
|---|
| 461 | + |
|---|
| 462 | +Gets options from an XDP socket. The only one supported so far is |
|---|
| 463 | +XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. |
|---|
| 464 | + |
|---|
| 217 | 465 | Usage |
|---|
| 218 | 466 | ===== |
|---|
| 219 | 467 | |
|---|
| 220 | | -In order to use AF_XDP sockets there are two parts needed. The |
|---|
| 468 | +In order to use AF_XDP sockets two parts are needed. The |
|---|
| 221 | 469 | user-space application and the XDP program. For a complete setup and |
|---|
| 222 | 470 | usage example, please refer to the sample application. The user-space |
|---|
| 223 | | -side is xdpsock_user.c and the XDP side xdpsock_kern.c. |
|---|
| 471 | +side is xdpsock_user.c and the XDP side is part of libbpf. |
|---|
| 224 | 472 | |
|---|
| 225 | | -Naive ring dequeue and enqueue could look like this:: |
|---|
| 473 | +The XDP code sample included in tools/lib/bpf/xsk.c is the following: |
|---|
| 474 | + |
|---|
| 475 | +.. code-block:: c |
|---|
| 476 | + |
|---|
| 477 | + SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) |
|---|
| 478 | + { |
|---|
| 479 | + int index = ctx->rx_queue_index; |
|---|
| 480 | + |
|---|
| 481 | + // A set entry here means that the corresponding queue_id |
|---|
| 482 | + // has an active AF_XDP socket bound to it. |
|---|
| 483 | + if (bpf_map_lookup_elem(&xsks_map, &index)) |
|---|
| 484 | + return bpf_redirect_map(&xsks_map, index, 0); |
|---|
| 485 | + |
|---|
| 486 | + return XDP_PASS; |
|---|
| 487 | + } |
|---|
| 488 | + |
|---|
| 489 | +A simple but not so performance ring dequeue and enqueue could look |
|---|
| 490 | +like this: |
|---|
| 491 | + |
|---|
| 492 | +.. code-block:: c |
|---|
| 226 | 493 | |
|---|
| 227 | 494 | // struct xdp_rxtx_ring { |
|---|
| 228 | 495 | // __u32 *producer; |
|---|
| .. | .. |
|---|
| 271 | 538 | return 0; |
|---|
| 272 | 539 | } |
|---|
| 273 | 540 | |
|---|
| 274 | | - |
|---|
| 275 | | -For a more optimized version, please refer to the sample application. |
|---|
| 541 | +But please use the libbpf functions as they are optimized and ready to |
|---|
| 542 | +use. Will make your life easier. |
|---|
| 276 | 543 | |
|---|
| 277 | 544 | Sample application |
|---|
| 278 | 545 | ================== |
|---|
| 279 | 546 | |
|---|
| 280 | 547 | There is a xdpsock benchmarking/test application included that |
|---|
| 281 | | -demonstrates how to use AF_XDP sockets with both private and shared |
|---|
| 282 | | -UMEMs. Say that you would like your UDP traffic from port 4242 to end |
|---|
| 283 | | -up in queue 16, that we will enable AF_XDP on. Here, we use ethtool |
|---|
| 284 | | -for this:: |
|---|
| 548 | +demonstrates how to use AF_XDP sockets with private UMEMs. Say that |
|---|
| 549 | +you would like your UDP traffic from port 4242 to end up in queue 16, |
|---|
| 550 | +that we will enable AF_XDP on. Here, we use ethtool for this:: |
|---|
| 285 | 551 | |
|---|
| 286 | 552 | ethtool -N p3p2 rx-flow-hash udp4 fn |
|---|
| 287 | 553 | ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ |
|---|
| .. | .. |
|---|
| 294 | 560 | |
|---|
| 295 | 561 | For XDP_SKB mode, use the switch "-S" instead of "-N" and all options |
|---|
| 296 | 562 | can be displayed with "-h", as usual. |
|---|
| 563 | + |
|---|
| 564 | +This sample application uses libbpf to make the setup and usage of |
|---|
| 565 | +AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is |
|---|
| 566 | +really used to make something more advanced, take a look at the libbpf |
|---|
| 567 | +code in tools/lib/bpf/xsk.[ch]. |
|---|
| 568 | + |
|---|
| 569 | +FAQ |
|---|
| 570 | +======= |
|---|
| 571 | + |
|---|
| 572 | +Q: I am not seeing any traffic on the socket. What am I doing wrong? |
|---|
| 573 | + |
|---|
| 574 | +A: When a netdev of a physical NIC is initialized, Linux usually |
|---|
| 575 | + allocates one RX and TX queue pair per core. So on a 8 core system, |
|---|
| 576 | + queue ids 0 to 7 will be allocated, one per core. In the AF_XDP |
|---|
| 577 | + bind call or the xsk_socket__create libbpf function call, you |
|---|
| 578 | + specify a specific queue id to bind to and it is only the traffic |
|---|
| 579 | + towards that queue you are going to get on you socket. So in the |
|---|
| 580 | + example above, if you bind to queue 0, you are NOT going to get any |
|---|
| 581 | + traffic that is distributed to queues 1 through 7. If you are |
|---|
| 582 | + lucky, you will see the traffic, but usually it will end up on one |
|---|
| 583 | + of the queues you have not bound to. |
|---|
| 584 | + |
|---|
| 585 | + There are a number of ways to solve the problem of getting the |
|---|
| 586 | + traffic you want to the queue id you bound to. If you want to see |
|---|
| 587 | + all the traffic, you can force the netdev to only have 1 queue, queue |
|---|
| 588 | + id 0, and then bind to queue 0. You can use ethtool to do this:: |
|---|
| 589 | + |
|---|
| 590 | + sudo ethtool -L <interface> combined 1 |
|---|
| 591 | + |
|---|
| 592 | + If you want to only see part of the traffic, you can program the |
|---|
| 593 | + NIC through ethtool to filter out your traffic to a single queue id |
|---|
| 594 | + that you can bind your XDP socket to. Here is one example in which |
|---|
| 595 | + UDP traffic to and from port 4242 are sent to queue 2:: |
|---|
| 596 | + |
|---|
| 597 | + sudo ethtool -N <interface> rx-flow-hash udp4 fn |
|---|
| 598 | + sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ |
|---|
| 599 | + 4242 action 2 |
|---|
| 600 | + |
|---|
| 601 | + A number of other ways are possible all up to the capabilities of |
|---|
| 602 | + the NIC you have. |
|---|
| 603 | + |
|---|
| 604 | +Q: Can I use the XSKMAP to implement a switch betwen different umems |
|---|
| 605 | + in copy mode? |
|---|
| 606 | + |
|---|
| 607 | +A: The short answer is no, that is not supported at the moment. The |
|---|
| 608 | + XSKMAP can only be used to switch traffic coming in on queue id X |
|---|
| 609 | + to sockets bound to the same queue id X. The XSKMAP can contain |
|---|
| 610 | + sockets bound to different queue ids, for example X and Y, but only |
|---|
| 611 | + traffic goming in from queue id Y can be directed to sockets bound |
|---|
| 612 | + to the same queue id Y. In zero-copy mode, you should use the |
|---|
| 613 | + switch, or other distribution mechanism, in your NIC to direct |
|---|
| 614 | + traffic to the correct queue id and socket. |
|---|
| 615 | + |
|---|
| 616 | +Q: My packets are sometimes corrupted. What is wrong? |
|---|
| 617 | + |
|---|
| 618 | +A: Care has to be taken not to feed the same buffer in the UMEM into |
|---|
| 619 | + more than one ring at the same time. If you for example feed the |
|---|
| 620 | + same buffer into the FILL ring and the TX ring at the same time, the |
|---|
| 621 | + NIC might receive data into the buffer at the same time it is |
|---|
| 622 | + sending it. This will cause some packets to become corrupted. Same |
|---|
| 623 | + thing goes for feeding the same buffer into the FILL rings |
|---|
| 624 | + belonging to different queue ids or netdevs bound with the |
|---|
| 625 | + XDP_SHARED_UMEM flag. |
|---|
| 297 | 626 | |
|---|
| 298 | 627 | Credits |
|---|
| 299 | 628 | ======= |
|---|
| .. | .. |
|---|
| 309 | 638 | - Michael S. Tsirkin |
|---|
| 310 | 639 | - Qi Z Zhang |
|---|
| 311 | 640 | - Willem de Bruijn |
|---|
| 312 | | - |
|---|