.. | .. |
---|
3 | 3 | ============================ |
---|
4 | 4 | |
---|
5 | 5 | By: David Howells <dhowells@redhat.com> |
---|
6 | | - Paul E. McKenney <paulmck@linux.vnet.ibm.com> |
---|
| 6 | + Paul E. McKenney <paulmck@linux.ibm.com> |
---|
7 | 7 | Will Deacon <will.deacon@arm.com> |
---|
8 | 8 | Peter Zijlstra <peterz@infradead.org> |
---|
9 | 9 | |
---|
.. | .. |
---|
63 | 63 | |
---|
64 | 64 | - Compiler barrier. |
---|
65 | 65 | - CPU memory barriers. |
---|
66 | | - - MMIO write barrier. |
---|
67 | 66 | |
---|
68 | 67 | (*) Implicit kernel memory barriers. |
---|
69 | 68 | |
---|
.. | .. |
---|
75 | 74 | (*) Inter-CPU acquiring barrier effects. |
---|
76 | 75 | |
---|
77 | 76 | - Acquires vs memory accesses. |
---|
78 | | - - Acquires vs I/O accesses. |
---|
79 | 77 | |
---|
80 | 78 | (*) Where are memory barriers needed? |
---|
81 | 79 | |
---|
.. | .. |
---|
187 | 185 | =============== =============== |
---|
188 | 186 | { A == 1, B == 2, C == 3, P == &A, Q == &C } |
---|
189 | 187 | B = 4; Q = P; |
---|
190 | | - P = &B D = *Q; |
---|
| 188 | + P = &B; D = *Q; |
---|
191 | 189 | |
---|
192 | 190 | There is an obvious data dependency here, as the value loaded into D depends on |
---|
193 | 191 | the address retrieved from P by CPU 2. At the end of the sequence, any of the |
---|
.. | .. |
---|
471 | 469 | operations after the ACQUIRE operation will appear to happen after the |
---|
472 | 470 | ACQUIRE operation with respect to the other components of the system. |
---|
473 | 471 | ACQUIRE operations include LOCK operations and both smp_load_acquire() |
---|
474 | | - and smp_cond_acquire() operations. The later builds the necessary ACQUIRE |
---|
475 | | - semantics from relying on a control dependency and smp_rmb(). |
---|
| 472 | + and smp_cond_load_acquire() operations. |
---|
476 | 473 | |
---|
477 | 474 | Memory operations that occur before an ACQUIRE operation may appear to |
---|
478 | 475 | happen after it completes. |
---|
.. | .. |
---|
493 | 490 | happen before it completes. |
---|
494 | 491 | |
---|
495 | 492 | The use of ACQUIRE and RELEASE operations generally precludes the need |
---|
496 | | - for other sorts of memory barrier (but note the exceptions mentioned in |
---|
497 | | - the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE |
---|
498 | | - pair is -not- guaranteed to act as a full memory barrier. However, after |
---|
499 | | - an ACQUIRE on a given variable, all memory accesses preceding any prior |
---|
| 493 | + for other sorts of memory barrier. In addition, a RELEASE+ACQUIRE pair is |
---|
| 494 | + -not- guaranteed to act as a full memory barrier. However, after an |
---|
| 495 | + ACQUIRE on a given variable, all memory accesses preceding any prior |
---|
500 | 496 | RELEASE on that same variable are guaranteed to be visible. In other |
---|
501 | 497 | words, within a given variable's critical section, all accesses of all |
---|
502 | 498 | previous critical sections for that variable are guaranteed to have |
---|
.. | .. |
---|
549 | 545 | |
---|
550 | 546 | [*] For information on bus mastering DMA and coherency please read: |
---|
551 | 547 | |
---|
552 | | - Documentation/PCI/pci.txt |
---|
553 | | - Documentation/DMA-API-HOWTO.txt |
---|
554 | | - Documentation/DMA-API.txt |
---|
| 548 | + Documentation/driver-api/pci/pci.rst |
---|
| 549 | + Documentation/core-api/dma-api-howto.rst |
---|
| 550 | + Documentation/core-api/dma-api.rst |
---|
555 | 551 | |
---|
556 | 552 | |
---|
557 | 553 | DATA DEPENDENCY BARRIERS (HISTORICAL) |
---|
558 | 554 | ------------------------------------- |
---|
559 | 555 | |
---|
560 | | -As of v4.15 of the Linux kernel, an smp_read_barrier_depends() was |
---|
561 | | -added to READ_ONCE(), which means that about the only people who |
---|
562 | | -need to pay attention to this section are those working on DEC Alpha |
---|
563 | | -architecture-specific code and those working on READ_ONCE() itself. |
---|
564 | | -For those who need it, and for those who are interested in the history, |
---|
565 | | -here is the story of data-dependency barriers. |
---|
| 556 | +As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for |
---|
| 557 | +DEC Alpha, which means that about the only people who need to pay attention |
---|
| 558 | +to this section are those working on DEC Alpha architecture-specific code |
---|
| 559 | +and those working on READ_ONCE() itself. For those who need it, and for |
---|
| 560 | +those who are interested in the history, here is the story of |
---|
| 561 | +data-dependency barriers. |
---|
566 | 562 | |
---|
567 | 563 | The usage requirements of data dependency barriers are a little subtle, and |
---|
568 | 564 | it's not always obvious that they're needed. To illustrate, consider the |
---|
.. | .. |
---|
573 | 569 | { A == 1, B == 2, C == 3, P == &A, Q == &C } |
---|
574 | 570 | B = 4; |
---|
575 | 571 | <write barrier> |
---|
576 | | - WRITE_ONCE(P, &B) |
---|
| 572 | + WRITE_ONCE(P, &B); |
---|
577 | 573 | Q = READ_ONCE(P); |
---|
578 | 574 | D = *Q; |
---|
579 | 575 | |
---|
.. | .. |
---|
588 | 584 | |
---|
589 | 585 | (Q == &B) and (D == 2) ???? |
---|
590 | 586 | |
---|
591 | | -Whilst this may seem like a failure of coherency or causality maintenance, it |
---|
| 587 | +While this may seem like a failure of coherency or causality maintenance, it |
---|
592 | 588 | isn't, and this behaviour can be observed on certain real CPUs (such as the DEC |
---|
593 | 589 | Alpha). |
---|
594 | 590 | |
---|
.. | .. |
---|
624 | 620 | until they are certain (1) that the write will actually happen, (2) |
---|
625 | 621 | of the location of the write, and (3) of the value to be written. |
---|
626 | 622 | But please carefully read the "CONTROL DEPENDENCIES" section and the |
---|
627 | | -Documentation/RCU/rcu_dereference.txt file: The compiler can and does |
---|
| 623 | +Documentation/RCU/rcu_dereference.rst file: The compiler can and does |
---|
628 | 624 | break dependencies in a great many highly creative ways. |
---|
629 | 625 | |
---|
630 | 626 | CPU 1 CPU 2 |
---|
.. | .. |
---|
1513 | 1509 | |
---|
1514 | 1510 | (*) CPU memory barriers. |
---|
1515 | 1511 | |
---|
1516 | | - (*) MMIO write barrier. |
---|
1517 | | - |
---|
1518 | 1512 | |
---|
1519 | 1513 | COMPILER BARRIER |
---|
1520 | 1514 | ---------------- |
---|
.. | .. |
---|
1727 | 1721 | and WRITE_ONCE() are more selective: With READ_ONCE() and |
---|
1728 | 1722 | WRITE_ONCE(), the compiler need only forget the contents of the |
---|
1729 | 1723 | indicated memory locations, while with barrier() the compiler must |
---|
1730 | | - discard the value of all memory locations that it has currented |
---|
| 1724 | + discard the value of all memory locations that it has currently |
---|
1731 | 1725 | cached in any machine registers. Of course, the compiler must also |
---|
1732 | 1726 | respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur, |
---|
1733 | 1727 | though the CPU of course need not do so. |
---|
.. | .. |
---|
1839 | 1833 | to issue the loads in the correct order (eg. `a[b]` would have to load |
---|
1840 | 1834 | the value of b before loading a[b]), however there is no guarantee in |
---|
1841 | 1835 | the C specification that the compiler may not speculate the value of b |
---|
1842 | | -(eg. is equal to 1) and load a before b (eg. tmp = a[1]; if (b != 1) |
---|
| 1836 | +(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1) |
---|
1843 | 1837 | tmp = a[b]; ). There is also the problem of a compiler reloading b after |
---|
1844 | 1838 | having loaded a[b], thus having a newer copy of b than a[b]. A consensus |
---|
1845 | 1839 | has not yet been reached about these problems, however the READ_ONCE() |
---|
.. | .. |
---|
1874 | 1868 | (*) smp_mb__before_atomic(); |
---|
1875 | 1869 | (*) smp_mb__after_atomic(); |
---|
1876 | 1870 | |
---|
1877 | | - These are for use with atomic (such as add, subtract, increment and |
---|
1878 | | - decrement) functions that don't return a value, especially when used for |
---|
1879 | | - reference counting. These functions do not imply memory barriers. |
---|
| 1871 | + These are for use with atomic RMW functions that do not imply memory |
---|
| 1872 | + barriers, but where the code needs a memory barrier. Examples for atomic |
---|
| 1873 | + RMW functions that do not imply are memory barrier are e.g. add, |
---|
| 1874 | + subtract, (failed) conditional operations, _relaxed functions, |
---|
| 1875 | + but not atomic_read or atomic_set. A common example where a memory |
---|
| 1876 | + barrier may be required is when atomic ops are used for reference |
---|
| 1877 | + counting. |
---|
1880 | 1878 | |
---|
1881 | | - These are also used for atomic bitop functions that do not return a |
---|
1882 | | - value (such as set_bit and clear_bit). |
---|
| 1879 | + These are also used for atomic RMW bitop functions that do not imply a |
---|
| 1880 | + memory barrier (such as set_bit and clear_bit). |
---|
1883 | 1881 | |
---|
1884 | 1882 | As an example, consider a piece of code that marks an object as being dead |
---|
1885 | 1883 | and then decrements the object's reference count: |
---|
.. | .. |
---|
1934 | 1932 | here. |
---|
1935 | 1933 | |
---|
1936 | 1934 | See the subsection "Kernel I/O barrier effects" for more information on |
---|
1937 | | - relaxed I/O accessors and the Documentation/DMA-API.txt file for more |
---|
1938 | | - information on consistent memory. |
---|
| 1935 | + relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for |
---|
| 1936 | + more information on consistent memory. |
---|
1939 | 1937 | |
---|
| 1938 | + (*) pmem_wmb(); |
---|
1940 | 1939 | |
---|
1941 | | -MMIO WRITE BARRIER |
---|
1942 | | ------------------- |
---|
| 1940 | + This is for use with persistent memory to ensure that stores for which |
---|
| 1941 | + modifications are written to persistent storage reached a platform |
---|
| 1942 | + durability domain. |
---|
1943 | 1943 | |
---|
1944 | | -The Linux kernel also has a special barrier for use with memory-mapped I/O |
---|
1945 | | -writes: |
---|
| 1944 | + For example, after a non-temporal write to pmem region, we use pmem_wmb() |
---|
| 1945 | + to ensure that stores have reached a platform durability domain. This ensures |
---|
| 1946 | + that stores have updated persistent storage before any data access or |
---|
| 1947 | + data transfer caused by subsequent instructions is initiated. This is |
---|
| 1948 | + in addition to the ordering done by wmb(). |
---|
1946 | 1949 | |
---|
1947 | | - mmiowb(); |
---|
1948 | | - |
---|
1949 | | -This is a variation on the mandatory write barrier that causes writes to weakly |
---|
1950 | | -ordered I/O regions to be partially ordered. Its effects may go beyond the |
---|
1951 | | -CPU->Hardware interface and actually affect the hardware at some level. |
---|
1952 | | - |
---|
1953 | | -See the subsection "Acquires vs I/O accesses" for more information. |
---|
1954 | | - |
---|
| 1950 | + For load from persistent memory, existing read memory barriers are sufficient |
---|
| 1951 | + to ensure read ordering. |
---|
1955 | 1952 | |
---|
1956 | 1953 | =============================== |
---|
1957 | 1954 | IMPLICIT KERNEL MEMORY BARRIERS |
---|
.. | .. |
---|
2009 | 2006 | |
---|
2010 | 2007 | Certain locking variants of the ACQUIRE operation may fail, either due to |
---|
2011 | 2008 | being unable to get the lock immediately, or due to receiving an unblocked |
---|
2012 | | - signal whilst asleep waiting for the lock to become available. Failed |
---|
| 2009 | + signal while asleep waiting for the lock to become available. Failed |
---|
2013 | 2010 | locks do not imply any sort of barrier. |
---|
2014 | 2011 | |
---|
2015 | 2012 | [!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only |
---|
.. | .. |
---|
2318 | 2315 | *E, *F or *G following RELEASE Q |
---|
2319 | 2316 | |
---|
2320 | 2317 | |
---|
2321 | | - |
---|
2322 | | -ACQUIRES VS I/O ACCESSES |
---|
2323 | | ------------------------- |
---|
2324 | | - |
---|
2325 | | -Under certain circumstances (especially involving NUMA), I/O accesses within |
---|
2326 | | -two spinlocked sections on two different CPUs may be seen as interleaved by the |
---|
2327 | | -PCI bridge, because the PCI bridge does not necessarily participate in the |
---|
2328 | | -cache-coherence protocol, and is therefore incapable of issuing the required |
---|
2329 | | -read memory barriers. |
---|
2330 | | - |
---|
2331 | | -For example: |
---|
2332 | | - |
---|
2333 | | - CPU 1 CPU 2 |
---|
2334 | | - =============================== =============================== |
---|
2335 | | - spin_lock(Q) |
---|
2336 | | - writel(0, ADDR) |
---|
2337 | | - writel(1, DATA); |
---|
2338 | | - spin_unlock(Q); |
---|
2339 | | - spin_lock(Q); |
---|
2340 | | - writel(4, ADDR); |
---|
2341 | | - writel(5, DATA); |
---|
2342 | | - spin_unlock(Q); |
---|
2343 | | - |
---|
2344 | | -may be seen by the PCI bridge as follows: |
---|
2345 | | - |
---|
2346 | | - STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5 |
---|
2347 | | - |
---|
2348 | | -which would probably cause the hardware to malfunction. |
---|
2349 | | - |
---|
2350 | | - |
---|
2351 | | -What is necessary here is to intervene with an mmiowb() before dropping the |
---|
2352 | | -spinlock, for example: |
---|
2353 | | - |
---|
2354 | | - CPU 1 CPU 2 |
---|
2355 | | - =============================== =============================== |
---|
2356 | | - spin_lock(Q) |
---|
2357 | | - writel(0, ADDR) |
---|
2358 | | - writel(1, DATA); |
---|
2359 | | - mmiowb(); |
---|
2360 | | - spin_unlock(Q); |
---|
2361 | | - spin_lock(Q); |
---|
2362 | | - writel(4, ADDR); |
---|
2363 | | - writel(5, DATA); |
---|
2364 | | - mmiowb(); |
---|
2365 | | - spin_unlock(Q); |
---|
2366 | | - |
---|
2367 | | -this will ensure that the two stores issued on CPU 1 appear at the PCI bridge |
---|
2368 | | -before either of the stores issued on CPU 2. |
---|
2369 | | - |
---|
2370 | | - |
---|
2371 | | -Furthermore, following a store by a load from the same device obviates the need |
---|
2372 | | -for the mmiowb(), because the load forces the store to complete before the load |
---|
2373 | | -is performed: |
---|
2374 | | - |
---|
2375 | | - CPU 1 CPU 2 |
---|
2376 | | - =============================== =============================== |
---|
2377 | | - spin_lock(Q) |
---|
2378 | | - writel(0, ADDR) |
---|
2379 | | - a = readl(DATA); |
---|
2380 | | - spin_unlock(Q); |
---|
2381 | | - spin_lock(Q); |
---|
2382 | | - writel(4, ADDR); |
---|
2383 | | - b = readl(DATA); |
---|
2384 | | - spin_unlock(Q); |
---|
2385 | | - |
---|
2386 | | - |
---|
2387 | | -See Documentation/driver-api/device-io.rst for more information. |
---|
2388 | | - |
---|
2389 | | - |
---|
2390 | 2318 | ================================= |
---|
2391 | 2319 | WHERE ARE MEMORY BARRIERS NEEDED? |
---|
2392 | 2320 | ================================= |
---|
.. | .. |
---|
2509 | 2437 | ATOMIC OPERATIONS |
---|
2510 | 2438 | ----------------- |
---|
2511 | 2439 | |
---|
2512 | | -Whilst they are technically interprocessor interaction considerations, atomic |
---|
| 2440 | +While they are technically interprocessor interaction considerations, atomic |
---|
2513 | 2441 | operations are noted specially as some of them imply full memory barriers and |
---|
2514 | 2442 | some don't, but they're very heavily relied on as a group throughout the |
---|
2515 | 2443 | kernel. |
---|
.. | .. |
---|
2532 | 2460 | |
---|
2533 | 2461 | Inside of the Linux kernel, I/O should be done through the appropriate accessor |
---|
2534 | 2462 | routines - such as inb() or writel() - which know how to make such accesses |
---|
2535 | | -appropriately sequential. Whilst this, for the most part, renders the explicit |
---|
2536 | | -use of memory barriers unnecessary, there are a couple of situations where they |
---|
2537 | | -might be needed: |
---|
2538 | | - |
---|
2539 | | - (1) On some systems, I/O stores are not strongly ordered across all CPUs, and |
---|
2540 | | - so for _all_ general drivers locks should be used and mmiowb() must be |
---|
2541 | | - issued prior to unlocking the critical section. |
---|
2542 | | - |
---|
2543 | | - (2) If the accessor functions are used to refer to an I/O memory window with |
---|
2544 | | - relaxed memory access properties, then _mandatory_ memory barriers are |
---|
2545 | | - required to enforce ordering. |
---|
| 2463 | +appropriately sequential. While this, for the most part, renders the explicit |
---|
| 2464 | +use of memory barriers unnecessary, if the accessor functions are used to refer |
---|
| 2465 | +to an I/O memory window with relaxed memory access properties, then _mandatory_ |
---|
| 2466 | +memory barriers are required to enforce ordering. |
---|
2546 | 2467 | |
---|
2547 | 2468 | See Documentation/driver-api/device-io.rst for more information. |
---|
2548 | 2469 | |
---|
.. | .. |
---|
2556 | 2477 | |
---|
2557 | 2478 | This may be alleviated - at least in part - by disabling local interrupts (a |
---|
2558 | 2479 | form of locking), such that the critical operations are all contained within |
---|
2559 | | -the interrupt-disabled section in the driver. Whilst the driver's interrupt |
---|
| 2480 | +the interrupt-disabled section in the driver. While the driver's interrupt |
---|
2560 | 2481 | routine is executing, the driver's core may not run on the same CPU, and its |
---|
2561 | 2482 | interrupt is not permitted to happen again until the current interrupt has been |
---|
2562 | 2483 | handled, thus the interrupt handler does not need to lock against that. |
---|
.. | .. |
---|
2587 | 2508 | |
---|
2588 | 2509 | Normally this won't be a problem because the I/O accesses done inside such |
---|
2589 | 2510 | sections will include synchronous load operations on strictly ordered I/O |
---|
2590 | | -registers that form implicit I/O barriers. If this isn't sufficient then an |
---|
2591 | | -mmiowb() may need to be used explicitly. |
---|
| 2511 | +registers that form implicit I/O barriers. |
---|
2592 | 2512 | |
---|
2593 | 2513 | |
---|
2594 | 2514 | A similar situation may occur between an interrupt routine and two routines |
---|
.. | .. |
---|
2600 | 2520 | KERNEL I/O BARRIER EFFECTS |
---|
2601 | 2521 | ========================== |
---|
2602 | 2522 | |
---|
2603 | | -When accessing I/O memory, drivers should use the appropriate accessor |
---|
2604 | | -functions: |
---|
2605 | | - |
---|
2606 | | - (*) inX(), outX(): |
---|
2607 | | - |
---|
2608 | | - These are intended to talk to I/O space rather than memory space, but |
---|
2609 | | - that's primarily a CPU-specific concept. The i386 and x86_64 processors |
---|
2610 | | - do indeed have special I/O space access cycles and instructions, but many |
---|
2611 | | - CPUs don't have such a concept. |
---|
2612 | | - |
---|
2613 | | - The PCI bus, amongst others, defines an I/O space concept which - on such |
---|
2614 | | - CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O |
---|
2615 | | - space. However, it may also be mapped as a virtual I/O space in the CPU's |
---|
2616 | | - memory map, particularly on those CPUs that don't support alternate I/O |
---|
2617 | | - spaces. |
---|
2618 | | - |
---|
2619 | | - Accesses to this space may be fully synchronous (as on i386), but |
---|
2620 | | - intermediary bridges (such as the PCI host bridge) may not fully honour |
---|
2621 | | - that. |
---|
2622 | | - |
---|
2623 | | - They are guaranteed to be fully ordered with respect to each other. |
---|
2624 | | - |
---|
2625 | | - They are not guaranteed to be fully ordered with respect to other types of |
---|
2626 | | - memory and I/O operation. |
---|
| 2523 | +Interfacing with peripherals via I/O accesses is deeply architecture and device |
---|
| 2524 | +specific. Therefore, drivers which are inherently non-portable may rely on |
---|
| 2525 | +specific behaviours of their target systems in order to achieve synchronization |
---|
| 2526 | +in the most lightweight manner possible. For drivers intending to be portable |
---|
| 2527 | +between multiple architectures and bus implementations, the kernel offers a |
---|
| 2528 | +series of accessor functions that provide various degrees of ordering |
---|
| 2529 | +guarantees: |
---|
2627 | 2530 | |
---|
2628 | 2531 | (*) readX(), writeX(): |
---|
2629 | 2532 | |
---|
2630 | | - Whether these are guaranteed to be fully ordered and uncombined with |
---|
2631 | | - respect to each other on the issuing CPU depends on the characteristics |
---|
2632 | | - defined for the memory window through which they're accessing. On later |
---|
2633 | | - i386 architecture machines, for example, this is controlled by way of the |
---|
2634 | | - MTRR registers. |
---|
| 2533 | + The readX() and writeX() MMIO accessors take a pointer to the |
---|
| 2534 | + peripheral being accessed as an __iomem * parameter. For pointers |
---|
| 2535 | + mapped with the default I/O attributes (e.g. those returned by |
---|
| 2536 | + ioremap()), the ordering guarantees are as follows: |
---|
2635 | 2537 | |
---|
2636 | | - Ordinarily, these will be guaranteed to be fully ordered and uncombined, |
---|
2637 | | - provided they're not accessing a prefetchable device. |
---|
| 2538 | + 1. All readX() and writeX() accesses to the same peripheral are ordered |
---|
| 2539 | + with respect to each other. This ensures that MMIO register accesses |
---|
| 2540 | + by the same CPU thread to a particular device will arrive in program |
---|
| 2541 | + order. |
---|
2638 | 2542 | |
---|
2639 | | - However, intermediary hardware (such as a PCI bridge) may indulge in |
---|
2640 | | - deferral if it so wishes; to flush a store, a load from the same location |
---|
2641 | | - is preferred[*], but a load from the same device or from configuration |
---|
2642 | | - space should suffice for PCI. |
---|
| 2543 | + 2. A writeX() issued by a CPU thread holding a spinlock is ordered |
---|
| 2544 | + before a writeX() to the same peripheral from another CPU thread |
---|
| 2545 | + issued after a later acquisition of the same spinlock. This ensures |
---|
| 2546 | + that MMIO register writes to a particular device issued while holding |
---|
| 2547 | + a spinlock will arrive in an order consistent with acquisitions of |
---|
| 2548 | + the lock. |
---|
2643 | 2549 | |
---|
2644 | | - [*] NOTE! attempting to load from the same location as was written to may |
---|
2645 | | - cause a malfunction - consider the 16550 Rx/Tx serial registers for |
---|
2646 | | - example. |
---|
| 2550 | + 3. A writeX() by a CPU thread to the peripheral will first wait for the |
---|
| 2551 | + completion of all prior writes to memory either issued by, or |
---|
| 2552 | + propagated to, the same thread. This ensures that writes by the CPU |
---|
| 2553 | + to an outbound DMA buffer allocated by dma_alloc_coherent() will be |
---|
| 2554 | + visible to a DMA engine when the CPU writes to its MMIO control |
---|
| 2555 | + register to trigger the transfer. |
---|
2647 | 2556 | |
---|
2648 | | - Used with prefetchable I/O memory, an mmiowb() barrier may be required to |
---|
2649 | | - force stores to be ordered. |
---|
| 2557 | + 4. A readX() by a CPU thread from the peripheral will complete before |
---|
| 2558 | + any subsequent reads from memory by the same thread can begin. This |
---|
| 2559 | + ensures that reads by the CPU from an incoming DMA buffer allocated |
---|
| 2560 | + by dma_alloc_coherent() will not see stale data after reading from |
---|
| 2561 | + the DMA engine's MMIO status register to establish that the DMA |
---|
| 2562 | + transfer has completed. |
---|
2650 | 2563 | |
---|
2651 | | - Please refer to the PCI specification for more information on interactions |
---|
2652 | | - between PCI transactions. |
---|
| 2564 | + 5. A readX() by a CPU thread from the peripheral will complete before |
---|
| 2565 | + any subsequent delay() loop can begin execution on the same thread. |
---|
| 2566 | + This ensures that two MMIO register writes by the CPU to a peripheral |
---|
| 2567 | + will arrive at least 1us apart if the first write is immediately read |
---|
| 2568 | + back with readX() and udelay(1) is called prior to the second |
---|
| 2569 | + writeX(): |
---|
2653 | 2570 | |
---|
2654 | | - (*) readX_relaxed(), writeX_relaxed() |
---|
| 2571 | + writel(42, DEVICE_REGISTER_0); // Arrives at the device... |
---|
| 2572 | + readl(DEVICE_REGISTER_0); |
---|
| 2573 | + udelay(1); |
---|
| 2574 | + writel(42, DEVICE_REGISTER_1); // ...at least 1us before this. |
---|
2655 | 2575 | |
---|
2656 | | - These are similar to readX() and writeX(), but provide weaker memory |
---|
2657 | | - ordering guarantees. Specifically, they do not guarantee ordering with |
---|
2658 | | - respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee |
---|
2659 | | - ordering with respect to LOCK or UNLOCK operations. If the latter is |
---|
2660 | | - required, an mmiowb() barrier can be used. Note that relaxed accesses to |
---|
2661 | | - the same peripheral are guaranteed to be ordered with respect to each |
---|
2662 | | - other. |
---|
| 2576 | + The ordering properties of __iomem pointers obtained with non-default |
---|
| 2577 | + attributes (e.g. those returned by ioremap_wc()) are specific to the |
---|
| 2578 | + underlying architecture and therefore the guarantees listed above cannot |
---|
| 2579 | + generally be relied upon for accesses to these types of mappings. |
---|
2663 | 2580 | |
---|
2664 | | - (*) ioreadX(), iowriteX() |
---|
| 2581 | + (*) readX_relaxed(), writeX_relaxed(): |
---|
2665 | 2582 | |
---|
2666 | | - These will perform appropriately for the type of access they're actually |
---|
2667 | | - doing, be it inX()/outX() or readX()/writeX(). |
---|
| 2583 | + These are similar to readX() and writeX(), but provide weaker memory |
---|
| 2584 | + ordering guarantees. Specifically, they do not guarantee ordering with |
---|
| 2585 | + respect to locking, normal memory accesses or delay() loops (i.e. |
---|
| 2586 | + bullets 2-5 above) but they are still guaranteed to be ordered with |
---|
| 2587 | + respect to other accesses from the same CPU thread to the same |
---|
| 2588 | + peripheral when operating on __iomem pointers mapped with the default |
---|
| 2589 | + I/O attributes. |
---|
| 2590 | + |
---|
| 2591 | + (*) readsX(), writesX(): |
---|
| 2592 | + |
---|
| 2593 | + The readsX() and writesX() MMIO accessors are designed for accessing |
---|
| 2594 | + register-based, memory-mapped FIFOs residing on peripherals that are not |
---|
| 2595 | + capable of performing DMA. Consequently, they provide only the ordering |
---|
| 2596 | + guarantees of readX_relaxed() and writeX_relaxed(), as documented above. |
---|
| 2597 | + |
---|
| 2598 | + (*) inX(), outX(): |
---|
| 2599 | + |
---|
| 2600 | + The inX() and outX() accessors are intended to access legacy port-mapped |
---|
| 2601 | + I/O peripherals, which may require special instructions on some |
---|
| 2602 | + architectures (notably x86). The port number of the peripheral being |
---|
| 2603 | + accessed is passed as an argument. |
---|
| 2604 | + |
---|
| 2605 | + Since many CPU architectures ultimately access these peripherals via an |
---|
| 2606 | + internal virtual memory mapping, the portable ordering guarantees |
---|
| 2607 | + provided by inX() and outX() are the same as those provided by readX() |
---|
| 2608 | + and writeX() respectively when accessing a mapping with the default I/O |
---|
| 2609 | + attributes. |
---|
| 2610 | + |
---|
| 2611 | + Device drivers may expect outX() to emit a non-posted write transaction |
---|
| 2612 | + that waits for a completion response from the I/O peripheral before |
---|
| 2613 | + returning. This is not guaranteed by all architectures and is therefore |
---|
| 2614 | + not part of the portable ordering semantics. |
---|
| 2615 | + |
---|
| 2616 | + (*) insX(), outsX(): |
---|
| 2617 | + |
---|
| 2618 | + As above, the insX() and outsX() accessors provide the same ordering |
---|
| 2619 | + guarantees as readsX() and writesX() respectively when accessing a |
---|
| 2620 | + mapping with the default I/O attributes. |
---|
| 2621 | + |
---|
| 2622 | + (*) ioreadX(), iowriteX(): |
---|
| 2623 | + |
---|
| 2624 | + These will perform appropriately for the type of access they're actually |
---|
| 2625 | + doing, be it inX()/outX() or readX()/writeX(). |
---|
| 2626 | + |
---|
| 2627 | +With the exception of the string accessors (insX(), outsX(), readsX() and |
---|
| 2628 | +writesX()), all of the above assume that the underlying peripheral is |
---|
| 2629 | +little-endian and will therefore perform byte-swapping operations on big-endian |
---|
| 2630 | +architectures. |
---|
2668 | 2631 | |
---|
2669 | 2632 | |
---|
2670 | 2633 | ======================================== |
---|
.. | .. |
---|
2759 | 2722 | the use of any special device communication instructions the CPU may have. |
---|
2760 | 2723 | |
---|
2761 | 2724 | |
---|
2762 | | -CACHE COHERENCY |
---|
2763 | | ---------------- |
---|
2764 | | - |
---|
2765 | | -Life isn't quite as simple as it may appear above, however: for while the |
---|
2766 | | -caches are expected to be coherent, there's no guarantee that that coherency |
---|
2767 | | -will be ordered. This means that whilst changes made on one CPU will |
---|
2768 | | -eventually become visible on all CPUs, there's no guarantee that they will |
---|
2769 | | -become apparent in the same order on those other CPUs. |
---|
2770 | | - |
---|
2771 | | - |
---|
2772 | | -Consider dealing with a system that has a pair of CPUs (1 & 2), each of which |
---|
2773 | | -has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D): |
---|
2774 | | - |
---|
2775 | | - : |
---|
2776 | | - : +--------+ |
---|
2777 | | - : +---------+ | | |
---|
2778 | | - +--------+ : +--->| Cache A |<------->| | |
---|
2779 | | - | | : | +---------+ | | |
---|
2780 | | - | CPU 1 |<---+ | | |
---|
2781 | | - | | : | +---------+ | | |
---|
2782 | | - +--------+ : +--->| Cache B |<------->| | |
---|
2783 | | - : +---------+ | | |
---|
2784 | | - : | Memory | |
---|
2785 | | - : +---------+ | System | |
---|
2786 | | - +--------+ : +--->| Cache C |<------->| | |
---|
2787 | | - | | : | +---------+ | | |
---|
2788 | | - | CPU 2 |<---+ | | |
---|
2789 | | - | | : | +---------+ | | |
---|
2790 | | - +--------+ : +--->| Cache D |<------->| | |
---|
2791 | | - : +---------+ | | |
---|
2792 | | - : +--------+ |
---|
2793 | | - : |
---|
2794 | | - |
---|
2795 | | -Imagine the system has the following properties: |
---|
2796 | | - |
---|
2797 | | - (*) an odd-numbered cache line may be in cache A, cache C or it may still be |
---|
2798 | | - resident in memory; |
---|
2799 | | - |
---|
2800 | | - (*) an even-numbered cache line may be in cache B, cache D or it may still be |
---|
2801 | | - resident in memory; |
---|
2802 | | - |
---|
2803 | | - (*) whilst the CPU core is interrogating one cache, the other cache may be |
---|
2804 | | - making use of the bus to access the rest of the system - perhaps to |
---|
2805 | | - displace a dirty cacheline or to do a speculative load; |
---|
2806 | | - |
---|
2807 | | - (*) each cache has a queue of operations that need to be applied to that cache |
---|
2808 | | - to maintain coherency with the rest of the system; |
---|
2809 | | - |
---|
2810 | | - (*) the coherency queue is not flushed by normal loads to lines already |
---|
2811 | | - present in the cache, even though the contents of the queue may |
---|
2812 | | - potentially affect those loads. |
---|
2813 | | - |
---|
2814 | | -Imagine, then, that two writes are made on the first CPU, with a write barrier |
---|
2815 | | -between them to guarantee that they will appear to reach that CPU's caches in |
---|
2816 | | -the requisite order: |
---|
2817 | | - |
---|
2818 | | - CPU 1 CPU 2 COMMENT |
---|
2819 | | - =============== =============== ======================================= |
---|
2820 | | - u == 0, v == 1 and p == &u, q == &u |
---|
2821 | | - v = 2; |
---|
2822 | | - smp_wmb(); Make sure change to v is visible before |
---|
2823 | | - change to p |
---|
2824 | | - <A:modify v=2> v is now in cache A exclusively |
---|
2825 | | - p = &v; |
---|
2826 | | - <B:modify p=&v> p is now in cache B exclusively |
---|
2827 | | - |
---|
2828 | | -The write memory barrier forces the other CPUs in the system to perceive that |
---|
2829 | | -the local CPU's caches have apparently been updated in the correct order. But |
---|
2830 | | -now imagine that the second CPU wants to read those values: |
---|
2831 | | - |
---|
2832 | | - CPU 1 CPU 2 COMMENT |
---|
2833 | | - =============== =============== ======================================= |
---|
2834 | | - ... |
---|
2835 | | - q = p; |
---|
2836 | | - x = *q; |
---|
2837 | | - |
---|
2838 | | -The above pair of reads may then fail to happen in the expected order, as the |
---|
2839 | | -cacheline holding p may get updated in one of the second CPU's caches whilst |
---|
2840 | | -the update to the cacheline holding v is delayed in the other of the second |
---|
2841 | | -CPU's caches by some other cache event: |
---|
2842 | | - |
---|
2843 | | - CPU 1 CPU 2 COMMENT |
---|
2844 | | - =============== =============== ======================================= |
---|
2845 | | - u == 0, v == 1 and p == &u, q == &u |
---|
2846 | | - v = 2; |
---|
2847 | | - smp_wmb(); |
---|
2848 | | - <A:modify v=2> <C:busy> |
---|
2849 | | - <C:queue v=2> |
---|
2850 | | - p = &v; q = p; |
---|
2851 | | - <D:request p> |
---|
2852 | | - <B:modify p=&v> <D:commit p=&v> |
---|
2853 | | - <D:read p> |
---|
2854 | | - x = *q; |
---|
2855 | | - <C:read *q> Reads from v before v updated in cache |
---|
2856 | | - <C:unbusy> |
---|
2857 | | - <C:commit v=2> |
---|
2858 | | - |
---|
2859 | | -Basically, whilst both cachelines will be updated on CPU 2 eventually, there's |
---|
2860 | | -no guarantee that, without intervention, the order of update will be the same |
---|
2861 | | -as that committed on CPU 1. |
---|
2862 | | - |
---|
2863 | | - |
---|
2864 | | -To intervene, we need to interpolate a data dependency barrier or a read |
---|
2865 | | -barrier between the loads (which as of v4.15 is supplied unconditionally |
---|
2866 | | -by the READ_ONCE() macro). This will force the cache to commit its |
---|
2867 | | -coherency queue before processing any further requests: |
---|
2868 | | - |
---|
2869 | | - CPU 1 CPU 2 COMMENT |
---|
2870 | | - =============== =============== ======================================= |
---|
2871 | | - u == 0, v == 1 and p == &u, q == &u |
---|
2872 | | - v = 2; |
---|
2873 | | - smp_wmb(); |
---|
2874 | | - <A:modify v=2> <C:busy> |
---|
2875 | | - <C:queue v=2> |
---|
2876 | | - p = &v; q = p; |
---|
2877 | | - <D:request p> |
---|
2878 | | - <B:modify p=&v> <D:commit p=&v> |
---|
2879 | | - <D:read p> |
---|
2880 | | - smp_read_barrier_depends() |
---|
2881 | | - <C:unbusy> |
---|
2882 | | - <C:commit v=2> |
---|
2883 | | - x = *q; |
---|
2884 | | - <C:read *q> Reads from v after v updated in cache |
---|
2885 | | - |
---|
2886 | | - |
---|
2887 | | -This sort of problem can be encountered on DEC Alpha processors as they have a |
---|
2888 | | -split cache that improves performance by making better use of the data bus. |
---|
2889 | | -Whilst most CPUs do imply a data dependency barrier on the read when a memory |
---|
2890 | | -access depends on a read, not all do, so it may not be relied on. |
---|
2891 | | - |
---|
2892 | | -Other CPUs may also have split caches, but must coordinate between the various |
---|
2893 | | -cachelets for normal memory accesses. The semantics of the Alpha removes the |
---|
2894 | | -need for hardware coordination in the absence of memory barriers, which |
---|
2895 | | -permitted Alpha to sport higher CPU clock rates back in the day. However, |
---|
2896 | | -please note that (again, as of v4.15) smp_read_barrier_depends() should not |
---|
2897 | | -be used except in Alpha arch-specific code and within the READ_ONCE() macro. |
---|
2898 | | - |
---|
2899 | | - |
---|
2900 | 2725 | CACHE COHERENCY VS DMA |
---|
2901 | 2726 | ---------------------- |
---|
2902 | 2727 | |
---|
.. | .. |
---|
2975 | 2800 | thus cutting down on transaction setup costs (memory and PCI devices may |
---|
2976 | 2801 | both be able to do this); and |
---|
2977 | 2802 | |
---|
2978 | | - (*) the CPU's data cache may affect the ordering, and whilst cache-coherency |
---|
| 2803 | + (*) the CPU's data cache may affect the ordering, and while cache-coherency |
---|
2979 | 2804 | mechanisms may alleviate this - once the store has actually hit the cache |
---|
2980 | 2805 | - there's no guarantee that the coherency management will be propagated in |
---|
2981 | 2806 | order to other CPUs. |
---|
.. | .. |
---|
3060 | 2885 | changes vs new data occur in the right order. |
---|
3061 | 2886 | |
---|
3062 | 2887 | The Alpha defines the Linux kernel's memory model, although as of v4.15 |
---|
3063 | | -the Linux kernel's addition of smp_read_barrier_depends() to READ_ONCE() |
---|
3064 | | -greatly reduced Alpha's impact on the memory model. |
---|
3065 | | - |
---|
3066 | | -See the subsection on "Cache Coherency" above. |
---|
| 2888 | +the Linux kernel's addition of smp_mb() to READ_ONCE() on Alpha greatly |
---|
| 2889 | +reduced its impact on the memory model. |
---|
3067 | 2890 | |
---|
3068 | 2891 | |
---|
3069 | 2892 | VIRTUAL MACHINE GUESTS |
---|