| .. | .. |
|---|
| 3 | 3 | ============================ |
|---|
| 4 | 4 | |
|---|
| 5 | 5 | By: David Howells <dhowells@redhat.com> |
|---|
| 6 | | - Paul E. McKenney <paulmck@linux.vnet.ibm.com> |
|---|
| 6 | + Paul E. McKenney <paulmck@linux.ibm.com> |
|---|
| 7 | 7 | Will Deacon <will.deacon@arm.com> |
|---|
| 8 | 8 | Peter Zijlstra <peterz@infradead.org> |
|---|
| 9 | 9 | |
|---|
| .. | .. |
|---|
| 63 | 63 | |
|---|
| 64 | 64 | - Compiler barrier. |
|---|
| 65 | 65 | - CPU memory barriers. |
|---|
| 66 | | - - MMIO write barrier. |
|---|
| 67 | 66 | |
|---|
| 68 | 67 | (*) Implicit kernel memory barriers. |
|---|
| 69 | 68 | |
|---|
| .. | .. |
|---|
| 75 | 74 | (*) Inter-CPU acquiring barrier effects. |
|---|
| 76 | 75 | |
|---|
| 77 | 76 | - Acquires vs memory accesses. |
|---|
| 78 | | - - Acquires vs I/O accesses. |
|---|
| 79 | 77 | |
|---|
| 80 | 78 | (*) Where are memory barriers needed? |
|---|
| 81 | 79 | |
|---|
| .. | .. |
|---|
| 187 | 185 | =============== =============== |
|---|
| 188 | 186 | { A == 1, B == 2, C == 3, P == &A, Q == &C } |
|---|
| 189 | 187 | B = 4; Q = P; |
|---|
| 190 | | - P = &B D = *Q; |
|---|
| 188 | + P = &B; D = *Q; |
|---|
| 191 | 189 | |
|---|
| 192 | 190 | There is an obvious data dependency here, as the value loaded into D depends on |
|---|
| 193 | 191 | the address retrieved from P by CPU 2. At the end of the sequence, any of the |
|---|
| .. | .. |
|---|
| 471 | 469 | operations after the ACQUIRE operation will appear to happen after the |
|---|
| 472 | 470 | ACQUIRE operation with respect to the other components of the system. |
|---|
| 473 | 471 | ACQUIRE operations include LOCK operations and both smp_load_acquire() |
|---|
| 474 | | - and smp_cond_acquire() operations. The later builds the necessary ACQUIRE |
|---|
| 475 | | - semantics from relying on a control dependency and smp_rmb(). |
|---|
| 472 | + and smp_cond_load_acquire() operations. |
|---|
| 476 | 473 | |
|---|
| 477 | 474 | Memory operations that occur before an ACQUIRE operation may appear to |
|---|
| 478 | 475 | happen after it completes. |
|---|
| .. | .. |
|---|
| 493 | 490 | happen before it completes. |
|---|
| 494 | 491 | |
|---|
| 495 | 492 | The use of ACQUIRE and RELEASE operations generally precludes the need |
|---|
| 496 | | - for other sorts of memory barrier (but note the exceptions mentioned in |
|---|
| 497 | | - the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE |
|---|
| 498 | | - pair is -not- guaranteed to act as a full memory barrier. However, after |
|---|
| 499 | | - an ACQUIRE on a given variable, all memory accesses preceding any prior |
|---|
| 493 | + for other sorts of memory barrier. In addition, a RELEASE+ACQUIRE pair is |
|---|
| 494 | + -not- guaranteed to act as a full memory barrier. However, after an |
|---|
| 495 | + ACQUIRE on a given variable, all memory accesses preceding any prior |
|---|
| 500 | 496 | RELEASE on that same variable are guaranteed to be visible. In other |
|---|
| 501 | 497 | words, within a given variable's critical section, all accesses of all |
|---|
| 502 | 498 | previous critical sections for that variable are guaranteed to have |
|---|
| .. | .. |
|---|
| 549 | 545 | |
|---|
| 550 | 546 | [*] For information on bus mastering DMA and coherency please read: |
|---|
| 551 | 547 | |
|---|
| 552 | | - Documentation/PCI/pci.txt |
|---|
| 553 | | - Documentation/DMA-API-HOWTO.txt |
|---|
| 554 | | - Documentation/DMA-API.txt |
|---|
| 548 | + Documentation/driver-api/pci/pci.rst |
|---|
| 549 | + Documentation/core-api/dma-api-howto.rst |
|---|
| 550 | + Documentation/core-api/dma-api.rst |
|---|
| 555 | 551 | |
|---|
| 556 | 552 | |
|---|
| 557 | 553 | DATA DEPENDENCY BARRIERS (HISTORICAL) |
|---|
| 558 | 554 | ------------------------------------- |
|---|
| 559 | 555 | |
|---|
| 560 | | -As of v4.15 of the Linux kernel, an smp_read_barrier_depends() was |
|---|
| 561 | | -added to READ_ONCE(), which means that about the only people who |
|---|
| 562 | | -need to pay attention to this section are those working on DEC Alpha |
|---|
| 563 | | -architecture-specific code and those working on READ_ONCE() itself. |
|---|
| 564 | | -For those who need it, and for those who are interested in the history, |
|---|
| 565 | | -here is the story of data-dependency barriers. |
|---|
| 556 | +As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for |
|---|
| 557 | +DEC Alpha, which means that about the only people who need to pay attention |
|---|
| 558 | +to this section are those working on DEC Alpha architecture-specific code |
|---|
| 559 | +and those working on READ_ONCE() itself. For those who need it, and for |
|---|
| 560 | +those who are interested in the history, here is the story of |
|---|
| 561 | +data-dependency barriers. |
|---|
| 566 | 562 | |
|---|
| 567 | 563 | The usage requirements of data dependency barriers are a little subtle, and |
|---|
| 568 | 564 | it's not always obvious that they're needed. To illustrate, consider the |
|---|
| .. | .. |
|---|
| 573 | 569 | { A == 1, B == 2, C == 3, P == &A, Q == &C } |
|---|
| 574 | 570 | B = 4; |
|---|
| 575 | 571 | <write barrier> |
|---|
| 576 | | - WRITE_ONCE(P, &B) |
|---|
| 572 | + WRITE_ONCE(P, &B); |
|---|
| 577 | 573 | Q = READ_ONCE(P); |
|---|
| 578 | 574 | D = *Q; |
|---|
| 579 | 575 | |
|---|
| .. | .. |
|---|
| 588 | 584 | |
|---|
| 589 | 585 | (Q == &B) and (D == 2) ???? |
|---|
| 590 | 586 | |
|---|
| 591 | | -Whilst this may seem like a failure of coherency or causality maintenance, it |
|---|
| 587 | +While this may seem like a failure of coherency or causality maintenance, it |
|---|
| 592 | 588 | isn't, and this behaviour can be observed on certain real CPUs (such as the DEC |
|---|
| 593 | 589 | Alpha). |
|---|
| 594 | 590 | |
|---|
| .. | .. |
|---|
| 624 | 620 | until they are certain (1) that the write will actually happen, (2) |
|---|
| 625 | 621 | of the location of the write, and (3) of the value to be written. |
|---|
| 626 | 622 | But please carefully read the "CONTROL DEPENDENCIES" section and the |
|---|
| 627 | | -Documentation/RCU/rcu_dereference.txt file: The compiler can and does |
|---|
| 623 | +Documentation/RCU/rcu_dereference.rst file: The compiler can and does |
|---|
| 628 | 624 | break dependencies in a great many highly creative ways. |
|---|
| 629 | 625 | |
|---|
| 630 | 626 | CPU 1 CPU 2 |
|---|
| .. | .. |
|---|
| 1513 | 1509 | |
|---|
| 1514 | 1510 | (*) CPU memory barriers. |
|---|
| 1515 | 1511 | |
|---|
| 1516 | | - (*) MMIO write barrier. |
|---|
| 1517 | | - |
|---|
| 1518 | 1512 | |
|---|
| 1519 | 1513 | COMPILER BARRIER |
|---|
| 1520 | 1514 | ---------------- |
|---|
| .. | .. |
|---|
| 1727 | 1721 | and WRITE_ONCE() are more selective: With READ_ONCE() and |
|---|
| 1728 | 1722 | WRITE_ONCE(), the compiler need only forget the contents of the |
|---|
| 1729 | 1723 | indicated memory locations, while with barrier() the compiler must |
|---|
| 1730 | | - discard the value of all memory locations that it has currented |
|---|
| 1724 | + discard the value of all memory locations that it has currently |
|---|
| 1731 | 1725 | cached in any machine registers. Of course, the compiler must also |
|---|
| 1732 | 1726 | respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur, |
|---|
| 1733 | 1727 | though the CPU of course need not do so. |
|---|
| .. | .. |
|---|
| 1839 | 1833 | to issue the loads in the correct order (eg. `a[b]` would have to load |
|---|
| 1840 | 1834 | the value of b before loading a[b]), however there is no guarantee in |
|---|
| 1841 | 1835 | the C specification that the compiler may not speculate the value of b |
|---|
| 1842 | | -(eg. is equal to 1) and load a before b (eg. tmp = a[1]; if (b != 1) |
|---|
| 1836 | +(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1) |
|---|
| 1843 | 1837 | tmp = a[b]; ). There is also the problem of a compiler reloading b after |
|---|
| 1844 | 1838 | having loaded a[b], thus having a newer copy of b than a[b]. A consensus |
|---|
| 1845 | 1839 | has not yet been reached about these problems, however the READ_ONCE() |
|---|
| .. | .. |
|---|
| 1874 | 1868 | (*) smp_mb__before_atomic(); |
|---|
| 1875 | 1869 | (*) smp_mb__after_atomic(); |
|---|
| 1876 | 1870 | |
|---|
| 1877 | | - These are for use with atomic (such as add, subtract, increment and |
|---|
| 1878 | | - decrement) functions that don't return a value, especially when used for |
|---|
| 1879 | | - reference counting. These functions do not imply memory barriers. |
|---|
| 1871 | + These are for use with atomic RMW functions that do not imply memory |
|---|
| 1872 | + barriers, but where the code needs a memory barrier. Examples for atomic |
|---|
| 1873 | + RMW functions that do not imply are memory barrier are e.g. add, |
|---|
| 1874 | + subtract, (failed) conditional operations, _relaxed functions, |
|---|
| 1875 | + but not atomic_read or atomic_set. A common example where a memory |
|---|
| 1876 | + barrier may be required is when atomic ops are used for reference |
|---|
| 1877 | + counting. |
|---|
| 1880 | 1878 | |
|---|
| 1881 | | - These are also used for atomic bitop functions that do not return a |
|---|
| 1882 | | - value (such as set_bit and clear_bit). |
|---|
| 1879 | + These are also used for atomic RMW bitop functions that do not imply a |
|---|
| 1880 | + memory barrier (such as set_bit and clear_bit). |
|---|
| 1883 | 1881 | |
|---|
| 1884 | 1882 | As an example, consider a piece of code that marks an object as being dead |
|---|
| 1885 | 1883 | and then decrements the object's reference count: |
|---|
| .. | .. |
|---|
| 1934 | 1932 | here. |
|---|
| 1935 | 1933 | |
|---|
| 1936 | 1934 | See the subsection "Kernel I/O barrier effects" for more information on |
|---|
| 1937 | | - relaxed I/O accessors and the Documentation/DMA-API.txt file for more |
|---|
| 1938 | | - information on consistent memory. |
|---|
| 1935 | + relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for |
|---|
| 1936 | + more information on consistent memory. |
|---|
| 1939 | 1937 | |
|---|
| 1938 | + (*) pmem_wmb(); |
|---|
| 1940 | 1939 | |
|---|
| 1941 | | -MMIO WRITE BARRIER |
|---|
| 1942 | | ------------------- |
|---|
| 1940 | + This is for use with persistent memory to ensure that stores for which |
|---|
| 1941 | + modifications are written to persistent storage reached a platform |
|---|
| 1942 | + durability domain. |
|---|
| 1943 | 1943 | |
|---|
| 1944 | | -The Linux kernel also has a special barrier for use with memory-mapped I/O |
|---|
| 1945 | | -writes: |
|---|
| 1944 | + For example, after a non-temporal write to pmem region, we use pmem_wmb() |
|---|
| 1945 | + to ensure that stores have reached a platform durability domain. This ensures |
|---|
| 1946 | + that stores have updated persistent storage before any data access or |
|---|
| 1947 | + data transfer caused by subsequent instructions is initiated. This is |
|---|
| 1948 | + in addition to the ordering done by wmb(). |
|---|
| 1946 | 1949 | |
|---|
| 1947 | | - mmiowb(); |
|---|
| 1948 | | - |
|---|
| 1949 | | -This is a variation on the mandatory write barrier that causes writes to weakly |
|---|
| 1950 | | -ordered I/O regions to be partially ordered. Its effects may go beyond the |
|---|
| 1951 | | -CPU->Hardware interface and actually affect the hardware at some level. |
|---|
| 1952 | | - |
|---|
| 1953 | | -See the subsection "Acquires vs I/O accesses" for more information. |
|---|
| 1954 | | - |
|---|
| 1950 | + For load from persistent memory, existing read memory barriers are sufficient |
|---|
| 1951 | + to ensure read ordering. |
|---|
| 1955 | 1952 | |
|---|
| 1956 | 1953 | =============================== |
|---|
| 1957 | 1954 | IMPLICIT KERNEL MEMORY BARRIERS |
|---|
| .. | .. |
|---|
| 2009 | 2006 | |
|---|
| 2010 | 2007 | Certain locking variants of the ACQUIRE operation may fail, either due to |
|---|
| 2011 | 2008 | being unable to get the lock immediately, or due to receiving an unblocked |
|---|
| 2012 | | - signal whilst asleep waiting for the lock to become available. Failed |
|---|
| 2009 | + signal while asleep waiting for the lock to become available. Failed |
|---|
| 2013 | 2010 | locks do not imply any sort of barrier. |
|---|
| 2014 | 2011 | |
|---|
| 2015 | 2012 | [!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only |
|---|
| .. | .. |
|---|
| 2318 | 2315 | *E, *F or *G following RELEASE Q |
|---|
| 2319 | 2316 | |
|---|
| 2320 | 2317 | |
|---|
| 2321 | | - |
|---|
| 2322 | | -ACQUIRES VS I/O ACCESSES |
|---|
| 2323 | | ------------------------- |
|---|
| 2324 | | - |
|---|
| 2325 | | -Under certain circumstances (especially involving NUMA), I/O accesses within |
|---|
| 2326 | | -two spinlocked sections on two different CPUs may be seen as interleaved by the |
|---|
| 2327 | | -PCI bridge, because the PCI bridge does not necessarily participate in the |
|---|
| 2328 | | -cache-coherence protocol, and is therefore incapable of issuing the required |
|---|
| 2329 | | -read memory barriers. |
|---|
| 2330 | | - |
|---|
| 2331 | | -For example: |
|---|
| 2332 | | - |
|---|
| 2333 | | - CPU 1 CPU 2 |
|---|
| 2334 | | - =============================== =============================== |
|---|
| 2335 | | - spin_lock(Q) |
|---|
| 2336 | | - writel(0, ADDR) |
|---|
| 2337 | | - writel(1, DATA); |
|---|
| 2338 | | - spin_unlock(Q); |
|---|
| 2339 | | - spin_lock(Q); |
|---|
| 2340 | | - writel(4, ADDR); |
|---|
| 2341 | | - writel(5, DATA); |
|---|
| 2342 | | - spin_unlock(Q); |
|---|
| 2343 | | - |
|---|
| 2344 | | -may be seen by the PCI bridge as follows: |
|---|
| 2345 | | - |
|---|
| 2346 | | - STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5 |
|---|
| 2347 | | - |
|---|
| 2348 | | -which would probably cause the hardware to malfunction. |
|---|
| 2349 | | - |
|---|
| 2350 | | - |
|---|
| 2351 | | -What is necessary here is to intervene with an mmiowb() before dropping the |
|---|
| 2352 | | -spinlock, for example: |
|---|
| 2353 | | - |
|---|
| 2354 | | - CPU 1 CPU 2 |
|---|
| 2355 | | - =============================== =============================== |
|---|
| 2356 | | - spin_lock(Q) |
|---|
| 2357 | | - writel(0, ADDR) |
|---|
| 2358 | | - writel(1, DATA); |
|---|
| 2359 | | - mmiowb(); |
|---|
| 2360 | | - spin_unlock(Q); |
|---|
| 2361 | | - spin_lock(Q); |
|---|
| 2362 | | - writel(4, ADDR); |
|---|
| 2363 | | - writel(5, DATA); |
|---|
| 2364 | | - mmiowb(); |
|---|
| 2365 | | - spin_unlock(Q); |
|---|
| 2366 | | - |
|---|
| 2367 | | -this will ensure that the two stores issued on CPU 1 appear at the PCI bridge |
|---|
| 2368 | | -before either of the stores issued on CPU 2. |
|---|
| 2369 | | - |
|---|
| 2370 | | - |
|---|
| 2371 | | -Furthermore, following a store by a load from the same device obviates the need |
|---|
| 2372 | | -for the mmiowb(), because the load forces the store to complete before the load |
|---|
| 2373 | | -is performed: |
|---|
| 2374 | | - |
|---|
| 2375 | | - CPU 1 CPU 2 |
|---|
| 2376 | | - =============================== =============================== |
|---|
| 2377 | | - spin_lock(Q) |
|---|
| 2378 | | - writel(0, ADDR) |
|---|
| 2379 | | - a = readl(DATA); |
|---|
| 2380 | | - spin_unlock(Q); |
|---|
| 2381 | | - spin_lock(Q); |
|---|
| 2382 | | - writel(4, ADDR); |
|---|
| 2383 | | - b = readl(DATA); |
|---|
| 2384 | | - spin_unlock(Q); |
|---|
| 2385 | | - |
|---|
| 2386 | | - |
|---|
| 2387 | | -See Documentation/driver-api/device-io.rst for more information. |
|---|
| 2388 | | - |
|---|
| 2389 | | - |
|---|
| 2390 | 2318 | ================================= |
|---|
| 2391 | 2319 | WHERE ARE MEMORY BARRIERS NEEDED? |
|---|
| 2392 | 2320 | ================================= |
|---|
| .. | .. |
|---|
| 2509 | 2437 | ATOMIC OPERATIONS |
|---|
| 2510 | 2438 | ----------------- |
|---|
| 2511 | 2439 | |
|---|
| 2512 | | -Whilst they are technically interprocessor interaction considerations, atomic |
|---|
| 2440 | +While they are technically interprocessor interaction considerations, atomic |
|---|
| 2513 | 2441 | operations are noted specially as some of them imply full memory barriers and |
|---|
| 2514 | 2442 | some don't, but they're very heavily relied on as a group throughout the |
|---|
| 2515 | 2443 | kernel. |
|---|
| .. | .. |
|---|
| 2532 | 2460 | |
|---|
| 2533 | 2461 | Inside of the Linux kernel, I/O should be done through the appropriate accessor |
|---|
| 2534 | 2462 | routines - such as inb() or writel() - which know how to make such accesses |
|---|
| 2535 | | -appropriately sequential. Whilst this, for the most part, renders the explicit |
|---|
| 2536 | | -use of memory barriers unnecessary, there are a couple of situations where they |
|---|
| 2537 | | -might be needed: |
|---|
| 2538 | | - |
|---|
| 2539 | | - (1) On some systems, I/O stores are not strongly ordered across all CPUs, and |
|---|
| 2540 | | - so for _all_ general drivers locks should be used and mmiowb() must be |
|---|
| 2541 | | - issued prior to unlocking the critical section. |
|---|
| 2542 | | - |
|---|
| 2543 | | - (2) If the accessor functions are used to refer to an I/O memory window with |
|---|
| 2544 | | - relaxed memory access properties, then _mandatory_ memory barriers are |
|---|
| 2545 | | - required to enforce ordering. |
|---|
| 2463 | +appropriately sequential. While this, for the most part, renders the explicit |
|---|
| 2464 | +use of memory barriers unnecessary, if the accessor functions are used to refer |
|---|
| 2465 | +to an I/O memory window with relaxed memory access properties, then _mandatory_ |
|---|
| 2466 | +memory barriers are required to enforce ordering. |
|---|
| 2546 | 2467 | |
|---|
| 2547 | 2468 | See Documentation/driver-api/device-io.rst for more information. |
|---|
| 2548 | 2469 | |
|---|
| .. | .. |
|---|
| 2556 | 2477 | |
|---|
| 2557 | 2478 | This may be alleviated - at least in part - by disabling local interrupts (a |
|---|
| 2558 | 2479 | form of locking), such that the critical operations are all contained within |
|---|
| 2559 | | -the interrupt-disabled section in the driver. Whilst the driver's interrupt |
|---|
| 2480 | +the interrupt-disabled section in the driver. While the driver's interrupt |
|---|
| 2560 | 2481 | routine is executing, the driver's core may not run on the same CPU, and its |
|---|
| 2561 | 2482 | interrupt is not permitted to happen again until the current interrupt has been |
|---|
| 2562 | 2483 | handled, thus the interrupt handler does not need to lock against that. |
|---|
| .. | .. |
|---|
| 2587 | 2508 | |
|---|
| 2588 | 2509 | Normally this won't be a problem because the I/O accesses done inside such |
|---|
| 2589 | 2510 | sections will include synchronous load operations on strictly ordered I/O |
|---|
| 2590 | | -registers that form implicit I/O barriers. If this isn't sufficient then an |
|---|
| 2591 | | -mmiowb() may need to be used explicitly. |
|---|
| 2511 | +registers that form implicit I/O barriers. |
|---|
| 2592 | 2512 | |
|---|
| 2593 | 2513 | |
|---|
| 2594 | 2514 | A similar situation may occur between an interrupt routine and two routines |
|---|
| .. | .. |
|---|
| 2600 | 2520 | KERNEL I/O BARRIER EFFECTS |
|---|
| 2601 | 2521 | ========================== |
|---|
| 2602 | 2522 | |
|---|
| 2603 | | -When accessing I/O memory, drivers should use the appropriate accessor |
|---|
| 2604 | | -functions: |
|---|
| 2605 | | - |
|---|
| 2606 | | - (*) inX(), outX(): |
|---|
| 2607 | | - |
|---|
| 2608 | | - These are intended to talk to I/O space rather than memory space, but |
|---|
| 2609 | | - that's primarily a CPU-specific concept. The i386 and x86_64 processors |
|---|
| 2610 | | - do indeed have special I/O space access cycles and instructions, but many |
|---|
| 2611 | | - CPUs don't have such a concept. |
|---|
| 2612 | | - |
|---|
| 2613 | | - The PCI bus, amongst others, defines an I/O space concept which - on such |
|---|
| 2614 | | - CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O |
|---|
| 2615 | | - space. However, it may also be mapped as a virtual I/O space in the CPU's |
|---|
| 2616 | | - memory map, particularly on those CPUs that don't support alternate I/O |
|---|
| 2617 | | - spaces. |
|---|
| 2618 | | - |
|---|
| 2619 | | - Accesses to this space may be fully synchronous (as on i386), but |
|---|
| 2620 | | - intermediary bridges (such as the PCI host bridge) may not fully honour |
|---|
| 2621 | | - that. |
|---|
| 2622 | | - |
|---|
| 2623 | | - They are guaranteed to be fully ordered with respect to each other. |
|---|
| 2624 | | - |
|---|
| 2625 | | - They are not guaranteed to be fully ordered with respect to other types of |
|---|
| 2626 | | - memory and I/O operation. |
|---|
| 2523 | +Interfacing with peripherals via I/O accesses is deeply architecture and device |
|---|
| 2524 | +specific. Therefore, drivers which are inherently non-portable may rely on |
|---|
| 2525 | +specific behaviours of their target systems in order to achieve synchronization |
|---|
| 2526 | +in the most lightweight manner possible. For drivers intending to be portable |
|---|
| 2527 | +between multiple architectures and bus implementations, the kernel offers a |
|---|
| 2528 | +series of accessor functions that provide various degrees of ordering |
|---|
| 2529 | +guarantees: |
|---|
| 2627 | 2530 | |
|---|
| 2628 | 2531 | (*) readX(), writeX(): |
|---|
| 2629 | 2532 | |
|---|
| 2630 | | - Whether these are guaranteed to be fully ordered and uncombined with |
|---|
| 2631 | | - respect to each other on the issuing CPU depends on the characteristics |
|---|
| 2632 | | - defined for the memory window through which they're accessing. On later |
|---|
| 2633 | | - i386 architecture machines, for example, this is controlled by way of the |
|---|
| 2634 | | - MTRR registers. |
|---|
| 2533 | + The readX() and writeX() MMIO accessors take a pointer to the |
|---|
| 2534 | + peripheral being accessed as an __iomem * parameter. For pointers |
|---|
| 2535 | + mapped with the default I/O attributes (e.g. those returned by |
|---|
| 2536 | + ioremap()), the ordering guarantees are as follows: |
|---|
| 2635 | 2537 | |
|---|
| 2636 | | - Ordinarily, these will be guaranteed to be fully ordered and uncombined, |
|---|
| 2637 | | - provided they're not accessing a prefetchable device. |
|---|
| 2538 | + 1. All readX() and writeX() accesses to the same peripheral are ordered |
|---|
| 2539 | + with respect to each other. This ensures that MMIO register accesses |
|---|
| 2540 | + by the same CPU thread to a particular device will arrive in program |
|---|
| 2541 | + order. |
|---|
| 2638 | 2542 | |
|---|
| 2639 | | - However, intermediary hardware (such as a PCI bridge) may indulge in |
|---|
| 2640 | | - deferral if it so wishes; to flush a store, a load from the same location |
|---|
| 2641 | | - is preferred[*], but a load from the same device or from configuration |
|---|
| 2642 | | - space should suffice for PCI. |
|---|
| 2543 | + 2. A writeX() issued by a CPU thread holding a spinlock is ordered |
|---|
| 2544 | + before a writeX() to the same peripheral from another CPU thread |
|---|
| 2545 | + issued after a later acquisition of the same spinlock. This ensures |
|---|
| 2546 | + that MMIO register writes to a particular device issued while holding |
|---|
| 2547 | + a spinlock will arrive in an order consistent with acquisitions of |
|---|
| 2548 | + the lock. |
|---|
| 2643 | 2549 | |
|---|
| 2644 | | - [*] NOTE! attempting to load from the same location as was written to may |
|---|
| 2645 | | - cause a malfunction - consider the 16550 Rx/Tx serial registers for |
|---|
| 2646 | | - example. |
|---|
| 2550 | + 3. A writeX() by a CPU thread to the peripheral will first wait for the |
|---|
| 2551 | + completion of all prior writes to memory either issued by, or |
|---|
| 2552 | + propagated to, the same thread. This ensures that writes by the CPU |
|---|
| 2553 | + to an outbound DMA buffer allocated by dma_alloc_coherent() will be |
|---|
| 2554 | + visible to a DMA engine when the CPU writes to its MMIO control |
|---|
| 2555 | + register to trigger the transfer. |
|---|
| 2647 | 2556 | |
|---|
| 2648 | | - Used with prefetchable I/O memory, an mmiowb() barrier may be required to |
|---|
| 2649 | | - force stores to be ordered. |
|---|
| 2557 | + 4. A readX() by a CPU thread from the peripheral will complete before |
|---|
| 2558 | + any subsequent reads from memory by the same thread can begin. This |
|---|
| 2559 | + ensures that reads by the CPU from an incoming DMA buffer allocated |
|---|
| 2560 | + by dma_alloc_coherent() will not see stale data after reading from |
|---|
| 2561 | + the DMA engine's MMIO status register to establish that the DMA |
|---|
| 2562 | + transfer has completed. |
|---|
| 2650 | 2563 | |
|---|
| 2651 | | - Please refer to the PCI specification for more information on interactions |
|---|
| 2652 | | - between PCI transactions. |
|---|
| 2564 | + 5. A readX() by a CPU thread from the peripheral will complete before |
|---|
| 2565 | + any subsequent delay() loop can begin execution on the same thread. |
|---|
| 2566 | + This ensures that two MMIO register writes by the CPU to a peripheral |
|---|
| 2567 | + will arrive at least 1us apart if the first write is immediately read |
|---|
| 2568 | + back with readX() and udelay(1) is called prior to the second |
|---|
| 2569 | + writeX(): |
|---|
| 2653 | 2570 | |
|---|
| 2654 | | - (*) readX_relaxed(), writeX_relaxed() |
|---|
| 2571 | + writel(42, DEVICE_REGISTER_0); // Arrives at the device... |
|---|
| 2572 | + readl(DEVICE_REGISTER_0); |
|---|
| 2573 | + udelay(1); |
|---|
| 2574 | + writel(42, DEVICE_REGISTER_1); // ...at least 1us before this. |
|---|
| 2655 | 2575 | |
|---|
| 2656 | | - These are similar to readX() and writeX(), but provide weaker memory |
|---|
| 2657 | | - ordering guarantees. Specifically, they do not guarantee ordering with |
|---|
| 2658 | | - respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee |
|---|
| 2659 | | - ordering with respect to LOCK or UNLOCK operations. If the latter is |
|---|
| 2660 | | - required, an mmiowb() barrier can be used. Note that relaxed accesses to |
|---|
| 2661 | | - the same peripheral are guaranteed to be ordered with respect to each |
|---|
| 2662 | | - other. |
|---|
| 2576 | + The ordering properties of __iomem pointers obtained with non-default |
|---|
| 2577 | + attributes (e.g. those returned by ioremap_wc()) are specific to the |
|---|
| 2578 | + underlying architecture and therefore the guarantees listed above cannot |
|---|
| 2579 | + generally be relied upon for accesses to these types of mappings. |
|---|
| 2663 | 2580 | |
|---|
| 2664 | | - (*) ioreadX(), iowriteX() |
|---|
| 2581 | + (*) readX_relaxed(), writeX_relaxed(): |
|---|
| 2665 | 2582 | |
|---|
| 2666 | | - These will perform appropriately for the type of access they're actually |
|---|
| 2667 | | - doing, be it inX()/outX() or readX()/writeX(). |
|---|
| 2583 | + These are similar to readX() and writeX(), but provide weaker memory |
|---|
| 2584 | + ordering guarantees. Specifically, they do not guarantee ordering with |
|---|
| 2585 | + respect to locking, normal memory accesses or delay() loops (i.e. |
|---|
| 2586 | + bullets 2-5 above) but they are still guaranteed to be ordered with |
|---|
| 2587 | + respect to other accesses from the same CPU thread to the same |
|---|
| 2588 | + peripheral when operating on __iomem pointers mapped with the default |
|---|
| 2589 | + I/O attributes. |
|---|
| 2590 | + |
|---|
| 2591 | + (*) readsX(), writesX(): |
|---|
| 2592 | + |
|---|
| 2593 | + The readsX() and writesX() MMIO accessors are designed for accessing |
|---|
| 2594 | + register-based, memory-mapped FIFOs residing on peripherals that are not |
|---|
| 2595 | + capable of performing DMA. Consequently, they provide only the ordering |
|---|
| 2596 | + guarantees of readX_relaxed() and writeX_relaxed(), as documented above. |
|---|
| 2597 | + |
|---|
| 2598 | + (*) inX(), outX(): |
|---|
| 2599 | + |
|---|
| 2600 | + The inX() and outX() accessors are intended to access legacy port-mapped |
|---|
| 2601 | + I/O peripherals, which may require special instructions on some |
|---|
| 2602 | + architectures (notably x86). The port number of the peripheral being |
|---|
| 2603 | + accessed is passed as an argument. |
|---|
| 2604 | + |
|---|
| 2605 | + Since many CPU architectures ultimately access these peripherals via an |
|---|
| 2606 | + internal virtual memory mapping, the portable ordering guarantees |
|---|
| 2607 | + provided by inX() and outX() are the same as those provided by readX() |
|---|
| 2608 | + and writeX() respectively when accessing a mapping with the default I/O |
|---|
| 2609 | + attributes. |
|---|
| 2610 | + |
|---|
| 2611 | + Device drivers may expect outX() to emit a non-posted write transaction |
|---|
| 2612 | + that waits for a completion response from the I/O peripheral before |
|---|
| 2613 | + returning. This is not guaranteed by all architectures and is therefore |
|---|
| 2614 | + not part of the portable ordering semantics. |
|---|
| 2615 | + |
|---|
| 2616 | + (*) insX(), outsX(): |
|---|
| 2617 | + |
|---|
| 2618 | + As above, the insX() and outsX() accessors provide the same ordering |
|---|
| 2619 | + guarantees as readsX() and writesX() respectively when accessing a |
|---|
| 2620 | + mapping with the default I/O attributes. |
|---|
| 2621 | + |
|---|
| 2622 | + (*) ioreadX(), iowriteX(): |
|---|
| 2623 | + |
|---|
| 2624 | + These will perform appropriately for the type of access they're actually |
|---|
| 2625 | + doing, be it inX()/outX() or readX()/writeX(). |
|---|
| 2626 | + |
|---|
| 2627 | +With the exception of the string accessors (insX(), outsX(), readsX() and |
|---|
| 2628 | +writesX()), all of the above assume that the underlying peripheral is |
|---|
| 2629 | +little-endian and will therefore perform byte-swapping operations on big-endian |
|---|
| 2630 | +architectures. |
|---|
| 2668 | 2631 | |
|---|
| 2669 | 2632 | |
|---|
| 2670 | 2633 | ======================================== |
|---|
| .. | .. |
|---|
| 2759 | 2722 | the use of any special device communication instructions the CPU may have. |
|---|
| 2760 | 2723 | |
|---|
| 2761 | 2724 | |
|---|
| 2762 | | -CACHE COHERENCY |
|---|
| 2763 | | ---------------- |
|---|
| 2764 | | - |
|---|
| 2765 | | -Life isn't quite as simple as it may appear above, however: for while the |
|---|
| 2766 | | -caches are expected to be coherent, there's no guarantee that that coherency |
|---|
| 2767 | | -will be ordered. This means that whilst changes made on one CPU will |
|---|
| 2768 | | -eventually become visible on all CPUs, there's no guarantee that they will |
|---|
| 2769 | | -become apparent in the same order on those other CPUs. |
|---|
| 2770 | | - |
|---|
| 2771 | | - |
|---|
| 2772 | | -Consider dealing with a system that has a pair of CPUs (1 & 2), each of which |
|---|
| 2773 | | -has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D): |
|---|
| 2774 | | - |
|---|
| 2775 | | - : |
|---|
| 2776 | | - : +--------+ |
|---|
| 2777 | | - : +---------+ | | |
|---|
| 2778 | | - +--------+ : +--->| Cache A |<------->| | |
|---|
| 2779 | | - | | : | +---------+ | | |
|---|
| 2780 | | - | CPU 1 |<---+ | | |
|---|
| 2781 | | - | | : | +---------+ | | |
|---|
| 2782 | | - +--------+ : +--->| Cache B |<------->| | |
|---|
| 2783 | | - : +---------+ | | |
|---|
| 2784 | | - : | Memory | |
|---|
| 2785 | | - : +---------+ | System | |
|---|
| 2786 | | - +--------+ : +--->| Cache C |<------->| | |
|---|
| 2787 | | - | | : | +---------+ | | |
|---|
| 2788 | | - | CPU 2 |<---+ | | |
|---|
| 2789 | | - | | : | +---------+ | | |
|---|
| 2790 | | - +--------+ : +--->| Cache D |<------->| | |
|---|
| 2791 | | - : +---------+ | | |
|---|
| 2792 | | - : +--------+ |
|---|
| 2793 | | - : |
|---|
| 2794 | | - |
|---|
| 2795 | | -Imagine the system has the following properties: |
|---|
| 2796 | | - |
|---|
| 2797 | | - (*) an odd-numbered cache line may be in cache A, cache C or it may still be |
|---|
| 2798 | | - resident in memory; |
|---|
| 2799 | | - |
|---|
| 2800 | | - (*) an even-numbered cache line may be in cache B, cache D or it may still be |
|---|
| 2801 | | - resident in memory; |
|---|
| 2802 | | - |
|---|
| 2803 | | - (*) whilst the CPU core is interrogating one cache, the other cache may be |
|---|
| 2804 | | - making use of the bus to access the rest of the system - perhaps to |
|---|
| 2805 | | - displace a dirty cacheline or to do a speculative load; |
|---|
| 2806 | | - |
|---|
| 2807 | | - (*) each cache has a queue of operations that need to be applied to that cache |
|---|
| 2808 | | - to maintain coherency with the rest of the system; |
|---|
| 2809 | | - |
|---|
| 2810 | | - (*) the coherency queue is not flushed by normal loads to lines already |
|---|
| 2811 | | - present in the cache, even though the contents of the queue may |
|---|
| 2812 | | - potentially affect those loads. |
|---|
| 2813 | | - |
|---|
| 2814 | | -Imagine, then, that two writes are made on the first CPU, with a write barrier |
|---|
| 2815 | | -between them to guarantee that they will appear to reach that CPU's caches in |
|---|
| 2816 | | -the requisite order: |
|---|
| 2817 | | - |
|---|
| 2818 | | - CPU 1 CPU 2 COMMENT |
|---|
| 2819 | | - =============== =============== ======================================= |
|---|
| 2820 | | - u == 0, v == 1 and p == &u, q == &u |
|---|
| 2821 | | - v = 2; |
|---|
| 2822 | | - smp_wmb(); Make sure change to v is visible before |
|---|
| 2823 | | - change to p |
|---|
| 2824 | | - <A:modify v=2> v is now in cache A exclusively |
|---|
| 2825 | | - p = &v; |
|---|
| 2826 | | - <B:modify p=&v> p is now in cache B exclusively |
|---|
| 2827 | | - |
|---|
| 2828 | | -The write memory barrier forces the other CPUs in the system to perceive that |
|---|
| 2829 | | -the local CPU's caches have apparently been updated in the correct order. But |
|---|
| 2830 | | -now imagine that the second CPU wants to read those values: |
|---|
| 2831 | | - |
|---|
| 2832 | | - CPU 1 CPU 2 COMMENT |
|---|
| 2833 | | - =============== =============== ======================================= |
|---|
| 2834 | | - ... |
|---|
| 2835 | | - q = p; |
|---|
| 2836 | | - x = *q; |
|---|
| 2837 | | - |
|---|
| 2838 | | -The above pair of reads may then fail to happen in the expected order, as the |
|---|
| 2839 | | -cacheline holding p may get updated in one of the second CPU's caches whilst |
|---|
| 2840 | | -the update to the cacheline holding v is delayed in the other of the second |
|---|
| 2841 | | -CPU's caches by some other cache event: |
|---|
| 2842 | | - |
|---|
| 2843 | | - CPU 1 CPU 2 COMMENT |
|---|
| 2844 | | - =============== =============== ======================================= |
|---|
| 2845 | | - u == 0, v == 1 and p == &u, q == &u |
|---|
| 2846 | | - v = 2; |
|---|
| 2847 | | - smp_wmb(); |
|---|
| 2848 | | - <A:modify v=2> <C:busy> |
|---|
| 2849 | | - <C:queue v=2> |
|---|
| 2850 | | - p = &v; q = p; |
|---|
| 2851 | | - <D:request p> |
|---|
| 2852 | | - <B:modify p=&v> <D:commit p=&v> |
|---|
| 2853 | | - <D:read p> |
|---|
| 2854 | | - x = *q; |
|---|
| 2855 | | - <C:read *q> Reads from v before v updated in cache |
|---|
| 2856 | | - <C:unbusy> |
|---|
| 2857 | | - <C:commit v=2> |
|---|
| 2858 | | - |
|---|
| 2859 | | -Basically, whilst both cachelines will be updated on CPU 2 eventually, there's |
|---|
| 2860 | | -no guarantee that, without intervention, the order of update will be the same |
|---|
| 2861 | | -as that committed on CPU 1. |
|---|
| 2862 | | - |
|---|
| 2863 | | - |
|---|
| 2864 | | -To intervene, we need to interpolate a data dependency barrier or a read |
|---|
| 2865 | | -barrier between the loads (which as of v4.15 is supplied unconditionally |
|---|
| 2866 | | -by the READ_ONCE() macro). This will force the cache to commit its |
|---|
| 2867 | | -coherency queue before processing any further requests: |
|---|
| 2868 | | - |
|---|
| 2869 | | - CPU 1 CPU 2 COMMENT |
|---|
| 2870 | | - =============== =============== ======================================= |
|---|
| 2871 | | - u == 0, v == 1 and p == &u, q == &u |
|---|
| 2872 | | - v = 2; |
|---|
| 2873 | | - smp_wmb(); |
|---|
| 2874 | | - <A:modify v=2> <C:busy> |
|---|
| 2875 | | - <C:queue v=2> |
|---|
| 2876 | | - p = &v; q = p; |
|---|
| 2877 | | - <D:request p> |
|---|
| 2878 | | - <B:modify p=&v> <D:commit p=&v> |
|---|
| 2879 | | - <D:read p> |
|---|
| 2880 | | - smp_read_barrier_depends() |
|---|
| 2881 | | - <C:unbusy> |
|---|
| 2882 | | - <C:commit v=2> |
|---|
| 2883 | | - x = *q; |
|---|
| 2884 | | - <C:read *q> Reads from v after v updated in cache |
|---|
| 2885 | | - |
|---|
| 2886 | | - |
|---|
| 2887 | | -This sort of problem can be encountered on DEC Alpha processors as they have a |
|---|
| 2888 | | -split cache that improves performance by making better use of the data bus. |
|---|
| 2889 | | -Whilst most CPUs do imply a data dependency barrier on the read when a memory |
|---|
| 2890 | | -access depends on a read, not all do, so it may not be relied on. |
|---|
| 2891 | | - |
|---|
| 2892 | | -Other CPUs may also have split caches, but must coordinate between the various |
|---|
| 2893 | | -cachelets for normal memory accesses. The semantics of the Alpha removes the |
|---|
| 2894 | | -need for hardware coordination in the absence of memory barriers, which |
|---|
| 2895 | | -permitted Alpha to sport higher CPU clock rates back in the day. However, |
|---|
| 2896 | | -please note that (again, as of v4.15) smp_read_barrier_depends() should not |
|---|
| 2897 | | -be used except in Alpha arch-specific code and within the READ_ONCE() macro. |
|---|
| 2898 | | - |
|---|
| 2899 | | - |
|---|
| 2900 | 2725 | CACHE COHERENCY VS DMA |
|---|
| 2901 | 2726 | ---------------------- |
|---|
| 2902 | 2727 | |
|---|
| .. | .. |
|---|
| 2975 | 2800 | thus cutting down on transaction setup costs (memory and PCI devices may |
|---|
| 2976 | 2801 | both be able to do this); and |
|---|
| 2977 | 2802 | |
|---|
| 2978 | | - (*) the CPU's data cache may affect the ordering, and whilst cache-coherency |
|---|
| 2803 | + (*) the CPU's data cache may affect the ordering, and while cache-coherency |
|---|
| 2979 | 2804 | mechanisms may alleviate this - once the store has actually hit the cache |
|---|
| 2980 | 2805 | - there's no guarantee that the coherency management will be propagated in |
|---|
| 2981 | 2806 | order to other CPUs. |
|---|
| .. | .. |
|---|
| 3060 | 2885 | changes vs new data occur in the right order. |
|---|
| 3061 | 2886 | |
|---|
| 3062 | 2887 | The Alpha defines the Linux kernel's memory model, although as of v4.15 |
|---|
| 3063 | | -the Linux kernel's addition of smp_read_barrier_depends() to READ_ONCE() |
|---|
| 3064 | | -greatly reduced Alpha's impact on the memory model. |
|---|
| 3065 | | - |
|---|
| 3066 | | -See the subsection on "Cache Coherency" above. |
|---|
| 2888 | +the Linux kernel's addition of smp_mb() to READ_ONCE() on Alpha greatly |
|---|
| 2889 | +reduced its impact on the memory model. |
|---|
| 3067 | 2890 | |
|---|
| 3068 | 2891 | |
|---|
| 3069 | 2892 | VIRTUAL MACHINE GUESTS |
|---|