hc
2024-10-12 a5969cabbb4660eab42b6ef0412cbbd1200cf14d
kernel/Documentation/memory-barriers.txt
....@@ -3,7 +3,7 @@
33 ============================
44
55 By: David Howells <dhowells@redhat.com>
6
- Paul E. McKenney <paulmck@linux.vnet.ibm.com>
6
+ Paul E. McKenney <paulmck@linux.ibm.com>
77 Will Deacon <will.deacon@arm.com>
88 Peter Zijlstra <peterz@infradead.org>
99
....@@ -63,7 +63,6 @@
6363
6464 - Compiler barrier.
6565 - CPU memory barriers.
66
- - MMIO write barrier.
6766
6867 (*) Implicit kernel memory barriers.
6968
....@@ -75,7 +74,6 @@
7574 (*) Inter-CPU acquiring barrier effects.
7675
7776 - Acquires vs memory accesses.
78
- - Acquires vs I/O accesses.
7977
8078 (*) Where are memory barriers needed?
8179
....@@ -187,7 +185,7 @@
187185 =============== ===============
188186 { A == 1, B == 2, C == 3, P == &A, Q == &C }
189187 B = 4; Q = P;
190
- P = &B D = *Q;
188
+ P = &B; D = *Q;
191189
192190 There is an obvious data dependency here, as the value loaded into D depends on
193191 the address retrieved from P by CPU 2. At the end of the sequence, any of the
....@@ -471,8 +469,7 @@
471469 operations after the ACQUIRE operation will appear to happen after the
472470 ACQUIRE operation with respect to the other components of the system.
473471 ACQUIRE operations include LOCK operations and both smp_load_acquire()
474
- and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
475
- semantics from relying on a control dependency and smp_rmb().
472
+ and smp_cond_load_acquire() operations.
476473
477474 Memory operations that occur before an ACQUIRE operation may appear to
478475 happen after it completes.
....@@ -493,10 +490,9 @@
493490 happen before it completes.
494491
495492 The use of ACQUIRE and RELEASE operations generally precludes the need
496
- for other sorts of memory barrier (but note the exceptions mentioned in
497
- the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE
498
- pair is -not- guaranteed to act as a full memory barrier. However, after
499
- an ACQUIRE on a given variable, all memory accesses preceding any prior
493
+ for other sorts of memory barrier. In addition, a RELEASE+ACQUIRE pair is
494
+ -not- guaranteed to act as a full memory barrier. However, after an
495
+ ACQUIRE on a given variable, all memory accesses preceding any prior
500496 RELEASE on that same variable are guaranteed to be visible. In other
501497 words, within a given variable's critical section, all accesses of all
502498 previous critical sections for that variable are guaranteed to have
....@@ -549,20 +545,20 @@
549545
550546 [*] For information on bus mastering DMA and coherency please read:
551547
552
- Documentation/PCI/pci.txt
553
- Documentation/DMA-API-HOWTO.txt
554
- Documentation/DMA-API.txt
548
+ Documentation/driver-api/pci/pci.rst
549
+ Documentation/core-api/dma-api-howto.rst
550
+ Documentation/core-api/dma-api.rst
555551
556552
557553 DATA DEPENDENCY BARRIERS (HISTORICAL)
558554 -------------------------------------
559555
560
-As of v4.15 of the Linux kernel, an smp_read_barrier_depends() was
561
-added to READ_ONCE(), which means that about the only people who
562
-need to pay attention to this section are those working on DEC Alpha
563
-architecture-specific code and those working on READ_ONCE() itself.
564
-For those who need it, and for those who are interested in the history,
565
-here is the story of data-dependency barriers.
556
+As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
557
+DEC Alpha, which means that about the only people who need to pay attention
558
+to this section are those working on DEC Alpha architecture-specific code
559
+and those working on READ_ONCE() itself. For those who need it, and for
560
+those who are interested in the history, here is the story of
561
+data-dependency barriers.
566562
567563 The usage requirements of data dependency barriers are a little subtle, and
568564 it's not always obvious that they're needed. To illustrate, consider the
....@@ -573,7 +569,7 @@
573569 { A == 1, B == 2, C == 3, P == &A, Q == &C }
574570 B = 4;
575571 <write barrier>
576
- WRITE_ONCE(P, &B)
572
+ WRITE_ONCE(P, &B);
577573 Q = READ_ONCE(P);
578574 D = *Q;
579575
....@@ -588,7 +584,7 @@
588584
589585 (Q == &B) and (D == 2) ????
590586
591
-Whilst this may seem like a failure of coherency or causality maintenance, it
587
+While this may seem like a failure of coherency or causality maintenance, it
592588 isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
593589 Alpha).
594590
....@@ -624,7 +620,7 @@
624620 until they are certain (1) that the write will actually happen, (2)
625621 of the location of the write, and (3) of the value to be written.
626622 But please carefully read the "CONTROL DEPENDENCIES" section and the
627
-Documentation/RCU/rcu_dereference.txt file: The compiler can and does
623
+Documentation/RCU/rcu_dereference.rst file: The compiler can and does
628624 break dependencies in a great many highly creative ways.
629625
630626 CPU 1 CPU 2
....@@ -1513,8 +1509,6 @@
15131509
15141510 (*) CPU memory barriers.
15151511
1516
- (*) MMIO write barrier.
1517
-
15181512
15191513 COMPILER BARRIER
15201514 ----------------
....@@ -1727,7 +1721,7 @@
17271721 and WRITE_ONCE() are more selective: With READ_ONCE() and
17281722 WRITE_ONCE(), the compiler need only forget the contents of the
17291723 indicated memory locations, while with barrier() the compiler must
1730
- discard the value of all memory locations that it has currented
1724
+ discard the value of all memory locations that it has currently
17311725 cached in any machine registers. Of course, the compiler must also
17321726 respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur,
17331727 though the CPU of course need not do so.
....@@ -1839,7 +1833,7 @@
18391833 to issue the loads in the correct order (eg. `a[b]` would have to load
18401834 the value of b before loading a[b]), however there is no guarantee in
18411835 the C specification that the compiler may not speculate the value of b
1842
-(eg. is equal to 1) and load a before b (eg. tmp = a[1]; if (b != 1)
1836
+(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1)
18431837 tmp = a[b]; ). There is also the problem of a compiler reloading b after
18441838 having loaded a[b], thus having a newer copy of b than a[b]. A consensus
18451839 has not yet been reached about these problems, however the READ_ONCE()
....@@ -1874,12 +1868,16 @@
18741868 (*) smp_mb__before_atomic();
18751869 (*) smp_mb__after_atomic();
18761870
1877
- These are for use with atomic (such as add, subtract, increment and
1878
- decrement) functions that don't return a value, especially when used for
1879
- reference counting. These functions do not imply memory barriers.
1871
+ These are for use with atomic RMW functions that do not imply memory
1872
+ barriers, but where the code needs a memory barrier. Examples for atomic
1873
+ RMW functions that do not imply are memory barrier are e.g. add,
1874
+ subtract, (failed) conditional operations, _relaxed functions,
1875
+ but not atomic_read or atomic_set. A common example where a memory
1876
+ barrier may be required is when atomic ops are used for reference
1877
+ counting.
18801878
1881
- These are also used for atomic bitop functions that do not return a
1882
- value (such as set_bit and clear_bit).
1879
+ These are also used for atomic RMW bitop functions that do not imply a
1880
+ memory barrier (such as set_bit and clear_bit).
18831881
18841882 As an example, consider a piece of code that marks an object as being dead
18851883 and then decrements the object's reference count:
....@@ -1934,24 +1932,23 @@
19341932 here.
19351933
19361934 See the subsection "Kernel I/O barrier effects" for more information on
1937
- relaxed I/O accessors and the Documentation/DMA-API.txt file for more
1938
- information on consistent memory.
1935
+ relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for
1936
+ more information on consistent memory.
19391937
1938
+ (*) pmem_wmb();
19401939
1941
-MMIO WRITE BARRIER
1942
-------------------
1940
+ This is for use with persistent memory to ensure that stores for which
1941
+ modifications are written to persistent storage reached a platform
1942
+ durability domain.
19431943
1944
-The Linux kernel also has a special barrier for use with memory-mapped I/O
1945
-writes:
1944
+ For example, after a non-temporal write to pmem region, we use pmem_wmb()
1945
+ to ensure that stores have reached a platform durability domain. This ensures
1946
+ that stores have updated persistent storage before any data access or
1947
+ data transfer caused by subsequent instructions is initiated. This is
1948
+ in addition to the ordering done by wmb().
19461949
1947
- mmiowb();
1948
-
1949
-This is a variation on the mandatory write barrier that causes writes to weakly
1950
-ordered I/O regions to be partially ordered. Its effects may go beyond the
1951
-CPU->Hardware interface and actually affect the hardware at some level.
1952
-
1953
-See the subsection "Acquires vs I/O accesses" for more information.
1954
-
1950
+ For load from persistent memory, existing read memory barriers are sufficient
1951
+ to ensure read ordering.
19551952
19561953 ===============================
19571954 IMPLICIT KERNEL MEMORY BARRIERS
....@@ -2009,7 +2006,7 @@
20092006
20102007 Certain locking variants of the ACQUIRE operation may fail, either due to
20112008 being unable to get the lock immediately, or due to receiving an unblocked
2012
- signal whilst asleep waiting for the lock to become available. Failed
2009
+ signal while asleep waiting for the lock to become available. Failed
20132010 locks do not imply any sort of barrier.
20142011
20152012 [!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only
....@@ -2318,75 +2315,6 @@
23182315 *E, *F or *G following RELEASE Q
23192316
23202317
2321
-
2322
-ACQUIRES VS I/O ACCESSES
2323
-------------------------
2324
-
2325
-Under certain circumstances (especially involving NUMA), I/O accesses within
2326
-two spinlocked sections on two different CPUs may be seen as interleaved by the
2327
-PCI bridge, because the PCI bridge does not necessarily participate in the
2328
-cache-coherence protocol, and is therefore incapable of issuing the required
2329
-read memory barriers.
2330
-
2331
-For example:
2332
-
2333
- CPU 1 CPU 2
2334
- =============================== ===============================
2335
- spin_lock(Q)
2336
- writel(0, ADDR)
2337
- writel(1, DATA);
2338
- spin_unlock(Q);
2339
- spin_lock(Q);
2340
- writel(4, ADDR);
2341
- writel(5, DATA);
2342
- spin_unlock(Q);
2343
-
2344
-may be seen by the PCI bridge as follows:
2345
-
2346
- STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
2347
-
2348
-which would probably cause the hardware to malfunction.
2349
-
2350
-
2351
-What is necessary here is to intervene with an mmiowb() before dropping the
2352
-spinlock, for example:
2353
-
2354
- CPU 1 CPU 2
2355
- =============================== ===============================
2356
- spin_lock(Q)
2357
- writel(0, ADDR)
2358
- writel(1, DATA);
2359
- mmiowb();
2360
- spin_unlock(Q);
2361
- spin_lock(Q);
2362
- writel(4, ADDR);
2363
- writel(5, DATA);
2364
- mmiowb();
2365
- spin_unlock(Q);
2366
-
2367
-this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
2368
-before either of the stores issued on CPU 2.
2369
-
2370
-
2371
-Furthermore, following a store by a load from the same device obviates the need
2372
-for the mmiowb(), because the load forces the store to complete before the load
2373
-is performed:
2374
-
2375
- CPU 1 CPU 2
2376
- =============================== ===============================
2377
- spin_lock(Q)
2378
- writel(0, ADDR)
2379
- a = readl(DATA);
2380
- spin_unlock(Q);
2381
- spin_lock(Q);
2382
- writel(4, ADDR);
2383
- b = readl(DATA);
2384
- spin_unlock(Q);
2385
-
2386
-
2387
-See Documentation/driver-api/device-io.rst for more information.
2388
-
2389
-
23902318 =================================
23912319 WHERE ARE MEMORY BARRIERS NEEDED?
23922320 =================================
....@@ -2509,7 +2437,7 @@
25092437 ATOMIC OPERATIONS
25102438 -----------------
25112439
2512
-Whilst they are technically interprocessor interaction considerations, atomic
2440
+While they are technically interprocessor interaction considerations, atomic
25132441 operations are noted specially as some of them imply full memory barriers and
25142442 some don't, but they're very heavily relied on as a group throughout the
25152443 kernel.
....@@ -2532,17 +2460,10 @@
25322460
25332461 Inside of the Linux kernel, I/O should be done through the appropriate accessor
25342462 routines - such as inb() or writel() - which know how to make such accesses
2535
-appropriately sequential. Whilst this, for the most part, renders the explicit
2536
-use of memory barriers unnecessary, there are a couple of situations where they
2537
-might be needed:
2538
-
2539
- (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
2540
- so for _all_ general drivers locks should be used and mmiowb() must be
2541
- issued prior to unlocking the critical section.
2542
-
2543
- (2) If the accessor functions are used to refer to an I/O memory window with
2544
- relaxed memory access properties, then _mandatory_ memory barriers are
2545
- required to enforce ordering.
2463
+appropriately sequential. While this, for the most part, renders the explicit
2464
+use of memory barriers unnecessary, if the accessor functions are used to refer
2465
+to an I/O memory window with relaxed memory access properties, then _mandatory_
2466
+memory barriers are required to enforce ordering.
25462467
25472468 See Documentation/driver-api/device-io.rst for more information.
25482469
....@@ -2556,7 +2477,7 @@
25562477
25572478 This may be alleviated - at least in part - by disabling local interrupts (a
25582479 form of locking), such that the critical operations are all contained within
2559
-the interrupt-disabled section in the driver. Whilst the driver's interrupt
2480
+the interrupt-disabled section in the driver. While the driver's interrupt
25602481 routine is executing, the driver's core may not run on the same CPU, and its
25612482 interrupt is not permitted to happen again until the current interrupt has been
25622483 handled, thus the interrupt handler does not need to lock against that.
....@@ -2587,8 +2508,7 @@
25872508
25882509 Normally this won't be a problem because the I/O accesses done inside such
25892510 sections will include synchronous load operations on strictly ordered I/O
2590
-registers that form implicit I/O barriers. If this isn't sufficient then an
2591
-mmiowb() may need to be used explicitly.
2511
+registers that form implicit I/O barriers.
25922512
25932513
25942514 A similar situation may occur between an interrupt routine and two routines
....@@ -2600,71 +2520,114 @@
26002520 KERNEL I/O BARRIER EFFECTS
26012521 ==========================
26022522
2603
-When accessing I/O memory, drivers should use the appropriate accessor
2604
-functions:
2605
-
2606
- (*) inX(), outX():
2607
-
2608
- These are intended to talk to I/O space rather than memory space, but
2609
- that's primarily a CPU-specific concept. The i386 and x86_64 processors
2610
- do indeed have special I/O space access cycles and instructions, but many
2611
- CPUs don't have such a concept.
2612
-
2613
- The PCI bus, amongst others, defines an I/O space concept which - on such
2614
- CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
2615
- space. However, it may also be mapped as a virtual I/O space in the CPU's
2616
- memory map, particularly on those CPUs that don't support alternate I/O
2617
- spaces.
2618
-
2619
- Accesses to this space may be fully synchronous (as on i386), but
2620
- intermediary bridges (such as the PCI host bridge) may not fully honour
2621
- that.
2622
-
2623
- They are guaranteed to be fully ordered with respect to each other.
2624
-
2625
- They are not guaranteed to be fully ordered with respect to other types of
2626
- memory and I/O operation.
2523
+Interfacing with peripherals via I/O accesses is deeply architecture and device
2524
+specific. Therefore, drivers which are inherently non-portable may rely on
2525
+specific behaviours of their target systems in order to achieve synchronization
2526
+in the most lightweight manner possible. For drivers intending to be portable
2527
+between multiple architectures and bus implementations, the kernel offers a
2528
+series of accessor functions that provide various degrees of ordering
2529
+guarantees:
26272530
26282531 (*) readX(), writeX():
26292532
2630
- Whether these are guaranteed to be fully ordered and uncombined with
2631
- respect to each other on the issuing CPU depends on the characteristics
2632
- defined for the memory window through which they're accessing. On later
2633
- i386 architecture machines, for example, this is controlled by way of the
2634
- MTRR registers.
2533
+ The readX() and writeX() MMIO accessors take a pointer to the
2534
+ peripheral being accessed as an __iomem * parameter. For pointers
2535
+ mapped with the default I/O attributes (e.g. those returned by
2536
+ ioremap()), the ordering guarantees are as follows:
26352537
2636
- Ordinarily, these will be guaranteed to be fully ordered and uncombined,
2637
- provided they're not accessing a prefetchable device.
2538
+ 1. All readX() and writeX() accesses to the same peripheral are ordered
2539
+ with respect to each other. This ensures that MMIO register accesses
2540
+ by the same CPU thread to a particular device will arrive in program
2541
+ order.
26382542
2639
- However, intermediary hardware (such as a PCI bridge) may indulge in
2640
- deferral if it so wishes; to flush a store, a load from the same location
2641
- is preferred[*], but a load from the same device or from configuration
2642
- space should suffice for PCI.
2543
+ 2. A writeX() issued by a CPU thread holding a spinlock is ordered
2544
+ before a writeX() to the same peripheral from another CPU thread
2545
+ issued after a later acquisition of the same spinlock. This ensures
2546
+ that MMIO register writes to a particular device issued while holding
2547
+ a spinlock will arrive in an order consistent with acquisitions of
2548
+ the lock.
26432549
2644
- [*] NOTE! attempting to load from the same location as was written to may
2645
- cause a malfunction - consider the 16550 Rx/Tx serial registers for
2646
- example.
2550
+ 3. A writeX() by a CPU thread to the peripheral will first wait for the
2551
+ completion of all prior writes to memory either issued by, or
2552
+ propagated to, the same thread. This ensures that writes by the CPU
2553
+ to an outbound DMA buffer allocated by dma_alloc_coherent() will be
2554
+ visible to a DMA engine when the CPU writes to its MMIO control
2555
+ register to trigger the transfer.
26472556
2648
- Used with prefetchable I/O memory, an mmiowb() barrier may be required to
2649
- force stores to be ordered.
2557
+ 4. A readX() by a CPU thread from the peripheral will complete before
2558
+ any subsequent reads from memory by the same thread can begin. This
2559
+ ensures that reads by the CPU from an incoming DMA buffer allocated
2560
+ by dma_alloc_coherent() will not see stale data after reading from
2561
+ the DMA engine's MMIO status register to establish that the DMA
2562
+ transfer has completed.
26502563
2651
- Please refer to the PCI specification for more information on interactions
2652
- between PCI transactions.
2564
+ 5. A readX() by a CPU thread from the peripheral will complete before
2565
+ any subsequent delay() loop can begin execution on the same thread.
2566
+ This ensures that two MMIO register writes by the CPU to a peripheral
2567
+ will arrive at least 1us apart if the first write is immediately read
2568
+ back with readX() and udelay(1) is called prior to the second
2569
+ writeX():
26532570
2654
- (*) readX_relaxed(), writeX_relaxed()
2571
+ writel(42, DEVICE_REGISTER_0); // Arrives at the device...
2572
+ readl(DEVICE_REGISTER_0);
2573
+ udelay(1);
2574
+ writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
26552575
2656
- These are similar to readX() and writeX(), but provide weaker memory
2657
- ordering guarantees. Specifically, they do not guarantee ordering with
2658
- respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
2659
- ordering with respect to LOCK or UNLOCK operations. If the latter is
2660
- required, an mmiowb() barrier can be used. Note that relaxed accesses to
2661
- the same peripheral are guaranteed to be ordered with respect to each
2662
- other.
2576
+ The ordering properties of __iomem pointers obtained with non-default
2577
+ attributes (e.g. those returned by ioremap_wc()) are specific to the
2578
+ underlying architecture and therefore the guarantees listed above cannot
2579
+ generally be relied upon for accesses to these types of mappings.
26632580
2664
- (*) ioreadX(), iowriteX()
2581
+ (*) readX_relaxed(), writeX_relaxed():
26652582
2666
- These will perform appropriately for the type of access they're actually
2667
- doing, be it inX()/outX() or readX()/writeX().
2583
+ These are similar to readX() and writeX(), but provide weaker memory
2584
+ ordering guarantees. Specifically, they do not guarantee ordering with
2585
+ respect to locking, normal memory accesses or delay() loops (i.e.
2586
+ bullets 2-5 above) but they are still guaranteed to be ordered with
2587
+ respect to other accesses from the same CPU thread to the same
2588
+ peripheral when operating on __iomem pointers mapped with the default
2589
+ I/O attributes.
2590
+
2591
+ (*) readsX(), writesX():
2592
+
2593
+ The readsX() and writesX() MMIO accessors are designed for accessing
2594
+ register-based, memory-mapped FIFOs residing on peripherals that are not
2595
+ capable of performing DMA. Consequently, they provide only the ordering
2596
+ guarantees of readX_relaxed() and writeX_relaxed(), as documented above.
2597
+
2598
+ (*) inX(), outX():
2599
+
2600
+ The inX() and outX() accessors are intended to access legacy port-mapped
2601
+ I/O peripherals, which may require special instructions on some
2602
+ architectures (notably x86). The port number of the peripheral being
2603
+ accessed is passed as an argument.
2604
+
2605
+ Since many CPU architectures ultimately access these peripherals via an
2606
+ internal virtual memory mapping, the portable ordering guarantees
2607
+ provided by inX() and outX() are the same as those provided by readX()
2608
+ and writeX() respectively when accessing a mapping with the default I/O
2609
+ attributes.
2610
+
2611
+ Device drivers may expect outX() to emit a non-posted write transaction
2612
+ that waits for a completion response from the I/O peripheral before
2613
+ returning. This is not guaranteed by all architectures and is therefore
2614
+ not part of the portable ordering semantics.
2615
+
2616
+ (*) insX(), outsX():
2617
+
2618
+ As above, the insX() and outsX() accessors provide the same ordering
2619
+ guarantees as readsX() and writesX() respectively when accessing a
2620
+ mapping with the default I/O attributes.
2621
+
2622
+ (*) ioreadX(), iowriteX():
2623
+
2624
+ These will perform appropriately for the type of access they're actually
2625
+ doing, be it inX()/outX() or readX()/writeX().
2626
+
2627
+With the exception of the string accessors (insX(), outsX(), readsX() and
2628
+writesX()), all of the above assume that the underlying peripheral is
2629
+little-endian and will therefore perform byte-swapping operations on big-endian
2630
+architectures.
26682631
26692632
26702633 ========================================
....@@ -2759,144 +2722,6 @@
27592722 the use of any special device communication instructions the CPU may have.
27602723
27612724
2762
-CACHE COHERENCY
2763
----------------
2764
-
2765
-Life isn't quite as simple as it may appear above, however: for while the
2766
-caches are expected to be coherent, there's no guarantee that that coherency
2767
-will be ordered. This means that whilst changes made on one CPU will
2768
-eventually become visible on all CPUs, there's no guarantee that they will
2769
-become apparent in the same order on those other CPUs.
2770
-
2771
-
2772
-Consider dealing with a system that has a pair of CPUs (1 & 2), each of which
2773
-has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
2774
-
2775
- :
2776
- : +--------+
2777
- : +---------+ | |
2778
- +--------+ : +--->| Cache A |<------->| |
2779
- | | : | +---------+ | |
2780
- | CPU 1 |<---+ | |
2781
- | | : | +---------+ | |
2782
- +--------+ : +--->| Cache B |<------->| |
2783
- : +---------+ | |
2784
- : | Memory |
2785
- : +---------+ | System |
2786
- +--------+ : +--->| Cache C |<------->| |
2787
- | | : | +---------+ | |
2788
- | CPU 2 |<---+ | |
2789
- | | : | +---------+ | |
2790
- +--------+ : +--->| Cache D |<------->| |
2791
- : +---------+ | |
2792
- : +--------+
2793
- :
2794
-
2795
-Imagine the system has the following properties:
2796
-
2797
- (*) an odd-numbered cache line may be in cache A, cache C or it may still be
2798
- resident in memory;
2799
-
2800
- (*) an even-numbered cache line may be in cache B, cache D or it may still be
2801
- resident in memory;
2802
-
2803
- (*) whilst the CPU core is interrogating one cache, the other cache may be
2804
- making use of the bus to access the rest of the system - perhaps to
2805
- displace a dirty cacheline or to do a speculative load;
2806
-
2807
- (*) each cache has a queue of operations that need to be applied to that cache
2808
- to maintain coherency with the rest of the system;
2809
-
2810
- (*) the coherency queue is not flushed by normal loads to lines already
2811
- present in the cache, even though the contents of the queue may
2812
- potentially affect those loads.
2813
-
2814
-Imagine, then, that two writes are made on the first CPU, with a write barrier
2815
-between them to guarantee that they will appear to reach that CPU's caches in
2816
-the requisite order:
2817
-
2818
- CPU 1 CPU 2 COMMENT
2819
- =============== =============== =======================================
2820
- u == 0, v == 1 and p == &u, q == &u
2821
- v = 2;
2822
- smp_wmb(); Make sure change to v is visible before
2823
- change to p
2824
- <A:modify v=2> v is now in cache A exclusively
2825
- p = &v;
2826
- <B:modify p=&v> p is now in cache B exclusively
2827
-
2828
-The write memory barrier forces the other CPUs in the system to perceive that
2829
-the local CPU's caches have apparently been updated in the correct order. But
2830
-now imagine that the second CPU wants to read those values:
2831
-
2832
- CPU 1 CPU 2 COMMENT
2833
- =============== =============== =======================================
2834
- ...
2835
- q = p;
2836
- x = *q;
2837
-
2838
-The above pair of reads may then fail to happen in the expected order, as the
2839
-cacheline holding p may get updated in one of the second CPU's caches whilst
2840
-the update to the cacheline holding v is delayed in the other of the second
2841
-CPU's caches by some other cache event:
2842
-
2843
- CPU 1 CPU 2 COMMENT
2844
- =============== =============== =======================================
2845
- u == 0, v == 1 and p == &u, q == &u
2846
- v = 2;
2847
- smp_wmb();
2848
- <A:modify v=2> <C:busy>
2849
- <C:queue v=2>
2850
- p = &v; q = p;
2851
- <D:request p>
2852
- <B:modify p=&v> <D:commit p=&v>
2853
- <D:read p>
2854
- x = *q;
2855
- <C:read *q> Reads from v before v updated in cache
2856
- <C:unbusy>
2857
- <C:commit v=2>
2858
-
2859
-Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
2860
-no guarantee that, without intervention, the order of update will be the same
2861
-as that committed on CPU 1.
2862
-
2863
-
2864
-To intervene, we need to interpolate a data dependency barrier or a read
2865
-barrier between the loads (which as of v4.15 is supplied unconditionally
2866
-by the READ_ONCE() macro). This will force the cache to commit its
2867
-coherency queue before processing any further requests:
2868
-
2869
- CPU 1 CPU 2 COMMENT
2870
- =============== =============== =======================================
2871
- u == 0, v == 1 and p == &u, q == &u
2872
- v = 2;
2873
- smp_wmb();
2874
- <A:modify v=2> <C:busy>
2875
- <C:queue v=2>
2876
- p = &v; q = p;
2877
- <D:request p>
2878
- <B:modify p=&v> <D:commit p=&v>
2879
- <D:read p>
2880
- smp_read_barrier_depends()
2881
- <C:unbusy>
2882
- <C:commit v=2>
2883
- x = *q;
2884
- <C:read *q> Reads from v after v updated in cache
2885
-
2886
-
2887
-This sort of problem can be encountered on DEC Alpha processors as they have a
2888
-split cache that improves performance by making better use of the data bus.
2889
-Whilst most CPUs do imply a data dependency barrier on the read when a memory
2890
-access depends on a read, not all do, so it may not be relied on.
2891
-
2892
-Other CPUs may also have split caches, but must coordinate between the various
2893
-cachelets for normal memory accesses. The semantics of the Alpha removes the
2894
-need for hardware coordination in the absence of memory barriers, which
2895
-permitted Alpha to sport higher CPU clock rates back in the day. However,
2896
-please note that (again, as of v4.15) smp_read_barrier_depends() should not
2897
-be used except in Alpha arch-specific code and within the READ_ONCE() macro.
2898
-
2899
-
29002725 CACHE COHERENCY VS DMA
29012726 ----------------------
29022727
....@@ -2975,7 +2800,7 @@
29752800 thus cutting down on transaction setup costs (memory and PCI devices may
29762801 both be able to do this); and
29772802
2978
- (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
2803
+ (*) the CPU's data cache may affect the ordering, and while cache-coherency
29792804 mechanisms may alleviate this - once the store has actually hit the cache
29802805 - there's no guarantee that the coherency management will be propagated in
29812806 order to other CPUs.
....@@ -3060,10 +2885,8 @@
30602885 changes vs new data occur in the right order.
30612886
30622887 The Alpha defines the Linux kernel's memory model, although as of v4.15
3063
-the Linux kernel's addition of smp_read_barrier_depends() to READ_ONCE()
3064
-greatly reduced Alpha's impact on the memory model.
3065
-
3066
-See the subsection on "Cache Coherency" above.
2888
+the Linux kernel's addition of smp_mb() to READ_ONCE() on Alpha greatly
2889
+reduced its impact on the memory model.
30672890
30682891
30692892 VIRTUAL MACHINE GUESTS