~hc/RK356X_SDK_RELEASE.git

..	..	@@ -3,7 +3,7 @@
3	3	============================
4	4
5	5	By: David Howells <dhowells@redhat.com>
6		- Paul E. McKenney <paulmck@linux.vnet.ibm.com>
	6	+ Paul E. McKenney <paulmck@linux.ibm.com>
7	7	Will Deacon <will.deacon@arm.com>
8	8	Peter Zijlstra <peterz@infradead.org>
9	9
..	..	@@ -63,7 +63,6 @@
63	63
64	64	- Compiler barrier.
65	65	- CPU memory barriers.
66		- - MMIO write barrier.
67	66
68	67	(*) Implicit kernel memory barriers.
69	68
..	..	@@ -75,7 +74,6 @@
75	74	(*) Inter-CPU acquiring barrier effects.
76	75
77	76	- Acquires vs memory accesses.
78		- - Acquires vs I/O accesses.
79	77
80	78	(*) Where are memory barriers needed?
81	79
..	..	@@ -187,7 +185,7 @@
187	185	=============== ===============
188	186	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
189	187	B = 4; Q = P;
190		- P = &B D = *Q;
	188	+ P = &B; D = *Q;
191	189
192	190	There is an obvious data dependency here, as the value loaded into D depends on
193	191	the address retrieved from P by CPU 2. At the end of the sequence, any of the
..	..	@@ -471,8 +469,7 @@
471	469	operations after the ACQUIRE operation will appear to happen after the
472	470	ACQUIRE operation with respect to the other components of the system.
473	471	ACQUIRE operations include LOCK operations and both smp_load_acquire()
474		- and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
475		- semantics from relying on a control dependency and smp_rmb().
	472	+ and smp_cond_load_acquire() operations.
476	473
477	474	Memory operations that occur before an ACQUIRE operation may appear to
478	475	happen after it completes.
..	..	@@ -493,10 +490,9 @@
493	490	happen before it completes.
494	491
495	492	The use of ACQUIRE and RELEASE operations generally precludes the need
496		- for other sorts of memory barrier (but note the exceptions mentioned in
497		- the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE
498		- pair is -not- guaranteed to act as a full memory barrier. However, after
499		- an ACQUIRE on a given variable, all memory accesses preceding any prior
	493	+ for other sorts of memory barrier. In addition, a RELEASE+ACQUIRE pair is
	494	+ -not- guaranteed to act as a full memory barrier. However, after an
	495	+ ACQUIRE on a given variable, all memory accesses preceding any prior
500	496	RELEASE on that same variable are guaranteed to be visible. In other
501	497	words, within a given variable's critical section, all accesses of all
502	498	previous critical sections for that variable are guaranteed to have
..	..	@@ -549,20 +545,20 @@
549	545
550	546	[*] For information on bus mastering DMA and coherency please read:
551	547
552		- Documentation/PCI/pci.txt
553		- Documentation/DMA-API-HOWTO.txt
554		- Documentation/DMA-API.txt
	548	+ Documentation/driver-api/pci/pci.rst
	549	+ Documentation/core-api/dma-api-howto.rst
	550	+ Documentation/core-api/dma-api.rst
555	551
556	552
557	553	DATA DEPENDENCY BARRIERS (HISTORICAL)
558	554	-------------------------------------
559	555
560		-As of v4.15 of the Linux kernel, an smp_read_barrier_depends() was
561		-added to READ_ONCE(), which means that about the only people who
562		-need to pay attention to this section are those working on DEC Alpha
563		-architecture-specific code and those working on READ_ONCE() itself.
564		-For those who need it, and for those who are interested in the history,
565		-here is the story of data-dependency barriers.
	556	+As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
	557	+DEC Alpha, which means that about the only people who need to pay attention
	558	+to this section are those working on DEC Alpha architecture-specific code
	559	+and those working on READ_ONCE() itself. For those who need it, and for
	560	+those who are interested in the history, here is the story of
	561	+data-dependency barriers.
566	562
567	563	The usage requirements of data dependency barriers are a little subtle, and
568	564	it's not always obvious that they're needed. To illustrate, consider the
..	..	@@ -573,7 +569,7 @@
573	569	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
574	570	B = 4;
575	571	<write barrier>
576		- WRITE_ONCE(P, &B)
	572	+ WRITE_ONCE(P, &B);
577	573	Q = READ_ONCE(P);
578	574	D = *Q;
579	575
..	..	@@ -588,7 +584,7 @@
588	584
589	585	(Q == &B) and (D == 2) ????
590	586
591		-Whilst this may seem like a failure of coherency or causality maintenance, it
	587	+While this may seem like a failure of coherency or causality maintenance, it
592	588	isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
593	589	Alpha).
594	590
..	..	@@ -624,7 +620,7 @@
624	620	until they are certain (1) that the write will actually happen, (2)
625	621	of the location of the write, and (3) of the value to be written.
626	622	But please carefully read the "CONTROL DEPENDENCIES" section and the
627		-Documentation/RCU/rcu_dereference.txt file: The compiler can and does
	623	+Documentation/RCU/rcu_dereference.rst file: The compiler can and does
628	624	break dependencies in a great many highly creative ways.
629	625
630	626	CPU 1 CPU 2
..	..	@@ -1513,8 +1509,6 @@
1513	1509
1514	1510	(*) CPU memory barriers.
1515	1511
1516		- (*) MMIO write barrier.
1517		-
1518	1512
1519	1513	COMPILER BARRIER
1520	1514	----------------
..	..	@@ -1727,7 +1721,7 @@
1727	1721	and WRITE_ONCE() are more selective: With READ_ONCE() and
1728	1722	WRITE_ONCE(), the compiler need only forget the contents of the
1729	1723	indicated memory locations, while with barrier() the compiler must
1730		- discard the value of all memory locations that it has currented
	1724	+ discard the value of all memory locations that it has currently
1731	1725	cached in any machine registers. Of course, the compiler must also
1732	1726	respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur,
1733	1727	though the CPU of course need not do so.
..	..	@@ -1839,7 +1833,7 @@
1839	1833	to issue the loads in the correct order (eg. `a[b]` would have to load
1840	1834	the value of b before loading a[b]), however there is no guarantee in
1841	1835	the C specification that the compiler may not speculate the value of b
1842		-(eg. is equal to 1) and load a before b (eg. tmp = a[1]; if (b != 1)
	1836	+(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1)
1843	1837	tmp = a[b]; ). There is also the problem of a compiler reloading b after
1844	1838	having loaded a[b], thus having a newer copy of b than a[b]. A consensus
1845	1839	has not yet been reached about these problems, however the READ_ONCE()
..	..	@@ -1874,12 +1868,16 @@
1874	1868	(*) smp_mb__before_atomic();
1875	1869	(*) smp_mb__after_atomic();
1876	1870
1877		- These are for use with atomic (such as add, subtract, increment and
1878		- decrement) functions that don't return a value, especially when used for
1879		- reference counting. These functions do not imply memory barriers.
	1871	+ These are for use with atomic RMW functions that do not imply memory
	1872	+ barriers, but where the code needs a memory barrier. Examples for atomic
	1873	+ RMW functions that do not imply are memory barrier are e.g. add,
	1874	+ subtract, (failed) conditional operations, _relaxed functions,
	1875	+ but not atomic_read or atomic_set. A common example where a memory
	1876	+ barrier may be required is when atomic ops are used for reference
	1877	+ counting.
1880	1878
1881		- These are also used for atomic bitop functions that do not return a
1882		- value (such as set_bit and clear_bit).
	1879	+ These are also used for atomic RMW bitop functions that do not imply a
	1880	+ memory barrier (such as set_bit and clear_bit).
1883	1881
1884	1882	As an example, consider a piece of code that marks an object as being dead
1885	1883	and then decrements the object's reference count:
..	..	@@ -1934,24 +1932,23 @@
1934	1932	here.
1935	1933
1936	1934	See the subsection "Kernel I/O barrier effects" for more information on
1937		- relaxed I/O accessors and the Documentation/DMA-API.txt file for more
1938		- information on consistent memory.
	1935	+ relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for
	1936	+ more information on consistent memory.
1939	1937
	1938	+ (*) pmem_wmb();
1940	1939
1941		-MMIO WRITE BARRIER
1942		-------------------
	1940	+ This is for use with persistent memory to ensure that stores for which
	1941	+ modifications are written to persistent storage reached a platform
	1942	+ durability domain.
1943	1943
1944		-The Linux kernel also has a special barrier for use with memory-mapped I/O
1945		-writes:
	1944	+ For example, after a non-temporal write to pmem region, we use pmem_wmb()
	1945	+ to ensure that stores have reached a platform durability domain. This ensures
	1946	+ that stores have updated persistent storage before any data access or
	1947	+ data transfer caused by subsequent instructions is initiated. This is
	1948	+ in addition to the ordering done by wmb().
1946	1949
1947		- mmiowb();
1948		-
1949		-This is a variation on the mandatory write barrier that causes writes to weakly
1950		-ordered I/O regions to be partially ordered. Its effects may go beyond the
1951		-CPU->Hardware interface and actually affect the hardware at some level.
1952		-
1953		-See the subsection "Acquires vs I/O accesses" for more information.
1954		-
	1950	+ For load from persistent memory, existing read memory barriers are sufficient
	1951	+ to ensure read ordering.
1955	1952
1956	1953	===============================
1957	1954	IMPLICIT KERNEL MEMORY BARRIERS
..	..	@@ -2009,7 +2006,7 @@
2009	2006
2010	2007	Certain locking variants of the ACQUIRE operation may fail, either due to
2011	2008	being unable to get the lock immediately, or due to receiving an unblocked
2012		- signal whilst asleep waiting for the lock to become available. Failed
	2009	+ signal while asleep waiting for the lock to become available. Failed
2013	2010	locks do not imply any sort of barrier.
2014	2011
2015	2012	[!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only
..	..	@@ -2318,75 +2315,6 @@
2318	2315	E, F or *G following RELEASE Q
2319	2316
2320	2317
2321		-
2322		-ACQUIRES VS I/O ACCESSES
2323		-------------------------
2324		-
2325		-Under certain circumstances (especially involving NUMA), I/O accesses within
2326		-two spinlocked sections on two different CPUs may be seen as interleaved by the
2327		-PCI bridge, because the PCI bridge does not necessarily participate in the
2328		-cache-coherence protocol, and is therefore incapable of issuing the required
2329		-read memory barriers.
2330		-
2331		-For example:
2332		-
2333		- CPU 1 CPU 2
2334		- =============================== ===============================
2335		- spin_lock(Q)
2336		- writel(0, ADDR)
2337		- writel(1, DATA);
2338		- spin_unlock(Q);
2339		- spin_lock(Q);
2340		- writel(4, ADDR);
2341		- writel(5, DATA);
2342		- spin_unlock(Q);
2343		-
2344		-may be seen by the PCI bridge as follows:
2345		-
2346		- STORE ADDR = 0, STORE ADDR = 4, STORE DATA = 1, STORE DATA = 5
2347		-
2348		-which would probably cause the hardware to malfunction.
2349		-
2350		-
2351		-What is necessary here is to intervene with an mmiowb() before dropping the
2352		-spinlock, for example:
2353		-
2354		- CPU 1 CPU 2
2355		- =============================== ===============================
2356		- spin_lock(Q)
2357		- writel(0, ADDR)
2358		- writel(1, DATA);
2359		- mmiowb();
2360		- spin_unlock(Q);
2361		- spin_lock(Q);
2362		- writel(4, ADDR);
2363		- writel(5, DATA);
2364		- mmiowb();
2365		- spin_unlock(Q);
2366		-
2367		-this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
2368		-before either of the stores issued on CPU 2.
2369		-
2370		-
2371		-Furthermore, following a store by a load from the same device obviates the need
2372		-for the mmiowb(), because the load forces the store to complete before the load
2373		-is performed:
2374		-
2375		- CPU 1 CPU 2
2376		- =============================== ===============================
2377		- spin_lock(Q)
2378		- writel(0, ADDR)
2379		- a = readl(DATA);
2380		- spin_unlock(Q);
2381		- spin_lock(Q);
2382		- writel(4, ADDR);
2383		- b = readl(DATA);
2384		- spin_unlock(Q);
2385		-
2386		-
2387		-See Documentation/driver-api/device-io.rst for more information.
2388		-
2389		-
2390	2318	=================================
2391	2319	WHERE ARE MEMORY BARRIERS NEEDED?
2392	2320	=================================
..	..	@@ -2509,7 +2437,7 @@
2509	2437	ATOMIC OPERATIONS
2510	2438	-----------------
2511	2439
2512		-Whilst they are technically interprocessor interaction considerations, atomic
	2440	+While they are technically interprocessor interaction considerations, atomic
2513	2441	operations are noted specially as some of them imply full memory barriers and
2514	2442	some don't, but they're very heavily relied on as a group throughout the
2515	2443	kernel.
..	..	@@ -2532,17 +2460,10 @@
2532	2460
2533	2461	Inside of the Linux kernel, I/O should be done through the appropriate accessor
2534	2462	routines - such as inb() or writel() - which know how to make such accesses
2535		-appropriately sequential. Whilst this, for the most part, renders the explicit
2536		-use of memory barriers unnecessary, there are a couple of situations where they
2537		-might be needed:
2538		-
2539		- (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
2540		- so for _all_ general drivers locks should be used and mmiowb() must be
2541		- issued prior to unlocking the critical section.
2542		-
2543		- (2) If the accessor functions are used to refer to an I/O memory window with
2544		- relaxed memory access properties, then _mandatory_ memory barriers are
2545		- required to enforce ordering.
	2463	+appropriately sequential. While this, for the most part, renders the explicit
	2464	+use of memory barriers unnecessary, if the accessor functions are used to refer
	2465	+to an I/O memory window with relaxed memory access properties, then _mandatory_
	2466	+memory barriers are required to enforce ordering.
2546	2467
2547	2468	See Documentation/driver-api/device-io.rst for more information.
2548	2469
..	..	@@ -2556,7 +2477,7 @@
2556	2477
2557	2478	This may be alleviated - at least in part - by disabling local interrupts (a
2558	2479	form of locking), such that the critical operations are all contained within
2559		-the interrupt-disabled section in the driver. Whilst the driver's interrupt
	2480	+the interrupt-disabled section in the driver. While the driver's interrupt
2560	2481	routine is executing, the driver's core may not run on the same CPU, and its
2561	2482	interrupt is not permitted to happen again until the current interrupt has been
2562	2483	handled, thus the interrupt handler does not need to lock against that.
..	..	@@ -2587,8 +2508,7 @@
2587	2508
2588	2509	Normally this won't be a problem because the I/O accesses done inside such
2589	2510	sections will include synchronous load operations on strictly ordered I/O
2590		-registers that form implicit I/O barriers. If this isn't sufficient then an
2591		-mmiowb() may need to be used explicitly.
	2511	+registers that form implicit I/O barriers.
2592	2512
2593	2513
2594	2514	A similar situation may occur between an interrupt routine and two routines
..	..	@@ -2600,71 +2520,114 @@
2600	2520	KERNEL I/O BARRIER EFFECTS
2601	2521	==========================
2602	2522
2603		-When accessing I/O memory, drivers should use the appropriate accessor
2604		-functions:
2605		-
2606		- (*) inX(), outX():
2607		-
2608		- These are intended to talk to I/O space rather than memory space, but
2609		- that's primarily a CPU-specific concept. The i386 and x86_64 processors
2610		- do indeed have special I/O space access cycles and instructions, but many
2611		- CPUs don't have such a concept.
2612		-
2613		- The PCI bus, amongst others, defines an I/O space concept which - on such
2614		- CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
2615		- space. However, it may also be mapped as a virtual I/O space in the CPU's
2616		- memory map, particularly on those CPUs that don't support alternate I/O
2617		- spaces.
2618		-
2619		- Accesses to this space may be fully synchronous (as on i386), but
2620		- intermediary bridges (such as the PCI host bridge) may not fully honour
2621		- that.
2622		-
2623		- They are guaranteed to be fully ordered with respect to each other.
2624		-
2625		- They are not guaranteed to be fully ordered with respect to other types of
2626		- memory and I/O operation.
	2523	+Interfacing with peripherals via I/O accesses is deeply architecture and device
	2524	+specific. Therefore, drivers which are inherently non-portable may rely on
	2525	+specific behaviours of their target systems in order to achieve synchronization
	2526	+in the most lightweight manner possible. For drivers intending to be portable
	2527	+between multiple architectures and bus implementations, the kernel offers a
	2528	+series of accessor functions that provide various degrees of ordering
	2529	+guarantees:
2627	2530
2628	2531	(*) readX(), writeX():
2629	2532
2630		- Whether these are guaranteed to be fully ordered and uncombined with
2631		- respect to each other on the issuing CPU depends on the characteristics
2632		- defined for the memory window through which they're accessing. On later
2633		- i386 architecture machines, for example, this is controlled by way of the
2634		- MTRR registers.
	2533	+ The readX() and writeX() MMIO accessors take a pointer to the
	2534	+ peripheral being accessed as an __iomem * parameter. For pointers
	2535	+ mapped with the default I/O attributes (e.g. those returned by
	2536	+ ioremap()), the ordering guarantees are as follows:
2635	2537
2636		- Ordinarily, these will be guaranteed to be fully ordered and uncombined,
2637		- provided they're not accessing a prefetchable device.
	2538	+ 1. All readX() and writeX() accesses to the same peripheral are ordered
	2539	+ with respect to each other. This ensures that MMIO register accesses
	2540	+ by the same CPU thread to a particular device will arrive in program
	2541	+ order.
2638	2542
2639		- However, intermediary hardware (such as a PCI bridge) may indulge in
2640		- deferral if it so wishes; to flush a store, a load from the same location
2641		- is preferred[*], but a load from the same device or from configuration
2642		- space should suffice for PCI.
	2543	+ 2. A writeX() issued by a CPU thread holding a spinlock is ordered
	2544	+ before a writeX() to the same peripheral from another CPU thread
	2545	+ issued after a later acquisition of the same spinlock. This ensures
	2546	+ that MMIO register writes to a particular device issued while holding
	2547	+ a spinlock will arrive in an order consistent with acquisitions of
	2548	+ the lock.
2643	2549
2644		- [*] NOTE! attempting to load from the same location as was written to may
2645		- cause a malfunction - consider the 16550 Rx/Tx serial registers for
2646		- example.
	2550	+ 3. A writeX() by a CPU thread to the peripheral will first wait for the
	2551	+ completion of all prior writes to memory either issued by, or
	2552	+ propagated to, the same thread. This ensures that writes by the CPU
	2553	+ to an outbound DMA buffer allocated by dma_alloc_coherent() will be
	2554	+ visible to a DMA engine when the CPU writes to its MMIO control
	2555	+ register to trigger the transfer.
2647	2556
2648		- Used with prefetchable I/O memory, an mmiowb() barrier may be required to
2649		- force stores to be ordered.
	2557	+ 4. A readX() by a CPU thread from the peripheral will complete before
	2558	+ any subsequent reads from memory by the same thread can begin. This
	2559	+ ensures that reads by the CPU from an incoming DMA buffer allocated
	2560	+ by dma_alloc_coherent() will not see stale data after reading from
	2561	+ the DMA engine's MMIO status register to establish that the DMA
	2562	+ transfer has completed.
2650	2563
2651		- Please refer to the PCI specification for more information on interactions
2652		- between PCI transactions.
	2564	+ 5. A readX() by a CPU thread from the peripheral will complete before
	2565	+ any subsequent delay() loop can begin execution on the same thread.
	2566	+ This ensures that two MMIO register writes by the CPU to a peripheral
	2567	+ will arrive at least 1us apart if the first write is immediately read
	2568	+ back with readX() and udelay(1) is called prior to the second
	2569	+ writeX():
2653	2570
2654		- (*) readX_relaxed(), writeX_relaxed()
	2571	+ writel(42, DEVICE_REGISTER_0); // Arrives at the device...
	2572	+ readl(DEVICE_REGISTER_0);
	2573	+ udelay(1);
	2574	+ writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
2655	2575
2656		- These are similar to readX() and writeX(), but provide weaker memory
2657		- ordering guarantees. Specifically, they do not guarantee ordering with
2658		- respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
2659		- ordering with respect to LOCK or UNLOCK operations. If the latter is
2660		- required, an mmiowb() barrier can be used. Note that relaxed accesses to
2661		- the same peripheral are guaranteed to be ordered with respect to each
2662		- other.
	2576	+ The ordering properties of __iomem pointers obtained with non-default
	2577	+ attributes (e.g. those returned by ioremap_wc()) are specific to the
	2578	+ underlying architecture and therefore the guarantees listed above cannot
	2579	+ generally be relied upon for accesses to these types of mappings.
2663	2580
2664		- (*) ioreadX(), iowriteX()
	2581	+ (*) readX_relaxed(), writeX_relaxed():
2665	2582
2666		- These will perform appropriately for the type of access they're actually
2667		- doing, be it inX()/outX() or readX()/writeX().
	2583	+ These are similar to readX() and writeX(), but provide weaker memory
	2584	+ ordering guarantees. Specifically, they do not guarantee ordering with
	2585	+ respect to locking, normal memory accesses or delay() loops (i.e.
	2586	+ bullets 2-5 above) but they are still guaranteed to be ordered with
	2587	+ respect to other accesses from the same CPU thread to the same
	2588	+ peripheral when operating on __iomem pointers mapped with the default
	2589	+ I/O attributes.
	2590	+
	2591	+ (*) readsX(), writesX():
	2592	+
	2593	+ The readsX() and writesX() MMIO accessors are designed for accessing
	2594	+ register-based, memory-mapped FIFOs residing on peripherals that are not
	2595	+ capable of performing DMA. Consequently, they provide only the ordering
	2596	+ guarantees of readX_relaxed() and writeX_relaxed(), as documented above.
	2597	+
	2598	+ (*) inX(), outX():
	2599	+
	2600	+ The inX() and outX() accessors are intended to access legacy port-mapped
	2601	+ I/O peripherals, which may require special instructions on some
	2602	+ architectures (notably x86). The port number of the peripheral being
	2603	+ accessed is passed as an argument.
	2604	+
	2605	+ Since many CPU architectures ultimately access these peripherals via an
	2606	+ internal virtual memory mapping, the portable ordering guarantees
	2607	+ provided by inX() and outX() are the same as those provided by readX()
	2608	+ and writeX() respectively when accessing a mapping with the default I/O
	2609	+ attributes.
	2610	+
	2611	+ Device drivers may expect outX() to emit a non-posted write transaction
	2612	+ that waits for a completion response from the I/O peripheral before
	2613	+ returning. This is not guaranteed by all architectures and is therefore
	2614	+ not part of the portable ordering semantics.
	2615	+
	2616	+ (*) insX(), outsX():
	2617	+
	2618	+ As above, the insX() and outsX() accessors provide the same ordering
	2619	+ guarantees as readsX() and writesX() respectively when accessing a
	2620	+ mapping with the default I/O attributes.
	2621	+
	2622	+ (*) ioreadX(), iowriteX():
	2623	+
	2624	+ These will perform appropriately for the type of access they're actually
	2625	+ doing, be it inX()/outX() or readX()/writeX().
	2626	+
	2627	+With the exception of the string accessors (insX(), outsX(), readsX() and
	2628	+writesX()), all of the above assume that the underlying peripheral is
	2629	+little-endian and will therefore perform byte-swapping operations on big-endian
	2630	+architectures.
2668	2631
2669	2632
2670	2633	========================================
..	..	@@ -2759,144 +2722,6 @@
2759	2722	the use of any special device communication instructions the CPU may have.
2760	2723
2761	2724
2762		-CACHE COHERENCY
2763		----------------
2764		-
2765		-Life isn't quite as simple as it may appear above, however: for while the
2766		-caches are expected to be coherent, there's no guarantee that that coherency
2767		-will be ordered. This means that whilst changes made on one CPU will
2768		-eventually become visible on all CPUs, there's no guarantee that they will
2769		-become apparent in the same order on those other CPUs.
2770		-
2771		-
2772		-Consider dealing with a system that has a pair of CPUs (1 & 2), each of which
2773		-has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
2774		-
2775		- :
2776		- : +--------+
2777		- : +---------+ \| \|
2778		- +--------+ : +--->\| Cache A \|<------->\| \|
2779		- \| \| : \| +---------+ \| \|
2780		- \| CPU 1 \|<---+ \| \|
2781		- \| \| : \| +---------+ \| \|
2782		- +--------+ : +--->\| Cache B \|<------->\| \|
2783		- : +---------+ \| \|
2784		- : \| Memory \|
2785		- : +---------+ \| System \|
2786		- +--------+ : +--->\| Cache C \|<------->\| \|
2787		- \| \| : \| +---------+ \| \|
2788		- \| CPU 2 \|<---+ \| \|
2789		- \| \| : \| +---------+ \| \|
2790		- +--------+ : +--->\| Cache D \|<------->\| \|
2791		- : +---------+ \| \|
2792		- : +--------+
2793		- :
2794		-
2795		-Imagine the system has the following properties:
2796		-
2797		- (*) an odd-numbered cache line may be in cache A, cache C or it may still be
2798		- resident in memory;
2799		-
2800		- (*) an even-numbered cache line may be in cache B, cache D or it may still be
2801		- resident in memory;
2802		-
2803		- (*) whilst the CPU core is interrogating one cache, the other cache may be
2804		- making use of the bus to access the rest of the system - perhaps to
2805		- displace a dirty cacheline or to do a speculative load;
2806		-
2807		- (*) each cache has a queue of operations that need to be applied to that cache
2808		- to maintain coherency with the rest of the system;
2809		-
2810		- (*) the coherency queue is not flushed by normal loads to lines already
2811		- present in the cache, even though the contents of the queue may
2812		- potentially affect those loads.
2813		-
2814		-Imagine, then, that two writes are made on the first CPU, with a write barrier
2815		-between them to guarantee that they will appear to reach that CPU's caches in
2816		-the requisite order:
2817		-
2818		- CPU 1 CPU 2 COMMENT
2819		- =============== =============== =======================================
2820		- u == 0, v == 1 and p == &u, q == &u
2821		- v = 2;
2822		- smp_wmb(); Make sure change to v is visible before
2823		- change to p
2824		- <A:modify v=2> v is now in cache A exclusively
2825		- p = &v;
2826		- <B:modify p=&v> p is now in cache B exclusively
2827		-
2828		-The write memory barrier forces the other CPUs in the system to perceive that
2829		-the local CPU's caches have apparently been updated in the correct order. But
2830		-now imagine that the second CPU wants to read those values:
2831		-
2832		- CPU 1 CPU 2 COMMENT
2833		- =============== =============== =======================================
2834		- ...
2835		- q = p;
2836		- x = *q;
2837		-
2838		-The above pair of reads may then fail to happen in the expected order, as the
2839		-cacheline holding p may get updated in one of the second CPU's caches whilst
2840		-the update to the cacheline holding v is delayed in the other of the second
2841		-CPU's caches by some other cache event:
2842		-
2843		- CPU 1 CPU 2 COMMENT
2844		- =============== =============== =======================================
2845		- u == 0, v == 1 and p == &u, q == &u
2846		- v = 2;
2847		- smp_wmb();
2848		- <A:modify v=2> <C:busy>
2849		- <C:queue v=2>
2850		- p = &v; q = p;
2851		- <D:request p>
2852		- <B:modify p=&v> <D:commit p=&v>
2853		- <D:read p>
2854		- x = *q;
2855		- <C:read *q> Reads from v before v updated in cache
2856		- <C:unbusy>
2857		- <C:commit v=2>
2858		-
2859		-Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
2860		-no guarantee that, without intervention, the order of update will be the same
2861		-as that committed on CPU 1.
2862		-
2863		-
2864		-To intervene, we need to interpolate a data dependency barrier or a read
2865		-barrier between the loads (which as of v4.15 is supplied unconditionally
2866		-by the READ_ONCE() macro). This will force the cache to commit its
2867		-coherency queue before processing any further requests:
2868		-
2869		- CPU 1 CPU 2 COMMENT
2870		- =============== =============== =======================================
2871		- u == 0, v == 1 and p == &u, q == &u
2872		- v = 2;
2873		- smp_wmb();
2874		- <A:modify v=2> <C:busy>
2875		- <C:queue v=2>
2876		- p = &v; q = p;
2877		- <D:request p>
2878		- <B:modify p=&v> <D:commit p=&v>
2879		- <D:read p>
2880		- smp_read_barrier_depends()
2881		- <C:unbusy>
2882		- <C:commit v=2>
2883		- x = *q;
2884		- <C:read *q> Reads from v after v updated in cache
2885		-
2886		-
2887		-This sort of problem can be encountered on DEC Alpha processors as they have a
2888		-split cache that improves performance by making better use of the data bus.
2889		-Whilst most CPUs do imply a data dependency barrier on the read when a memory
2890		-access depends on a read, not all do, so it may not be relied on.
2891		-
2892		-Other CPUs may also have split caches, but must coordinate between the various
2893		-cachelets for normal memory accesses. The semantics of the Alpha removes the
2894		-need for hardware coordination in the absence of memory barriers, which
2895		-permitted Alpha to sport higher CPU clock rates back in the day. However,
2896		-please note that (again, as of v4.15) smp_read_barrier_depends() should not
2897		-be used except in Alpha arch-specific code and within the READ_ONCE() macro.
2898		-
2899		-
2900	2725	CACHE COHERENCY VS DMA
2901	2726	----------------------
2902	2727
..	..	@@ -2975,7 +2800,7 @@
2975	2800	thus cutting down on transaction setup costs (memory and PCI devices may
2976	2801	both be able to do this); and
2977	2802
2978		- (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
	2803	+ (*) the CPU's data cache may affect the ordering, and while cache-coherency
2979	2804	mechanisms may alleviate this - once the store has actually hit the cache
2980	2805	- there's no guarantee that the coherency management will be propagated in
2981	2806	order to other CPUs.
..	..	@@ -3060,10 +2885,8 @@
3060	2885	changes vs new data occur in the right order.
3061	2886
3062	2887	The Alpha defines the Linux kernel's memory model, although as of v4.15
3063		-the Linux kernel's addition of smp_read_barrier_depends() to READ_ONCE()
3064		-greatly reduced Alpha's impact on the memory model.
3065		-
3066		-See the subsection on "Cache Coherency" above.
	2888	+the Linux kernel's addition of smp_mb() to READ_ONCE() on Alpha greatly
	2889	+reduced its impact on the memory model.
3067	2890
3068	2891
3069	2892	VIRTUAL MACHINE GUESTS