.. | .. |
---|
27 | 27 | 19. AND THEN THERE WAS ALPHA |
---|
28 | 28 | 20. THE HAPPENS-BEFORE RELATION: hb |
---|
29 | 29 | 21. THE PROPAGATES-BEFORE RELATION: pb |
---|
30 | | - 22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb |
---|
31 | | - 23. ODDS AND ENDS |
---|
| 30 | + 22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb |
---|
| 31 | + 23. LOCKING |
---|
| 32 | + 24. PLAIN ACCESSES AND DATA RACES |
---|
| 33 | + 25. ODDS AND ENDS |
---|
32 | 34 | |
---|
33 | 35 | |
---|
34 | 36 | |
---|
.. | .. |
---|
204 | 206 | P0 stores 1 to buf before storing 1 to flag, since it executes |
---|
205 | 207 | its instructions in order. |
---|
206 | 208 | |
---|
207 | | - Since an instruction (in this case, P1's store to flag) cannot |
---|
| 209 | + Since an instruction (in this case, P0's store to flag) cannot |
---|
208 | 210 | execute before itself, the specified outcome is impossible. |
---|
209 | 211 | |
---|
210 | 212 | However, real computer hardware almost never follows the Sequential |
---|
.. | .. |
---|
353 | 355 | Optimizing compilers have great freedom in the way they translate |
---|
354 | 356 | source code to object code. They are allowed to apply transformations |
---|
355 | 357 | that add memory accesses, eliminate accesses, combine them, split them |
---|
356 | | -into pieces, or move them around. Faced with all these possibilities, |
---|
357 | | -the LKMM basically gives up. It insists that the code it analyzes |
---|
358 | | -must contain no ordinary accesses to shared memory; all accesses must |
---|
359 | | -be performed using READ_ONCE(), WRITE_ONCE(), or one of the other |
---|
360 | | -atomic or synchronization primitives. These primitives prevent a |
---|
361 | | -large number of compiler optimizations. In particular, it is |
---|
362 | | -guaranteed that the compiler will not remove such accesses from the |
---|
363 | | -generated code (unless it can prove the accesses will never be |
---|
364 | | -executed), it will not change the order in which they occur in the |
---|
365 | | -code (within limits imposed by the C standard), and it will not |
---|
366 | | -introduce extraneous accesses. |
---|
| 358 | +into pieces, or move them around. The use of READ_ONCE(), WRITE_ONCE(), |
---|
| 359 | +or one of the other atomic or synchronization primitives prevents a |
---|
| 360 | +large number of compiler optimizations. In particular, it is guaranteed |
---|
| 361 | +that the compiler will not remove such accesses from the generated code |
---|
| 362 | +(unless it can prove the accesses will never be executed), it will not |
---|
| 363 | +change the order in which they occur in the code (within limits imposed |
---|
| 364 | +by the C standard), and it will not introduce extraneous accesses. |
---|
367 | 365 | |
---|
368 | | -This explains why the MP and SB examples above used READ_ONCE() and |
---|
369 | | -WRITE_ONCE() rather than ordinary memory accesses. Thanks to this |
---|
370 | | -usage, we can be certain that in the MP example, P0's write event to |
---|
371 | | -buf really is po-before its write event to flag, and similarly for the |
---|
372 | | -other shared memory accesses in the examples. |
---|
| 366 | +The MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather |
---|
| 367 | +than ordinary memory accesses. Thanks to this usage, we can be certain |
---|
| 368 | +that in the MP example, the compiler won't reorder P0's write event to |
---|
| 369 | +buf and P0's write event to flag, and similarly for the other shared |
---|
| 370 | +memory accesses in the examples. |
---|
373 | 371 | |
---|
374 | | -Private variables are not subject to this restriction. Since they are |
---|
375 | | -not shared between CPUs, they can be accessed normally without |
---|
376 | | -READ_ONCE() or WRITE_ONCE(), and there will be no ill effects. In |
---|
377 | | -fact, they need not even be stored in normal memory at all -- in |
---|
378 | | -principle a private variable could be stored in a CPU register (hence |
---|
379 | | -the convention that these variables have names starting with the |
---|
380 | | -letter 'r'). |
---|
| 372 | +Since private variables are not shared between CPUs, they can be |
---|
| 373 | +accessed normally without READ_ONCE() or WRITE_ONCE(). In fact, they |
---|
| 374 | +need not even be stored in normal memory at all -- in principle a |
---|
| 375 | +private variable could be stored in a CPU register (hence the convention |
---|
| 376 | +that these variables have names starting with the letter 'r'). |
---|
381 | 377 | |
---|
382 | 378 | |
---|
383 | 379 | A WARNING |
---|
.. | .. |
---|
423 | 419 | |
---|
424 | 420 | The object code might call f(5) either before or after g(6); the |
---|
425 | 421 | memory model cannot assume there is a fixed program order relation |
---|
426 | | -between them. (In fact, if the functions are inlined then the |
---|
| 422 | +between them. (In fact, if the function calls are inlined then the |
---|
427 | 423 | compiler might even interleave their object code.) |
---|
428 | 424 | |
---|
429 | 425 | |
---|
.. | .. |
---|
503 | 499 | |
---|
504 | 500 | For our purposes, a memory location's initial value is treated as |
---|
505 | 501 | though it had been written there by an imaginary initial store that |
---|
506 | | -executes on a separate CPU before the program runs. |
---|
| 502 | +executes on a separate CPU before the main program runs. |
---|
507 | 503 | |
---|
508 | 504 | Usage of the rf relation implicitly assumes that loads will always |
---|
509 | 505 | read from a single store. It doesn't apply properly in the presence |
---|
.. | .. |
---|
861 | 857 | maintaining cache coherence and the fact that a CPU can't operate on a |
---|
862 | 858 | value before it knows what that value is, among other things. |
---|
863 | 859 | |
---|
864 | | -The formal version of the LKMM is defined by five requirements, or |
---|
| 860 | +The formal version of the LKMM is defined by six requirements, or |
---|
865 | 861 | axioms: |
---|
866 | 862 | |
---|
867 | 863 | Sequential consistency per variable: This requires that the |
---|
.. | .. |
---|
881 | 877 | grace periods obey the rules of RCU, in particular, the |
---|
882 | 878 | Grace-Period Guarantee. |
---|
883 | 879 | |
---|
| 880 | + Plain-coherence: This requires that plain memory accesses |
---|
| 881 | + (those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey |
---|
| 882 | + the operational model's rules regarding cache coherence. |
---|
| 883 | + |
---|
884 | 884 | The first and second are quite common; they can be found in many |
---|
885 | 885 | memory models (such as those for C11/C++11). The "happens-before" and |
---|
886 | 886 | "propagation" axioms have analogs in other memory models as well. The |
---|
887 | | -"rcu" axiom is specific to the LKMM. |
---|
| 887 | +"rcu" and "plain-coherence" axioms are specific to the LKMM. |
---|
888 | 888 | |
---|
889 | 889 | Each of these axioms is discussed below. |
---|
890 | 890 | |
---|
.. | .. |
---|
959 | 959 | THE PRESERVED PROGRAM ORDER RELATION: ppo |
---|
960 | 960 | ----------------------------------------- |
---|
961 | 961 | |
---|
962 | | -There are many situations where a CPU is obligated to execute two |
---|
| 962 | +There are many situations where a CPU is obliged to execute two |
---|
963 | 963 | instructions in program order. We amalgamate them into the ppo (for |
---|
964 | 964 | "preserved program order") relation, which links the po-earlier |
---|
965 | 965 | instruction to the po-later instruction and is thus a sub-relation of |
---|
.. | .. |
---|
1067 | 1067 | violating the write-write coherence rule by requiring the CPU not to |
---|
1068 | 1068 | send the W write to the memory subsystem at all!) |
---|
1069 | 1069 | |
---|
1070 | | -There is one last example of preserved program order in the LKMM: when |
---|
1071 | | -a load-acquire reads from an earlier store-release. For example: |
---|
1072 | | - |
---|
1073 | | - smp_store_release(&x, 123); |
---|
1074 | | - r1 = smp_load_acquire(&x); |
---|
1075 | | - |
---|
1076 | | -If the smp_load_acquire() ends up obtaining the 123 value that was |
---|
1077 | | -stored by the smp_store_release(), the LKMM says that the load must be |
---|
1078 | | -executed after the store; the store cannot be forwarded to the load. |
---|
1079 | | -This requirement does not arise from the operational model, but it |
---|
1080 | | -yields correct predictions on all architectures supported by the Linux |
---|
1081 | | -kernel, although for differing reasons. |
---|
1082 | | - |
---|
1083 | | -On some architectures, including x86 and ARMv8, it is true that the |
---|
1084 | | -store cannot be forwarded to the load. On others, including PowerPC |
---|
1085 | | -and ARMv7, smp_store_release() generates object code that starts with |
---|
1086 | | -a fence and smp_load_acquire() generates object code that ends with a |
---|
1087 | | -fence. The upshot is that even though the store may be forwarded to |
---|
1088 | | -the load, it is still true that any instruction preceding the store |
---|
1089 | | -will be executed before the load or any following instructions, and |
---|
1090 | | -the store will be executed before any instruction following the load. |
---|
1091 | | - |
---|
1092 | 1070 | |
---|
1093 | 1071 | AND THEN THERE WAS ALPHA |
---|
1094 | 1072 | ------------------------ |
---|
.. | .. |
---|
1144 | 1122 | In practice, this difficulty is solved by inserting a special fence |
---|
1145 | 1123 | between P1's two loads when the kernel is compiled for the Alpha |
---|
1146 | 1124 | architecture. In fact, as of version 4.15, the kernel automatically |
---|
1147 | | -adds this fence (called smp_read_barrier_depends() and defined as |
---|
1148 | | -nothing at all on non-Alpha builds) after every READ_ONCE() and atomic |
---|
1149 | | -load. The effect of the fence is to cause the CPU not to execute any |
---|
1150 | | -po-later instructions until after the local cache has finished |
---|
1151 | | -processing all the stores it has already received. Thus, if the code |
---|
1152 | | -was changed to: |
---|
| 1125 | +adds this fence after every READ_ONCE() and atomic load on Alpha. The |
---|
| 1126 | +effect of the fence is to cause the CPU not to execute any po-later |
---|
| 1127 | +instructions until after the local cache has finished processing all |
---|
| 1128 | +the stores it has already received. Thus, if the code was changed to: |
---|
1153 | 1129 | |
---|
1154 | 1130 | P1() |
---|
1155 | 1131 | { |
---|
.. | .. |
---|
1168 | 1144 | directly. |
---|
1169 | 1145 | |
---|
1170 | 1146 | The LKMM requires that smp_rmb(), acquire fences, and strong fences |
---|
1171 | | -share this property with smp_read_barrier_depends(): They do not allow |
---|
1172 | | -the CPU to execute any po-later instructions (or po-later loads in the |
---|
1173 | | -case of smp_rmb()) until all outstanding stores have been processed by |
---|
1174 | | -the local cache. In the case of a strong fence, the CPU first has to |
---|
1175 | | -wait for all of its po-earlier stores to propagate to every other CPU |
---|
1176 | | -in the system; then it has to wait for the local cache to process all |
---|
1177 | | -the stores received as of that time -- not just the stores received |
---|
1178 | | -when the strong fence began. |
---|
| 1147 | +share this property: They do not allow the CPU to execute any po-later |
---|
| 1148 | +instructions (or po-later loads in the case of smp_rmb()) until all |
---|
| 1149 | +outstanding stores have been processed by the local cache. In the |
---|
| 1150 | +case of a strong fence, the CPU first has to wait for all of its |
---|
| 1151 | +po-earlier stores to propagate to every other CPU in the system; then |
---|
| 1152 | +it has to wait for the local cache to process all the stores received |
---|
| 1153 | +as of that time -- not just the stores received when the strong fence |
---|
| 1154 | +began. |
---|
1179 | 1155 | |
---|
1180 | 1156 | And of course, none of this matters for any architecture other than |
---|
1181 | 1157 | Alpha. |
---|
.. | .. |
---|
1323 | 1299 | rfe link. You can concoct more exotic examples, containing more than |
---|
1324 | 1300 | one fence, although this quickly leads to diminishing returns in terms |
---|
1325 | 1301 | of complexity. For instance, here's an example containing a coe link |
---|
1326 | | -followed by two fences and an rfe link, utilizing the fact that |
---|
| 1302 | +followed by two cumul-fences and an rfe link, utilizing the fact that |
---|
1327 | 1303 | release fences are A-cumulative: |
---|
1328 | 1304 | |
---|
1329 | 1305 | int x, y, z; |
---|
.. | .. |
---|
1355 | 1331 | link from P0's store to its load. This is because P0's store gets |
---|
1356 | 1332 | overwritten by P1's store since x = 2 at the end (a coe link), the |
---|
1357 | 1333 | smp_wmb() ensures that P1's store to x propagates to P2 before the |
---|
1358 | | -store to y does (the first fence), the store to y propagates to P2 |
---|
| 1334 | +store to y does (the first cumul-fence), the store to y propagates to P2 |
---|
1359 | 1335 | before P2's load and store execute, P2's smp_store_release() |
---|
1360 | 1336 | guarantees that the stores to x and y both propagate to P0 before the |
---|
1361 | | -store to z does (the second fence), and P0's load executes after the |
---|
| 1337 | +store to z does (the second cumul-fence), and P0's load executes after the |
---|
1362 | 1338 | store to z has propagated to P0 (an rfe link). |
---|
1363 | 1339 | |
---|
1364 | 1340 | In summary, the fact that the hb relation links memory access events |
---|
.. | .. |
---|
1451 | 1427 | the content of the LKMM's "propagation" axiom. |
---|
1452 | 1428 | |
---|
1453 | 1429 | |
---|
1454 | | -RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb |
---|
1455 | | ----------------------------------------------------- |
---|
| 1430 | +RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb |
---|
| 1431 | +------------------------------------------------------------------------ |
---|
1456 | 1432 | |
---|
1457 | 1433 | RCU (Read-Copy-Update) is a powerful synchronization mechanism. It |
---|
1458 | 1434 | rests on two concepts: grace periods and read-side critical sections. |
---|
.. | .. |
---|
1467 | 1443 | Grace-Period Guarantee, which states that a critical section can never |
---|
1468 | 1444 | span a full grace period. In more detail, the Guarantee says: |
---|
1469 | 1445 | |
---|
1470 | | - If a critical section starts before a grace period then it |
---|
1471 | | - must end before the grace period does. In addition, every |
---|
1472 | | - store that propagates to the critical section's CPU before the |
---|
1473 | | - end of the critical section must propagate to every CPU before |
---|
1474 | | - the end of the grace period. |
---|
| 1446 | + For any critical section C and any grace period G, at least |
---|
| 1447 | + one of the following statements must hold: |
---|
1475 | 1448 | |
---|
1476 | | - If a critical section ends after a grace period ends then it |
---|
1477 | | - must start after the grace period does. In addition, every |
---|
1478 | | - store that propagates to the grace period's CPU before the |
---|
1479 | | - start of the grace period must propagate to every CPU before |
---|
1480 | | - the start of the critical section. |
---|
| 1449 | +(1) C ends before G does, and in addition, every store that |
---|
| 1450 | + propagates to C's CPU before the end of C must propagate to |
---|
| 1451 | + every CPU before G ends. |
---|
| 1452 | + |
---|
| 1453 | +(2) G starts before C does, and in addition, every store that |
---|
| 1454 | + propagates to G's CPU before the start of G must propagate |
---|
| 1455 | + to every CPU before C starts. |
---|
| 1456 | + |
---|
| 1457 | +In particular, it is not possible for a critical section to both start |
---|
| 1458 | +before and end after a grace period. |
---|
1481 | 1459 | |
---|
1482 | 1460 | Here is a simple example of RCU in action: |
---|
1483 | 1461 | |
---|
.. | .. |
---|
1504 | 1482 | never end with r1 = 1 and r2 = 0. The reasoning is as follows. r1 = 1 |
---|
1505 | 1483 | means that P0's store to x propagated to P1 before P1 called |
---|
1506 | 1484 | synchronize_rcu(), so P0's critical section must have started before |
---|
1507 | | -P1's grace period. On the other hand, r2 = 0 means that P0's store to |
---|
1508 | | -y, which occurs before the end of the critical section, did not |
---|
1509 | | -propagate to P1 before the end of the grace period, violating the |
---|
1510 | | -Guarantee. |
---|
| 1485 | +P1's grace period, contrary to part (2) of the Guarantee. On the |
---|
| 1486 | +other hand, r2 = 0 means that P0's store to y, which occurs before the |
---|
| 1487 | +end of the critical section, did not propagate to P1 before the end of |
---|
| 1488 | +the grace period, contrary to part (1). Together the results violate |
---|
| 1489 | +the Guarantee. |
---|
1511 | 1490 | |
---|
1512 | 1491 | In the kernel's implementations of RCU, the requirements for stores |
---|
1513 | 1492 | to propagate to every CPU are fulfilled by placing strong fences at |
---|
.. | .. |
---|
1525 | 1504 | are pretty obvious, as in the example above, but the details aren't |
---|
1526 | 1505 | entirely clear. The LKMM formalizes this notion by means of the |
---|
1527 | 1506 | rcu-link relation. rcu-link encompasses a very general notion of |
---|
1528 | | -"before": Among other things, X ->rcu-link Z includes cases where X |
---|
1529 | | -happens-before or is equal to some event Y which is equal to or comes |
---|
1530 | | -before Z in the coherence order. When Y = Z this says that X ->rfe Z |
---|
1531 | | -implies X ->rcu-link Z. In addition, when Y = X it says that X ->fr Z |
---|
1532 | | -and X ->co Z each imply X ->rcu-link Z. |
---|
| 1507 | +"before": If E and F are RCU fence events (i.e., rcu_read_lock(), |
---|
| 1508 | +rcu_read_unlock(), or synchronize_rcu()) then among other things, |
---|
| 1509 | +E ->rcu-link F includes cases where E is po-before some memory-access |
---|
| 1510 | +event X, F is po-after some memory-access event Y, and we have any of |
---|
| 1511 | +X ->rfe Y, X ->co Y, or X ->fr Y. |
---|
1533 | 1512 | |
---|
1534 | 1513 | The formal definition of the rcu-link relation is more than a little |
---|
1535 | 1514 | obscure, and we won't give it here. It is closely related to the pb |
---|
.. | .. |
---|
1537 | 1516 | a somewhat lengthy formal proof. Pretty much all you need to know |
---|
1538 | 1517 | about rcu-link is the information in the preceding paragraph. |
---|
1539 | 1518 | |
---|
1540 | | -The LKMM also defines the gp and rscs relations. They bring grace |
---|
1541 | | -periods and read-side critical sections into the picture, in the |
---|
| 1519 | +The LKMM also defines the rcu-gp and rcu-rscsi relations. They bring |
---|
| 1520 | +grace periods and read-side critical sections into the picture, in the |
---|
1542 | 1521 | following way: |
---|
1543 | 1522 | |
---|
1544 | | - E ->gp F means there is a synchronize_rcu() fence event S such |
---|
1545 | | - that E ->po S and either S ->po F or S = F. In simple terms, |
---|
1546 | | - there is a grace period po-between E and F. |
---|
| 1523 | + E ->rcu-gp F means that E and F are in fact the same event, |
---|
| 1524 | + and that event is a synchronize_rcu() fence (i.e., a grace |
---|
| 1525 | + period). |
---|
1547 | 1526 | |
---|
1548 | | - E ->rscs F means there is a critical section delimited by an |
---|
1549 | | - rcu_read_lock() fence L and an rcu_read_unlock() fence U, such |
---|
1550 | | - that E ->po U and either L ->po F or L = F. You can think of |
---|
1551 | | - this as saying that E and F are in the same critical section |
---|
1552 | | - (in fact, it also allows E to be po-before the start of the |
---|
1553 | | - critical section and F to be po-after the end). |
---|
| 1527 | + E ->rcu-rscsi F means that E and F are the rcu_read_unlock() |
---|
| 1528 | + and rcu_read_lock() fence events delimiting some read-side |
---|
| 1529 | + critical section. (The 'i' at the end of the name emphasizes |
---|
| 1530 | + that this relation is "inverted": It links the end of the |
---|
| 1531 | + critical section to the start.) |
---|
1554 | 1532 | |
---|
1555 | 1533 | If we think of the rcu-link relation as standing for an extended |
---|
1556 | | -"before", then X ->gp Y ->rcu-link Z says that X executes before a |
---|
1557 | | -grace period which ends before Z executes. (In fact it covers more |
---|
1558 | | -than this, because it also includes cases where X executes before a |
---|
1559 | | -grace period and some store propagates to Z's CPU before Z executes |
---|
1560 | | -but doesn't propagate to some other CPU until after the grace period |
---|
1561 | | -ends.) Similarly, X ->rscs Y ->rcu-link Z says that X is part of (or |
---|
1562 | | -before the start of) a critical section which starts before Z |
---|
1563 | | -executes. |
---|
| 1534 | +"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a |
---|
| 1535 | +grace period which ends before Z begins. (In fact it covers more than |
---|
| 1536 | +this, because it also includes cases where some store propagates to |
---|
| 1537 | +Z's CPU before Z begins but doesn't propagate to some other CPU until |
---|
| 1538 | +after X ends.) Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is |
---|
| 1539 | +the end of a critical section which starts before Z begins. |
---|
1564 | 1540 | |
---|
1565 | | -The LKMM goes on to define the rcu-fence relation as a sequence of gp |
---|
1566 | | -and rscs links separated by rcu-link links, in which the number of gp |
---|
1567 | | -links is >= the number of rscs links. For example: |
---|
| 1541 | +The LKMM goes on to define the rcu-order relation as a sequence of |
---|
| 1542 | +rcu-gp and rcu-rscsi links separated by rcu-link links, in which the |
---|
| 1543 | +number of rcu-gp links is >= the number of rcu-rscsi links. For |
---|
| 1544 | +example: |
---|
1568 | 1545 | |
---|
1569 | | - X ->gp Y ->rcu-link Z ->rscs T ->rcu-link U ->gp V |
---|
| 1546 | + X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V |
---|
1570 | 1547 | |
---|
1571 | | -would imply that X ->rcu-fence V, because this sequence contains two |
---|
1572 | | -gp links and only one rscs link. (It also implies that X ->rcu-fence T |
---|
1573 | | -and Z ->rcu-fence V.) On the other hand: |
---|
| 1548 | +would imply that X ->rcu-order V, because this sequence contains two |
---|
| 1549 | +rcu-gp links and one rcu-rscsi link. (It also implies that |
---|
| 1550 | +X ->rcu-order T and Z ->rcu-order V.) On the other hand: |
---|
1574 | 1551 | |
---|
1575 | | - X ->rscs Y ->rcu-link Z ->rscs T ->rcu-link U ->gp V |
---|
| 1552 | + X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V |
---|
1576 | 1553 | |
---|
1577 | | -does not imply X ->rcu-fence V, because the sequence contains only |
---|
1578 | | -one gp link but two rscs links. |
---|
| 1554 | +does not imply X ->rcu-order V, because the sequence contains only |
---|
| 1555 | +one rcu-gp link but two rcu-rscsi links. |
---|
1579 | 1556 | |
---|
1580 | | -The rcu-fence relation is important because the Grace Period Guarantee |
---|
1581 | | -means that rcu-fence acts kind of like a strong fence. In particular, |
---|
1582 | | -if W is a write and we have W ->rcu-fence Z, the Guarantee says that W |
---|
1583 | | -will propagate to every CPU before Z executes. |
---|
| 1557 | +The rcu-order relation is important because the Grace Period Guarantee |
---|
| 1558 | +means that rcu-order links act kind of like strong fences. In |
---|
| 1559 | +particular, E ->rcu-order F implies not only that E begins before F |
---|
| 1560 | +ends, but also that any write po-before E will propagate to every CPU |
---|
| 1561 | +before any instruction po-after F can execute. (However, it does not |
---|
| 1562 | +imply that E must execute before F; in fact, each synchronize_rcu() |
---|
| 1563 | +fence event is linked to itself by rcu-order as a degenerate case.) |
---|
1584 | 1564 | |
---|
1585 | 1565 | To prove this in full generality requires some intellectual effort. |
---|
1586 | 1566 | We'll consider just a very simple case: |
---|
1587 | 1567 | |
---|
1588 | | - W ->gp X ->rcu-link Y ->rscs Z. |
---|
| 1568 | + G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F. |
---|
1589 | 1569 | |
---|
1590 | | -This formula means that there is a grace period G and a critical |
---|
1591 | | -section C such that: |
---|
| 1570 | +This formula means that G and W are the same event (a grace period), |
---|
| 1571 | +and there are events X, Y and a read-side critical section C such that: |
---|
1592 | 1572 | |
---|
1593 | | - 1. W is po-before G; |
---|
| 1573 | + 1. G = W is po-before or equal to X; |
---|
1594 | 1574 | |
---|
1595 | | - 2. X is equal to or po-after G; |
---|
| 1575 | + 2. X comes "before" Y in some sense (including rfe, co and fr); |
---|
1596 | 1576 | |
---|
1597 | | - 3. X comes "before" Y in some sense; |
---|
| 1577 | + 3. Y is po-before Z; |
---|
1598 | 1578 | |
---|
1599 | | - 4. Y is po-before the end of C; |
---|
| 1579 | + 4. Z is the rcu_read_unlock() event marking the end of C; |
---|
1600 | 1580 | |
---|
1601 | | - 5. Z is equal to or po-after the start of C. |
---|
| 1581 | + 5. F is the rcu_read_lock() event marking the start of C. |
---|
1602 | 1582 | |
---|
1603 | | -From 2 - 4 we deduce that the grace period G ends before the critical |
---|
1604 | | -section C. Then the second part of the Grace Period Guarantee says |
---|
1605 | | -not only that G starts before C does, but also that W (which executes |
---|
1606 | | -on G's CPU before G starts) must propagate to every CPU before C |
---|
1607 | | -starts. In particular, W propagates to every CPU before Z executes |
---|
1608 | | -(or finishes executing, in the case where Z is equal to the |
---|
1609 | | -rcu_read_lock() fence event which starts C.) This sort of reasoning |
---|
1610 | | -can be expanded to handle all the situations covered by rcu-fence. |
---|
| 1583 | +From 1 - 4 we deduce that the grace period G ends before the critical |
---|
| 1584 | +section C. Then part (2) of the Grace Period Guarantee says not only |
---|
| 1585 | +that G starts before C does, but also that any write which executes on |
---|
| 1586 | +G's CPU before G starts must propagate to every CPU before C starts. |
---|
| 1587 | +In particular, the write propagates to every CPU before F finishes |
---|
| 1588 | +executing and hence before any instruction po-after F can execute. |
---|
| 1589 | +This sort of reasoning can be extended to handle all the situations |
---|
| 1590 | +covered by rcu-order. |
---|
| 1591 | + |
---|
| 1592 | +The rcu-fence relation is a simple extension of rcu-order. While |
---|
| 1593 | +rcu-order only links certain fence events (calls to synchronize_rcu(), |
---|
| 1594 | +rcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events |
---|
| 1595 | +that are separated by an rcu-order link. This is analogous to the way |
---|
| 1596 | +the strong-fence relation links events that are separated by an |
---|
| 1597 | +smp_mb() fence event (as mentioned above, rcu-order links act kind of |
---|
| 1598 | +like strong fences). Written symbolically, X ->rcu-fence Y means |
---|
| 1599 | +there are fence events E and F such that: |
---|
| 1600 | + |
---|
| 1601 | + X ->po E ->rcu-order F ->po Y. |
---|
| 1602 | + |
---|
| 1603 | +From the discussion above, we see this implies not only that X |
---|
| 1604 | +executes before Y, but also (if X is a store) that X propagates to |
---|
| 1605 | +every CPU before Y executes. Thus rcu-fence is sort of a |
---|
| 1606 | +"super-strong" fence: Unlike the original strong fences (smp_mb() and |
---|
| 1607 | +synchronize_rcu()), rcu-fence is able to link events on different |
---|
| 1608 | +CPUs. (Perhaps this fact should lead us to say that rcu-fence isn't |
---|
| 1609 | +really a fence at all!) |
---|
1611 | 1610 | |
---|
1612 | 1611 | Finally, the LKMM defines the RCU-before (rb) relation in terms of |
---|
1613 | 1612 | rcu-fence. This is done in essentially the same way as the pb |
---|
1614 | 1613 | relation was defined in terms of strong-fence. We will omit the |
---|
1615 | | -details; the end result is that E ->rb F implies E must execute before |
---|
1616 | | -F, just as E ->pb F does (and for much the same reasons). |
---|
| 1614 | +details; the end result is that E ->rb F implies E must execute |
---|
| 1615 | +before F, just as E ->pb F does (and for much the same reasons). |
---|
1617 | 1616 | |
---|
1618 | 1617 | Putting this all together, the LKMM expresses the Grace Period |
---|
1619 | 1618 | Guarantee by requiring that the rb relation does not contain a cycle. |
---|
1620 | | -Equivalently, this "rcu" axiom requires that there are no events E and |
---|
1621 | | -F with E ->rcu-link F ->rcu-fence E. Or to put it a third way, the |
---|
1622 | | -axiom requires that there are no cycles consisting of gp and rscs |
---|
1623 | | -alternating with rcu-link, where the number of gp links is >= the |
---|
1624 | | -number of rscs links. |
---|
| 1619 | +Equivalently, this "rcu" axiom requires that there are no events E |
---|
| 1620 | +and F with E ->rcu-link F ->rcu-order E. Or to put it a third way, |
---|
| 1621 | +the axiom requires that there are no cycles consisting of rcu-gp and |
---|
| 1622 | +rcu-rscsi alternating with rcu-link, where the number of rcu-gp links |
---|
| 1623 | +is >= the number of rcu-rscsi links. |
---|
1625 | 1624 | |
---|
1626 | 1625 | Justifying the axiom isn't easy, but it is in fact a valid |
---|
1627 | 1626 | formalization of the Grace Period Guarantee. We won't attempt to go |
---|
1628 | 1627 | through the detailed argument, but the following analysis gives a |
---|
1629 | | -taste of what is involved. Suppose we have a violation of the first |
---|
1630 | | -part of the Guarantee: A critical section starts before a grace |
---|
1631 | | -period, and some store propagates to the critical section's CPU before |
---|
1632 | | -the end of the critical section but doesn't propagate to some other |
---|
1633 | | -CPU until after the end of the grace period. |
---|
| 1628 | +taste of what is involved. Suppose both parts of the Guarantee are |
---|
| 1629 | +violated: A critical section starts before a grace period, and some |
---|
| 1630 | +store propagates to the critical section's CPU before the end of the |
---|
| 1631 | +critical section but doesn't propagate to some other CPU until after |
---|
| 1632 | +the end of the grace period. |
---|
1634 | 1633 | |
---|
1635 | 1634 | Putting symbols to these ideas, let L and U be the rcu_read_lock() and |
---|
1636 | 1635 | rcu_read_unlock() fence events delimiting the critical section in |
---|
1637 | 1636 | question, and let S be the synchronize_rcu() fence event for the grace |
---|
1638 | 1637 | period. Saying that the critical section starts before S means there |
---|
1639 | | -are events E and F where E is po-after L (which marks the start of the |
---|
1640 | | -critical section), E is "before" F in the sense of the rcu-link |
---|
1641 | | -relation, and F is po-before the grace period S: |
---|
| 1638 | +are events Q and R where Q is po-after L (which marks the start of the |
---|
| 1639 | +critical section), Q is "before" R in the sense used by the rcu-link |
---|
| 1640 | +relation, and R is po-before the grace period S. Thus we have: |
---|
1642 | 1641 | |
---|
1643 | | - L ->po E ->rcu-link F ->po S. |
---|
| 1642 | + L ->rcu-link S. |
---|
1644 | 1643 | |
---|
1645 | | -Let W be the store mentioned above, let Z come before the end of the |
---|
| 1644 | +Let W be the store mentioned above, let Y come before the end of the |
---|
1646 | 1645 | critical section and witness that W propagates to the critical |
---|
1647 | | -section's CPU by reading from W, and let Y on some arbitrary CPU be a |
---|
1648 | | -witness that W has not propagated to that CPU, where Y happens after |
---|
| 1646 | +section's CPU by reading from W, and let Z on some arbitrary CPU be a |
---|
| 1647 | +witness that W has not propagated to that CPU, where Z happens after |
---|
1649 | 1648 | some event X which is po-after S. Symbolically, this amounts to: |
---|
1650 | 1649 | |
---|
1651 | | - S ->po X ->hb* Y ->fr W ->rf Z ->po U. |
---|
| 1650 | + S ->po X ->hb* Z ->fr W ->rf Y ->po U. |
---|
1652 | 1651 | |
---|
1653 | | -The fr link from Y to W indicates that W has not propagated to Y's CPU |
---|
1654 | | -at the time that Y executes. From this, it can be shown (see the |
---|
1655 | | -discussion of the rcu-link relation earlier) that X and Z are related |
---|
1656 | | -by rcu-link, yielding: |
---|
| 1652 | +The fr link from Z to W indicates that W has not propagated to Z's CPU |
---|
| 1653 | +at the time that Z executes. From this, it can be shown (see the |
---|
| 1654 | +discussion of the rcu-link relation earlier) that S and U are related |
---|
| 1655 | +by rcu-link: |
---|
1657 | 1656 | |
---|
1658 | | - S ->po X ->rcu-link Z ->po U. |
---|
| 1657 | + S ->rcu-link U. |
---|
1659 | 1658 | |
---|
1660 | | -The formulas say that S is po-between F and X, hence F ->gp X. They |
---|
1661 | | -also say that Z comes before the end of the critical section and E |
---|
1662 | | -comes after its start, hence Z ->rscs E. From all this we obtain: |
---|
| 1659 | +Since S is a grace period we have S ->rcu-gp S, and since L and U are |
---|
| 1660 | +the start and end of the critical section C we have U ->rcu-rscsi L. |
---|
| 1661 | +From this we obtain: |
---|
1663 | 1662 | |
---|
1664 | | - F ->gp X ->rcu-link Z ->rscs E ->rcu-link F, |
---|
| 1663 | + S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S, |
---|
1665 | 1664 | |
---|
1666 | 1665 | a forbidden cycle. Thus the "rcu" axiom rules out this violation of |
---|
1667 | 1666 | the Grace Period Guarantee. |
---|
1668 | 1667 | |
---|
1669 | 1668 | For something a little more down-to-earth, let's see how the axiom |
---|
1670 | 1669 | works out in practice. Consider the RCU code example from above, this |
---|
1671 | | -time with statement labels added to the memory access instructions: |
---|
| 1670 | +time with statement labels added: |
---|
1672 | 1671 | |
---|
1673 | 1672 | int x, y; |
---|
1674 | 1673 | |
---|
1675 | 1674 | P0() |
---|
1676 | 1675 | { |
---|
1677 | | - rcu_read_lock(); |
---|
1678 | | - W: WRITE_ONCE(x, 1); |
---|
1679 | | - X: WRITE_ONCE(y, 1); |
---|
1680 | | - rcu_read_unlock(); |
---|
| 1676 | + L: rcu_read_lock(); |
---|
| 1677 | + X: WRITE_ONCE(x, 1); |
---|
| 1678 | + Y: WRITE_ONCE(y, 1); |
---|
| 1679 | + U: rcu_read_unlock(); |
---|
1681 | 1680 | } |
---|
1682 | 1681 | |
---|
1683 | 1682 | P1() |
---|
1684 | 1683 | { |
---|
1685 | 1684 | int r1, r2; |
---|
1686 | 1685 | |
---|
1687 | | - Y: r1 = READ_ONCE(x); |
---|
1688 | | - synchronize_rcu(); |
---|
1689 | | - Z: r2 = READ_ONCE(y); |
---|
| 1686 | + Z: r1 = READ_ONCE(x); |
---|
| 1687 | + S: synchronize_rcu(); |
---|
| 1688 | + W: r2 = READ_ONCE(y); |
---|
1690 | 1689 | } |
---|
1691 | 1690 | |
---|
1692 | 1691 | |
---|
1693 | | -If r2 = 0 at the end then P0's store at X overwrites the value that |
---|
1694 | | -P1's load at Z reads from, so we have Z ->fre X and thus Z ->rcu-link X. |
---|
1695 | | -In addition, there is a synchronize_rcu() between Y and Z, so therefore |
---|
1696 | | -we have Y ->gp Z. |
---|
| 1692 | +If r2 = 0 at the end then P0's store at Y overwrites the value that |
---|
| 1693 | +P1's load at W reads from, so we have W ->fre Y. Since S ->po W and |
---|
| 1694 | +also Y ->po U, we get S ->rcu-link U. In addition, S ->rcu-gp S |
---|
| 1695 | +because S is a grace period. |
---|
1697 | 1696 | |
---|
1698 | | -If r1 = 1 at the end then P1's load at Y reads from P0's store at W, |
---|
1699 | | -so we have W ->rcu-link Y. In addition, W and X are in the same critical |
---|
1700 | | -section, so therefore we have X ->rscs W. |
---|
| 1697 | +If r1 = 1 at the end then P1's load at Z reads from P0's store at X, |
---|
| 1698 | +so we have X ->rfe Z. Together with L ->po X and Z ->po S, this |
---|
| 1699 | +yields L ->rcu-link S. And since L and U are the start and end of a |
---|
| 1700 | +critical section, we have U ->rcu-rscsi L. |
---|
1701 | 1701 | |
---|
1702 | | -Then X ->rscs W ->rcu-link Y ->gp Z ->rcu-link X is a forbidden cycle, |
---|
1703 | | -violating the "rcu" axiom. Hence the outcome is not allowed by the |
---|
1704 | | -LKMM, as we would expect. |
---|
| 1702 | +Then U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a |
---|
| 1703 | +forbidden cycle, violating the "rcu" axiom. Hence the outcome is not |
---|
| 1704 | +allowed by the LKMM, as we would expect. |
---|
1705 | 1705 | |
---|
1706 | 1706 | For contrast, let's see what can happen in a more complicated example: |
---|
1707 | 1707 | |
---|
.. | .. |
---|
1711 | 1711 | { |
---|
1712 | 1712 | int r0; |
---|
1713 | 1713 | |
---|
1714 | | - rcu_read_lock(); |
---|
1715 | | - W: r0 = READ_ONCE(x); |
---|
1716 | | - X: WRITE_ONCE(y, 1); |
---|
1717 | | - rcu_read_unlock(); |
---|
| 1714 | + L0: rcu_read_lock(); |
---|
| 1715 | + r0 = READ_ONCE(x); |
---|
| 1716 | + WRITE_ONCE(y, 1); |
---|
| 1717 | + U0: rcu_read_unlock(); |
---|
1718 | 1718 | } |
---|
1719 | 1719 | |
---|
1720 | 1720 | P1() |
---|
1721 | 1721 | { |
---|
1722 | 1722 | int r1; |
---|
1723 | 1723 | |
---|
1724 | | - Y: r1 = READ_ONCE(y); |
---|
1725 | | - synchronize_rcu(); |
---|
1726 | | - Z: WRITE_ONCE(z, 1); |
---|
| 1724 | + r1 = READ_ONCE(y); |
---|
| 1725 | + S1: synchronize_rcu(); |
---|
| 1726 | + WRITE_ONCE(z, 1); |
---|
1727 | 1727 | } |
---|
1728 | 1728 | |
---|
1729 | 1729 | P2() |
---|
1730 | 1730 | { |
---|
1731 | 1731 | int r2; |
---|
1732 | 1732 | |
---|
1733 | | - rcu_read_lock(); |
---|
1734 | | - U: r2 = READ_ONCE(z); |
---|
1735 | | - V: WRITE_ONCE(x, 1); |
---|
1736 | | - rcu_read_unlock(); |
---|
| 1733 | + L2: rcu_read_lock(); |
---|
| 1734 | + r2 = READ_ONCE(z); |
---|
| 1735 | + WRITE_ONCE(x, 1); |
---|
| 1736 | + U2: rcu_read_unlock(); |
---|
1737 | 1737 | } |
---|
1738 | 1738 | |
---|
1739 | 1739 | If r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows |
---|
1740 | | -that W ->rscs X ->rcu-link Y ->gp Z ->rcu-link U ->rscs V ->rcu-link W. |
---|
1741 | | -However this cycle is not forbidden, because the sequence of relations |
---|
1742 | | -contains fewer instances of gp (one) than of rscs (two). Consequently |
---|
1743 | | -the outcome is allowed by the LKMM. The following instruction timing |
---|
1744 | | -diagram shows how it might actually occur: |
---|
| 1740 | +that U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi |
---|
| 1741 | +L2 ->rcu-link U0. However this cycle is not forbidden, because the |
---|
| 1742 | +sequence of relations contains fewer instances of rcu-gp (one) than of |
---|
| 1743 | +rcu-rscsi (two). Consequently the outcome is allowed by the LKMM. |
---|
| 1744 | +The following instruction timing diagram shows how it might actually |
---|
| 1745 | +occur: |
---|
1745 | 1746 | |
---|
1746 | 1747 | P0 P1 P2 |
---|
1747 | 1748 | -------------------- -------------------- -------------------- |
---|
1748 | 1749 | rcu_read_lock() |
---|
1749 | | -X: WRITE_ONCE(y, 1) |
---|
1750 | | - Y: r1 = READ_ONCE(y) |
---|
| 1750 | +WRITE_ONCE(y, 1) |
---|
| 1751 | + r1 = READ_ONCE(y) |
---|
1751 | 1752 | synchronize_rcu() starts |
---|
1752 | 1753 | . rcu_read_lock() |
---|
1753 | | - . V: WRITE_ONCE(x, 1) |
---|
1754 | | -W: r0 = READ_ONCE(x) . |
---|
| 1754 | + . WRITE_ONCE(x, 1) |
---|
| 1755 | +r0 = READ_ONCE(x) . |
---|
1755 | 1756 | rcu_read_unlock() . |
---|
1756 | 1757 | synchronize_rcu() ends |
---|
1757 | | - Z: WRITE_ONCE(z, 1) |
---|
1758 | | - U: r2 = READ_ONCE(z) |
---|
| 1758 | + WRITE_ONCE(z, 1) |
---|
| 1759 | + r2 = READ_ONCE(z) |
---|
1759 | 1760 | rcu_read_unlock() |
---|
1760 | 1761 | |
---|
1761 | 1762 | This requires P0 and P2 to execute their loads and stores out of |
---|
.. | .. |
---|
1764 | 1765 | section in P0 both starts before P1's grace period does and ends |
---|
1765 | 1766 | before it does, and the critical section in P2 both starts after P1's |
---|
1766 | 1767 | grace period does and ends after it does. |
---|
| 1768 | + |
---|
| 1769 | +Addendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in |
---|
| 1770 | +addition to normal RCU. The ideas involved are much the same as |
---|
| 1771 | +above, with new relations srcu-gp and srcu-rscsi added to represent |
---|
| 1772 | +SRCU grace periods and read-side critical sections. There is a |
---|
| 1773 | +restriction on the srcu-gp and srcu-rscsi links that can appear in an |
---|
| 1774 | +rcu-order sequence (the srcu-rscsi links must be paired with srcu-gp |
---|
| 1775 | +links having the same SRCU domain with proper nesting); the details |
---|
| 1776 | +are relatively unimportant. |
---|
| 1777 | + |
---|
| 1778 | + |
---|
| 1779 | +LOCKING |
---|
| 1780 | +------- |
---|
| 1781 | + |
---|
| 1782 | +The LKMM includes locking. In fact, there is special code for locking |
---|
| 1783 | +in the formal model, added in order to make tools run faster. |
---|
| 1784 | +However, this special code is intended to be more or less equivalent |
---|
| 1785 | +to concepts we have already covered. A spinlock_t variable is treated |
---|
| 1786 | +the same as an int, and spin_lock(&s) is treated almost the same as: |
---|
| 1787 | + |
---|
| 1788 | + while (cmpxchg_acquire(&s, 0, 1) != 0) |
---|
| 1789 | + cpu_relax(); |
---|
| 1790 | + |
---|
| 1791 | +This waits until s is equal to 0 and then atomically sets it to 1, |
---|
| 1792 | +and the read part of the cmpxchg operation acts as an acquire fence. |
---|
| 1793 | +An alternate way to express the same thing would be: |
---|
| 1794 | + |
---|
| 1795 | + r = xchg_acquire(&s, 1); |
---|
| 1796 | + |
---|
| 1797 | +along with a requirement that at the end, r = 0. Similarly, |
---|
| 1798 | +spin_trylock(&s) is treated almost the same as: |
---|
| 1799 | + |
---|
| 1800 | + return !cmpxchg_acquire(&s, 0, 1); |
---|
| 1801 | + |
---|
| 1802 | +which atomically sets s to 1 if it is currently equal to 0 and returns |
---|
| 1803 | +true if it succeeds (the read part of the cmpxchg operation acts as an |
---|
| 1804 | +acquire fence only if the operation is successful). spin_unlock(&s) |
---|
| 1805 | +is treated almost the same as: |
---|
| 1806 | + |
---|
| 1807 | + smp_store_release(&s, 0); |
---|
| 1808 | + |
---|
| 1809 | +The "almost" qualifiers above need some explanation. In the LKMM, the |
---|
| 1810 | +store-release in a spin_unlock() and the load-acquire which forms the |
---|
| 1811 | +first half of the atomic rmw update in a spin_lock() or a successful |
---|
| 1812 | +spin_trylock() -- we can call these things lock-releases and |
---|
| 1813 | +lock-acquires -- have two properties beyond those of ordinary releases |
---|
| 1814 | +and acquires. |
---|
| 1815 | + |
---|
| 1816 | +First, when a lock-acquire reads from a lock-release, the LKMM |
---|
| 1817 | +requires that every instruction po-before the lock-release must |
---|
| 1818 | +execute before any instruction po-after the lock-acquire. This would |
---|
| 1819 | +naturally hold if the release and acquire operations were on different |
---|
| 1820 | +CPUs, but the LKMM says it holds even when they are on the same CPU. |
---|
| 1821 | +For example: |
---|
| 1822 | + |
---|
| 1823 | + int x, y; |
---|
| 1824 | + spinlock_t s; |
---|
| 1825 | + |
---|
| 1826 | + P0() |
---|
| 1827 | + { |
---|
| 1828 | + int r1, r2; |
---|
| 1829 | + |
---|
| 1830 | + spin_lock(&s); |
---|
| 1831 | + r1 = READ_ONCE(x); |
---|
| 1832 | + spin_unlock(&s); |
---|
| 1833 | + spin_lock(&s); |
---|
| 1834 | + r2 = READ_ONCE(y); |
---|
| 1835 | + spin_unlock(&s); |
---|
| 1836 | + } |
---|
| 1837 | + |
---|
| 1838 | + P1() |
---|
| 1839 | + { |
---|
| 1840 | + WRITE_ONCE(y, 1); |
---|
| 1841 | + smp_wmb(); |
---|
| 1842 | + WRITE_ONCE(x, 1); |
---|
| 1843 | + } |
---|
| 1844 | + |
---|
| 1845 | +Here the second spin_lock() reads from the first spin_unlock(), and |
---|
| 1846 | +therefore the load of x must execute before the load of y. Thus we |
---|
| 1847 | +cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the |
---|
| 1848 | +MP pattern). |
---|
| 1849 | + |
---|
| 1850 | +This requirement does not apply to ordinary release and acquire |
---|
| 1851 | +fences, only to lock-related operations. For instance, suppose P0() |
---|
| 1852 | +in the example had been written as: |
---|
| 1853 | + |
---|
| 1854 | + P0() |
---|
| 1855 | + { |
---|
| 1856 | + int r1, r2, r3; |
---|
| 1857 | + |
---|
| 1858 | + r1 = READ_ONCE(x); |
---|
| 1859 | + smp_store_release(&s, 1); |
---|
| 1860 | + r3 = smp_load_acquire(&s); |
---|
| 1861 | + r2 = READ_ONCE(y); |
---|
| 1862 | + } |
---|
| 1863 | + |
---|
| 1864 | +Then the CPU would be allowed to forward the s = 1 value from the |
---|
| 1865 | +smp_store_release() to the smp_load_acquire(), executing the |
---|
| 1866 | +instructions in the following order: |
---|
| 1867 | + |
---|
| 1868 | + r3 = smp_load_acquire(&s); // Obtains r3 = 1 |
---|
| 1869 | + r2 = READ_ONCE(y); |
---|
| 1870 | + r1 = READ_ONCE(x); |
---|
| 1871 | + smp_store_release(&s, 1); // Value is forwarded |
---|
| 1872 | + |
---|
| 1873 | +and thus it could load y before x, obtaining r2 = 0 and r1 = 1. |
---|
| 1874 | + |
---|
| 1875 | +Second, when a lock-acquire reads from a lock-release, and some other |
---|
| 1876 | +stores W and W' occur po-before the lock-release and po-after the |
---|
| 1877 | +lock-acquire respectively, the LKMM requires that W must propagate to |
---|
| 1878 | +each CPU before W' does. For example, consider: |
---|
| 1879 | + |
---|
| 1880 | + int x, y; |
---|
| 1881 | + spinlock_t x; |
---|
| 1882 | + |
---|
| 1883 | + P0() |
---|
| 1884 | + { |
---|
| 1885 | + spin_lock(&s); |
---|
| 1886 | + WRITE_ONCE(x, 1); |
---|
| 1887 | + spin_unlock(&s); |
---|
| 1888 | + } |
---|
| 1889 | + |
---|
| 1890 | + P1() |
---|
| 1891 | + { |
---|
| 1892 | + int r1; |
---|
| 1893 | + |
---|
| 1894 | + spin_lock(&s); |
---|
| 1895 | + r1 = READ_ONCE(x); |
---|
| 1896 | + WRITE_ONCE(y, 1); |
---|
| 1897 | + spin_unlock(&s); |
---|
| 1898 | + } |
---|
| 1899 | + |
---|
| 1900 | + P2() |
---|
| 1901 | + { |
---|
| 1902 | + int r2, r3; |
---|
| 1903 | + |
---|
| 1904 | + r2 = READ_ONCE(y); |
---|
| 1905 | + smp_rmb(); |
---|
| 1906 | + r3 = READ_ONCE(x); |
---|
| 1907 | + } |
---|
| 1908 | + |
---|
| 1909 | +If r1 = 1 at the end then the spin_lock() in P1 must have read from |
---|
| 1910 | +the spin_unlock() in P0. Hence the store to x must propagate to P2 |
---|
| 1911 | +before the store to y does, so we cannot have r2 = 1 and r3 = 0. |
---|
| 1912 | + |
---|
| 1913 | +These two special requirements for lock-release and lock-acquire do |
---|
| 1914 | +not arise from the operational model. Nevertheless, kernel developers |
---|
| 1915 | +have come to expect and rely on them because they do hold on all |
---|
| 1916 | +architectures supported by the Linux kernel, albeit for various |
---|
| 1917 | +differing reasons. |
---|
| 1918 | + |
---|
| 1919 | + |
---|
| 1920 | +PLAIN ACCESSES AND DATA RACES |
---|
| 1921 | +----------------------------- |
---|
| 1922 | + |
---|
| 1923 | +In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y), |
---|
| 1924 | +smp_load_acquire(&z), and so on are collectively referred to as |
---|
| 1925 | +"marked" accesses, because they are all annotated with special |
---|
| 1926 | +operations of one kind or another. Ordinary C-language memory |
---|
| 1927 | +accesses such as x or y = 0 are simply called "plain" accesses. |
---|
| 1928 | + |
---|
| 1929 | +Early versions of the LKMM had nothing to say about plain accesses. |
---|
| 1930 | +The C standard allows compilers to assume that the variables affected |
---|
| 1931 | +by plain accesses are not concurrently read or written by any other |
---|
| 1932 | +threads or CPUs. This leaves compilers free to implement all manner |
---|
| 1933 | +of transformations or optimizations of code containing plain accesses, |
---|
| 1934 | +making such code very difficult for a memory model to handle. |
---|
| 1935 | + |
---|
| 1936 | +Here is just one example of a possible pitfall: |
---|
| 1937 | + |
---|
| 1938 | + int a = 6; |
---|
| 1939 | + int *x = &a; |
---|
| 1940 | + |
---|
| 1941 | + P0() |
---|
| 1942 | + { |
---|
| 1943 | + int *r1; |
---|
| 1944 | + int r2 = 0; |
---|
| 1945 | + |
---|
| 1946 | + r1 = x; |
---|
| 1947 | + if (r1 != NULL) |
---|
| 1948 | + r2 = READ_ONCE(*r1); |
---|
| 1949 | + } |
---|
| 1950 | + |
---|
| 1951 | + P1() |
---|
| 1952 | + { |
---|
| 1953 | + WRITE_ONCE(x, NULL); |
---|
| 1954 | + } |
---|
| 1955 | + |
---|
| 1956 | +On the face of it, one would expect that when this code runs, the only |
---|
| 1957 | +possible final values for r2 are 6 and 0, depending on whether or not |
---|
| 1958 | +P1's store to x propagates to P0 before P0's load from x executes. |
---|
| 1959 | +But since P0's load from x is a plain access, the compiler may decide |
---|
| 1960 | +to carry out the load twice (for the comparison against NULL, then again |
---|
| 1961 | +for the READ_ONCE()) and eliminate the temporary variable r1. The |
---|
| 1962 | +object code generated for P0 could therefore end up looking rather |
---|
| 1963 | +like this: |
---|
| 1964 | + |
---|
| 1965 | + P0() |
---|
| 1966 | + { |
---|
| 1967 | + int r2 = 0; |
---|
| 1968 | + |
---|
| 1969 | + if (x != NULL) |
---|
| 1970 | + r2 = READ_ONCE(*x); |
---|
| 1971 | + } |
---|
| 1972 | + |
---|
| 1973 | +And now it is obvious that this code runs the risk of dereferencing a |
---|
| 1974 | +NULL pointer, because P1's store to x might propagate to P0 after the |
---|
| 1975 | +test against NULL has been made but before the READ_ONCE() executes. |
---|
| 1976 | +If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x", |
---|
| 1977 | +the compiler would not have performed this optimization and there |
---|
| 1978 | +would be no possibility of a NULL-pointer dereference. |
---|
| 1979 | + |
---|
| 1980 | +Given the possibility of transformations like this one, the LKMM |
---|
| 1981 | +doesn't try to predict all possible outcomes of code containing plain |
---|
| 1982 | +accesses. It is instead content to determine whether the code |
---|
| 1983 | +violates the compiler's assumptions, which would render the ultimate |
---|
| 1984 | +outcome undefined. |
---|
| 1985 | + |
---|
| 1986 | +In technical terms, the compiler is allowed to assume that when the |
---|
| 1987 | +program executes, there will not be any data races. A "data race" |
---|
| 1988 | +occurs when there are two memory accesses such that: |
---|
| 1989 | + |
---|
| 1990 | +1. they access the same location, |
---|
| 1991 | + |
---|
| 1992 | +2. at least one of them is a store, |
---|
| 1993 | + |
---|
| 1994 | +3. at least one of them is plain, |
---|
| 1995 | + |
---|
| 1996 | +4. they occur on different CPUs (or in different threads on the |
---|
| 1997 | + same CPU), and |
---|
| 1998 | + |
---|
| 1999 | +5. they execute concurrently. |
---|
| 2000 | + |
---|
| 2001 | +In the literature, two accesses are said to "conflict" if they satisfy |
---|
| 2002 | +1 and 2 above. We'll go a little farther and say that two accesses |
---|
| 2003 | +are "race candidates" if they satisfy 1 - 4. Thus, whether or not two |
---|
| 2004 | +race candidates actually do race in a given execution depends on |
---|
| 2005 | +whether they are concurrent. |
---|
| 2006 | + |
---|
| 2007 | +The LKMM tries to determine whether a program contains race candidates |
---|
| 2008 | +which may execute concurrently; if it does then the LKMM says there is |
---|
| 2009 | +a potential data race and makes no predictions about the program's |
---|
| 2010 | +outcome. |
---|
| 2011 | + |
---|
| 2012 | +Determining whether two accesses are race candidates is easy; you can |
---|
| 2013 | +see that all the concepts involved in the definition above are already |
---|
| 2014 | +part of the memory model. The hard part is telling whether they may |
---|
| 2015 | +execute concurrently. The LKMM takes a conservative attitude, |
---|
| 2016 | +assuming that accesses may be concurrent unless it can prove they |
---|
| 2017 | +are not. |
---|
| 2018 | + |
---|
| 2019 | +If two memory accesses aren't concurrent then one must execute before |
---|
| 2020 | +the other. Therefore the LKMM decides two accesses aren't concurrent |
---|
| 2021 | +if they can be connected by a sequence of hb, pb, and rb links |
---|
| 2022 | +(together referred to as xb, for "executes before"). However, there |
---|
| 2023 | +are two complicating factors. |
---|
| 2024 | + |
---|
| 2025 | +If X is a load and X executes before a store Y, then indeed there is |
---|
| 2026 | +no danger of X and Y being concurrent. After all, Y can't have any |
---|
| 2027 | +effect on the value obtained by X until the memory subsystem has |
---|
| 2028 | +propagated Y from its own CPU to X's CPU, which won't happen until |
---|
| 2029 | +some time after Y executes and thus after X executes. But if X is a |
---|
| 2030 | +store, then even if X executes before Y it is still possible that X |
---|
| 2031 | +will propagate to Y's CPU just as Y is executing. In such a case X |
---|
| 2032 | +could very well interfere somehow with Y, and we would have to |
---|
| 2033 | +consider X and Y to be concurrent. |
---|
| 2034 | + |
---|
| 2035 | +Therefore when X is a store, for X and Y to be non-concurrent the LKMM |
---|
| 2036 | +requires not only that X must execute before Y but also that X must |
---|
| 2037 | +propagate to Y's CPU before Y executes. (Or vice versa, of course, if |
---|
| 2038 | +Y executes before X -- then Y must propagate to X's CPU before X |
---|
| 2039 | +executes if Y is a store.) This is expressed by the visibility |
---|
| 2040 | +relation (vis), where X ->vis Y is defined to hold if there is an |
---|
| 2041 | +intermediate event Z such that: |
---|
| 2042 | + |
---|
| 2043 | + X is connected to Z by a possibly empty sequence of |
---|
| 2044 | + cumul-fence links followed by an optional rfe link (if none of |
---|
| 2045 | + these links are present, X and Z are the same event), |
---|
| 2046 | + |
---|
| 2047 | +and either: |
---|
| 2048 | + |
---|
| 2049 | + Z is connected to Y by a strong-fence link followed by a |
---|
| 2050 | + possibly empty sequence of xb links, |
---|
| 2051 | + |
---|
| 2052 | +or: |
---|
| 2053 | + |
---|
| 2054 | + Z is on the same CPU as Y and is connected to Y by a possibly |
---|
| 2055 | + empty sequence of xb links (again, if the sequence is empty it |
---|
| 2056 | + means Z and Y are the same event). |
---|
| 2057 | + |
---|
| 2058 | +The motivations behind this definition are straightforward: |
---|
| 2059 | + |
---|
| 2060 | + cumul-fence memory barriers force stores that are po-before |
---|
| 2061 | + the barrier to propagate to other CPUs before stores that are |
---|
| 2062 | + po-after the barrier. |
---|
| 2063 | + |
---|
| 2064 | + An rfe link from an event W to an event R says that R reads |
---|
| 2065 | + from W, which certainly means that W must have propagated to |
---|
| 2066 | + R's CPU before R executed. |
---|
| 2067 | + |
---|
| 2068 | + strong-fence memory barriers force stores that are po-before |
---|
| 2069 | + the barrier, or that propagate to the barrier's CPU before the |
---|
| 2070 | + barrier executes, to propagate to all CPUs before any events |
---|
| 2071 | + po-after the barrier can execute. |
---|
| 2072 | + |
---|
| 2073 | +To see how this works out in practice, consider our old friend, the MP |
---|
| 2074 | +pattern (with fences and statement labels, but without the conditional |
---|
| 2075 | +test): |
---|
| 2076 | + |
---|
| 2077 | + int buf = 0, flag = 0; |
---|
| 2078 | + |
---|
| 2079 | + P0() |
---|
| 2080 | + { |
---|
| 2081 | + X: WRITE_ONCE(buf, 1); |
---|
| 2082 | + smp_wmb(); |
---|
| 2083 | + W: WRITE_ONCE(flag, 1); |
---|
| 2084 | + } |
---|
| 2085 | + |
---|
| 2086 | + P1() |
---|
| 2087 | + { |
---|
| 2088 | + int r1; |
---|
| 2089 | + int r2 = 0; |
---|
| 2090 | + |
---|
| 2091 | + Z: r1 = READ_ONCE(flag); |
---|
| 2092 | + smp_rmb(); |
---|
| 2093 | + Y: r2 = READ_ONCE(buf); |
---|
| 2094 | + } |
---|
| 2095 | + |
---|
| 2096 | +The smp_wmb() memory barrier gives a cumul-fence link from X to W, and |
---|
| 2097 | +assuming r1 = 1 at the end, there is an rfe link from W to Z. This |
---|
| 2098 | +means that the store to buf must propagate from P0 to P1 before Z |
---|
| 2099 | +executes. Next, Z and Y are on the same CPU and the smp_rmb() fence |
---|
| 2100 | +provides an xb link from Z to Y (i.e., it forces Z to execute before |
---|
| 2101 | +Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y |
---|
| 2102 | +executes. |
---|
| 2103 | + |
---|
| 2104 | +The second complicating factor mentioned above arises from the fact |
---|
| 2105 | +that when we are considering data races, some of the memory accesses |
---|
| 2106 | +are plain. Now, although we have not said so explicitly, up to this |
---|
| 2107 | +point most of the relations defined by the LKMM (ppo, hb, prop, |
---|
| 2108 | +cumul-fence, pb, and so on -- including vis) apply only to marked |
---|
| 2109 | +accesses. |
---|
| 2110 | + |
---|
| 2111 | +There are good reasons for this restriction. The compiler is not |
---|
| 2112 | +allowed to apply fancy transformations to marked accesses, and |
---|
| 2113 | +consequently each such access in the source code corresponds more or |
---|
| 2114 | +less directly to a single machine instruction in the object code. But |
---|
| 2115 | +plain accesses are a different story; the compiler may combine them, |
---|
| 2116 | +split them up, duplicate them, eliminate them, invent new ones, and |
---|
| 2117 | +who knows what else. Seeing a plain access in the source code tells |
---|
| 2118 | +you almost nothing about what machine instructions will end up in the |
---|
| 2119 | +object code. |
---|
| 2120 | + |
---|
| 2121 | +Fortunately, the compiler isn't completely free; it is subject to some |
---|
| 2122 | +limitations. For one, it is not allowed to introduce a data race into |
---|
| 2123 | +the object code if the source code does not already contain a data |
---|
| 2124 | +race (if it could, memory models would be useless and no multithreaded |
---|
| 2125 | +code would be safe!). For another, it cannot move a plain access past |
---|
| 2126 | +a compiler barrier. |
---|
| 2127 | + |
---|
| 2128 | +A compiler barrier is a kind of fence, but as the name implies, it |
---|
| 2129 | +only affects the compiler; it does not necessarily have any effect on |
---|
| 2130 | +how instructions are executed by the CPU. In Linux kernel source |
---|
| 2131 | +code, the barrier() function is a compiler barrier. It doesn't give |
---|
| 2132 | +rise directly to any machine instructions in the object code; rather, |
---|
| 2133 | +it affects how the compiler generates the rest of the object code. |
---|
| 2134 | +Given source code like this: |
---|
| 2135 | + |
---|
| 2136 | + ... some memory accesses ... |
---|
| 2137 | + barrier(); |
---|
| 2138 | + ... some other memory accesses ... |
---|
| 2139 | + |
---|
| 2140 | +the barrier() function ensures that the machine instructions |
---|
| 2141 | +corresponding to the first group of accesses will all end po-before |
---|
| 2142 | +any machine instructions corresponding to the second group of accesses |
---|
| 2143 | +-- even if some of the accesses are plain. (Of course, the CPU may |
---|
| 2144 | +then execute some of those accesses out of program order, but we |
---|
| 2145 | +already know how to deal with such issues.) Without the barrier() |
---|
| 2146 | +there would be no such guarantee; the two groups of accesses could be |
---|
| 2147 | +intermingled or even reversed in the object code. |
---|
| 2148 | + |
---|
| 2149 | +The LKMM doesn't say much about the barrier() function, but it does |
---|
| 2150 | +require that all fences are also compiler barriers. In addition, it |
---|
| 2151 | +requires that the ordering properties of memory barriers such as |
---|
| 2152 | +smp_rmb() or smp_store_release() apply to plain accesses as well as to |
---|
| 2153 | +marked accesses. |
---|
| 2154 | + |
---|
| 2155 | +This is the key to analyzing data races. Consider the MP pattern |
---|
| 2156 | +again, now using plain accesses for buf: |
---|
| 2157 | + |
---|
| 2158 | + int buf = 0, flag = 0; |
---|
| 2159 | + |
---|
| 2160 | + P0() |
---|
| 2161 | + { |
---|
| 2162 | + U: buf = 1; |
---|
| 2163 | + smp_wmb(); |
---|
| 2164 | + X: WRITE_ONCE(flag, 1); |
---|
| 2165 | + } |
---|
| 2166 | + |
---|
| 2167 | + P1() |
---|
| 2168 | + { |
---|
| 2169 | + int r1; |
---|
| 2170 | + int r2 = 0; |
---|
| 2171 | + |
---|
| 2172 | + Y: r1 = READ_ONCE(flag); |
---|
| 2173 | + if (r1) { |
---|
| 2174 | + smp_rmb(); |
---|
| 2175 | + V: r2 = buf; |
---|
| 2176 | + } |
---|
| 2177 | + } |
---|
| 2178 | + |
---|
| 2179 | +This program does not contain a data race. Although the U and V |
---|
| 2180 | +accesses are race candidates, the LKMM can prove they are not |
---|
| 2181 | +concurrent as follows: |
---|
| 2182 | + |
---|
| 2183 | + The smp_wmb() fence in P0 is both a compiler barrier and a |
---|
| 2184 | + cumul-fence. It guarantees that no matter what hash of |
---|
| 2185 | + machine instructions the compiler generates for the plain |
---|
| 2186 | + access U, all those instructions will be po-before the fence. |
---|
| 2187 | + Consequently U's store to buf, no matter how it is carried out |
---|
| 2188 | + at the machine level, must propagate to P1 before X's store to |
---|
| 2189 | + flag does. |
---|
| 2190 | + |
---|
| 2191 | + X and Y are both marked accesses. Hence an rfe link from X to |
---|
| 2192 | + Y is a valid indicator that X propagated to P1 before Y |
---|
| 2193 | + executed, i.e., X ->vis Y. (And if there is no rfe link then |
---|
| 2194 | + r1 will be 0, so V will not be executed and ipso facto won't |
---|
| 2195 | + race with U.) |
---|
| 2196 | + |
---|
| 2197 | + The smp_rmb() fence in P1 is a compiler barrier as well as a |
---|
| 2198 | + fence. It guarantees that all the machine-level instructions |
---|
| 2199 | + corresponding to the access V will be po-after the fence, and |
---|
| 2200 | + therefore any loads among those instructions will execute |
---|
| 2201 | + after the fence does and hence after Y does. |
---|
| 2202 | + |
---|
| 2203 | +Thus U's store to buf is forced to propagate to P1 before V's load |
---|
| 2204 | +executes (assuming V does execute), ruling out the possibility of a |
---|
| 2205 | +data race between them. |
---|
| 2206 | + |
---|
| 2207 | +This analysis illustrates how the LKMM deals with plain accesses in |
---|
| 2208 | +general. Suppose R is a plain load and we want to show that R |
---|
| 2209 | +executes before some marked access E. We can do this by finding a |
---|
| 2210 | +marked access X such that R and X are ordered by a suitable fence and |
---|
| 2211 | +X ->xb* E. If E was also a plain access, we would also look for a |
---|
| 2212 | +marked access Y such that X ->xb* Y, and Y and E are ordered by a |
---|
| 2213 | +fence. We describe this arrangement by saying that R is |
---|
| 2214 | +"post-bounded" by X and E is "pre-bounded" by Y. |
---|
| 2215 | + |
---|
| 2216 | +In fact, we go one step further: Since R is a read, we say that R is |
---|
| 2217 | +"r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or |
---|
| 2218 | +"w-pre-bounded" by Y, depending on whether E was a store or a load. |
---|
| 2219 | +This distinction is needed because some fences affect only loads |
---|
| 2220 | +(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise |
---|
| 2221 | +the two types of bounds are the same. And as a degenerate case, we |
---|
| 2222 | +say that a marked access pre-bounds and post-bounds itself (e.g., if R |
---|
| 2223 | +above were a marked load then X could simply be taken to be R itself.) |
---|
| 2224 | + |
---|
| 2225 | +The need to distinguish between r- and w-bounding raises yet another |
---|
| 2226 | +issue. When the source code contains a plain store, the compiler is |
---|
| 2227 | +allowed to put plain loads of the same location into the object code. |
---|
| 2228 | +For example, given the source code: |
---|
| 2229 | + |
---|
| 2230 | + x = 1; |
---|
| 2231 | + |
---|
| 2232 | +the compiler is theoretically allowed to generate object code that |
---|
| 2233 | +looks like: |
---|
| 2234 | + |
---|
| 2235 | + if (x != 1) |
---|
| 2236 | + x = 1; |
---|
| 2237 | + |
---|
| 2238 | +thereby adding a load (and possibly replacing the store entirely). |
---|
| 2239 | +For this reason, whenever the LKMM requires a plain store to be |
---|
| 2240 | +w-pre-bounded or w-post-bounded by a marked access, it also requires |
---|
| 2241 | +the store to be r-pre-bounded or r-post-bounded, so as to handle cases |
---|
| 2242 | +where the compiler adds a load. |
---|
| 2243 | + |
---|
| 2244 | +(This may be overly cautious. We don't know of any examples where a |
---|
| 2245 | +compiler has augmented a store with a load in this fashion, and the |
---|
| 2246 | +Linux kernel developers would probably fight pretty hard to change a |
---|
| 2247 | +compiler if it ever did this. Still, better safe than sorry.) |
---|
| 2248 | + |
---|
| 2249 | +Incidentally, the other tranformation -- augmenting a plain load by |
---|
| 2250 | +adding in a store to the same location -- is not allowed. This is |
---|
| 2251 | +because the compiler cannot know whether any other CPUs might perform |
---|
| 2252 | +a concurrent load from that location. Two concurrent loads don't |
---|
| 2253 | +constitute a race (they can't interfere with each other), but a store |
---|
| 2254 | +does race with a concurrent load. Thus adding a store might create a |
---|
| 2255 | +data race where one was not already present in the source code, |
---|
| 2256 | +something the compiler is forbidden to do. Augmenting a store with a |
---|
| 2257 | +load, on the other hand, is acceptable because doing so won't create a |
---|
| 2258 | +data race unless one already existed. |
---|
| 2259 | + |
---|
| 2260 | +The LKMM includes a second way to pre-bound plain accesses, in |
---|
| 2261 | +addition to fences: an address dependency from a marked load. That |
---|
| 2262 | +is, in the sequence: |
---|
| 2263 | + |
---|
| 2264 | + p = READ_ONCE(ptr); |
---|
| 2265 | + r = *p; |
---|
| 2266 | + |
---|
| 2267 | +the LKMM says that the marked load of ptr pre-bounds the plain load of |
---|
| 2268 | +*p; the marked load must execute before any of the machine |
---|
| 2269 | +instructions corresponding to the plain load. This is a reasonable |
---|
| 2270 | +stipulation, since after all, the CPU can't perform the load of *p |
---|
| 2271 | +until it knows what value p will hold. Furthermore, without some |
---|
| 2272 | +assumption like this one, some usages typical of RCU would count as |
---|
| 2273 | +data races. For example: |
---|
| 2274 | + |
---|
| 2275 | + int a = 1, b; |
---|
| 2276 | + int *ptr = &a; |
---|
| 2277 | + |
---|
| 2278 | + P0() |
---|
| 2279 | + { |
---|
| 2280 | + b = 2; |
---|
| 2281 | + rcu_assign_pointer(ptr, &b); |
---|
| 2282 | + } |
---|
| 2283 | + |
---|
| 2284 | + P1() |
---|
| 2285 | + { |
---|
| 2286 | + int *p; |
---|
| 2287 | + int r; |
---|
| 2288 | + |
---|
| 2289 | + rcu_read_lock(); |
---|
| 2290 | + p = rcu_dereference(ptr); |
---|
| 2291 | + r = *p; |
---|
| 2292 | + rcu_read_unlock(); |
---|
| 2293 | + } |
---|
| 2294 | + |
---|
| 2295 | +(In this example the rcu_read_lock() and rcu_read_unlock() calls don't |
---|
| 2296 | +really do anything, because there aren't any grace periods. They are |
---|
| 2297 | +included merely for the sake of good form; typically P0 would call |
---|
| 2298 | +synchronize_rcu() somewhere after the rcu_assign_pointer().) |
---|
| 2299 | + |
---|
| 2300 | +rcu_assign_pointer() performs a store-release, so the plain store to b |
---|
| 2301 | +is definitely w-post-bounded before the store to ptr, and the two |
---|
| 2302 | +stores will propagate to P1 in that order. However, rcu_dereference() |
---|
| 2303 | +is only equivalent to READ_ONCE(). While it is a marked access, it is |
---|
| 2304 | +not a fence or compiler barrier. Hence the only guarantee we have |
---|
| 2305 | +that the load of ptr in P1 is r-pre-bounded before the load of *p |
---|
| 2306 | +(thus avoiding a race) is the assumption about address dependencies. |
---|
| 2307 | + |
---|
| 2308 | +This is a situation where the compiler can undermine the memory model, |
---|
| 2309 | +and a certain amount of care is required when programming constructs |
---|
| 2310 | +like this one. In particular, comparisons between the pointer and |
---|
| 2311 | +other known addresses can cause trouble. If you have something like: |
---|
| 2312 | + |
---|
| 2313 | + p = rcu_dereference(ptr); |
---|
| 2314 | + if (p == &x) |
---|
| 2315 | + r = *p; |
---|
| 2316 | + |
---|
| 2317 | +then the compiler just might generate object code resembling: |
---|
| 2318 | + |
---|
| 2319 | + p = rcu_dereference(ptr); |
---|
| 2320 | + if (p == &x) |
---|
| 2321 | + r = x; |
---|
| 2322 | + |
---|
| 2323 | +or even: |
---|
| 2324 | + |
---|
| 2325 | + rtemp = x; |
---|
| 2326 | + p = rcu_dereference(ptr); |
---|
| 2327 | + if (p == &x) |
---|
| 2328 | + r = rtemp; |
---|
| 2329 | + |
---|
| 2330 | +which would invalidate the memory model's assumption, since the CPU |
---|
| 2331 | +could now perform the load of x before the load of ptr (there might be |
---|
| 2332 | +a control dependency but no address dependency at the machine level). |
---|
| 2333 | + |
---|
| 2334 | +Finally, it turns out there is a situation in which a plain write does |
---|
| 2335 | +not need to be w-post-bounded: when it is separated from the other |
---|
| 2336 | +race-candidate access by a fence. At first glance this may seem |
---|
| 2337 | +impossible. After all, to be race candidates the two accesses must |
---|
| 2338 | +be on different CPUs, and fences don't link events on different CPUs. |
---|
| 2339 | +Well, normal fences don't -- but rcu-fence can! Here's an example: |
---|
| 2340 | + |
---|
| 2341 | + int x, y; |
---|
| 2342 | + |
---|
| 2343 | + P0() |
---|
| 2344 | + { |
---|
| 2345 | + WRITE_ONCE(x, 1); |
---|
| 2346 | + synchronize_rcu(); |
---|
| 2347 | + y = 3; |
---|
| 2348 | + } |
---|
| 2349 | + |
---|
| 2350 | + P1() |
---|
| 2351 | + { |
---|
| 2352 | + rcu_read_lock(); |
---|
| 2353 | + if (READ_ONCE(x) == 0) |
---|
| 2354 | + y = 2; |
---|
| 2355 | + rcu_read_unlock(); |
---|
| 2356 | + } |
---|
| 2357 | + |
---|
| 2358 | +Do the plain stores to y race? Clearly not if P1 reads a non-zero |
---|
| 2359 | +value for x, so let's assume the READ_ONCE(x) does obtain 0. This |
---|
| 2360 | +means that the read-side critical section in P1 must finish executing |
---|
| 2361 | +before the grace period in P0 does, because RCU's Grace-Period |
---|
| 2362 | +Guarantee says that otherwise P0's store to x would have propagated to |
---|
| 2363 | +P1 before the critical section started and so would have been visible |
---|
| 2364 | +to the READ_ONCE(). (Another way of putting it is that the fre link |
---|
| 2365 | +from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link |
---|
| 2366 | +between those two events.) |
---|
| 2367 | + |
---|
| 2368 | +This means there is an rcu-fence link from P1's "y = 2" store to P0's |
---|
| 2369 | +"y = 3" store, and consequently the first must propagate from P1 to P0 |
---|
| 2370 | +before the second can execute. Therefore the two stores cannot be |
---|
| 2371 | +concurrent and there is no race, even though P1's plain store to y |
---|
| 2372 | +isn't w-post-bounded by any marked accesses. |
---|
| 2373 | + |
---|
| 2374 | +Putting all this material together yields the following picture. For |
---|
| 2375 | +race-candidate stores W and W', where W ->co W', the LKMM says the |
---|
| 2376 | +stores don't race if W can be linked to W' by a |
---|
| 2377 | + |
---|
| 2378 | + w-post-bounded ; vis ; w-pre-bounded |
---|
| 2379 | + |
---|
| 2380 | +sequence. If W is plain then they also have to be linked by an |
---|
| 2381 | + |
---|
| 2382 | + r-post-bounded ; xb* ; w-pre-bounded |
---|
| 2383 | + |
---|
| 2384 | +sequence, and if W' is plain then they also have to be linked by a |
---|
| 2385 | + |
---|
| 2386 | + w-post-bounded ; vis ; r-pre-bounded |
---|
| 2387 | + |
---|
| 2388 | +sequence. For race-candidate load R and store W, the LKMM says the |
---|
| 2389 | +two accesses don't race if R can be linked to W by an |
---|
| 2390 | + |
---|
| 2391 | + r-post-bounded ; xb* ; w-pre-bounded |
---|
| 2392 | + |
---|
| 2393 | +sequence or if W can be linked to R by a |
---|
| 2394 | + |
---|
| 2395 | + w-post-bounded ; vis ; r-pre-bounded |
---|
| 2396 | + |
---|
| 2397 | +sequence. For the cases involving a vis link, the LKMM also accepts |
---|
| 2398 | +sequences in which W is linked to W' or R by a |
---|
| 2399 | + |
---|
| 2400 | + strong-fence ; xb* ; {w and/or r}-pre-bounded |
---|
| 2401 | + |
---|
| 2402 | +sequence with no post-bounding, and in every case the LKMM also allows |
---|
| 2403 | +the link simply to be a fence with no bounding at all. If no sequence |
---|
| 2404 | +of the appropriate sort exists, the LKMM says that the accesses race. |
---|
| 2405 | + |
---|
| 2406 | +There is one more part of the LKMM related to plain accesses (although |
---|
| 2407 | +not to data races) we should discuss. Recall that many relations such |
---|
| 2408 | +as hb are limited to marked accesses only. As a result, the |
---|
| 2409 | +happens-before, propagates-before, and rcu axioms (which state that |
---|
| 2410 | +various relation must not contain a cycle) doesn't apply to plain |
---|
| 2411 | +accesses. Nevertheless, we do want to rule out such cycles, because |
---|
| 2412 | +they don't make sense even for plain accesses. |
---|
| 2413 | + |
---|
| 2414 | +To this end, the LKMM imposes three extra restrictions, together |
---|
| 2415 | +called the "plain-coherence" axiom because of their resemblance to the |
---|
| 2416 | +rules used by the operational model to ensure cache coherence (that |
---|
| 2417 | +is, the rules governing the memory subsystem's choice of a store to |
---|
| 2418 | +satisfy a load request and its determination of where a store will |
---|
| 2419 | +fall in the coherence order): |
---|
| 2420 | + |
---|
| 2421 | + If R and W are race candidates and it is possible to link R to |
---|
| 2422 | + W by one of the xb* sequences listed above, then W ->rfe R is |
---|
| 2423 | + not allowed (i.e., a load cannot read from a store that it |
---|
| 2424 | + executes before, even if one or both is plain). |
---|
| 2425 | + |
---|
| 2426 | + If W and R are race candidates and it is possible to link W to |
---|
| 2427 | + R by one of the vis sequences listed above, then R ->fre W is |
---|
| 2428 | + not allowed (i.e., if a store is visible to a load then the |
---|
| 2429 | + load must read from that store or one coherence-after it). |
---|
| 2430 | + |
---|
| 2431 | + If W and W' are race candidates and it is possible to link W |
---|
| 2432 | + to W' by one of the vis sequences listed above, then W' ->co W |
---|
| 2433 | + is not allowed (i.e., if one store is visible to a second then |
---|
| 2434 | + the second must come after the first in the coherence order). |
---|
| 2435 | + |
---|
| 2436 | +This is the extent to which the LKMM deals with plain accesses. |
---|
| 2437 | +Perhaps it could say more (for example, plain accesses might |
---|
| 2438 | +contribute to the ppo relation), but at the moment it seems that this |
---|
| 2439 | +minimal, conservative approach is good enough. |
---|
1767 | 2440 | |
---|
1768 | 2441 | |
---|
1769 | 2442 | ODDS AND ENDS |
---|
.. | .. |
---|
1813 | 2486 | smp_store_release() -- which is basically how the Linux kernel treats |
---|
1814 | 2487 | them. |
---|
1815 | 2488 | |
---|
| 2489 | +Although we said that plain accesses are not linked by the ppo |
---|
| 2490 | +relation, they do contribute to it indirectly. Namely, when there is |
---|
| 2491 | +an address dependency from a marked load R to a plain store W, |
---|
| 2492 | +followed by smp_wmb() and then a marked store W', the LKMM creates a |
---|
| 2493 | +ppo link from R to W'. The reasoning behind this is perhaps a little |
---|
| 2494 | +shaky, but essentially it says there is no way to generate object code |
---|
| 2495 | +for this source code in which W' could execute before R. Just as with |
---|
| 2496 | +pre-bounding by address dependencies, it is possible for the compiler |
---|
| 2497 | +to undermine this relation if sufficient care is not taken. |
---|
| 2498 | + |
---|
1816 | 2499 | There are a few oddball fences which need special treatment: |
---|
1817 | 2500 | smp_mb__before_atomic(), smp_mb__after_atomic(), and |
---|
1818 | 2501 | smp_mb__after_spinlock(). The LKMM uses fence events with special |
---|
.. | .. |
---|
1830 | 2513 | smp_mb_after_spinlock() orders po-earlier lock acquisition |
---|
1831 | 2514 | events and the events preceding them against all po-later |
---|
1832 | 2515 | events. |
---|
1833 | | - |
---|
1834 | | -The LKMM includes locking. In fact, there is special code for locking |
---|
1835 | | -in the formal model, added in order to make tools run faster. |
---|
1836 | | -However, this special code is intended to be exactly equivalent to |
---|
1837 | | -concepts we have already covered. A spinlock_t variable is treated |
---|
1838 | | -the same as an int, and spin_lock(&s) is treated the same as: |
---|
1839 | | - |
---|
1840 | | - while (cmpxchg_acquire(&s, 0, 1) != 0) |
---|
1841 | | - cpu_relax(); |
---|
1842 | | - |
---|
1843 | | -which waits until s is equal to 0 and then atomically sets it to 1, |
---|
1844 | | -and where the read part of the atomic update is also an acquire fence. |
---|
1845 | | -An alternate way to express the same thing would be: |
---|
1846 | | - |
---|
1847 | | - r = xchg_acquire(&s, 1); |
---|
1848 | | - |
---|
1849 | | -along with a requirement that at the end, r = 0. spin_unlock(&s) is |
---|
1850 | | -treated the same as: |
---|
1851 | | - |
---|
1852 | | - smp_store_release(&s, 0); |
---|
1853 | 2516 | |
---|
1854 | 2517 | Interestingly, RCU and locking each introduce the possibility of |
---|
1855 | 2518 | deadlock. When faced with code sequences such as: |
---|