.. | .. |
---|
9 | 9 | conventions of cgroup v2. It describes all userland-visible aspects |
---|
10 | 10 | of cgroup including core and specific controller behaviors. All |
---|
11 | 11 | future changes must be reflected in this document. Documentation for |
---|
12 | | -v1 is available under Documentation/cgroup-v1/. |
---|
| 12 | +v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`. |
---|
13 | 13 | |
---|
14 | 14 | .. CONTENTS |
---|
15 | 15 | |
---|
.. | .. |
---|
54 | 54 | 5-3-3. IO Latency |
---|
55 | 55 | 5-3-3-1. How IO Latency Throttling Works |
---|
56 | 56 | 5-3-3-2. IO Latency Interface Files |
---|
| 57 | + 5-3-4. IO Priority |
---|
57 | 58 | 5-4. PID |
---|
58 | 59 | 5-4-1. PID Interface Files |
---|
59 | | - 5-5. Device |
---|
60 | | - 5-6. RDMA |
---|
61 | | - 5-6-1. RDMA Interface Files |
---|
62 | | - 5-7. Misc |
---|
63 | | - 5-7-1. perf_event |
---|
| 60 | + 5-5. Cpuset |
---|
| 61 | + 5.5-1. Cpuset Interface Files |
---|
| 62 | + 5-6. Device |
---|
| 63 | + 5-7. RDMA |
---|
| 64 | + 5-7-1. RDMA Interface Files |
---|
| 65 | + 5-8. HugeTLB |
---|
| 66 | + 5.8-1. HugeTLB Interface Files |
---|
| 67 | + 5-8. Misc |
---|
| 68 | + 5-8-1. perf_event |
---|
64 | 69 | 5-N. Non-normative information |
---|
65 | 70 | 5-N-1. CPU controller root cgroup process behaviour |
---|
66 | 71 | 5-N-2. IO controller root cgroup process behaviour |
---|
.. | .. |
---|
174 | 179 | through remount from the init namespace. The mount option is |
---|
175 | 180 | ignored on non-init namespace mounts. Please refer to the |
---|
176 | 181 | Delegation section for details. |
---|
| 182 | + |
---|
| 183 | + memory_localevents |
---|
| 184 | + |
---|
| 185 | + Only populate memory.events with data for the current cgroup, |
---|
| 186 | + and not any subtrees. This is legacy behaviour, the default |
---|
| 187 | + behaviour without this option is to include subtree counts. |
---|
| 188 | + This option is system wide and can only be set on mount or |
---|
| 189 | + modified through remount from the init namespace. The mount |
---|
| 190 | + option is ignored on non-init namespace mounts. |
---|
| 191 | + |
---|
| 192 | + memory_recursiveprot |
---|
| 193 | + |
---|
| 194 | + Recursively apply memory.min and memory.low protection to |
---|
| 195 | + entire subtrees, without requiring explicit downward |
---|
| 196 | + propagation into leaf cgroups. This allows protecting entire |
---|
| 197 | + subtrees from one another, while retaining free competition |
---|
| 198 | + within those subtrees. This should have been the default |
---|
| 199 | + behavior but is a mount-option to avoid regressing setups |
---|
| 200 | + relying on the original semantics (e.g. specifying bogusly |
---|
| 201 | + high 'bypass' protection values at higher tree levels). |
---|
177 | 202 | |
---|
178 | 203 | |
---|
179 | 204 | Organizing Processes and Threads |
---|
.. | .. |
---|
604 | 629 | Protections |
---|
605 | 630 | ----------- |
---|
606 | 631 | |
---|
607 | | -A cgroup is protected to be allocated upto the configured amount of |
---|
608 | | -the resource if the usages of all its ancestors are under their |
---|
| 632 | +A cgroup is protected upto the configured amount of the resource |
---|
| 633 | +as long as the usages of all its ancestors are under their |
---|
609 | 634 | protected levels. Protections can be hard guarantees or best effort |
---|
610 | 635 | soft boundaries. Protections can also be over-committed in which case |
---|
611 | 636 | only upto the amount available to the parent is protected among |
---|
.. | .. |
---|
690 | 715 | - Settings for a single feature should be contained in a single file. |
---|
691 | 716 | |
---|
692 | 717 | - The root cgroup should be exempt from resource control and thus |
---|
693 | | - shouldn't have resource control interface files. Also, |
---|
694 | | - informational files on the root cgroup which end up showing global |
---|
695 | | - information available elsewhere shouldn't exist. |
---|
| 718 | + shouldn't have resource control interface files. |
---|
696 | 719 | |
---|
697 | 720 | - The default time unit is microseconds. If a different unit is ever |
---|
698 | 721 | used, an explicit unit suffix must be present. |
---|
.. | .. |
---|
868 | 891 | populated |
---|
869 | 892 | 1 if the cgroup or its descendants contains any live |
---|
870 | 893 | processes; otherwise, 0. |
---|
| 894 | + frozen |
---|
| 895 | + 1 if the cgroup is frozen; otherwise, 0. |
---|
871 | 896 | |
---|
872 | 897 | cgroup.max.descendants |
---|
873 | 898 | A read-write single value files. The default is "max". |
---|
.. | .. |
---|
901 | 926 | A dying cgroup can consume system resources not exceeding |
---|
902 | 927 | limits, which were active at the moment of cgroup deletion. |
---|
903 | 928 | |
---|
| 929 | + cgroup.freeze |
---|
| 930 | + A read-write single value file which exists on non-root cgroups. |
---|
| 931 | + Allowed values are "0" and "1". The default is "0". |
---|
| 932 | + |
---|
| 933 | + Writing "1" to the file causes freezing of the cgroup and all |
---|
| 934 | + descendant cgroups. This means that all belonging processes will |
---|
| 935 | + be stopped and will not run until the cgroup will be explicitly |
---|
| 936 | + unfrozen. Freezing of the cgroup may take some time; when this action |
---|
| 937 | + is completed, the "frozen" value in the cgroup.events control file |
---|
| 938 | + will be updated to "1" and the corresponding notification will be |
---|
| 939 | + issued. |
---|
| 940 | + |
---|
| 941 | + A cgroup can be frozen either by its own settings, or by settings |
---|
| 942 | + of any ancestor cgroups. If any of ancestor cgroups is frozen, the |
---|
| 943 | + cgroup will remain frozen. |
---|
| 944 | + |
---|
| 945 | + Processes in the frozen cgroup can be killed by a fatal signal. |
---|
| 946 | + They also can enter and leave a frozen cgroup: either by an explicit |
---|
| 947 | + move by a user, or if freezing of the cgroup races with fork(). |
---|
| 948 | + If a process is moved to a frozen cgroup, it stops. If a process is |
---|
| 949 | + moved out of a frozen cgroup, it becomes running. |
---|
| 950 | + |
---|
| 951 | + Frozen status of a cgroup doesn't affect any cgroup tree operations: |
---|
| 952 | + it's possible to delete a frozen (and empty) cgroup, as well as |
---|
| 953 | + create new sub-cgroups. |
---|
904 | 954 | |
---|
905 | 955 | Controllers |
---|
906 | 956 | =========== |
---|
.. | .. |
---|
934 | 984 | All time durations are in microseconds. |
---|
935 | 985 | |
---|
936 | 986 | cpu.stat |
---|
937 | | - A read-only flat-keyed file which exists on non-root cgroups. |
---|
| 987 | + A read-only flat-keyed file. |
---|
938 | 988 | This file exists whether the controller is enabled or not. |
---|
939 | 989 | |
---|
940 | 990 | It always reports the following three stats: |
---|
.. | .. |
---|
983 | 1033 | A read-only nested-key file which exists on non-root cgroups. |
---|
984 | 1034 | |
---|
985 | 1035 | Shows pressure stall information for CPU. See |
---|
986 | | - Documentation/accounting/psi.txt for details. |
---|
| 1036 | + :ref:`Documentation/accounting/psi.rst <psi>` for details. |
---|
987 | 1037 | |
---|
988 | 1038 | cpu.uclamp.min |
---|
989 | 1039 | A read-write single value file which exists on non-root cgroups. |
---|
.. | .. |
---|
1058 | 1108 | is within its effective min boundary, the cgroup's memory |
---|
1059 | 1109 | won't be reclaimed under any conditions. If there is no |
---|
1060 | 1110 | unprotected reclaimable memory available, OOM killer |
---|
1061 | | - is invoked. |
---|
| 1111 | + is invoked. Above the effective min boundary (or |
---|
| 1112 | + effective low boundary if it is higher), pages are reclaimed |
---|
| 1113 | + proportionally to the overage, reducing reclaim pressure for |
---|
| 1114 | + smaller overages. |
---|
1062 | 1115 | |
---|
1063 | | - Effective min boundary is limited by memory.min values of |
---|
| 1116 | + Effective min boundary is limited by memory.min values of |
---|
1064 | 1117 | all ancestor cgroups. If there is memory.min overcommitment |
---|
1065 | 1118 | (child cgroup or cgroups are requiring more protected memory |
---|
1066 | 1119 | than parent will allow), then each child cgroup will get |
---|
.. | .. |
---|
1079 | 1132 | |
---|
1080 | 1133 | Best-effort memory protection. If the memory usage of a |
---|
1081 | 1134 | cgroup is within its effective low boundary, the cgroup's |
---|
1082 | | - memory won't be reclaimed unless memory can be reclaimed |
---|
1083 | | - from unprotected cgroups. |
---|
| 1135 | + memory won't be reclaimed unless there is no reclaimable |
---|
| 1136 | + memory available in unprotected cgroups. |
---|
| 1137 | + Above the effective low boundary (or |
---|
| 1138 | + effective min boundary if it is higher), pages are reclaimed |
---|
| 1139 | + proportionally to the overage, reducing reclaim pressure for |
---|
| 1140 | + smaller overages. |
---|
1084 | 1141 | |
---|
1085 | 1142 | Effective low boundary is limited by memory.low values of |
---|
1086 | 1143 | all ancestor cgroups. If there is memory.low overcommitment |
---|
.. | .. |
---|
1114 | 1171 | Under certain circumstances, the usage may go over the limit |
---|
1115 | 1172 | temporarily. |
---|
1116 | 1173 | |
---|
| 1174 | + In default configuration regular 0-order allocations always |
---|
| 1175 | + succeed unless OOM killer chooses current task as a victim. |
---|
| 1176 | + |
---|
| 1177 | + Some kinds of allocations don't invoke the OOM killer. |
---|
| 1178 | + Caller could retry them differently, return into userspace |
---|
| 1179 | + as -ENOMEM or silently ignore in cases like disk readahead. |
---|
| 1180 | + |
---|
1117 | 1181 | This is the ultimate protection mechanism. As long as the |
---|
1118 | 1182 | high limit is used and monitored properly, this limit's |
---|
1119 | 1183 | utility is limited to providing the final safety net. |
---|
.. | .. |
---|
1142 | 1206 | otherwise, a value change in this file generates a file |
---|
1143 | 1207 | modified event. |
---|
1144 | 1208 | |
---|
| 1209 | + Note that all fields in this file are hierarchical and the |
---|
| 1210 | + file modified event can be generated due to an event down the |
---|
| 1211 | + hierarchy. For for the local events at the cgroup level see |
---|
| 1212 | + memory.events.local. |
---|
| 1213 | + |
---|
1145 | 1214 | low |
---|
1146 | 1215 | The number of times the cgroup is reclaimed due to |
---|
1147 | 1216 | high memory pressure even though its usage is under |
---|
.. | .. |
---|
1165 | 1234 | The number of time the cgroup's memory usage was |
---|
1166 | 1235 | reached the limit and allocation was about to fail. |
---|
1167 | 1236 | |
---|
1168 | | - Depending on context result could be invocation of OOM |
---|
1169 | | - killer and retrying allocation or failing allocation. |
---|
1170 | | - |
---|
1171 | | - Failed allocation in its turn could be returned into |
---|
1172 | | - userspace as -ENOMEM or silently ignored in cases like |
---|
1173 | | - disk readahead. For now OOM in memory cgroup kills |
---|
1174 | | - tasks iff shortage has happened inside page fault. |
---|
| 1237 | + This event is not raised if the OOM killer is not |
---|
| 1238 | + considered as an option, e.g. for failed high-order |
---|
| 1239 | + allocations or if caller asked to not retry attempts. |
---|
1175 | 1240 | |
---|
1176 | 1241 | oom_kill |
---|
1177 | 1242 | The number of processes belonging to this cgroup |
---|
1178 | 1243 | killed by any kind of OOM killer. |
---|
| 1244 | + |
---|
| 1245 | + memory.events.local |
---|
| 1246 | + Similar to memory.events but the fields in the file are local |
---|
| 1247 | + to the cgroup i.e. not hierarchical. The file modified event |
---|
| 1248 | + generated on this file reflects only the local events. |
---|
1179 | 1249 | |
---|
1180 | 1250 | memory.stat |
---|
1181 | 1251 | A read-only flat-keyed file which exists on non-root cgroups. |
---|
.. | .. |
---|
1190 | 1260 | can show up in the middle. Don't rely on items remaining in a |
---|
1191 | 1261 | fixed position; use the keys to look up specific values! |
---|
1192 | 1262 | |
---|
| 1263 | + If the entry has no per-node counter(or not show in the |
---|
| 1264 | + mempry.numa_stat). We use 'npn'(non-per-node) as the tag |
---|
| 1265 | + to indicate that it will not show in the mempry.numa_stat. |
---|
| 1266 | + |
---|
1193 | 1267 | anon |
---|
1194 | 1268 | Amount of memory used in anonymous mappings such as |
---|
1195 | 1269 | brk(), sbrk(), and mmap(MAP_ANONYMOUS) |
---|
.. | .. |
---|
1201 | 1275 | kernel_stack |
---|
1202 | 1276 | Amount of memory allocated to kernel stacks. |
---|
1203 | 1277 | |
---|
1204 | | - slab |
---|
1205 | | - Amount of memory used for storing in-kernel data |
---|
1206 | | - structures. |
---|
| 1278 | + percpu(npn) |
---|
| 1279 | + Amount of memory used for storing per-cpu kernel |
---|
| 1280 | + data structures. |
---|
1207 | 1281 | |
---|
1208 | | - sock |
---|
| 1282 | + sock(npn) |
---|
1209 | 1283 | Amount of memory used in network transmission buffers |
---|
1210 | 1284 | |
---|
1211 | 1285 | shmem |
---|
.. | .. |
---|
1223 | 1297 | Amount of cached filesystem data that was modified and |
---|
1224 | 1298 | is currently being written back to disk |
---|
1225 | 1299 | |
---|
| 1300 | + anon_thp |
---|
| 1301 | + Amount of memory used in anonymous mappings backed by |
---|
| 1302 | + transparent hugepages |
---|
| 1303 | + |
---|
1226 | 1304 | inactive_anon, active_anon, inactive_file, active_file, unevictable |
---|
1227 | 1305 | Amount of memory, swap-backed and filesystem-backed, |
---|
1228 | 1306 | on the internal memory management lists used by the |
---|
1229 | | - page reclaim algorithm |
---|
| 1307 | + page reclaim algorithm. |
---|
| 1308 | + |
---|
| 1309 | + As these represent internal list state (eg. shmem pages are on anon |
---|
| 1310 | + memory management lists), inactive_foo + active_foo may not be equal to |
---|
| 1311 | + the value for the foo counter, since the foo counter is type-based, not |
---|
| 1312 | + list-based. |
---|
1230 | 1313 | |
---|
1231 | 1314 | slab_reclaimable |
---|
1232 | 1315 | Part of "slab" that might be reclaimed, such as |
---|
.. | .. |
---|
1236 | 1319 | Part of "slab" that cannot be reclaimed on memory |
---|
1237 | 1320 | pressure. |
---|
1238 | 1321 | |
---|
1239 | | - pgfault |
---|
1240 | | - Total number of page faults incurred |
---|
| 1322 | + slab(npn) |
---|
| 1323 | + Amount of memory used for storing in-kernel data |
---|
| 1324 | + structures. |
---|
1241 | 1325 | |
---|
1242 | | - pgmajfault |
---|
1243 | | - Number of major page faults incurred |
---|
| 1326 | + workingset_refault_anon |
---|
| 1327 | + Number of refaults of previously evicted anonymous pages. |
---|
1244 | 1328 | |
---|
1245 | | - workingset_refault |
---|
| 1329 | + workingset_refault_file |
---|
| 1330 | + Number of refaults of previously evicted file pages. |
---|
1246 | 1331 | |
---|
1247 | | - Number of refaults of previously evicted pages |
---|
| 1332 | + workingset_activate_anon |
---|
| 1333 | + Number of refaulted anonymous pages that were immediately |
---|
| 1334 | + activated. |
---|
1248 | 1335 | |
---|
1249 | | - workingset_activate |
---|
| 1336 | + workingset_activate_file |
---|
| 1337 | + Number of refaulted file pages that were immediately activated. |
---|
1250 | 1338 | |
---|
1251 | | - Number of refaulted pages that were immediately activated |
---|
| 1339 | + workingset_restore_anon |
---|
| 1340 | + Number of restored anonymous pages which have been detected as |
---|
| 1341 | + an active workingset before they got reclaimed. |
---|
| 1342 | + |
---|
| 1343 | + workingset_restore_file |
---|
| 1344 | + Number of restored file pages which have been detected as an |
---|
| 1345 | + active workingset before they got reclaimed. |
---|
1252 | 1346 | |
---|
1253 | 1347 | workingset_nodereclaim |
---|
1254 | | - |
---|
1255 | 1348 | Number of times a shadow node has been reclaimed |
---|
1256 | 1349 | |
---|
1257 | | - pgrefill |
---|
| 1350 | + pgfault(npn) |
---|
| 1351 | + Total number of page faults incurred |
---|
1258 | 1352 | |
---|
| 1353 | + pgmajfault(npn) |
---|
| 1354 | + Number of major page faults incurred |
---|
| 1355 | + |
---|
| 1356 | + pgrefill(npn) |
---|
1259 | 1357 | Amount of scanned pages (in an active LRU list) |
---|
1260 | 1358 | |
---|
1261 | | - pgscan |
---|
1262 | | - |
---|
| 1359 | + pgscan(npn) |
---|
1263 | 1360 | Amount of scanned pages (in an inactive LRU list) |
---|
1264 | 1361 | |
---|
1265 | | - pgsteal |
---|
1266 | | - |
---|
| 1362 | + pgsteal(npn) |
---|
1267 | 1363 | Amount of reclaimed pages |
---|
1268 | 1364 | |
---|
1269 | | - pgactivate |
---|
1270 | | - |
---|
| 1365 | + pgactivate(npn) |
---|
1271 | 1366 | Amount of pages moved to the active LRU list |
---|
1272 | 1367 | |
---|
1273 | | - pgdeactivate |
---|
| 1368 | + pgdeactivate(npn) |
---|
| 1369 | + Amount of pages moved to the inactive LRU list |
---|
1274 | 1370 | |
---|
1275 | | - Amount of pages moved to the inactive LRU lis |
---|
1276 | | - |
---|
1277 | | - pglazyfree |
---|
1278 | | - |
---|
| 1371 | + pglazyfree(npn) |
---|
1279 | 1372 | Amount of pages postponed to be freed under memory pressure |
---|
1280 | 1373 | |
---|
1281 | | - pglazyfreed |
---|
1282 | | - |
---|
| 1374 | + pglazyfreed(npn) |
---|
1283 | 1375 | Amount of reclaimed lazyfree pages |
---|
| 1376 | + |
---|
| 1377 | + thp_fault_alloc(npn) |
---|
| 1378 | + Number of transparent hugepages which were allocated to satisfy |
---|
| 1379 | + a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE |
---|
| 1380 | + is not set. |
---|
| 1381 | + |
---|
| 1382 | + thp_collapse_alloc(npn) |
---|
| 1383 | + Number of transparent hugepages which were allocated to allow |
---|
| 1384 | + collapsing an existing range of pages. This counter is not |
---|
| 1385 | + present when CONFIG_TRANSPARENT_HUGEPAGE is not set. |
---|
| 1386 | + |
---|
| 1387 | + memory.numa_stat |
---|
| 1388 | + A read-only nested-keyed file which exists on non-root cgroups. |
---|
| 1389 | + |
---|
| 1390 | + This breaks down the cgroup's memory footprint into different |
---|
| 1391 | + types of memory, type-specific details, and other information |
---|
| 1392 | + per node on the state of the memory management system. |
---|
| 1393 | + |
---|
| 1394 | + This is useful for providing visibility into the NUMA locality |
---|
| 1395 | + information within an memcg since the pages are allowed to be |
---|
| 1396 | + allocated from any physical node. One of the use case is evaluating |
---|
| 1397 | + application performance by combining this information with the |
---|
| 1398 | + application's CPU allocation. |
---|
| 1399 | + |
---|
| 1400 | + All memory amounts are in bytes. |
---|
| 1401 | + |
---|
| 1402 | + The output format of memory.numa_stat is:: |
---|
| 1403 | + |
---|
| 1404 | + type N0=<bytes in node 0> N1=<bytes in node 1> ... |
---|
| 1405 | + |
---|
| 1406 | + The entries are ordered to be human readable, and new entries |
---|
| 1407 | + can show up in the middle. Don't rely on items remaining in a |
---|
| 1408 | + fixed position; use the keys to look up specific values! |
---|
| 1409 | + |
---|
| 1410 | + The entries can refer to the memory.stat. |
---|
1284 | 1411 | |
---|
1285 | 1412 | memory.swap.current |
---|
1286 | 1413 | A read-only single value file which exists on non-root |
---|
.. | .. |
---|
1288 | 1415 | |
---|
1289 | 1416 | The total amount of swap currently being used by the cgroup |
---|
1290 | 1417 | and its descendants. |
---|
| 1418 | + |
---|
| 1419 | + memory.swap.high |
---|
| 1420 | + A read-write single value file which exists on non-root |
---|
| 1421 | + cgroups. The default is "max". |
---|
| 1422 | + |
---|
| 1423 | + Swap usage throttle limit. If a cgroup's swap usage exceeds |
---|
| 1424 | + this limit, all its further allocations will be throttled to |
---|
| 1425 | + allow userspace to implement custom out-of-memory procedures. |
---|
| 1426 | + |
---|
| 1427 | + This limit marks a point of no return for the cgroup. It is NOT |
---|
| 1428 | + designed to manage the amount of swapping a workload does |
---|
| 1429 | + during regular operation. Compare to memory.swap.max, which |
---|
| 1430 | + prohibits swapping past a set amount, but lets the cgroup |
---|
| 1431 | + continue unimpeded as long as other memory can be reclaimed. |
---|
| 1432 | + |
---|
| 1433 | + Healthy workloads are not expected to reach this limit. |
---|
1291 | 1434 | |
---|
1292 | 1435 | memory.swap.max |
---|
1293 | 1436 | A read-write single value file which exists on non-root |
---|
.. | .. |
---|
1301 | 1444 | The following entries are defined. Unless specified |
---|
1302 | 1445 | otherwise, a value change in this file generates a file |
---|
1303 | 1446 | modified event. |
---|
| 1447 | + |
---|
| 1448 | + high |
---|
| 1449 | + The number of times the cgroup's swap usage was over |
---|
| 1450 | + the high threshold. |
---|
1304 | 1451 | |
---|
1305 | 1452 | max |
---|
1306 | 1453 | The number of times the cgroup's swap usage was about |
---|
.. | .. |
---|
1321 | 1468 | A read-only nested-key file which exists on non-root cgroups. |
---|
1322 | 1469 | |
---|
1323 | 1470 | Shows pressure stall information for memory. See |
---|
1324 | | - Documentation/accounting/psi.txt for details. |
---|
| 1471 | + :ref:`Documentation/accounting/psi.rst <psi>` for details. |
---|
1325 | 1472 | |
---|
1326 | 1473 | |
---|
1327 | 1474 | Usage Guidelines |
---|
.. | .. |
---|
1381 | 1528 | ~~~~~~~~~~~~~~~~~~ |
---|
1382 | 1529 | |
---|
1383 | 1530 | io.stat |
---|
1384 | | - A read-only nested-keyed file which exists on non-root |
---|
1385 | | - cgroups. |
---|
| 1531 | + A read-only nested-keyed file. |
---|
1386 | 1532 | |
---|
1387 | 1533 | Lines are keyed by $MAJ:$MIN device numbers and not ordered. |
---|
1388 | 1534 | The following nested keys are defined. |
---|
.. | .. |
---|
1396 | 1542 | dios Number of discard IOs |
---|
1397 | 1543 | ====== ===================== |
---|
1398 | 1544 | |
---|
1399 | | - An example read output follows: |
---|
| 1545 | + An example read output follows:: |
---|
1400 | 1546 | |
---|
1401 | 1547 | 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 |
---|
1402 | 1548 | 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 |
---|
| 1549 | + |
---|
| 1550 | + io.cost.qos |
---|
| 1551 | + A read-write nested-keyed file with exists only on the root |
---|
| 1552 | + cgroup. |
---|
| 1553 | + |
---|
| 1554 | + This file configures the Quality of Service of the IO cost |
---|
| 1555 | + model based controller (CONFIG_BLK_CGROUP_IOCOST) which |
---|
| 1556 | + currently implements "io.weight" proportional control. Lines |
---|
| 1557 | + are keyed by $MAJ:$MIN device numbers and not ordered. The |
---|
| 1558 | + line for a given device is populated on the first write for |
---|
| 1559 | + the device on "io.cost.qos" or "io.cost.model". The following |
---|
| 1560 | + nested keys are defined. |
---|
| 1561 | + |
---|
| 1562 | + ====== ===================================== |
---|
| 1563 | + enable Weight-based control enable |
---|
| 1564 | + ctrl "auto" or "user" |
---|
| 1565 | + rpct Read latency percentile [0, 100] |
---|
| 1566 | + rlat Read latency threshold |
---|
| 1567 | + wpct Write latency percentile [0, 100] |
---|
| 1568 | + wlat Write latency threshold |
---|
| 1569 | + min Minimum scaling percentage [1, 10000] |
---|
| 1570 | + max Maximum scaling percentage [1, 10000] |
---|
| 1571 | + ====== ===================================== |
---|
| 1572 | + |
---|
| 1573 | + The controller is disabled by default and can be enabled by |
---|
| 1574 | + setting "enable" to 1. "rpct" and "wpct" parameters default |
---|
| 1575 | + to zero and the controller uses internal device saturation |
---|
| 1576 | + state to adjust the overall IO rate between "min" and "max". |
---|
| 1577 | + |
---|
| 1578 | + When a better control quality is needed, latency QoS |
---|
| 1579 | + parameters can be configured. For example:: |
---|
| 1580 | + |
---|
| 1581 | + 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 |
---|
| 1582 | + |
---|
| 1583 | + shows that on sdb, the controller is enabled, will consider |
---|
| 1584 | + the device saturated if the 95th percentile of read completion |
---|
| 1585 | + latencies is above 75ms or write 150ms, and adjust the overall |
---|
| 1586 | + IO issue rate between 50% and 150% accordingly. |
---|
| 1587 | + |
---|
| 1588 | + The lower the saturation point, the better the latency QoS at |
---|
| 1589 | + the cost of aggregate bandwidth. The narrower the allowed |
---|
| 1590 | + adjustment range between "min" and "max", the more conformant |
---|
| 1591 | + to the cost model the IO behavior. Note that the IO issue |
---|
| 1592 | + base rate may be far off from 100% and setting "min" and "max" |
---|
| 1593 | + blindly can lead to a significant loss of device capacity or |
---|
| 1594 | + control quality. "min" and "max" are useful for regulating |
---|
| 1595 | + devices which show wide temporary behavior changes - e.g. a |
---|
| 1596 | + ssd which accepts writes at the line speed for a while and |
---|
| 1597 | + then completely stalls for multiple seconds. |
---|
| 1598 | + |
---|
| 1599 | + When "ctrl" is "auto", the parameters are controlled by the |
---|
| 1600 | + kernel and may change automatically. Setting "ctrl" to "user" |
---|
| 1601 | + or setting any of the percentile and latency parameters puts |
---|
| 1602 | + it into "user" mode and disables the automatic changes. The |
---|
| 1603 | + automatic mode can be restored by setting "ctrl" to "auto". |
---|
| 1604 | + |
---|
| 1605 | + io.cost.model |
---|
| 1606 | + A read-write nested-keyed file with exists only on the root |
---|
| 1607 | + cgroup. |
---|
| 1608 | + |
---|
| 1609 | + This file configures the cost model of the IO cost model based |
---|
| 1610 | + controller (CONFIG_BLK_CGROUP_IOCOST) which currently |
---|
| 1611 | + implements "io.weight" proportional control. Lines are keyed |
---|
| 1612 | + by $MAJ:$MIN device numbers and not ordered. The line for a |
---|
| 1613 | + given device is populated on the first write for the device on |
---|
| 1614 | + "io.cost.qos" or "io.cost.model". The following nested keys |
---|
| 1615 | + are defined. |
---|
| 1616 | + |
---|
| 1617 | + ===== ================================ |
---|
| 1618 | + ctrl "auto" or "user" |
---|
| 1619 | + model The cost model in use - "linear" |
---|
| 1620 | + ===== ================================ |
---|
| 1621 | + |
---|
| 1622 | + When "ctrl" is "auto", the kernel may change all parameters |
---|
| 1623 | + dynamically. When "ctrl" is set to "user" or any other |
---|
| 1624 | + parameters are written to, "ctrl" become "user" and the |
---|
| 1625 | + automatic changes are disabled. |
---|
| 1626 | + |
---|
| 1627 | + When "model" is "linear", the following model parameters are |
---|
| 1628 | + defined. |
---|
| 1629 | + |
---|
| 1630 | + ============= ======================================== |
---|
| 1631 | + [r|w]bps The maximum sequential IO throughput |
---|
| 1632 | + [r|w]seqiops The maximum 4k sequential IOs per second |
---|
| 1633 | + [r|w]randiops The maximum 4k random IOs per second |
---|
| 1634 | + ============= ======================================== |
---|
| 1635 | + |
---|
| 1636 | + From the above, the builtin linear model determines the base |
---|
| 1637 | + costs of a sequential and random IO and the cost coefficient |
---|
| 1638 | + for the IO size. While simple, this model can cover most |
---|
| 1639 | + common device classes acceptably. |
---|
| 1640 | + |
---|
| 1641 | + The IO cost model isn't expected to be accurate in absolute |
---|
| 1642 | + sense and is scaled to the device behavior dynamically. |
---|
| 1643 | + |
---|
| 1644 | + If needed, tools/cgroup/iocost_coef_gen.py can be used to |
---|
| 1645 | + generate device-specific coefficients. |
---|
1403 | 1646 | |
---|
1404 | 1647 | io.weight |
---|
1405 | 1648 | A read-write flat-keyed file which exists on non-root cgroups. |
---|
.. | .. |
---|
1464 | 1707 | A read-only nested-key file which exists on non-root cgroups. |
---|
1465 | 1708 | |
---|
1466 | 1709 | Shows pressure stall information for IO. See |
---|
1467 | | - Documentation/accounting/psi.txt for details. |
---|
| 1710 | + :ref:`Documentation/accounting/psi.rst <psi>` for details. |
---|
1468 | 1711 | |
---|
1469 | 1712 | |
---|
1470 | 1713 | Writeback |
---|
.. | .. |
---|
1485 | 1728 | of the two is enforced. |
---|
1486 | 1729 | |
---|
1487 | 1730 | cgroup writeback requires explicit support from the underlying |
---|
1488 | | -filesystem. Currently, cgroup writeback is implemented on ext2, ext4 |
---|
1489 | | -and btrfs. On other filesystems, all writeback IOs are attributed to |
---|
1490 | | -the root cgroup. |
---|
| 1731 | +filesystem. Currently, cgroup writeback is implemented on ext2, ext4, |
---|
| 1732 | +btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are |
---|
| 1733 | +attributed to the root cgroup. |
---|
1491 | 1734 | |
---|
1492 | 1735 | There are inherent differences in memory and writeback management |
---|
1493 | 1736 | which affects how cgroup ownership is tracked. Memory is tracked per |
---|
.. | .. |
---|
1537 | 1780 | |
---|
1538 | 1781 | The limits are only applied at the peer level in the hierarchy. This means that |
---|
1539 | 1782 | in the diagram below, only groups A, B, and C will influence each other, and |
---|
1540 | | -groups D and F will influence each other. Group G will influence nobody. |
---|
| 1783 | +groups D and F will influence each other. Group G will influence nobody:: |
---|
1541 | 1784 | |
---|
1542 | 1785 | [root] |
---|
1543 | 1786 | / | \ |
---|
.. | .. |
---|
1606 | 1849 | duration of time between evaluation events. Windows only elapse |
---|
1607 | 1850 | with IO activity. Idle periods extend the most recent window. |
---|
1608 | 1851 | |
---|
| 1852 | +IO Priority |
---|
| 1853 | +~~~~~~~~~~~ |
---|
| 1854 | + |
---|
| 1855 | +A single attribute controls the behavior of the I/O priority cgroup policy, |
---|
| 1856 | +namely the blkio.prio.class attribute. The following values are accepted for |
---|
| 1857 | +that attribute: |
---|
| 1858 | + |
---|
| 1859 | + no-change |
---|
| 1860 | + Do not modify the I/O priority class. |
---|
| 1861 | + |
---|
| 1862 | + none-to-rt |
---|
| 1863 | + For requests that do not have an I/O priority class (NONE), |
---|
| 1864 | + change the I/O priority class into RT. Do not modify |
---|
| 1865 | + the I/O priority class of other requests. |
---|
| 1866 | + |
---|
| 1867 | + restrict-to-be |
---|
| 1868 | + For requests that do not have an I/O priority class or that have I/O |
---|
| 1869 | + priority class RT, change it into BE. Do not modify the I/O priority |
---|
| 1870 | + class of requests that have priority class IDLE. |
---|
| 1871 | + |
---|
| 1872 | + idle |
---|
| 1873 | + Change the I/O priority class of all requests into IDLE, the lowest |
---|
| 1874 | + I/O priority class. |
---|
| 1875 | + |
---|
| 1876 | +The following numerical values are associated with the I/O priority policies: |
---|
| 1877 | + |
---|
| 1878 | ++-------------+---+ |
---|
| 1879 | +| no-change | 0 | |
---|
| 1880 | ++-------------+---+ |
---|
| 1881 | +| none-to-rt | 1 | |
---|
| 1882 | ++-------------+---+ |
---|
| 1883 | +| rt-to-be | 2 | |
---|
| 1884 | ++-------------+---+ |
---|
| 1885 | +| all-to-idle | 3 | |
---|
| 1886 | ++-------------+---+ |
---|
| 1887 | + |
---|
| 1888 | +The numerical value that corresponds to each I/O priority class is as follows: |
---|
| 1889 | + |
---|
| 1890 | ++-------------------------------+---+ |
---|
| 1891 | +| IOPRIO_CLASS_NONE | 0 | |
---|
| 1892 | ++-------------------------------+---+ |
---|
| 1893 | +| IOPRIO_CLASS_RT (real-time) | 1 | |
---|
| 1894 | ++-------------------------------+---+ |
---|
| 1895 | +| IOPRIO_CLASS_BE (best effort) | 2 | |
---|
| 1896 | ++-------------------------------+---+ |
---|
| 1897 | +| IOPRIO_CLASS_IDLE | 3 | |
---|
| 1898 | ++-------------------------------+---+ |
---|
| 1899 | + |
---|
| 1900 | +The algorithm to set the I/O priority class for a request is as follows: |
---|
| 1901 | + |
---|
| 1902 | +- Translate the I/O priority class policy into a number. |
---|
| 1903 | +- Change the request I/O priority class into the maximum of the I/O priority |
---|
| 1904 | + class policy number and the numerical I/O priority class. |
---|
| 1905 | + |
---|
1609 | 1906 | PID |
---|
1610 | 1907 | --- |
---|
1611 | 1908 | |
---|
.. | .. |
---|
1646 | 1943 | of a new process would cause a cgroup policy to be violated. |
---|
1647 | 1944 | |
---|
1648 | 1945 | |
---|
| 1946 | +Cpuset |
---|
| 1947 | +------ |
---|
| 1948 | + |
---|
| 1949 | +The "cpuset" controller provides a mechanism for constraining |
---|
| 1950 | +the CPU and memory node placement of tasks to only the resources |
---|
| 1951 | +specified in the cpuset interface files in a task's current cgroup. |
---|
| 1952 | +This is especially valuable on large NUMA systems where placing jobs |
---|
| 1953 | +on properly sized subsets of the systems with careful processor and |
---|
| 1954 | +memory placement to reduce cross-node memory access and contention |
---|
| 1955 | +can improve overall system performance. |
---|
| 1956 | + |
---|
| 1957 | +The "cpuset" controller is hierarchical. That means the controller |
---|
| 1958 | +cannot use CPUs or memory nodes not allowed in its parent. |
---|
| 1959 | + |
---|
| 1960 | + |
---|
| 1961 | +Cpuset Interface Files |
---|
| 1962 | +~~~~~~~~~~~~~~~~~~~~~~ |
---|
| 1963 | + |
---|
| 1964 | + cpuset.cpus |
---|
| 1965 | + A read-write multiple values file which exists on non-root |
---|
| 1966 | + cpuset-enabled cgroups. |
---|
| 1967 | + |
---|
| 1968 | + It lists the requested CPUs to be used by tasks within this |
---|
| 1969 | + cgroup. The actual list of CPUs to be granted, however, is |
---|
| 1970 | + subjected to constraints imposed by its parent and can differ |
---|
| 1971 | + from the requested CPUs. |
---|
| 1972 | + |
---|
| 1973 | + The CPU numbers are comma-separated numbers or ranges. |
---|
| 1974 | + For example:: |
---|
| 1975 | + |
---|
| 1976 | + # cat cpuset.cpus |
---|
| 1977 | + 0-4,6,8-10 |
---|
| 1978 | + |
---|
| 1979 | + An empty value indicates that the cgroup is using the same |
---|
| 1980 | + setting as the nearest cgroup ancestor with a non-empty |
---|
| 1981 | + "cpuset.cpus" or all the available CPUs if none is found. |
---|
| 1982 | + |
---|
| 1983 | + The value of "cpuset.cpus" stays constant until the next update |
---|
| 1984 | + and won't be affected by any CPU hotplug events. |
---|
| 1985 | + |
---|
| 1986 | + cpuset.cpus.effective |
---|
| 1987 | + A read-only multiple values file which exists on all |
---|
| 1988 | + cpuset-enabled cgroups. |
---|
| 1989 | + |
---|
| 1990 | + It lists the onlined CPUs that are actually granted to this |
---|
| 1991 | + cgroup by its parent. These CPUs are allowed to be used by |
---|
| 1992 | + tasks within the current cgroup. |
---|
| 1993 | + |
---|
| 1994 | + If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows |
---|
| 1995 | + all the CPUs from the parent cgroup that can be available to |
---|
| 1996 | + be used by this cgroup. Otherwise, it should be a subset of |
---|
| 1997 | + "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" |
---|
| 1998 | + can be granted. In this case, it will be treated just like an |
---|
| 1999 | + empty "cpuset.cpus". |
---|
| 2000 | + |
---|
| 2001 | + Its value will be affected by CPU hotplug events. |
---|
| 2002 | + |
---|
| 2003 | + cpuset.mems |
---|
| 2004 | + A read-write multiple values file which exists on non-root |
---|
| 2005 | + cpuset-enabled cgroups. |
---|
| 2006 | + |
---|
| 2007 | + It lists the requested memory nodes to be used by tasks within |
---|
| 2008 | + this cgroup. The actual list of memory nodes granted, however, |
---|
| 2009 | + is subjected to constraints imposed by its parent and can differ |
---|
| 2010 | + from the requested memory nodes. |
---|
| 2011 | + |
---|
| 2012 | + The memory node numbers are comma-separated numbers or ranges. |
---|
| 2013 | + For example:: |
---|
| 2014 | + |
---|
| 2015 | + # cat cpuset.mems |
---|
| 2016 | + 0-1,3 |
---|
| 2017 | + |
---|
| 2018 | + An empty value indicates that the cgroup is using the same |
---|
| 2019 | + setting as the nearest cgroup ancestor with a non-empty |
---|
| 2020 | + "cpuset.mems" or all the available memory nodes if none |
---|
| 2021 | + is found. |
---|
| 2022 | + |
---|
| 2023 | + The value of "cpuset.mems" stays constant until the next update |
---|
| 2024 | + and won't be affected by any memory nodes hotplug events. |
---|
| 2025 | + |
---|
| 2026 | + cpuset.mems.effective |
---|
| 2027 | + A read-only multiple values file which exists on all |
---|
| 2028 | + cpuset-enabled cgroups. |
---|
| 2029 | + |
---|
| 2030 | + It lists the onlined memory nodes that are actually granted to |
---|
| 2031 | + this cgroup by its parent. These memory nodes are allowed to |
---|
| 2032 | + be used by tasks within the current cgroup. |
---|
| 2033 | + |
---|
| 2034 | + If "cpuset.mems" is empty, it shows all the memory nodes from the |
---|
| 2035 | + parent cgroup that will be available to be used by this cgroup. |
---|
| 2036 | + Otherwise, it should be a subset of "cpuset.mems" unless none of |
---|
| 2037 | + the memory nodes listed in "cpuset.mems" can be granted. In this |
---|
| 2038 | + case, it will be treated just like an empty "cpuset.mems". |
---|
| 2039 | + |
---|
| 2040 | + Its value will be affected by memory nodes hotplug events. |
---|
| 2041 | + |
---|
| 2042 | + cpuset.cpus.partition |
---|
| 2043 | + A read-write single value file which exists on non-root |
---|
| 2044 | + cpuset-enabled cgroups. This flag is owned by the parent cgroup |
---|
| 2045 | + and is not delegatable. |
---|
| 2046 | + |
---|
| 2047 | + It accepts only the following input values when written to. |
---|
| 2048 | + |
---|
| 2049 | + "root" - a partition root |
---|
| 2050 | + "member" - a non-root member of a partition |
---|
| 2051 | + |
---|
| 2052 | + When set to be a partition root, the current cgroup is the |
---|
| 2053 | + root of a new partition or scheduling domain that comprises |
---|
| 2054 | + itself and all its descendants except those that are separate |
---|
| 2055 | + partition roots themselves and their descendants. The root |
---|
| 2056 | + cgroup is always a partition root. |
---|
| 2057 | + |
---|
| 2058 | + There are constraints on where a partition root can be set. |
---|
| 2059 | + It can only be set in a cgroup if all the following conditions |
---|
| 2060 | + are true. |
---|
| 2061 | + |
---|
| 2062 | + 1) The "cpuset.cpus" is not empty and the list of CPUs are |
---|
| 2063 | + exclusive, i.e. they are not shared by any of its siblings. |
---|
| 2064 | + 2) The parent cgroup is a partition root. |
---|
| 2065 | + 3) The "cpuset.cpus" is also a proper subset of the parent's |
---|
| 2066 | + "cpuset.cpus.effective". |
---|
| 2067 | + 4) There is no child cgroups with cpuset enabled. This is for |
---|
| 2068 | + eliminating corner cases that have to be handled if such a |
---|
| 2069 | + condition is allowed. |
---|
| 2070 | + |
---|
| 2071 | + Setting it to partition root will take the CPUs away from the |
---|
| 2072 | + effective CPUs of the parent cgroup. Once it is set, this |
---|
| 2073 | + file cannot be reverted back to "member" if there are any child |
---|
| 2074 | + cgroups with cpuset enabled. |
---|
| 2075 | + |
---|
| 2076 | + A parent partition cannot distribute all its CPUs to its |
---|
| 2077 | + child partitions. There must be at least one cpu left in the |
---|
| 2078 | + parent partition. |
---|
| 2079 | + |
---|
| 2080 | + Once becoming a partition root, changes to "cpuset.cpus" is |
---|
| 2081 | + generally allowed as long as the first condition above is true, |
---|
| 2082 | + the change will not take away all the CPUs from the parent |
---|
| 2083 | + partition and the new "cpuset.cpus" value is a superset of its |
---|
| 2084 | + children's "cpuset.cpus" values. |
---|
| 2085 | + |
---|
| 2086 | + Sometimes, external factors like changes to ancestors' |
---|
| 2087 | + "cpuset.cpus" or cpu hotplug can cause the state of the partition |
---|
| 2088 | + root to change. On read, the "cpuset.sched.partition" file |
---|
| 2089 | + can show the following values. |
---|
| 2090 | + |
---|
| 2091 | + "member" Non-root member of a partition |
---|
| 2092 | + "root" Partition root |
---|
| 2093 | + "root invalid" Invalid partition root |
---|
| 2094 | + |
---|
| 2095 | + It is a partition root if the first 2 partition root conditions |
---|
| 2096 | + above are true and at least one CPU from "cpuset.cpus" is |
---|
| 2097 | + granted by the parent cgroup. |
---|
| 2098 | + |
---|
| 2099 | + A partition root can become invalid if none of CPUs requested |
---|
| 2100 | + in "cpuset.cpus" can be granted by the parent cgroup or the |
---|
| 2101 | + parent cgroup is no longer a partition root itself. In this |
---|
| 2102 | + case, it is not a real partition even though the restriction |
---|
| 2103 | + of the first partition root condition above will still apply. |
---|
| 2104 | + The cpu affinity of all the tasks in the cgroup will then be |
---|
| 2105 | + associated with CPUs in the nearest ancestor partition. |
---|
| 2106 | + |
---|
| 2107 | + An invalid partition root can be transitioned back to a |
---|
| 2108 | + real partition root if at least one of the requested CPUs |
---|
| 2109 | + can now be granted by its parent. In this case, the cpu |
---|
| 2110 | + affinity of all the tasks in the formerly invalid partition |
---|
| 2111 | + will be associated to the CPUs of the newly formed partition. |
---|
| 2112 | + Changing the partition state of an invalid partition root to |
---|
| 2113 | + "member" is always allowed even if child cpusets are present. |
---|
| 2114 | + |
---|
| 2115 | + |
---|
1649 | 2116 | Device controller |
---|
1650 | 2117 | ----------------- |
---|
1651 | 2118 | |
---|
.. | .. |
---|
1674 | 2141 | ---- |
---|
1675 | 2142 | |
---|
1676 | 2143 | The "rdma" controller regulates the distribution and accounting of |
---|
1677 | | -of RDMA resources. |
---|
| 2144 | +RDMA resources. |
---|
1678 | 2145 | |
---|
1679 | 2146 | RDMA Interface Files |
---|
1680 | 2147 | ~~~~~~~~~~~~~~~~~~~~ |
---|
.. | .. |
---|
1709 | 2176 | mlx4_0 hca_handle=1 hca_object=20 |
---|
1710 | 2177 | ocrdma1 hca_handle=1 hca_object=23 |
---|
1711 | 2178 | |
---|
| 2179 | +HugeTLB |
---|
| 2180 | +------- |
---|
| 2181 | + |
---|
| 2182 | +The HugeTLB controller allows to limit the HugeTLB usage per control group and |
---|
| 2183 | +enforces the controller limit during page fault. |
---|
| 2184 | + |
---|
| 2185 | +HugeTLB Interface Files |
---|
| 2186 | +~~~~~~~~~~~~~~~~~~~~~~~ |
---|
| 2187 | + |
---|
| 2188 | + hugetlb.<hugepagesize>.current |
---|
| 2189 | + Show current usage for "hugepagesize" hugetlb. It exists for all |
---|
| 2190 | + the cgroup except root. |
---|
| 2191 | + |
---|
| 2192 | + hugetlb.<hugepagesize>.max |
---|
| 2193 | + Set/show the hard limit of "hugepagesize" hugetlb usage. |
---|
| 2194 | + The default value is "max". It exists for all the cgroup except root. |
---|
| 2195 | + |
---|
| 2196 | + hugetlb.<hugepagesize>.events |
---|
| 2197 | + A read-only flat-keyed file which exists on non-root cgroups. |
---|
| 2198 | + |
---|
| 2199 | + max |
---|
| 2200 | + The number of allocation failure due to HugeTLB limit |
---|
| 2201 | + |
---|
| 2202 | + hugetlb.<hugepagesize>.events.local |
---|
| 2203 | + Similar to hugetlb.<hugepagesize>.events but the fields in the file |
---|
| 2204 | + are local to the cgroup i.e. not hierarchical. The file modified event |
---|
| 2205 | + generated on this file reflects only the local events. |
---|
1712 | 2206 | |
---|
1713 | 2207 | Misc |
---|
1714 | 2208 | ---- |
---|
.. | .. |
---|
1915 | 2409 | |
---|
1916 | 2410 | wbc_init_bio(@wbc, @bio) |
---|
1917 | 2411 | Should be called for each bio carrying writeback data and |
---|
1918 | | - associates the bio with the inode's owner cgroup. Can be |
---|
1919 | | - called anytime between bio allocation and submission. |
---|
| 2412 | + associates the bio with the inode's owner cgroup and the |
---|
| 2413 | + corresponding request queue. This must be called after |
---|
| 2414 | + a queue (device) has been associated with the bio and |
---|
| 2415 | + before submission. |
---|
1920 | 2416 | |
---|
1921 | | - wbc_account_io(@wbc, @page, @bytes) |
---|
| 2417 | + wbc_account_cgroup_owner(@wbc, @page, @bytes) |
---|
1922 | 2418 | Should be called for each data segment being written out. |
---|
1923 | 2419 | While this function doesn't care exactly when it's called |
---|
1924 | 2420 | during the writeback session, it's the easiest and most |
---|
.. | .. |
---|
1935 | 2431 | the writeback session is holding shared resources, e.g. a journal |
---|
1936 | 2432 | entry, may lead to priority inversion. There is no one easy solution |
---|
1937 | 2433 | for the problem. Filesystems can try to work around specific problem |
---|
1938 | | -cases by skipping wbc_init_bio() or using bio_associate_blkcg() |
---|
| 2434 | +cases by skipping wbc_init_bio() and using bio_associate_blkg() |
---|
1939 | 2435 | directly. |
---|
1940 | 2436 | |
---|
1941 | 2437 | |
---|
.. | .. |
---|
2145 | 2641 | becomes self-defeating. |
---|
2146 | 2642 | |
---|
2147 | 2643 | The memory.low boundary on the other hand is a top-down allocated |
---|
2148 | | -reserve. A cgroup enjoys reclaim protection when it's within its low, |
---|
2149 | | -which makes delegation of subtrees possible. |
---|
| 2644 | +reserve. A cgroup enjoys reclaim protection when it's within its |
---|
| 2645 | +effective low, which makes delegation of subtrees possible. It also |
---|
| 2646 | +enjoys having reclaim pressure proportional to its overage when |
---|
| 2647 | +above its effective low. |
---|
2150 | 2648 | |
---|
2151 | 2649 | The original high boundary, the hard limit, is defined as a strict |
---|
2152 | 2650 | limit that can not budge, even if the OOM killer has to be called. |
---|