hc
2024-10-12 a5969cabbb4660eab42b6ef0412cbbd1200cf14d
kernel/Documentation/admin-guide/cgroup-v2.rst
....@@ -9,7 +9,7 @@
99 conventions of cgroup v2. It describes all userland-visible aspects
1010 of cgroup including core and specific controller behaviors. All
1111 future changes must be reflected in this document. Documentation for
12
-v1 is available under Documentation/cgroup-v1/.
12
+v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
1313
1414 .. CONTENTS
1515
....@@ -54,13 +54,18 @@
5454 5-3-3. IO Latency
5555 5-3-3-1. How IO Latency Throttling Works
5656 5-3-3-2. IO Latency Interface Files
57
+ 5-3-4. IO Priority
5758 5-4. PID
5859 5-4-1. PID Interface Files
59
- 5-5. Device
60
- 5-6. RDMA
61
- 5-6-1. RDMA Interface Files
62
- 5-7. Misc
63
- 5-7-1. perf_event
60
+ 5-5. Cpuset
61
+ 5.5-1. Cpuset Interface Files
62
+ 5-6. Device
63
+ 5-7. RDMA
64
+ 5-7-1. RDMA Interface Files
65
+ 5-8. HugeTLB
66
+ 5.8-1. HugeTLB Interface Files
67
+ 5-8. Misc
68
+ 5-8-1. perf_event
6469 5-N. Non-normative information
6570 5-N-1. CPU controller root cgroup process behaviour
6671 5-N-2. IO controller root cgroup process behaviour
....@@ -174,6 +179,26 @@
174179 through remount from the init namespace. The mount option is
175180 ignored on non-init namespace mounts. Please refer to the
176181 Delegation section for details.
182
+
183
+ memory_localevents
184
+
185
+ Only populate memory.events with data for the current cgroup,
186
+ and not any subtrees. This is legacy behaviour, the default
187
+ behaviour without this option is to include subtree counts.
188
+ This option is system wide and can only be set on mount or
189
+ modified through remount from the init namespace. The mount
190
+ option is ignored on non-init namespace mounts.
191
+
192
+ memory_recursiveprot
193
+
194
+ Recursively apply memory.min and memory.low protection to
195
+ entire subtrees, without requiring explicit downward
196
+ propagation into leaf cgroups. This allows protecting entire
197
+ subtrees from one another, while retaining free competition
198
+ within those subtrees. This should have been the default
199
+ behavior but is a mount-option to avoid regressing setups
200
+ relying on the original semantics (e.g. specifying bogusly
201
+ high 'bypass' protection values at higher tree levels).
177202
178203
179204 Organizing Processes and Threads
....@@ -604,8 +629,8 @@
604629 Protections
605630 -----------
606631
607
-A cgroup is protected to be allocated upto the configured amount of
608
-the resource if the usages of all its ancestors are under their
632
+A cgroup is protected upto the configured amount of the resource
633
+as long as the usages of all its ancestors are under their
609634 protected levels. Protections can be hard guarantees or best effort
610635 soft boundaries. Protections can also be over-committed in which case
611636 only upto the amount available to the parent is protected among
....@@ -690,9 +715,7 @@
690715 - Settings for a single feature should be contained in a single file.
691716
692717 - The root cgroup should be exempt from resource control and thus
693
- shouldn't have resource control interface files. Also,
694
- informational files on the root cgroup which end up showing global
695
- information available elsewhere shouldn't exist.
718
+ shouldn't have resource control interface files.
696719
697720 - The default time unit is microseconds. If a different unit is ever
698721 used, an explicit unit suffix must be present.
....@@ -868,6 +891,8 @@
868891 populated
869892 1 if the cgroup or its descendants contains any live
870893 processes; otherwise, 0.
894
+ frozen
895
+ 1 if the cgroup is frozen; otherwise, 0.
871896
872897 cgroup.max.descendants
873898 A read-write single value files. The default is "max".
....@@ -901,6 +926,31 @@
901926 A dying cgroup can consume system resources not exceeding
902927 limits, which were active at the moment of cgroup deletion.
903928
929
+ cgroup.freeze
930
+ A read-write single value file which exists on non-root cgroups.
931
+ Allowed values are "0" and "1". The default is "0".
932
+
933
+ Writing "1" to the file causes freezing of the cgroup and all
934
+ descendant cgroups. This means that all belonging processes will
935
+ be stopped and will not run until the cgroup will be explicitly
936
+ unfrozen. Freezing of the cgroup may take some time; when this action
937
+ is completed, the "frozen" value in the cgroup.events control file
938
+ will be updated to "1" and the corresponding notification will be
939
+ issued.
940
+
941
+ A cgroup can be frozen either by its own settings, or by settings
942
+ of any ancestor cgroups. If any of ancestor cgroups is frozen, the
943
+ cgroup will remain frozen.
944
+
945
+ Processes in the frozen cgroup can be killed by a fatal signal.
946
+ They also can enter and leave a frozen cgroup: either by an explicit
947
+ move by a user, or if freezing of the cgroup races with fork().
948
+ If a process is moved to a frozen cgroup, it stops. If a process is
949
+ moved out of a frozen cgroup, it becomes running.
950
+
951
+ Frozen status of a cgroup doesn't affect any cgroup tree operations:
952
+ it's possible to delete a frozen (and empty) cgroup, as well as
953
+ create new sub-cgroups.
904954
905955 Controllers
906956 ===========
....@@ -934,7 +984,7 @@
934984 All time durations are in microseconds.
935985
936986 cpu.stat
937
- A read-only flat-keyed file which exists on non-root cgroups.
987
+ A read-only flat-keyed file.
938988 This file exists whether the controller is enabled or not.
939989
940990 It always reports the following three stats:
....@@ -983,7 +1033,7 @@
9831033 A read-only nested-key file which exists on non-root cgroups.
9841034
9851035 Shows pressure stall information for CPU. See
986
- Documentation/accounting/psi.txt for details.
1036
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
9871037
9881038 cpu.uclamp.min
9891039 A read-write single value file which exists on non-root cgroups.
....@@ -1058,9 +1108,12 @@
10581108 is within its effective min boundary, the cgroup's memory
10591109 won't be reclaimed under any conditions. If there is no
10601110 unprotected reclaimable memory available, OOM killer
1061
- is invoked.
1111
+ is invoked. Above the effective min boundary (or
1112
+ effective low boundary if it is higher), pages are reclaimed
1113
+ proportionally to the overage, reducing reclaim pressure for
1114
+ smaller overages.
10621115
1063
- Effective min boundary is limited by memory.min values of
1116
+ Effective min boundary is limited by memory.min values of
10641117 all ancestor cgroups. If there is memory.min overcommitment
10651118 (child cgroup or cgroups are requiring more protected memory
10661119 than parent will allow), then each child cgroup will get
....@@ -1079,8 +1132,12 @@
10791132
10801133 Best-effort memory protection. If the memory usage of a
10811134 cgroup is within its effective low boundary, the cgroup's
1082
- memory won't be reclaimed unless memory can be reclaimed
1083
- from unprotected cgroups.
1135
+ memory won't be reclaimed unless there is no reclaimable
1136
+ memory available in unprotected cgroups.
1137
+ Above the effective low boundary (or
1138
+ effective min boundary if it is higher), pages are reclaimed
1139
+ proportionally to the overage, reducing reclaim pressure for
1140
+ smaller overages.
10841141
10851142 Effective low boundary is limited by memory.low values of
10861143 all ancestor cgroups. If there is memory.low overcommitment
....@@ -1114,6 +1171,13 @@
11141171 Under certain circumstances, the usage may go over the limit
11151172 temporarily.
11161173
1174
+ In default configuration regular 0-order allocations always
1175
+ succeed unless OOM killer chooses current task as a victim.
1176
+
1177
+ Some kinds of allocations don't invoke the OOM killer.
1178
+ Caller could retry them differently, return into userspace
1179
+ as -ENOMEM or silently ignore in cases like disk readahead.
1180
+
11171181 This is the ultimate protection mechanism. As long as the
11181182 high limit is used and monitored properly, this limit's
11191183 utility is limited to providing the final safety net.
....@@ -1142,6 +1206,11 @@
11421206 otherwise, a value change in this file generates a file
11431207 modified event.
11441208
1209
+ Note that all fields in this file are hierarchical and the
1210
+ file modified event can be generated due to an event down the
1211
+ hierarchy. For for the local events at the cgroup level see
1212
+ memory.events.local.
1213
+
11451214 low
11461215 The number of times the cgroup is reclaimed due to
11471216 high memory pressure even though its usage is under
....@@ -1165,17 +1234,18 @@
11651234 The number of time the cgroup's memory usage was
11661235 reached the limit and allocation was about to fail.
11671236
1168
- Depending on context result could be invocation of OOM
1169
- killer and retrying allocation or failing allocation.
1170
-
1171
- Failed allocation in its turn could be returned into
1172
- userspace as -ENOMEM or silently ignored in cases like
1173
- disk readahead. For now OOM in memory cgroup kills
1174
- tasks iff shortage has happened inside page fault.
1237
+ This event is not raised if the OOM killer is not
1238
+ considered as an option, e.g. for failed high-order
1239
+ allocations or if caller asked to not retry attempts.
11751240
11761241 oom_kill
11771242 The number of processes belonging to this cgroup
11781243 killed by any kind of OOM killer.
1244
+
1245
+ memory.events.local
1246
+ Similar to memory.events but the fields in the file are local
1247
+ to the cgroup i.e. not hierarchical. The file modified event
1248
+ generated on this file reflects only the local events.
11791249
11801250 memory.stat
11811251 A read-only flat-keyed file which exists on non-root cgroups.
....@@ -1190,6 +1260,10 @@
11901260 can show up in the middle. Don't rely on items remaining in a
11911261 fixed position; use the keys to look up specific values!
11921262
1263
+ If the entry has no per-node counter(or not show in the
1264
+ mempry.numa_stat). We use 'npn'(non-per-node) as the tag
1265
+ to indicate that it will not show in the mempry.numa_stat.
1266
+
11931267 anon
11941268 Amount of memory used in anonymous mappings such as
11951269 brk(), sbrk(), and mmap(MAP_ANONYMOUS)
....@@ -1201,11 +1275,11 @@
12011275 kernel_stack
12021276 Amount of memory allocated to kernel stacks.
12031277
1204
- slab
1205
- Amount of memory used for storing in-kernel data
1206
- structures.
1278
+ percpu(npn)
1279
+ Amount of memory used for storing per-cpu kernel
1280
+ data structures.
12071281
1208
- sock
1282
+ sock(npn)
12091283 Amount of memory used in network transmission buffers
12101284
12111285 shmem
....@@ -1223,10 +1297,19 @@
12231297 Amount of cached filesystem data that was modified and
12241298 is currently being written back to disk
12251299
1300
+ anon_thp
1301
+ Amount of memory used in anonymous mappings backed by
1302
+ transparent hugepages
1303
+
12261304 inactive_anon, active_anon, inactive_file, active_file, unevictable
12271305 Amount of memory, swap-backed and filesystem-backed,
12281306 on the internal memory management lists used by the
1229
- page reclaim algorithm
1307
+ page reclaim algorithm.
1308
+
1309
+ As these represent internal list state (eg. shmem pages are on anon
1310
+ memory management lists), inactive_foo + active_foo may not be equal to
1311
+ the value for the foo counter, since the foo counter is type-based, not
1312
+ list-based.
12301313
12311314 slab_reclaimable
12321315 Part of "slab" that might be reclaimed, such as
....@@ -1236,51 +1319,95 @@
12361319 Part of "slab" that cannot be reclaimed on memory
12371320 pressure.
12381321
1239
- pgfault
1240
- Total number of page faults incurred
1322
+ slab(npn)
1323
+ Amount of memory used for storing in-kernel data
1324
+ structures.
12411325
1242
- pgmajfault
1243
- Number of major page faults incurred
1326
+ workingset_refault_anon
1327
+ Number of refaults of previously evicted anonymous pages.
12441328
1245
- workingset_refault
1329
+ workingset_refault_file
1330
+ Number of refaults of previously evicted file pages.
12461331
1247
- Number of refaults of previously evicted pages
1332
+ workingset_activate_anon
1333
+ Number of refaulted anonymous pages that were immediately
1334
+ activated.
12481335
1249
- workingset_activate
1336
+ workingset_activate_file
1337
+ Number of refaulted file pages that were immediately activated.
12501338
1251
- Number of refaulted pages that were immediately activated
1339
+ workingset_restore_anon
1340
+ Number of restored anonymous pages which have been detected as
1341
+ an active workingset before they got reclaimed.
1342
+
1343
+ workingset_restore_file
1344
+ Number of restored file pages which have been detected as an
1345
+ active workingset before they got reclaimed.
12521346
12531347 workingset_nodereclaim
1254
-
12551348 Number of times a shadow node has been reclaimed
12561349
1257
- pgrefill
1350
+ pgfault(npn)
1351
+ Total number of page faults incurred
12581352
1353
+ pgmajfault(npn)
1354
+ Number of major page faults incurred
1355
+
1356
+ pgrefill(npn)
12591357 Amount of scanned pages (in an active LRU list)
12601358
1261
- pgscan
1262
-
1359
+ pgscan(npn)
12631360 Amount of scanned pages (in an inactive LRU list)
12641361
1265
- pgsteal
1266
-
1362
+ pgsteal(npn)
12671363 Amount of reclaimed pages
12681364
1269
- pgactivate
1270
-
1365
+ pgactivate(npn)
12711366 Amount of pages moved to the active LRU list
12721367
1273
- pgdeactivate
1368
+ pgdeactivate(npn)
1369
+ Amount of pages moved to the inactive LRU list
12741370
1275
- Amount of pages moved to the inactive LRU lis
1276
-
1277
- pglazyfree
1278
-
1371
+ pglazyfree(npn)
12791372 Amount of pages postponed to be freed under memory pressure
12801373
1281
- pglazyfreed
1282
-
1374
+ pglazyfreed(npn)
12831375 Amount of reclaimed lazyfree pages
1376
+
1377
+ thp_fault_alloc(npn)
1378
+ Number of transparent hugepages which were allocated to satisfy
1379
+ a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
1380
+ is not set.
1381
+
1382
+ thp_collapse_alloc(npn)
1383
+ Number of transparent hugepages which were allocated to allow
1384
+ collapsing an existing range of pages. This counter is not
1385
+ present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
1386
+
1387
+ memory.numa_stat
1388
+ A read-only nested-keyed file which exists on non-root cgroups.
1389
+
1390
+ This breaks down the cgroup's memory footprint into different
1391
+ types of memory, type-specific details, and other information
1392
+ per node on the state of the memory management system.
1393
+
1394
+ This is useful for providing visibility into the NUMA locality
1395
+ information within an memcg since the pages are allowed to be
1396
+ allocated from any physical node. One of the use case is evaluating
1397
+ application performance by combining this information with the
1398
+ application's CPU allocation.
1399
+
1400
+ All memory amounts are in bytes.
1401
+
1402
+ The output format of memory.numa_stat is::
1403
+
1404
+ type N0=<bytes in node 0> N1=<bytes in node 1> ...
1405
+
1406
+ The entries are ordered to be human readable, and new entries
1407
+ can show up in the middle. Don't rely on items remaining in a
1408
+ fixed position; use the keys to look up specific values!
1409
+
1410
+ The entries can refer to the memory.stat.
12841411
12851412 memory.swap.current
12861413 A read-only single value file which exists on non-root
....@@ -1288,6 +1415,22 @@
12881415
12891416 The total amount of swap currently being used by the cgroup
12901417 and its descendants.
1418
+
1419
+ memory.swap.high
1420
+ A read-write single value file which exists on non-root
1421
+ cgroups. The default is "max".
1422
+
1423
+ Swap usage throttle limit. If a cgroup's swap usage exceeds
1424
+ this limit, all its further allocations will be throttled to
1425
+ allow userspace to implement custom out-of-memory procedures.
1426
+
1427
+ This limit marks a point of no return for the cgroup. It is NOT
1428
+ designed to manage the amount of swapping a workload does
1429
+ during regular operation. Compare to memory.swap.max, which
1430
+ prohibits swapping past a set amount, but lets the cgroup
1431
+ continue unimpeded as long as other memory can be reclaimed.
1432
+
1433
+ Healthy workloads are not expected to reach this limit.
12911434
12921435 memory.swap.max
12931436 A read-write single value file which exists on non-root
....@@ -1301,6 +1444,10 @@
13011444 The following entries are defined. Unless specified
13021445 otherwise, a value change in this file generates a file
13031446 modified event.
1447
+
1448
+ high
1449
+ The number of times the cgroup's swap usage was over
1450
+ the high threshold.
13041451
13051452 max
13061453 The number of times the cgroup's swap usage was about
....@@ -1321,7 +1468,7 @@
13211468 A read-only nested-key file which exists on non-root cgroups.
13221469
13231470 Shows pressure stall information for memory. See
1324
- Documentation/accounting/psi.txt for details.
1471
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
13251472
13261473
13271474 Usage Guidelines
....@@ -1381,8 +1528,7 @@
13811528 ~~~~~~~~~~~~~~~~~~
13821529
13831530 io.stat
1384
- A read-only nested-keyed file which exists on non-root
1385
- cgroups.
1531
+ A read-only nested-keyed file.
13861532
13871533 Lines are keyed by $MAJ:$MIN device numbers and not ordered.
13881534 The following nested keys are defined.
....@@ -1396,10 +1542,107 @@
13961542 dios Number of discard IOs
13971543 ====== =====================
13981544
1399
- An example read output follows:
1545
+ An example read output follows::
14001546
14011547 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
14021548 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
1549
+
1550
+ io.cost.qos
1551
+ A read-write nested-keyed file with exists only on the root
1552
+ cgroup.
1553
+
1554
+ This file configures the Quality of Service of the IO cost
1555
+ model based controller (CONFIG_BLK_CGROUP_IOCOST) which
1556
+ currently implements "io.weight" proportional control. Lines
1557
+ are keyed by $MAJ:$MIN device numbers and not ordered. The
1558
+ line for a given device is populated on the first write for
1559
+ the device on "io.cost.qos" or "io.cost.model". The following
1560
+ nested keys are defined.
1561
+
1562
+ ====== =====================================
1563
+ enable Weight-based control enable
1564
+ ctrl "auto" or "user"
1565
+ rpct Read latency percentile [0, 100]
1566
+ rlat Read latency threshold
1567
+ wpct Write latency percentile [0, 100]
1568
+ wlat Write latency threshold
1569
+ min Minimum scaling percentage [1, 10000]
1570
+ max Maximum scaling percentage [1, 10000]
1571
+ ====== =====================================
1572
+
1573
+ The controller is disabled by default and can be enabled by
1574
+ setting "enable" to 1. "rpct" and "wpct" parameters default
1575
+ to zero and the controller uses internal device saturation
1576
+ state to adjust the overall IO rate between "min" and "max".
1577
+
1578
+ When a better control quality is needed, latency QoS
1579
+ parameters can be configured. For example::
1580
+
1581
+ 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
1582
+
1583
+ shows that on sdb, the controller is enabled, will consider
1584
+ the device saturated if the 95th percentile of read completion
1585
+ latencies is above 75ms or write 150ms, and adjust the overall
1586
+ IO issue rate between 50% and 150% accordingly.
1587
+
1588
+ The lower the saturation point, the better the latency QoS at
1589
+ the cost of aggregate bandwidth. The narrower the allowed
1590
+ adjustment range between "min" and "max", the more conformant
1591
+ to the cost model the IO behavior. Note that the IO issue
1592
+ base rate may be far off from 100% and setting "min" and "max"
1593
+ blindly can lead to a significant loss of device capacity or
1594
+ control quality. "min" and "max" are useful for regulating
1595
+ devices which show wide temporary behavior changes - e.g. a
1596
+ ssd which accepts writes at the line speed for a while and
1597
+ then completely stalls for multiple seconds.
1598
+
1599
+ When "ctrl" is "auto", the parameters are controlled by the
1600
+ kernel and may change automatically. Setting "ctrl" to "user"
1601
+ or setting any of the percentile and latency parameters puts
1602
+ it into "user" mode and disables the automatic changes. The
1603
+ automatic mode can be restored by setting "ctrl" to "auto".
1604
+
1605
+ io.cost.model
1606
+ A read-write nested-keyed file with exists only on the root
1607
+ cgroup.
1608
+
1609
+ This file configures the cost model of the IO cost model based
1610
+ controller (CONFIG_BLK_CGROUP_IOCOST) which currently
1611
+ implements "io.weight" proportional control. Lines are keyed
1612
+ by $MAJ:$MIN device numbers and not ordered. The line for a
1613
+ given device is populated on the first write for the device on
1614
+ "io.cost.qos" or "io.cost.model". The following nested keys
1615
+ are defined.
1616
+
1617
+ ===== ================================
1618
+ ctrl "auto" or "user"
1619
+ model The cost model in use - "linear"
1620
+ ===== ================================
1621
+
1622
+ When "ctrl" is "auto", the kernel may change all parameters
1623
+ dynamically. When "ctrl" is set to "user" or any other
1624
+ parameters are written to, "ctrl" become "user" and the
1625
+ automatic changes are disabled.
1626
+
1627
+ When "model" is "linear", the following model parameters are
1628
+ defined.
1629
+
1630
+ ============= ========================================
1631
+ [r|w]bps The maximum sequential IO throughput
1632
+ [r|w]seqiops The maximum 4k sequential IOs per second
1633
+ [r|w]randiops The maximum 4k random IOs per second
1634
+ ============= ========================================
1635
+
1636
+ From the above, the builtin linear model determines the base
1637
+ costs of a sequential and random IO and the cost coefficient
1638
+ for the IO size. While simple, this model can cover most
1639
+ common device classes acceptably.
1640
+
1641
+ The IO cost model isn't expected to be accurate in absolute
1642
+ sense and is scaled to the device behavior dynamically.
1643
+
1644
+ If needed, tools/cgroup/iocost_coef_gen.py can be used to
1645
+ generate device-specific coefficients.
14031646
14041647 io.weight
14051648 A read-write flat-keyed file which exists on non-root cgroups.
....@@ -1464,7 +1707,7 @@
14641707 A read-only nested-key file which exists on non-root cgroups.
14651708
14661709 Shows pressure stall information for IO. See
1467
- Documentation/accounting/psi.txt for details.
1710
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
14681711
14691712
14701713 Writeback
....@@ -1485,9 +1728,9 @@
14851728 of the two is enforced.
14861729
14871730 cgroup writeback requires explicit support from the underlying
1488
-filesystem. Currently, cgroup writeback is implemented on ext2, ext4
1489
-and btrfs. On other filesystems, all writeback IOs are attributed to
1490
-the root cgroup.
1731
+filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
1732
+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
1733
+attributed to the root cgroup.
14911734
14921735 There are inherent differences in memory and writeback management
14931736 which affects how cgroup ownership is tracked. Memory is tracked per
....@@ -1537,7 +1780,7 @@
15371780
15381781 The limits are only applied at the peer level in the hierarchy. This means that
15391782 in the diagram below, only groups A, B, and C will influence each other, and
1540
-groups D and F will influence each other. Group G will influence nobody.
1783
+groups D and F will influence each other. Group G will influence nobody::
15411784
15421785 [root]
15431786 / | \
....@@ -1606,6 +1849,60 @@
16061849 duration of time between evaluation events. Windows only elapse
16071850 with IO activity. Idle periods extend the most recent window.
16081851
1852
+IO Priority
1853
+~~~~~~~~~~~
1854
+
1855
+A single attribute controls the behavior of the I/O priority cgroup policy,
1856
+namely the blkio.prio.class attribute. The following values are accepted for
1857
+that attribute:
1858
+
1859
+ no-change
1860
+ Do not modify the I/O priority class.
1861
+
1862
+ none-to-rt
1863
+ For requests that do not have an I/O priority class (NONE),
1864
+ change the I/O priority class into RT. Do not modify
1865
+ the I/O priority class of other requests.
1866
+
1867
+ restrict-to-be
1868
+ For requests that do not have an I/O priority class or that have I/O
1869
+ priority class RT, change it into BE. Do not modify the I/O priority
1870
+ class of requests that have priority class IDLE.
1871
+
1872
+ idle
1873
+ Change the I/O priority class of all requests into IDLE, the lowest
1874
+ I/O priority class.
1875
+
1876
+The following numerical values are associated with the I/O priority policies:
1877
+
1878
++-------------+---+
1879
+| no-change | 0 |
1880
++-------------+---+
1881
+| none-to-rt | 1 |
1882
++-------------+---+
1883
+| rt-to-be | 2 |
1884
++-------------+---+
1885
+| all-to-idle | 3 |
1886
++-------------+---+
1887
+
1888
+The numerical value that corresponds to each I/O priority class is as follows:
1889
+
1890
++-------------------------------+---+
1891
+| IOPRIO_CLASS_NONE | 0 |
1892
++-------------------------------+---+
1893
+| IOPRIO_CLASS_RT (real-time) | 1 |
1894
++-------------------------------+---+
1895
+| IOPRIO_CLASS_BE (best effort) | 2 |
1896
++-------------------------------+---+
1897
+| IOPRIO_CLASS_IDLE | 3 |
1898
++-------------------------------+---+
1899
+
1900
+The algorithm to set the I/O priority class for a request is as follows:
1901
+
1902
+- Translate the I/O priority class policy into a number.
1903
+- Change the request I/O priority class into the maximum of the I/O priority
1904
+ class policy number and the numerical I/O priority class.
1905
+
16091906 PID
16101907 ---
16111908
....@@ -1646,6 +1943,176 @@
16461943 of a new process would cause a cgroup policy to be violated.
16471944
16481945
1946
+Cpuset
1947
+------
1948
+
1949
+The "cpuset" controller provides a mechanism for constraining
1950
+the CPU and memory node placement of tasks to only the resources
1951
+specified in the cpuset interface files in a task's current cgroup.
1952
+This is especially valuable on large NUMA systems where placing jobs
1953
+on properly sized subsets of the systems with careful processor and
1954
+memory placement to reduce cross-node memory access and contention
1955
+can improve overall system performance.
1956
+
1957
+The "cpuset" controller is hierarchical. That means the controller
1958
+cannot use CPUs or memory nodes not allowed in its parent.
1959
+
1960
+
1961
+Cpuset Interface Files
1962
+~~~~~~~~~~~~~~~~~~~~~~
1963
+
1964
+ cpuset.cpus
1965
+ A read-write multiple values file which exists on non-root
1966
+ cpuset-enabled cgroups.
1967
+
1968
+ It lists the requested CPUs to be used by tasks within this
1969
+ cgroup. The actual list of CPUs to be granted, however, is
1970
+ subjected to constraints imposed by its parent and can differ
1971
+ from the requested CPUs.
1972
+
1973
+ The CPU numbers are comma-separated numbers or ranges.
1974
+ For example::
1975
+
1976
+ # cat cpuset.cpus
1977
+ 0-4,6,8-10
1978
+
1979
+ An empty value indicates that the cgroup is using the same
1980
+ setting as the nearest cgroup ancestor with a non-empty
1981
+ "cpuset.cpus" or all the available CPUs if none is found.
1982
+
1983
+ The value of "cpuset.cpus" stays constant until the next update
1984
+ and won't be affected by any CPU hotplug events.
1985
+
1986
+ cpuset.cpus.effective
1987
+ A read-only multiple values file which exists on all
1988
+ cpuset-enabled cgroups.
1989
+
1990
+ It lists the onlined CPUs that are actually granted to this
1991
+ cgroup by its parent. These CPUs are allowed to be used by
1992
+ tasks within the current cgroup.
1993
+
1994
+ If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
1995
+ all the CPUs from the parent cgroup that can be available to
1996
+ be used by this cgroup. Otherwise, it should be a subset of
1997
+ "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
1998
+ can be granted. In this case, it will be treated just like an
1999
+ empty "cpuset.cpus".
2000
+
2001
+ Its value will be affected by CPU hotplug events.
2002
+
2003
+ cpuset.mems
2004
+ A read-write multiple values file which exists on non-root
2005
+ cpuset-enabled cgroups.
2006
+
2007
+ It lists the requested memory nodes to be used by tasks within
2008
+ this cgroup. The actual list of memory nodes granted, however,
2009
+ is subjected to constraints imposed by its parent and can differ
2010
+ from the requested memory nodes.
2011
+
2012
+ The memory node numbers are comma-separated numbers or ranges.
2013
+ For example::
2014
+
2015
+ # cat cpuset.mems
2016
+ 0-1,3
2017
+
2018
+ An empty value indicates that the cgroup is using the same
2019
+ setting as the nearest cgroup ancestor with a non-empty
2020
+ "cpuset.mems" or all the available memory nodes if none
2021
+ is found.
2022
+
2023
+ The value of "cpuset.mems" stays constant until the next update
2024
+ and won't be affected by any memory nodes hotplug events.
2025
+
2026
+ cpuset.mems.effective
2027
+ A read-only multiple values file which exists on all
2028
+ cpuset-enabled cgroups.
2029
+
2030
+ It lists the onlined memory nodes that are actually granted to
2031
+ this cgroup by its parent. These memory nodes are allowed to
2032
+ be used by tasks within the current cgroup.
2033
+
2034
+ If "cpuset.mems" is empty, it shows all the memory nodes from the
2035
+ parent cgroup that will be available to be used by this cgroup.
2036
+ Otherwise, it should be a subset of "cpuset.mems" unless none of
2037
+ the memory nodes listed in "cpuset.mems" can be granted. In this
2038
+ case, it will be treated just like an empty "cpuset.mems".
2039
+
2040
+ Its value will be affected by memory nodes hotplug events.
2041
+
2042
+ cpuset.cpus.partition
2043
+ A read-write single value file which exists on non-root
2044
+ cpuset-enabled cgroups. This flag is owned by the parent cgroup
2045
+ and is not delegatable.
2046
+
2047
+ It accepts only the following input values when written to.
2048
+
2049
+ "root" - a partition root
2050
+ "member" - a non-root member of a partition
2051
+
2052
+ When set to be a partition root, the current cgroup is the
2053
+ root of a new partition or scheduling domain that comprises
2054
+ itself and all its descendants except those that are separate
2055
+ partition roots themselves and their descendants. The root
2056
+ cgroup is always a partition root.
2057
+
2058
+ There are constraints on where a partition root can be set.
2059
+ It can only be set in a cgroup if all the following conditions
2060
+ are true.
2061
+
2062
+ 1) The "cpuset.cpus" is not empty and the list of CPUs are
2063
+ exclusive, i.e. they are not shared by any of its siblings.
2064
+ 2) The parent cgroup is a partition root.
2065
+ 3) The "cpuset.cpus" is also a proper subset of the parent's
2066
+ "cpuset.cpus.effective".
2067
+ 4) There is no child cgroups with cpuset enabled. This is for
2068
+ eliminating corner cases that have to be handled if such a
2069
+ condition is allowed.
2070
+
2071
+ Setting it to partition root will take the CPUs away from the
2072
+ effective CPUs of the parent cgroup. Once it is set, this
2073
+ file cannot be reverted back to "member" if there are any child
2074
+ cgroups with cpuset enabled.
2075
+
2076
+ A parent partition cannot distribute all its CPUs to its
2077
+ child partitions. There must be at least one cpu left in the
2078
+ parent partition.
2079
+
2080
+ Once becoming a partition root, changes to "cpuset.cpus" is
2081
+ generally allowed as long as the first condition above is true,
2082
+ the change will not take away all the CPUs from the parent
2083
+ partition and the new "cpuset.cpus" value is a superset of its
2084
+ children's "cpuset.cpus" values.
2085
+
2086
+ Sometimes, external factors like changes to ancestors'
2087
+ "cpuset.cpus" or cpu hotplug can cause the state of the partition
2088
+ root to change. On read, the "cpuset.sched.partition" file
2089
+ can show the following values.
2090
+
2091
+ "member" Non-root member of a partition
2092
+ "root" Partition root
2093
+ "root invalid" Invalid partition root
2094
+
2095
+ It is a partition root if the first 2 partition root conditions
2096
+ above are true and at least one CPU from "cpuset.cpus" is
2097
+ granted by the parent cgroup.
2098
+
2099
+ A partition root can become invalid if none of CPUs requested
2100
+ in "cpuset.cpus" can be granted by the parent cgroup or the
2101
+ parent cgroup is no longer a partition root itself. In this
2102
+ case, it is not a real partition even though the restriction
2103
+ of the first partition root condition above will still apply.
2104
+ The cpu affinity of all the tasks in the cgroup will then be
2105
+ associated with CPUs in the nearest ancestor partition.
2106
+
2107
+ An invalid partition root can be transitioned back to a
2108
+ real partition root if at least one of the requested CPUs
2109
+ can now be granted by its parent. In this case, the cpu
2110
+ affinity of all the tasks in the formerly invalid partition
2111
+ will be associated to the CPUs of the newly formed partition.
2112
+ Changing the partition state of an invalid partition root to
2113
+ "member" is always allowed even if child cpusets are present.
2114
+
2115
+
16492116 Device controller
16502117 -----------------
16512118
....@@ -1674,7 +2141,7 @@
16742141 ----
16752142
16762143 The "rdma" controller regulates the distribution and accounting of
1677
-of RDMA resources.
2144
+RDMA resources.
16782145
16792146 RDMA Interface Files
16802147 ~~~~~~~~~~~~~~~~~~~~
....@@ -1709,6 +2176,33 @@
17092176 mlx4_0 hca_handle=1 hca_object=20
17102177 ocrdma1 hca_handle=1 hca_object=23
17112178
2179
+HugeTLB
2180
+-------
2181
+
2182
+The HugeTLB controller allows to limit the HugeTLB usage per control group and
2183
+enforces the controller limit during page fault.
2184
+
2185
+HugeTLB Interface Files
2186
+~~~~~~~~~~~~~~~~~~~~~~~
2187
+
2188
+ hugetlb.<hugepagesize>.current
2189
+ Show current usage for "hugepagesize" hugetlb. It exists for all
2190
+ the cgroup except root.
2191
+
2192
+ hugetlb.<hugepagesize>.max
2193
+ Set/show the hard limit of "hugepagesize" hugetlb usage.
2194
+ The default value is "max". It exists for all the cgroup except root.
2195
+
2196
+ hugetlb.<hugepagesize>.events
2197
+ A read-only flat-keyed file which exists on non-root cgroups.
2198
+
2199
+ max
2200
+ The number of allocation failure due to HugeTLB limit
2201
+
2202
+ hugetlb.<hugepagesize>.events.local
2203
+ Similar to hugetlb.<hugepagesize>.events but the fields in the file
2204
+ are local to the cgroup i.e. not hierarchical. The file modified event
2205
+ generated on this file reflects only the local events.
17122206
17132207 Misc
17142208 ----
....@@ -1915,10 +2409,12 @@
19152409
19162410 wbc_init_bio(@wbc, @bio)
19172411 Should be called for each bio carrying writeback data and
1918
- associates the bio with the inode's owner cgroup. Can be
1919
- called anytime between bio allocation and submission.
2412
+ associates the bio with the inode's owner cgroup and the
2413
+ corresponding request queue. This must be called after
2414
+ a queue (device) has been associated with the bio and
2415
+ before submission.
19202416
1921
- wbc_account_io(@wbc, @page, @bytes)
2417
+ wbc_account_cgroup_owner(@wbc, @page, @bytes)
19222418 Should be called for each data segment being written out.
19232419 While this function doesn't care exactly when it's called
19242420 during the writeback session, it's the easiest and most
....@@ -1935,7 +2431,7 @@
19352431 the writeback session is holding shared resources, e.g. a journal
19362432 entry, may lead to priority inversion. There is no one easy solution
19372433 for the problem. Filesystems can try to work around specific problem
1938
-cases by skipping wbc_init_bio() or using bio_associate_blkcg()
2434
+cases by skipping wbc_init_bio() and using bio_associate_blkg()
19392435 directly.
19402436
19412437
....@@ -2145,8 +2641,10 @@
21452641 becomes self-defeating.
21462642
21472643 The memory.low boundary on the other hand is a top-down allocated
2148
-reserve. A cgroup enjoys reclaim protection when it's within its low,
2149
-which makes delegation of subtrees possible.
2644
+reserve. A cgroup enjoys reclaim protection when it's within its
2645
+effective low, which makes delegation of subtrees possible. It also
2646
+enjoys having reclaim pressure proportional to its overage when
2647
+above its effective low.
21502648
21512649 The original high boundary, the hard limit, is defined as a strict
21522650 limit that can not budge, even if the OOM killer has to be called.