From a5969cabbb4660eab42b6ef0412cbbd1200cf14d Mon Sep 17 00:00:00 2001 From: hc <hc@nodka.com> Date: Sat, 12 Oct 2024 07:10:09 +0000 Subject: [PATCH] 修改led为gpio --- kernel/Documentation/admin-guide/cgroup-v2.rst | 634 +++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 files changed, 566 insertions(+), 68 deletions(-) diff --git a/kernel/Documentation/admin-guide/cgroup-v2.rst b/kernel/Documentation/admin-guide/cgroup-v2.rst index 15779d9..540ee6d 100644 --- a/kernel/Documentation/admin-guide/cgroup-v2.rst +++ b/kernel/Documentation/admin-guide/cgroup-v2.rst @@ -9,7 +9,7 @@ conventions of cgroup v2. It describes all userland-visible aspects of cgroup including core and specific controller behaviors. All future changes must be reflected in this document. Documentation for -v1 is available under Documentation/cgroup-v1/. +v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`. .. CONTENTS @@ -54,13 +54,18 @@ 5-3-3. IO Latency 5-3-3-1. How IO Latency Throttling Works 5-3-3-2. IO Latency Interface Files + 5-3-4. IO Priority 5-4. PID 5-4-1. PID Interface Files - 5-5. Device - 5-6. RDMA - 5-6-1. RDMA Interface Files - 5-7. Misc - 5-7-1. perf_event + 5-5. Cpuset + 5.5-1. Cpuset Interface Files + 5-6. Device + 5-7. RDMA + 5-7-1. RDMA Interface Files + 5-8. HugeTLB + 5.8-1. HugeTLB Interface Files + 5-8. Misc + 5-8-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -174,6 +179,26 @@ through remount from the init namespace. The mount option is ignored on non-init namespace mounts. Please refer to the Delegation section for details. + + memory_localevents + + Only populate memory.events with data for the current cgroup, + and not any subtrees. This is legacy behaviour, the default + behaviour without this option is to include subtree counts. + This option is system wide and can only be set on mount or + modified through remount from the init namespace. The mount + option is ignored on non-init namespace mounts. + + memory_recursiveprot + + Recursively apply memory.min and memory.low protection to + entire subtrees, without requiring explicit downward + propagation into leaf cgroups. This allows protecting entire + subtrees from one another, while retaining free competition + within those subtrees. This should have been the default + behavior but is a mount-option to avoid regressing setups + relying on the original semantics (e.g. specifying bogusly + high 'bypass' protection values at higher tree levels). Organizing Processes and Threads @@ -604,8 +629,8 @@ Protections ----------- -A cgroup is protected to be allocated upto the configured amount of -the resource if the usages of all its ancestors are under their +A cgroup is protected upto the configured amount of the resource +as long as the usages of all its ancestors are under their protected levels. Protections can be hard guarantees or best effort soft boundaries. Protections can also be over-committed in which case only upto the amount available to the parent is protected among @@ -690,9 +715,7 @@ - Settings for a single feature should be contained in a single file. - The root cgroup should be exempt from resource control and thus - shouldn't have resource control interface files. Also, - informational files on the root cgroup which end up showing global - information available elsewhere shouldn't exist. + shouldn't have resource control interface files. - The default time unit is microseconds. If a different unit is ever used, an explicit unit suffix must be present. @@ -868,6 +891,8 @@ populated 1 if the cgroup or its descendants contains any live processes; otherwise, 0. + frozen + 1 if the cgroup is frozen; otherwise, 0. cgroup.max.descendants A read-write single value files. The default is "max". @@ -901,6 +926,31 @@ A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion. + cgroup.freeze + A read-write single value file which exists on non-root cgroups. + Allowed values are "0" and "1". The default is "0". + + Writing "1" to the file causes freezing of the cgroup and all + descendant cgroups. This means that all belonging processes will + be stopped and will not run until the cgroup will be explicitly + unfrozen. Freezing of the cgroup may take some time; when this action + is completed, the "frozen" value in the cgroup.events control file + will be updated to "1" and the corresponding notification will be + issued. + + A cgroup can be frozen either by its own settings, or by settings + of any ancestor cgroups. If any of ancestor cgroups is frozen, the + cgroup will remain frozen. + + Processes in the frozen cgroup can be killed by a fatal signal. + They also can enter and leave a frozen cgroup: either by an explicit + move by a user, or if freezing of the cgroup races with fork(). + If a process is moved to a frozen cgroup, it stops. If a process is + moved out of a frozen cgroup, it becomes running. + + Frozen status of a cgroup doesn't affect any cgroup tree operations: + it's possible to delete a frozen (and empty) cgroup, as well as + create new sub-cgroups. Controllers =========== @@ -934,7 +984,7 @@ All time durations are in microseconds. cpu.stat - A read-only flat-keyed file which exists on non-root cgroups. + A read-only flat-keyed file. This file exists whether the controller is enabled or not. It always reports the following three stats: @@ -983,7 +1033,7 @@ A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for CPU. See - Documentation/accounting/psi.txt for details. + :ref:`Documentation/accounting/psi.rst <psi>` for details. cpu.uclamp.min A read-write single value file which exists on non-root cgroups. @@ -1058,9 +1108,12 @@ is within its effective min boundary, the cgroup's memory won't be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer - is invoked. + is invoked. Above the effective min boundary (or + effective low boundary if it is higher), pages are reclaimed + proportionally to the overage, reducing reclaim pressure for + smaller overages. - Effective min boundary is limited by memory.min values of + Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get @@ -1079,8 +1132,12 @@ Best-effort memory protection. If the memory usage of a cgroup is within its effective low boundary, the cgroup's - memory won't be reclaimed unless memory can be reclaimed - from unprotected cgroups. + memory won't be reclaimed unless there is no reclaimable + memory available in unprotected cgroups. + Above the effective low boundary (or + effective min boundary if it is higher), pages are reclaimed + proportionally to the overage, reducing reclaim pressure for + smaller overages. Effective low boundary is limited by memory.low values of all ancestor cgroups. If there is memory.low overcommitment @@ -1114,6 +1171,13 @@ Under certain circumstances, the usage may go over the limit temporarily. + In default configuration regular 0-order allocations always + succeed unless OOM killer chooses current task as a victim. + + Some kinds of allocations don't invoke the OOM killer. + Caller could retry them differently, return into userspace + as -ENOMEM or silently ignore in cases like disk readahead. + This is the ultimate protection mechanism. As long as the high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. @@ -1142,6 +1206,11 @@ otherwise, a value change in this file generates a file modified event. + Note that all fields in this file are hierarchical and the + file modified event can be generated due to an event down the + hierarchy. For for the local events at the cgroup level see + memory.events.local. + low The number of times the cgroup is reclaimed due to high memory pressure even though its usage is under @@ -1165,17 +1234,18 @@ The number of time the cgroup's memory usage was reached the limit and allocation was about to fail. - Depending on context result could be invocation of OOM - killer and retrying allocation or failing allocation. - - Failed allocation in its turn could be returned into - userspace as -ENOMEM or silently ignored in cases like - disk readahead. For now OOM in memory cgroup kills - tasks iff shortage has happened inside page fault. + This event is not raised if the OOM killer is not + considered as an option, e.g. for failed high-order + allocations or if caller asked to not retry attempts. oom_kill The number of processes belonging to this cgroup killed by any kind of OOM killer. + + memory.events.local + Similar to memory.events but the fields in the file are local + to the cgroup i.e. not hierarchical. The file modified event + generated on this file reflects only the local events. memory.stat A read-only flat-keyed file which exists on non-root cgroups. @@ -1190,6 +1260,10 @@ can show up in the middle. Don't rely on items remaining in a fixed position; use the keys to look up specific values! + If the entry has no per-node counter(or not show in the + mempry.numa_stat). We use 'npn'(non-per-node) as the tag + to indicate that it will not show in the mempry.numa_stat. + anon Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS) @@ -1201,11 +1275,11 @@ kernel_stack Amount of memory allocated to kernel stacks. - slab - Amount of memory used for storing in-kernel data - structures. + percpu(npn) + Amount of memory used for storing per-cpu kernel + data structures. - sock + sock(npn) Amount of memory used in network transmission buffers shmem @@ -1223,10 +1297,19 @@ Amount of cached filesystem data that was modified and is currently being written back to disk + anon_thp + Amount of memory used in anonymous mappings backed by + transparent hugepages + inactive_anon, active_anon, inactive_file, active_file, unevictable Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the - page reclaim algorithm + page reclaim algorithm. + + As these represent internal list state (eg. shmem pages are on anon + memory management lists), inactive_foo + active_foo may not be equal to + the value for the foo counter, since the foo counter is type-based, not + list-based. slab_reclaimable Part of "slab" that might be reclaimed, such as @@ -1236,51 +1319,95 @@ Part of "slab" that cannot be reclaimed on memory pressure. - pgfault - Total number of page faults incurred + slab(npn) + Amount of memory used for storing in-kernel data + structures. - pgmajfault - Number of major page faults incurred + workingset_refault_anon + Number of refaults of previously evicted anonymous pages. - workingset_refault + workingset_refault_file + Number of refaults of previously evicted file pages. - Number of refaults of previously evicted pages + workingset_activate_anon + Number of refaulted anonymous pages that were immediately + activated. - workingset_activate + workingset_activate_file + Number of refaulted file pages that were immediately activated. - Number of refaulted pages that were immediately activated + workingset_restore_anon + Number of restored anonymous pages which have been detected as + an active workingset before they got reclaimed. + + workingset_restore_file + Number of restored file pages which have been detected as an + active workingset before they got reclaimed. workingset_nodereclaim - Number of times a shadow node has been reclaimed - pgrefill + pgfault(npn) + Total number of page faults incurred + pgmajfault(npn) + Number of major page faults incurred + + pgrefill(npn) Amount of scanned pages (in an active LRU list) - pgscan - + pgscan(npn) Amount of scanned pages (in an inactive LRU list) - pgsteal - + pgsteal(npn) Amount of reclaimed pages - pgactivate - + pgactivate(npn) Amount of pages moved to the active LRU list - pgdeactivate + pgdeactivate(npn) + Amount of pages moved to the inactive LRU list - Amount of pages moved to the inactive LRU lis - - pglazyfree - + pglazyfree(npn) Amount of pages postponed to be freed under memory pressure - pglazyfreed - + pglazyfreed(npn) Amount of reclaimed lazyfree pages + + thp_fault_alloc(npn) + Number of transparent hugepages which were allocated to satisfy + a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE + is not set. + + thp_collapse_alloc(npn) + Number of transparent hugepages which were allocated to allow + collapsing an existing range of pages. This counter is not + present when CONFIG_TRANSPARENT_HUGEPAGE is not set. + + memory.numa_stat + A read-only nested-keyed file which exists on non-root cgroups. + + This breaks down the cgroup's memory footprint into different + types of memory, type-specific details, and other information + per node on the state of the memory management system. + + This is useful for providing visibility into the NUMA locality + information within an memcg since the pages are allowed to be + allocated from any physical node. One of the use case is evaluating + application performance by combining this information with the + application's CPU allocation. + + All memory amounts are in bytes. + + The output format of memory.numa_stat is:: + + type N0=<bytes in node 0> N1=<bytes in node 1> ... + + The entries are ordered to be human readable, and new entries + can show up in the middle. Don't rely on items remaining in a + fixed position; use the keys to look up specific values! + + The entries can refer to the memory.stat. memory.swap.current A read-only single value file which exists on non-root @@ -1288,6 +1415,22 @@ The total amount of swap currently being used by the cgroup and its descendants. + + memory.swap.high + A read-write single value file which exists on non-root + cgroups. The default is "max". + + Swap usage throttle limit. If a cgroup's swap usage exceeds + this limit, all its further allocations will be throttled to + allow userspace to implement custom out-of-memory procedures. + + This limit marks a point of no return for the cgroup. It is NOT + designed to manage the amount of swapping a workload does + during regular operation. Compare to memory.swap.max, which + prohibits swapping past a set amount, but lets the cgroup + continue unimpeded as long as other memory can be reclaimed. + + Healthy workloads are not expected to reach this limit. memory.swap.max A read-write single value file which exists on non-root @@ -1301,6 +1444,10 @@ The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event. + + high + The number of times the cgroup's swap usage was over + the high threshold. max The number of times the cgroup's swap usage was about @@ -1321,7 +1468,7 @@ A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for memory. See - Documentation/accounting/psi.txt for details. + :ref:`Documentation/accounting/psi.rst <psi>` for details. Usage Guidelines @@ -1381,8 +1528,7 @@ ~~~~~~~~~~~~~~~~~~ io.stat - A read-only nested-keyed file which exists on non-root - cgroups. + A read-only nested-keyed file. Lines are keyed by $MAJ:$MIN device numbers and not ordered. The following nested keys are defined. @@ -1396,10 +1542,107 @@ dios Number of discard IOs ====== ===================== - An example read output follows: + An example read output follows:: 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 + + io.cost.qos + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the Quality of Service of the IO cost + model based controller (CONFIG_BLK_CGROUP_IOCOST) which + currently implements "io.weight" proportional control. Lines + are keyed by $MAJ:$MIN device numbers and not ordered. The + line for a given device is populated on the first write for + the device on "io.cost.qos" or "io.cost.model". The following + nested keys are defined. + + ====== ===================================== + enable Weight-based control enable + ctrl "auto" or "user" + rpct Read latency percentile [0, 100] + rlat Read latency threshold + wpct Write latency percentile [0, 100] + wlat Write latency threshold + min Minimum scaling percentage [1, 10000] + max Maximum scaling percentage [1, 10000] + ====== ===================================== + + The controller is disabled by default and can be enabled by + setting "enable" to 1. "rpct" and "wpct" parameters default + to zero and the controller uses internal device saturation + state to adjust the overall IO rate between "min" and "max". + + When a better control quality is needed, latency QoS + parameters can be configured. For example:: + + 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 + + shows that on sdb, the controller is enabled, will consider + the device saturated if the 95th percentile of read completion + latencies is above 75ms or write 150ms, and adjust the overall + IO issue rate between 50% and 150% accordingly. + + The lower the saturation point, the better the latency QoS at + the cost of aggregate bandwidth. The narrower the allowed + adjustment range between "min" and "max", the more conformant + to the cost model the IO behavior. Note that the IO issue + base rate may be far off from 100% and setting "min" and "max" + blindly can lead to a significant loss of device capacity or + control quality. "min" and "max" are useful for regulating + devices which show wide temporary behavior changes - e.g. a + ssd which accepts writes at the line speed for a while and + then completely stalls for multiple seconds. + + When "ctrl" is "auto", the parameters are controlled by the + kernel and may change automatically. Setting "ctrl" to "user" + or setting any of the percentile and latency parameters puts + it into "user" mode and disables the automatic changes. The + automatic mode can be restored by setting "ctrl" to "auto". + + io.cost.model + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the cost model of the IO cost model based + controller (CONFIG_BLK_CGROUP_IOCOST) which currently + implements "io.weight" proportional control. Lines are keyed + by $MAJ:$MIN device numbers and not ordered. The line for a + given device is populated on the first write for the device on + "io.cost.qos" or "io.cost.model". The following nested keys + are defined. + + ===== ================================ + ctrl "auto" or "user" + model The cost model in use - "linear" + ===== ================================ + + When "ctrl" is "auto", the kernel may change all parameters + dynamically. When "ctrl" is set to "user" or any other + parameters are written to, "ctrl" become "user" and the + automatic changes are disabled. + + When "model" is "linear", the following model parameters are + defined. + + ============= ======================================== + [r|w]bps The maximum sequential IO throughput + [r|w]seqiops The maximum 4k sequential IOs per second + [r|w]randiops The maximum 4k random IOs per second + ============= ======================================== + + From the above, the builtin linear model determines the base + costs of a sequential and random IO and the cost coefficient + for the IO size. While simple, this model can cover most + common device classes acceptably. + + The IO cost model isn't expected to be accurate in absolute + sense and is scaled to the device behavior dynamically. + + If needed, tools/cgroup/iocost_coef_gen.py can be used to + generate device-specific coefficients. io.weight A read-write flat-keyed file which exists on non-root cgroups. @@ -1464,7 +1707,7 @@ A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for IO. See - Documentation/accounting/psi.txt for details. + :ref:`Documentation/accounting/psi.rst <psi>` for details. Writeback @@ -1485,9 +1728,9 @@ of the two is enforced. cgroup writeback requires explicit support from the underlying -filesystem. Currently, cgroup writeback is implemented on ext2, ext4 -and btrfs. On other filesystems, all writeback IOs are attributed to -the root cgroup. +filesystem. Currently, cgroup writeback is implemented on ext2, ext4, +btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are +attributed to the root cgroup. There are inherent differences in memory and writeback management which affects how cgroup ownership is tracked. Memory is tracked per @@ -1537,7 +1780,7 @@ The limits are only applied at the peer level in the hierarchy. This means that in the diagram below, only groups A, B, and C will influence each other, and -groups D and F will influence each other. Group G will influence nobody. +groups D and F will influence each other. Group G will influence nobody:: [root] / | \ @@ -1606,6 +1849,60 @@ duration of time between evaluation events. Windows only elapse with IO activity. Idle periods extend the most recent window. +IO Priority +~~~~~~~~~~~ + +A single attribute controls the behavior of the I/O priority cgroup policy, +namely the blkio.prio.class attribute. The following values are accepted for +that attribute: + + no-change + Do not modify the I/O priority class. + + none-to-rt + For requests that do not have an I/O priority class (NONE), + change the I/O priority class into RT. Do not modify + the I/O priority class of other requests. + + restrict-to-be + For requests that do not have an I/O priority class or that have I/O + priority class RT, change it into BE. Do not modify the I/O priority + class of requests that have priority class IDLE. + + idle + Change the I/O priority class of all requests into IDLE, the lowest + I/O priority class. + +The following numerical values are associated with the I/O priority policies: + ++-------------+---+ +| no-change | 0 | ++-------------+---+ +| none-to-rt | 1 | ++-------------+---+ +| rt-to-be | 2 | ++-------------+---+ +| all-to-idle | 3 | ++-------------+---+ + +The numerical value that corresponds to each I/O priority class is as follows: + ++-------------------------------+---+ +| IOPRIO_CLASS_NONE | 0 | ++-------------------------------+---+ +| IOPRIO_CLASS_RT (real-time) | 1 | ++-------------------------------+---+ +| IOPRIO_CLASS_BE (best effort) | 2 | ++-------------------------------+---+ +| IOPRIO_CLASS_IDLE | 3 | ++-------------------------------+---+ + +The algorithm to set the I/O priority class for a request is as follows: + +- Translate the I/O priority class policy into a number. +- Change the request I/O priority class into the maximum of the I/O priority + class policy number and the numerical I/O priority class. + PID --- @@ -1646,6 +1943,176 @@ of a new process would cause a cgroup policy to be violated. +Cpuset +------ + +The "cpuset" controller provides a mechanism for constraining +the CPU and memory node placement of tasks to only the resources +specified in the cpuset interface files in a task's current cgroup. +This is especially valuable on large NUMA systems where placing jobs +on properly sized subsets of the systems with careful processor and +memory placement to reduce cross-node memory access and contention +can improve overall system performance. + +The "cpuset" controller is hierarchical. That means the controller +cannot use CPUs or memory nodes not allowed in its parent. + + +Cpuset Interface Files +~~~~~~~~~~~~~~~~~~~~~~ + + cpuset.cpus + A read-write multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the requested CPUs to be used by tasks within this + cgroup. The actual list of CPUs to be granted, however, is + subjected to constraints imposed by its parent and can differ + from the requested CPUs. + + The CPU numbers are comma-separated numbers or ranges. + For example:: + + # cat cpuset.cpus + 0-4,6,8-10 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.cpus" or all the available CPUs if none is found. + + The value of "cpuset.cpus" stays constant until the next update + and won't be affected by any CPU hotplug events. + + cpuset.cpus.effective + A read-only multiple values file which exists on all + cpuset-enabled cgroups. + + It lists the onlined CPUs that are actually granted to this + cgroup by its parent. These CPUs are allowed to be used by + tasks within the current cgroup. + + If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows + all the CPUs from the parent cgroup that can be available to + be used by this cgroup. Otherwise, it should be a subset of + "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" + can be granted. In this case, it will be treated just like an + empty "cpuset.cpus". + + Its value will be affected by CPU hotplug events. + + cpuset.mems + A read-write multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the requested memory nodes to be used by tasks within + this cgroup. The actual list of memory nodes granted, however, + is subjected to constraints imposed by its parent and can differ + from the requested memory nodes. + + The memory node numbers are comma-separated numbers or ranges. + For example:: + + # cat cpuset.mems + 0-1,3 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.mems" or all the available memory nodes if none + is found. + + The value of "cpuset.mems" stays constant until the next update + and won't be affected by any memory nodes hotplug events. + + cpuset.mems.effective + A read-only multiple values file which exists on all + cpuset-enabled cgroups. + + It lists the onlined memory nodes that are actually granted to + this cgroup by its parent. These memory nodes are allowed to + be used by tasks within the current cgroup. + + If "cpuset.mems" is empty, it shows all the memory nodes from the + parent cgroup that will be available to be used by this cgroup. + Otherwise, it should be a subset of "cpuset.mems" unless none of + the memory nodes listed in "cpuset.mems" can be granted. In this + case, it will be treated just like an empty "cpuset.mems". + + Its value will be affected by memory nodes hotplug events. + + cpuset.cpus.partition + A read-write single value file which exists on non-root + cpuset-enabled cgroups. This flag is owned by the parent cgroup + and is not delegatable. + + It accepts only the following input values when written to. + + "root" - a partition root + "member" - a non-root member of a partition + + When set to be a partition root, the current cgroup is the + root of a new partition or scheduling domain that comprises + itself and all its descendants except those that are separate + partition roots themselves and their descendants. The root + cgroup is always a partition root. + + There are constraints on where a partition root can be set. + It can only be set in a cgroup if all the following conditions + are true. + + 1) The "cpuset.cpus" is not empty and the list of CPUs are + exclusive, i.e. they are not shared by any of its siblings. + 2) The parent cgroup is a partition root. + 3) The "cpuset.cpus" is also a proper subset of the parent's + "cpuset.cpus.effective". + 4) There is no child cgroups with cpuset enabled. This is for + eliminating corner cases that have to be handled if such a + condition is allowed. + + Setting it to partition root will take the CPUs away from the + effective CPUs of the parent cgroup. Once it is set, this + file cannot be reverted back to "member" if there are any child + cgroups with cpuset enabled. + + A parent partition cannot distribute all its CPUs to its + child partitions. There must be at least one cpu left in the + parent partition. + + Once becoming a partition root, changes to "cpuset.cpus" is + generally allowed as long as the first condition above is true, + the change will not take away all the CPUs from the parent + partition and the new "cpuset.cpus" value is a superset of its + children's "cpuset.cpus" values. + + Sometimes, external factors like changes to ancestors' + "cpuset.cpus" or cpu hotplug can cause the state of the partition + root to change. On read, the "cpuset.sched.partition" file + can show the following values. + + "member" Non-root member of a partition + "root" Partition root + "root invalid" Invalid partition root + + It is a partition root if the first 2 partition root conditions + above are true and at least one CPU from "cpuset.cpus" is + granted by the parent cgroup. + + A partition root can become invalid if none of CPUs requested + in "cpuset.cpus" can be granted by the parent cgroup or the + parent cgroup is no longer a partition root itself. In this + case, it is not a real partition even though the restriction + of the first partition root condition above will still apply. + The cpu affinity of all the tasks in the cgroup will then be + associated with CPUs in the nearest ancestor partition. + + An invalid partition root can be transitioned back to a + real partition root if at least one of the requested CPUs + can now be granted by its parent. In this case, the cpu + affinity of all the tasks in the formerly invalid partition + will be associated to the CPUs of the newly formed partition. + Changing the partition state of an invalid partition root to + "member" is always allowed even if child cpusets are present. + + Device controller ----------------- @@ -1674,7 +2141,7 @@ ---- The "rdma" controller regulates the distribution and accounting of -of RDMA resources. +RDMA resources. RDMA Interface Files ~~~~~~~~~~~~~~~~~~~~ @@ -1709,6 +2176,33 @@ mlx4_0 hca_handle=1 hca_object=20 ocrdma1 hca_handle=1 hca_object=23 +HugeTLB +------- + +The HugeTLB controller allows to limit the HugeTLB usage per control group and +enforces the controller limit during page fault. + +HugeTLB Interface Files +~~~~~~~~~~~~~~~~~~~~~~~ + + hugetlb.<hugepagesize>.current + Show current usage for "hugepagesize" hugetlb. It exists for all + the cgroup except root. + + hugetlb.<hugepagesize>.max + Set/show the hard limit of "hugepagesize" hugetlb usage. + The default value is "max". It exists for all the cgroup except root. + + hugetlb.<hugepagesize>.events + A read-only flat-keyed file which exists on non-root cgroups. + + max + The number of allocation failure due to HugeTLB limit + + hugetlb.<hugepagesize>.events.local + Similar to hugetlb.<hugepagesize>.events but the fields in the file + are local to the cgroup i.e. not hierarchical. The file modified event + generated on this file reflects only the local events. Misc ---- @@ -1915,10 +2409,12 @@ wbc_init_bio(@wbc, @bio) Should be called for each bio carrying writeback data and - associates the bio with the inode's owner cgroup. Can be - called anytime between bio allocation and submission. + associates the bio with the inode's owner cgroup and the + corresponding request queue. This must be called after + a queue (device) has been associated with the bio and + before submission. - wbc_account_io(@wbc, @page, @bytes) + wbc_account_cgroup_owner(@wbc, @page, @bytes) Should be called for each data segment being written out. While this function doesn't care exactly when it's called during the writeback session, it's the easiest and most @@ -1935,7 +2431,7 @@ the writeback session is holding shared resources, e.g. a journal entry, may lead to priority inversion. There is no one easy solution for the problem. Filesystems can try to work around specific problem -cases by skipping wbc_init_bio() or using bio_associate_blkcg() +cases by skipping wbc_init_bio() and using bio_associate_blkg() directly. @@ -2145,8 +2641,10 @@ becomes self-defeating. The memory.low boundary on the other hand is a top-down allocated -reserve. A cgroup enjoys reclaim protection when it's within its low, -which makes delegation of subtrees possible. +reserve. A cgroup enjoys reclaim protection when it's within its +effective low, which makes delegation of subtrees possible. It also +enjoys having reclaim pressure proportional to its overage when +above its effective low. The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. -- Gitblit v1.6.2