~hc/RK356X_SDK_RELEASE.git

..	..	@@ -9,7 +9,7 @@
9	9	conventions of cgroup v2. It describes all userland-visible aspects
10	10	of cgroup including core and specific controller behaviors. All
11	11	future changes must be reflected in this document. Documentation for
12		-v1 is available under Documentation/cgroup-v1/.
	12	+v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
13	13
14	14	.. CONTENTS
15	15
..	..	@@ -54,13 +54,18 @@
54	54	5-3-3. IO Latency
55	55	5-3-3-1. How IO Latency Throttling Works
56	56	5-3-3-2. IO Latency Interface Files
	57	+ 5-3-4. IO Priority
57	58	5-4. PID
58	59	5-4-1. PID Interface Files
59		- 5-5. Device
60		- 5-6. RDMA
61		- 5-6-1. RDMA Interface Files
62		- 5-7. Misc
63		- 5-7-1. perf_event
	60	+ 5-5. Cpuset
	61	+ 5.5-1. Cpuset Interface Files
	62	+ 5-6. Device
	63	+ 5-7. RDMA
	64	+ 5-7-1. RDMA Interface Files
	65	+ 5-8. HugeTLB
	66	+ 5.8-1. HugeTLB Interface Files
	67	+ 5-8. Misc
	68	+ 5-8-1. perf_event
64	69	5-N. Non-normative information
65	70	5-N-1. CPU controller root cgroup process behaviour
66	71	5-N-2. IO controller root cgroup process behaviour
..	..	@@ -174,6 +179,26 @@
174	179	through remount from the init namespace. The mount option is
175	180	ignored on non-init namespace mounts. Please refer to the
176	181	Delegation section for details.
	182	+
	183	+ memory_localevents
	184	+
	185	+ Only populate memory.events with data for the current cgroup,
	186	+ and not any subtrees. This is legacy behaviour, the default
	187	+ behaviour without this option is to include subtree counts.
	188	+ This option is system wide and can only be set on mount or
	189	+ modified through remount from the init namespace. The mount
	190	+ option is ignored on non-init namespace mounts.
	191	+
	192	+ memory_recursiveprot
	193	+
	194	+ Recursively apply memory.min and memory.low protection to
	195	+ entire subtrees, without requiring explicit downward
	196	+ propagation into leaf cgroups. This allows protecting entire
	197	+ subtrees from one another, while retaining free competition
	198	+ within those subtrees. This should have been the default
	199	+ behavior but is a mount-option to avoid regressing setups
	200	+ relying on the original semantics (e.g. specifying bogusly
	201	+ high 'bypass' protection values at higher tree levels).
177	202
178	203
179	204	Organizing Processes and Threads
..	..	@@ -604,8 +629,8 @@
604	629	Protections
605	630	-----------
606	631
607		-A cgroup is protected to be allocated upto the configured amount of
608		-the resource if the usages of all its ancestors are under their
	632	+A cgroup is protected upto the configured amount of the resource
	633	+as long as the usages of all its ancestors are under their
609	634	protected levels. Protections can be hard guarantees or best effort
610	635	soft boundaries. Protections can also be over-committed in which case
611	636	only upto the amount available to the parent is protected among
..	..	@@ -690,9 +715,7 @@
690	715	- Settings for a single feature should be contained in a single file.
691	716
692	717	- The root cgroup should be exempt from resource control and thus
693		- shouldn't have resource control interface files. Also,
694		- informational files on the root cgroup which end up showing global
695		- information available elsewhere shouldn't exist.
	718	+ shouldn't have resource control interface files.
696	719
697	720	- The default time unit is microseconds. If a different unit is ever
698	721	used, an explicit unit suffix must be present.
..	..	@@ -868,6 +891,8 @@
868	891	populated
869	892	1 if the cgroup or its descendants contains any live
870	893	processes; otherwise, 0.
	894	+ frozen
	895	+ 1 if the cgroup is frozen; otherwise, 0.
871	896
872	897	cgroup.max.descendants
873	898	A read-write single value files. The default is "max".
..	..	@@ -901,6 +926,31 @@
901	926	A dying cgroup can consume system resources not exceeding
902	927	limits, which were active at the moment of cgroup deletion.
903	928
	929	+ cgroup.freeze
	930	+ A read-write single value file which exists on non-root cgroups.
	931	+ Allowed values are "0" and "1". The default is "0".
	932	+
	933	+ Writing "1" to the file causes freezing of the cgroup and all
	934	+ descendant cgroups. This means that all belonging processes will
	935	+ be stopped and will not run until the cgroup will be explicitly
	936	+ unfrozen. Freezing of the cgroup may take some time; when this action
	937	+ is completed, the "frozen" value in the cgroup.events control file
	938	+ will be updated to "1" and the corresponding notification will be
	939	+ issued.
	940	+
	941	+ A cgroup can be frozen either by its own settings, or by settings
	942	+ of any ancestor cgroups. If any of ancestor cgroups is frozen, the
	943	+ cgroup will remain frozen.
	944	+
	945	+ Processes in the frozen cgroup can be killed by a fatal signal.
	946	+ They also can enter and leave a frozen cgroup: either by an explicit
	947	+ move by a user, or if freezing of the cgroup races with fork().
	948	+ If a process is moved to a frozen cgroup, it stops. If a process is
	949	+ moved out of a frozen cgroup, it becomes running.
	950	+
	951	+ Frozen status of a cgroup doesn't affect any cgroup tree operations:
	952	+ it's possible to delete a frozen (and empty) cgroup, as well as
	953	+ create new sub-cgroups.
904	954
905	955	Controllers
906	956	===========
..	..	@@ -934,7 +984,7 @@
934	984	All time durations are in microseconds.
935	985
936	986	cpu.stat
937		- A read-only flat-keyed file which exists on non-root cgroups.
	987	+ A read-only flat-keyed file.
938	988	This file exists whether the controller is enabled or not.
939	989
940	990	It always reports the following three stats:
..	..	@@ -983,7 +1033,7 @@
983	1033	A read-only nested-key file which exists on non-root cgroups.
984	1034
985	1035	Shows pressure stall information for CPU. See
986		- Documentation/accounting/psi.txt for details.
	1036	+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
987	1037
988	1038	cpu.uclamp.min
989	1039	A read-write single value file which exists on non-root cgroups.
..	..	@@ -1058,9 +1108,12 @@
1058	1108	is within its effective min boundary, the cgroup's memory
1059	1109	won't be reclaimed under any conditions. If there is no
1060	1110	unprotected reclaimable memory available, OOM killer
1061		- is invoked.
	1111	+ is invoked. Above the effective min boundary (or
	1112	+ effective low boundary if it is higher), pages are reclaimed
	1113	+ proportionally to the overage, reducing reclaim pressure for
	1114	+ smaller overages.
1062	1115
1063		- Effective min boundary is limited by memory.min values of
	1116	+ Effective min boundary is limited by memory.min values of
1064	1117	all ancestor cgroups. If there is memory.min overcommitment
1065	1118	(child cgroup or cgroups are requiring more protected memory
1066	1119	than parent will allow), then each child cgroup will get
..	..	@@ -1079,8 +1132,12 @@
1079	1132
1080	1133	Best-effort memory protection. If the memory usage of a
1081	1134	cgroup is within its effective low boundary, the cgroup's
1082		- memory won't be reclaimed unless memory can be reclaimed
1083		- from unprotected cgroups.
	1135	+ memory won't be reclaimed unless there is no reclaimable
	1136	+ memory available in unprotected cgroups.
	1137	+ Above the effective low boundary (or
	1138	+ effective min boundary if it is higher), pages are reclaimed
	1139	+ proportionally to the overage, reducing reclaim pressure for
	1140	+ smaller overages.
1084	1141
1085	1142	Effective low boundary is limited by memory.low values of
1086	1143	all ancestor cgroups. If there is memory.low overcommitment
..	..	@@ -1114,6 +1171,13 @@
1114	1171	Under certain circumstances, the usage may go over the limit
1115	1172	temporarily.
1116	1173
	1174	+ In default configuration regular 0-order allocations always
	1175	+ succeed unless OOM killer chooses current task as a victim.
	1176	+
	1177	+ Some kinds of allocations don't invoke the OOM killer.
	1178	+ Caller could retry them differently, return into userspace
	1179	+ as -ENOMEM or silently ignore in cases like disk readahead.
	1180	+
1117	1181	This is the ultimate protection mechanism. As long as the
1118	1182	high limit is used and monitored properly, this limit's
1119	1183	utility is limited to providing the final safety net.
..	..	@@ -1142,6 +1206,11 @@
1142	1206	otherwise, a value change in this file generates a file
1143	1207	modified event.
1144	1208
	1209	+ Note that all fields in this file are hierarchical and the
	1210	+ file modified event can be generated due to an event down the
	1211	+ hierarchy. For for the local events at the cgroup level see
	1212	+ memory.events.local.
	1213	+
1145	1214	low
1146	1215	The number of times the cgroup is reclaimed due to
1147	1216	high memory pressure even though its usage is under
..	..	@@ -1165,17 +1234,18 @@
1165	1234	The number of time the cgroup's memory usage was
1166	1235	reached the limit and allocation was about to fail.
1167	1236
1168		- Depending on context result could be invocation of OOM
1169		- killer and retrying allocation or failing allocation.
1170		-
1171		- Failed allocation in its turn could be returned into
1172		- userspace as -ENOMEM or silently ignored in cases like
1173		- disk readahead. For now OOM in memory cgroup kills
1174		- tasks iff shortage has happened inside page fault.
	1237	+ This event is not raised if the OOM killer is not
	1238	+ considered as an option, e.g. for failed high-order
	1239	+ allocations or if caller asked to not retry attempts.
1175	1240
1176	1241	oom_kill
1177	1242	The number of processes belonging to this cgroup
1178	1243	killed by any kind of OOM killer.
	1244	+
	1245	+ memory.events.local
	1246	+ Similar to memory.events but the fields in the file are local
	1247	+ to the cgroup i.e. not hierarchical. The file modified event
	1248	+ generated on this file reflects only the local events.
1179	1249
1180	1250	memory.stat
1181	1251	A read-only flat-keyed file which exists on non-root cgroups.
..	..	@@ -1190,6 +1260,10 @@
1190	1260	can show up in the middle. Don't rely on items remaining in a
1191	1261	fixed position; use the keys to look up specific values!
1192	1262
	1263	+ If the entry has no per-node counter(or not show in the
	1264	+ mempry.numa_stat). We use 'npn'(non-per-node) as the tag
	1265	+ to indicate that it will not show in the mempry.numa_stat.
	1266	+
1193	1267	anon
1194	1268	Amount of memory used in anonymous mappings such as
1195	1269	brk(), sbrk(), and mmap(MAP_ANONYMOUS)
..	..	@@ -1201,11 +1275,11 @@
1201	1275	kernel_stack
1202	1276	Amount of memory allocated to kernel stacks.
1203	1277
1204		- slab
1205		- Amount of memory used for storing in-kernel data
1206		- structures.
	1278	+ percpu(npn)
	1279	+ Amount of memory used for storing per-cpu kernel
	1280	+ data structures.
1207	1281
1208		- sock
	1282	+ sock(npn)
1209	1283	Amount of memory used in network transmission buffers
1210	1284
1211	1285	shmem
..	..	@@ -1223,10 +1297,19 @@
1223	1297	Amount of cached filesystem data that was modified and
1224	1298	is currently being written back to disk
1225	1299
	1300	+ anon_thp
	1301	+ Amount of memory used in anonymous mappings backed by
	1302	+ transparent hugepages
	1303	+
1226	1304	inactive_anon, active_anon, inactive_file, active_file, unevictable
1227	1305	Amount of memory, swap-backed and filesystem-backed,
1228	1306	on the internal memory management lists used by the
1229		- page reclaim algorithm
	1307	+ page reclaim algorithm.
	1308	+
	1309	+ As these represent internal list state (eg. shmem pages are on anon
	1310	+ memory management lists), inactive_foo + active_foo may not be equal to
	1311	+ the value for the foo counter, since the foo counter is type-based, not
	1312	+ list-based.
1230	1313
1231	1314	slab_reclaimable
1232	1315	Part of "slab" that might be reclaimed, such as
..	..	@@ -1236,51 +1319,95 @@
1236	1319	Part of "slab" that cannot be reclaimed on memory
1237	1320	pressure.
1238	1321
1239		- pgfault
1240		- Total number of page faults incurred
	1322	+ slab(npn)
	1323	+ Amount of memory used for storing in-kernel data
	1324	+ structures.
1241	1325
1242		- pgmajfault
1243		- Number of major page faults incurred
	1326	+ workingset_refault_anon
	1327	+ Number of refaults of previously evicted anonymous pages.
1244	1328
1245		- workingset_refault
	1329	+ workingset_refault_file
	1330	+ Number of refaults of previously evicted file pages.
1246	1331
1247		- Number of refaults of previously evicted pages
	1332	+ workingset_activate_anon
	1333	+ Number of refaulted anonymous pages that were immediately
	1334	+ activated.
1248	1335
1249		- workingset_activate
	1336	+ workingset_activate_file
	1337	+ Number of refaulted file pages that were immediately activated.
1250	1338
1251		- Number of refaulted pages that were immediately activated
	1339	+ workingset_restore_anon
	1340	+ Number of restored anonymous pages which have been detected as
	1341	+ an active workingset before they got reclaimed.
	1342	+
	1343	+ workingset_restore_file
	1344	+ Number of restored file pages which have been detected as an
	1345	+ active workingset before they got reclaimed.
1252	1346
1253	1347	workingset_nodereclaim
1254		-
1255	1348	Number of times a shadow node has been reclaimed
1256	1349
1257		- pgrefill
	1350	+ pgfault(npn)
	1351	+ Total number of page faults incurred
1258	1352
	1353	+ pgmajfault(npn)
	1354	+ Number of major page faults incurred
	1355	+
	1356	+ pgrefill(npn)
1259	1357	Amount of scanned pages (in an active LRU list)
1260	1358
1261		- pgscan
1262		-
	1359	+ pgscan(npn)
1263	1360	Amount of scanned pages (in an inactive LRU list)
1264	1361
1265		- pgsteal
1266		-
	1362	+ pgsteal(npn)
1267	1363	Amount of reclaimed pages
1268	1364
1269		- pgactivate
1270		-
	1365	+ pgactivate(npn)
1271	1366	Amount of pages moved to the active LRU list
1272	1367
1273		- pgdeactivate
	1368	+ pgdeactivate(npn)
	1369	+ Amount of pages moved to the inactive LRU list
1274	1370
1275		- Amount of pages moved to the inactive LRU lis
1276		-
1277		- pglazyfree
1278		-
	1371	+ pglazyfree(npn)
1279	1372	Amount of pages postponed to be freed under memory pressure
1280	1373
1281		- pglazyfreed
1282		-
	1374	+ pglazyfreed(npn)
1283	1375	Amount of reclaimed lazyfree pages
	1376	+
	1377	+ thp_fault_alloc(npn)
	1378	+ Number of transparent hugepages which were allocated to satisfy
	1379	+ a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
	1380	+ is not set.
	1381	+
	1382	+ thp_collapse_alloc(npn)
	1383	+ Number of transparent hugepages which were allocated to allow
	1384	+ collapsing an existing range of pages. This counter is not
	1385	+ present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
	1386	+
	1387	+ memory.numa_stat
	1388	+ A read-only nested-keyed file which exists on non-root cgroups.
	1389	+
	1390	+ This breaks down the cgroup's memory footprint into different
	1391	+ types of memory, type-specific details, and other information
	1392	+ per node on the state of the memory management system.
	1393	+
	1394	+ This is useful for providing visibility into the NUMA locality
	1395	+ information within an memcg since the pages are allowed to be
	1396	+ allocated from any physical node. One of the use case is evaluating
	1397	+ application performance by combining this information with the
	1398	+ application's CPU allocation.
	1399	+
	1400	+ All memory amounts are in bytes.
	1401	+
	1402	+ The output format of memory.numa_stat is::
	1403	+
	1404	+ type N0=<bytes in node 0> N1=<bytes in node 1> ...
	1405	+
	1406	+ The entries are ordered to be human readable, and new entries
	1407	+ can show up in the middle. Don't rely on items remaining in a
	1408	+ fixed position; use the keys to look up specific values!
	1409	+
	1410	+ The entries can refer to the memory.stat.
1284	1411
1285	1412	memory.swap.current
1286	1413	A read-only single value file which exists on non-root
..	..	@@ -1288,6 +1415,22 @@
1288	1415
1289	1416	The total amount of swap currently being used by the cgroup
1290	1417	and its descendants.
	1418	+
	1419	+ memory.swap.high
	1420	+ A read-write single value file which exists on non-root
	1421	+ cgroups. The default is "max".
	1422	+
	1423	+ Swap usage throttle limit. If a cgroup's swap usage exceeds
	1424	+ this limit, all its further allocations will be throttled to
	1425	+ allow userspace to implement custom out-of-memory procedures.
	1426	+
	1427	+ This limit marks a point of no return for the cgroup. It is NOT
	1428	+ designed to manage the amount of swapping a workload does
	1429	+ during regular operation. Compare to memory.swap.max, which
	1430	+ prohibits swapping past a set amount, but lets the cgroup
	1431	+ continue unimpeded as long as other memory can be reclaimed.
	1432	+
	1433	+ Healthy workloads are not expected to reach this limit.
1291	1434
1292	1435	memory.swap.max
1293	1436	A read-write single value file which exists on non-root
..	..	@@ -1301,6 +1444,10 @@
1301	1444	The following entries are defined. Unless specified
1302	1445	otherwise, a value change in this file generates a file
1303	1446	modified event.
	1447	+
	1448	+ high
	1449	+ The number of times the cgroup's swap usage was over
	1450	+ the high threshold.
1304	1451
1305	1452	max
1306	1453	The number of times the cgroup's swap usage was about
..	..	@@ -1321,7 +1468,7 @@
1321	1468	A read-only nested-key file which exists on non-root cgroups.
1322	1469
1323	1470	Shows pressure stall information for memory. See
1324		- Documentation/accounting/psi.txt for details.
	1471	+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
1325	1472
1326	1473
1327	1474	Usage Guidelines
..	..	@@ -1381,8 +1528,7 @@
1381	1528	~~~~~~~~~~~~~~~~~~
1382	1529
1383	1530	io.stat
1384		- A read-only nested-keyed file which exists on non-root
1385		- cgroups.
	1531	+ A read-only nested-keyed file.
1386	1532
1387	1533	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
1388	1534	The following nested keys are defined.
..	..	@@ -1396,10 +1542,107 @@
1396	1542	dios Number of discard IOs
1397	1543	====== =====================
1398	1544
1399		- An example read output follows:
	1545	+ An example read output follows::
1400	1546
1401	1547	8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
1402	1548	8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
	1549	+
	1550	+ io.cost.qos
	1551	+ A read-write nested-keyed file with exists only on the root
	1552	+ cgroup.
	1553	+
	1554	+ This file configures the Quality of Service of the IO cost
	1555	+ model based controller (CONFIG_BLK_CGROUP_IOCOST) which
	1556	+ currently implements "io.weight" proportional control. Lines
	1557	+ are keyed by $MAJ:$MIN device numbers and not ordered. The
	1558	+ line for a given device is populated on the first write for
	1559	+ the device on "io.cost.qos" or "io.cost.model". The following
	1560	+ nested keys are defined.
	1561	+
	1562	+ ====== =====================================
	1563	+ enable Weight-based control enable
	1564	+ ctrl "auto" or "user"
	1565	+ rpct Read latency percentile [0, 100]
	1566	+ rlat Read latency threshold
	1567	+ wpct Write latency percentile [0, 100]
	1568	+ wlat Write latency threshold
	1569	+ min Minimum scaling percentage [1, 10000]
	1570	+ max Maximum scaling percentage [1, 10000]
	1571	+ ====== =====================================
	1572	+
	1573	+ The controller is disabled by default and can be enabled by
	1574	+ setting "enable" to 1. "rpct" and "wpct" parameters default
	1575	+ to zero and the controller uses internal device saturation
	1576	+ state to adjust the overall IO rate between "min" and "max".
	1577	+
	1578	+ When a better control quality is needed, latency QoS
	1579	+ parameters can be configured. For example::
	1580	+
	1581	+ 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
	1582	+
	1583	+ shows that on sdb, the controller is enabled, will consider
	1584	+ the device saturated if the 95th percentile of read completion
	1585	+ latencies is above 75ms or write 150ms, and adjust the overall
	1586	+ IO issue rate between 50% and 150% accordingly.
	1587	+
	1588	+ The lower the saturation point, the better the latency QoS at
	1589	+ the cost of aggregate bandwidth. The narrower the allowed
	1590	+ adjustment range between "min" and "max", the more conformant
	1591	+ to the cost model the IO behavior. Note that the IO issue
	1592	+ base rate may be far off from 100% and setting "min" and "max"
	1593	+ blindly can lead to a significant loss of device capacity or
	1594	+ control quality. "min" and "max" are useful for regulating
	1595	+ devices which show wide temporary behavior changes - e.g. a
	1596	+ ssd which accepts writes at the line speed for a while and
	1597	+ then completely stalls for multiple seconds.
	1598	+
	1599	+ When "ctrl" is "auto", the parameters are controlled by the
	1600	+ kernel and may change automatically. Setting "ctrl" to "user"
	1601	+ or setting any of the percentile and latency parameters puts
	1602	+ it into "user" mode and disables the automatic changes. The
	1603	+ automatic mode can be restored by setting "ctrl" to "auto".
	1604	+
	1605	+ io.cost.model
	1606	+ A read-write nested-keyed file with exists only on the root
	1607	+ cgroup.
	1608	+
	1609	+ This file configures the cost model of the IO cost model based
	1610	+ controller (CONFIG_BLK_CGROUP_IOCOST) which currently
	1611	+ implements "io.weight" proportional control. Lines are keyed
	1612	+ by $MAJ:$MIN device numbers and not ordered. The line for a
	1613	+ given device is populated on the first write for the device on
	1614	+ "io.cost.qos" or "io.cost.model". The following nested keys
	1615	+ are defined.
	1616	+
	1617	+ ===== ================================
	1618	+ ctrl "auto" or "user"
	1619	+ model The cost model in use - "linear"
	1620	+ ===== ================================
	1621	+
	1622	+ When "ctrl" is "auto", the kernel may change all parameters
	1623	+ dynamically. When "ctrl" is set to "user" or any other
	1624	+ parameters are written to, "ctrl" become "user" and the
	1625	+ automatic changes are disabled.
	1626	+
	1627	+ When "model" is "linear", the following model parameters are
	1628	+ defined.
	1629	+
	1630	+ ============= ========================================
	1631	+ [r\|w]bps The maximum sequential IO throughput
	1632	+ [r\|w]seqiops The maximum 4k sequential IOs per second
	1633	+ [r\|w]randiops The maximum 4k random IOs per second
	1634	+ ============= ========================================
	1635	+
	1636	+ From the above, the builtin linear model determines the base
	1637	+ costs of a sequential and random IO and the cost coefficient
	1638	+ for the IO size. While simple, this model can cover most
	1639	+ common device classes acceptably.
	1640	+
	1641	+ The IO cost model isn't expected to be accurate in absolute
	1642	+ sense and is scaled to the device behavior dynamically.
	1643	+
	1644	+ If needed, tools/cgroup/iocost_coef_gen.py can be used to
	1645	+ generate device-specific coefficients.
1403	1646
1404	1647	io.weight
1405	1648	A read-write flat-keyed file which exists on non-root cgroups.
..	..	@@ -1464,7 +1707,7 @@
1464	1707	A read-only nested-key file which exists on non-root cgroups.
1465	1708
1466	1709	Shows pressure stall information for IO. See
1467		- Documentation/accounting/psi.txt for details.
	1710	+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
1468	1711
1469	1712
1470	1713	Writeback
..	..	@@ -1485,9 +1728,9 @@
1485	1728	of the two is enforced.
1486	1729
1487	1730	cgroup writeback requires explicit support from the underlying
1488		-filesystem. Currently, cgroup writeback is implemented on ext2, ext4
1489		-and btrfs. On other filesystems, all writeback IOs are attributed to
1490		-the root cgroup.
	1731	+filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
	1732	+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
	1733	+attributed to the root cgroup.
1491	1734
1492	1735	There are inherent differences in memory and writeback management
1493	1736	which affects how cgroup ownership is tracked. Memory is tracked per
..	..	@@ -1537,7 +1780,7 @@
1537	1780
1538	1781	The limits are only applied at the peer level in the hierarchy. This means that
1539	1782	in the diagram below, only groups A, B, and C will influence each other, and
1540		-groups D and F will influence each other. Group G will influence nobody.
	1783	+groups D and F will influence each other. Group G will influence nobody::
1541	1784
1542	1785	[root]
1543	1786	/ \| \
..	..	@@ -1606,6 +1849,60 @@
1606	1849	duration of time between evaluation events. Windows only elapse
1607	1850	with IO activity. Idle periods extend the most recent window.
1608	1851
	1852	+IO Priority
	1853	+~~~~~~~~~~~
	1854	+
	1855	+A single attribute controls the behavior of the I/O priority cgroup policy,
	1856	+namely the blkio.prio.class attribute. The following values are accepted for
	1857	+that attribute:
	1858	+
	1859	+ no-change
	1860	+ Do not modify the I/O priority class.
	1861	+
	1862	+ none-to-rt
	1863	+ For requests that do not have an I/O priority class (NONE),
	1864	+ change the I/O priority class into RT. Do not modify
	1865	+ the I/O priority class of other requests.
	1866	+
	1867	+ restrict-to-be
	1868	+ For requests that do not have an I/O priority class or that have I/O
	1869	+ priority class RT, change it into BE. Do not modify the I/O priority
	1870	+ class of requests that have priority class IDLE.
	1871	+
	1872	+ idle
	1873	+ Change the I/O priority class of all requests into IDLE, the lowest
	1874	+ I/O priority class.
	1875	+
	1876	+The following numerical values are associated with the I/O priority policies:
	1877	+
	1878	++-------------+---+
	1879	+\| no-change \| 0 \|
	1880	++-------------+---+
	1881	+\| none-to-rt \| 1 \|
	1882	++-------------+---+
	1883	+\| rt-to-be \| 2 \|
	1884	++-------------+---+
	1885	+\| all-to-idle \| 3 \|
	1886	++-------------+---+
	1887	+
	1888	+The numerical value that corresponds to each I/O priority class is as follows:
	1889	+
	1890	++-------------------------------+---+
	1891	+\| IOPRIO_CLASS_NONE \| 0 \|
	1892	++-------------------------------+---+
	1893	+\| IOPRIO_CLASS_RT (real-time) \| 1 \|
	1894	++-------------------------------+---+
	1895	+\| IOPRIO_CLASS_BE (best effort) \| 2 \|
	1896	++-------------------------------+---+
	1897	+\| IOPRIO_CLASS_IDLE \| 3 \|
	1898	++-------------------------------+---+
	1899	+
	1900	+The algorithm to set the I/O priority class for a request is as follows:
	1901	+
	1902	+- Translate the I/O priority class policy into a number.
	1903	+- Change the request I/O priority class into the maximum of the I/O priority
	1904	+ class policy number and the numerical I/O priority class.
	1905	+
1609	1906	PID
1610	1907	---
1611	1908
..	..	@@ -1646,6 +1943,176 @@
1646	1943	of a new process would cause a cgroup policy to be violated.
1647	1944
1648	1945
	1946	+Cpuset
	1947	+------
	1948	+
	1949	+The "cpuset" controller provides a mechanism for constraining
	1950	+the CPU and memory node placement of tasks to only the resources
	1951	+specified in the cpuset interface files in a task's current cgroup.
	1952	+This is especially valuable on large NUMA systems where placing jobs
	1953	+on properly sized subsets of the systems with careful processor and
	1954	+memory placement to reduce cross-node memory access and contention
	1955	+can improve overall system performance.
	1956	+
	1957	+The "cpuset" controller is hierarchical. That means the controller
	1958	+cannot use CPUs or memory nodes not allowed in its parent.
	1959	+
	1960	+
	1961	+Cpuset Interface Files
	1962	+~~~~~~~~~~~~~~~~~~~~~~
	1963	+
	1964	+ cpuset.cpus
	1965	+ A read-write multiple values file which exists on non-root
	1966	+ cpuset-enabled cgroups.
	1967	+
	1968	+ It lists the requested CPUs to be used by tasks within this
	1969	+ cgroup. The actual list of CPUs to be granted, however, is
	1970	+ subjected to constraints imposed by its parent and can differ
	1971	+ from the requested CPUs.
	1972	+
	1973	+ The CPU numbers are comma-separated numbers or ranges.
	1974	+ For example::
	1975	+
	1976	+ # cat cpuset.cpus
	1977	+ 0-4,6,8-10
	1978	+
	1979	+ An empty value indicates that the cgroup is using the same
	1980	+ setting as the nearest cgroup ancestor with a non-empty
	1981	+ "cpuset.cpus" or all the available CPUs if none is found.
	1982	+
	1983	+ The value of "cpuset.cpus" stays constant until the next update
	1984	+ and won't be affected by any CPU hotplug events.
	1985	+
	1986	+ cpuset.cpus.effective
	1987	+ A read-only multiple values file which exists on all
	1988	+ cpuset-enabled cgroups.
	1989	+
	1990	+ It lists the onlined CPUs that are actually granted to this
	1991	+ cgroup by its parent. These CPUs are allowed to be used by
	1992	+ tasks within the current cgroup.
	1993	+
	1994	+ If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
	1995	+ all the CPUs from the parent cgroup that can be available to
	1996	+ be used by this cgroup. Otherwise, it should be a subset of
	1997	+ "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
	1998	+ can be granted. In this case, it will be treated just like an
	1999	+ empty "cpuset.cpus".
	2000	+
	2001	+ Its value will be affected by CPU hotplug events.
	2002	+
	2003	+ cpuset.mems
	2004	+ A read-write multiple values file which exists on non-root
	2005	+ cpuset-enabled cgroups.
	2006	+
	2007	+ It lists the requested memory nodes to be used by tasks within
	2008	+ this cgroup. The actual list of memory nodes granted, however,
	2009	+ is subjected to constraints imposed by its parent and can differ
	2010	+ from the requested memory nodes.
	2011	+
	2012	+ The memory node numbers are comma-separated numbers or ranges.
	2013	+ For example::
	2014	+
	2015	+ # cat cpuset.mems
	2016	+ 0-1,3
	2017	+
	2018	+ An empty value indicates that the cgroup is using the same
	2019	+ setting as the nearest cgroup ancestor with a non-empty
	2020	+ "cpuset.mems" or all the available memory nodes if none
	2021	+ is found.
	2022	+
	2023	+ The value of "cpuset.mems" stays constant until the next update
	2024	+ and won't be affected by any memory nodes hotplug events.
	2025	+
	2026	+ cpuset.mems.effective
	2027	+ A read-only multiple values file which exists on all
	2028	+ cpuset-enabled cgroups.
	2029	+
	2030	+ It lists the onlined memory nodes that are actually granted to
	2031	+ this cgroup by its parent. These memory nodes are allowed to
	2032	+ be used by tasks within the current cgroup.
	2033	+
	2034	+ If "cpuset.mems" is empty, it shows all the memory nodes from the
	2035	+ parent cgroup that will be available to be used by this cgroup.
	2036	+ Otherwise, it should be a subset of "cpuset.mems" unless none of
	2037	+ the memory nodes listed in "cpuset.mems" can be granted. In this
	2038	+ case, it will be treated just like an empty "cpuset.mems".
	2039	+
	2040	+ Its value will be affected by memory nodes hotplug events.
	2041	+
	2042	+ cpuset.cpus.partition
	2043	+ A read-write single value file which exists on non-root
	2044	+ cpuset-enabled cgroups. This flag is owned by the parent cgroup
	2045	+ and is not delegatable.
	2046	+
	2047	+ It accepts only the following input values when written to.
	2048	+
	2049	+ "root" - a partition root
	2050	+ "member" - a non-root member of a partition
	2051	+
	2052	+ When set to be a partition root, the current cgroup is the
	2053	+ root of a new partition or scheduling domain that comprises
	2054	+ itself and all its descendants except those that are separate
	2055	+ partition roots themselves and their descendants. The root
	2056	+ cgroup is always a partition root.
	2057	+
	2058	+ There are constraints on where a partition root can be set.
	2059	+ It can only be set in a cgroup if all the following conditions
	2060	+ are true.
	2061	+
	2062	+ 1) The "cpuset.cpus" is not empty and the list of CPUs are
	2063	+ exclusive, i.e. they are not shared by any of its siblings.
	2064	+ 2) The parent cgroup is a partition root.
	2065	+ 3) The "cpuset.cpus" is also a proper subset of the parent's
	2066	+ "cpuset.cpus.effective".
	2067	+ 4) There is no child cgroups with cpuset enabled. This is for
	2068	+ eliminating corner cases that have to be handled if such a
	2069	+ condition is allowed.
	2070	+
	2071	+ Setting it to partition root will take the CPUs away from the
	2072	+ effective CPUs of the parent cgroup. Once it is set, this
	2073	+ file cannot be reverted back to "member" if there are any child
	2074	+ cgroups with cpuset enabled.
	2075	+
	2076	+ A parent partition cannot distribute all its CPUs to its
	2077	+ child partitions. There must be at least one cpu left in the
	2078	+ parent partition.
	2079	+
	2080	+ Once becoming a partition root, changes to "cpuset.cpus" is
	2081	+ generally allowed as long as the first condition above is true,
	2082	+ the change will not take away all the CPUs from the parent
	2083	+ partition and the new "cpuset.cpus" value is a superset of its
	2084	+ children's "cpuset.cpus" values.
	2085	+
	2086	+ Sometimes, external factors like changes to ancestors'
	2087	+ "cpuset.cpus" or cpu hotplug can cause the state of the partition
	2088	+ root to change. On read, the "cpuset.sched.partition" file
	2089	+ can show the following values.
	2090	+
	2091	+ "member" Non-root member of a partition
	2092	+ "root" Partition root
	2093	+ "root invalid" Invalid partition root
	2094	+
	2095	+ It is a partition root if the first 2 partition root conditions
	2096	+ above are true and at least one CPU from "cpuset.cpus" is
	2097	+ granted by the parent cgroup.
	2098	+
	2099	+ A partition root can become invalid if none of CPUs requested
	2100	+ in "cpuset.cpus" can be granted by the parent cgroup or the
	2101	+ parent cgroup is no longer a partition root itself. In this
	2102	+ case, it is not a real partition even though the restriction
	2103	+ of the first partition root condition above will still apply.
	2104	+ The cpu affinity of all the tasks in the cgroup will then be
	2105	+ associated with CPUs in the nearest ancestor partition.
	2106	+
	2107	+ An invalid partition root can be transitioned back to a
	2108	+ real partition root if at least one of the requested CPUs
	2109	+ can now be granted by its parent. In this case, the cpu
	2110	+ affinity of all the tasks in the formerly invalid partition
	2111	+ will be associated to the CPUs of the newly formed partition.
	2112	+ Changing the partition state of an invalid partition root to
	2113	+ "member" is always allowed even if child cpusets are present.
	2114	+
	2115	+
1649	2116	Device controller
1650	2117	-----------------
1651	2118
..	..	@@ -1674,7 +2141,7 @@
1674	2141	----
1675	2142
1676	2143	The "rdma" controller regulates the distribution and accounting of
1677		-of RDMA resources.
	2144	+RDMA resources.
1678	2145
1679	2146	RDMA Interface Files
1680	2147	~~~~~~~~~~~~~~~~~~~~
..	..	@@ -1709,6 +2176,33 @@
1709	2176	mlx4_0 hca_handle=1 hca_object=20
1710	2177	ocrdma1 hca_handle=1 hca_object=23
1711	2178
	2179	+HugeTLB
	2180	+-------
	2181	+
	2182	+The HugeTLB controller allows to limit the HugeTLB usage per control group and
	2183	+enforces the controller limit during page fault.
	2184	+
	2185	+HugeTLB Interface Files
	2186	+~~~~~~~~~~~~~~~~~~~~~~~
	2187	+
	2188	+ hugetlb.<hugepagesize>.current
	2189	+ Show current usage for "hugepagesize" hugetlb. It exists for all
	2190	+ the cgroup except root.
	2191	+
	2192	+ hugetlb.<hugepagesize>.max
	2193	+ Set/show the hard limit of "hugepagesize" hugetlb usage.
	2194	+ The default value is "max". It exists for all the cgroup except root.
	2195	+
	2196	+ hugetlb.<hugepagesize>.events
	2197	+ A read-only flat-keyed file which exists on non-root cgroups.
	2198	+
	2199	+ max
	2200	+ The number of allocation failure due to HugeTLB limit
	2201	+
	2202	+ hugetlb.<hugepagesize>.events.local
	2203	+ Similar to hugetlb.<hugepagesize>.events but the fields in the file
	2204	+ are local to the cgroup i.e. not hierarchical. The file modified event
	2205	+ generated on this file reflects only the local events.
1712	2206
1713	2207	Misc
1714	2208	----
..	..	@@ -1915,10 +2409,12 @@
1915	2409
1916	2410	wbc_init_bio(@wbc, @bio)
1917	2411	Should be called for each bio carrying writeback data and
1918		- associates the bio with the inode's owner cgroup. Can be
1919		- called anytime between bio allocation and submission.
	2412	+ associates the bio with the inode's owner cgroup and the
	2413	+ corresponding request queue. This must be called after
	2414	+ a queue (device) has been associated with the bio and
	2415	+ before submission.
1920	2416
1921		- wbc_account_io(@wbc, @page, @bytes)
	2417	+ wbc_account_cgroup_owner(@wbc, @page, @bytes)
1922	2418	Should be called for each data segment being written out.
1923	2419	While this function doesn't care exactly when it's called
1924	2420	during the writeback session, it's the easiest and most
..	..	@@ -1935,7 +2431,7 @@
1935	2431	the writeback session is holding shared resources, e.g. a journal
1936	2432	entry, may lead to priority inversion. There is no one easy solution
1937	2433	for the problem. Filesystems can try to work around specific problem
1938		-cases by skipping wbc_init_bio() or using bio_associate_blkcg()
	2434	+cases by skipping wbc_init_bio() and using bio_associate_blkg()
1939	2435	directly.
1940	2436
1941	2437
..	..	@@ -2145,8 +2641,10 @@
2145	2641	becomes self-defeating.
2146	2642
2147	2643	The memory.low boundary on the other hand is a top-down allocated
2148		-reserve. A cgroup enjoys reclaim protection when it's within its low,
2149		-which makes delegation of subtrees possible.
	2644	+reserve. A cgroup enjoys reclaim protection when it's within its
	2645	+effective low, which makes delegation of subtrees possible. It also
	2646	+enjoys having reclaim pressure proportional to its overage when
	2647	+above its effective low.
2150	2648
2151	2649	The original high boundary, the hard limit, is defined as a strict
2152	2650	limit that can not budge, even if the OOM killer has to be called.