hc
2024-02-20 102a0743326a03cd1a1202ceda21e175b7d3575c
kernel/tools/perf/Documentation/perf-stat.txt
....@@ -47,6 +47,10 @@
4747 param1 and param2 are defined as formats for the PMU in
4848 /sys/bus/event_source/devices/<pmu>/format/*
4949
50
+ 'percore' is a event qualifier that sums up the event counts for both
51
+ hardware threads in a core. For example:
52
+ perf stat -A -a -e cpu/event,percore=1/,otherevent ...
53
+
5054 - a symbolically formed event like 'pmu/config=M,config1=N,config2=K/'
5155 where M, N, K are numbers (in decimal, hex, octal format).
5256 Acceptable values for each of 'config', 'config1' and 'config2'
....@@ -54,7 +58,7 @@
5458 /sys/bus/event_source/devices/<pmu>/format/*
5559
5660 Note that the last two syntaxes support prefix and glob matching in
57
- the PMU name to simplify creation of events accross multiple instances
61
+ the PMU name to simplify creation of events across multiple instances
5862 of the same type of PMU in large systems (e.g. memory controller PMUs).
5963 Multiple PMU instances are typical for uncore PMUs, so the prefix
6064 'uncore_' is also ignored when performing this match.
....@@ -71,14 +75,23 @@
7175 --tid=<tid>::
7276 stat events on existing thread id (comma separated list)
7377
78
+ifdef::HAVE_LIBPFM[]
79
+--pfm-events events::
80
+Select a PMU event using libpfm4 syntax (see http://perfmon2.sf.net)
81
+including support for event filters. For example '--pfm-events
82
+inst_retired:any_p:u:c=1:i'. More than one event can be passed to the
83
+option using the comma separator. Hardware events and generic hardware
84
+events cannot be mixed together. The latter must be used with the -e
85
+option. The -e option and this one can be mixed and matched. Events
86
+can be grouped using the {} notation.
87
+endif::HAVE_LIBPFM[]
7488
7589 -a::
7690 --all-cpus::
7791 system-wide collection from all CPUs (default if no target is specified)
7892
79
--c::
80
---scale::
81
- scale/normalize counter values
93
+--no-scale::
94
+ Don't scale/normalize counter values
8295
8396 -d::
8497 --detailed::
....@@ -94,7 +107,9 @@
94107
95108 -B::
96109 --big-num::
97
- print large numbers with thousands' separators according to locale
110
+ print large numbers with thousands' separators according to locale.
111
+ Enabled by default. Use "--no-big-num" to disable.
112
+ Default setting can be changed with "perf config stat.big-num=false".
98113
99114 -C::
100115 --cpu=::
....@@ -151,6 +166,11 @@
151166 If wanting to monitor, say, 'cycles' for a cgroup and also for system wide, this
152167 command line can be used: 'perf stat -e cycles -G cgroup_name -a -e cycles'.
153168
169
+--for-each-cgroup name::
170
+Expand event list for each cgroup in "name" (allow multiple cgroups separated
171
+by comma). This has same effect that repeating -e option and -G option for
172
+each event x name. This option cannot be used with -G/--cgroup option.
173
+
154174 -o file::
155175 --output file::
156176 Print the output into the designated file.
....@@ -165,6 +185,47 @@
165185 3>results perf stat --log-fd 3 -- $cmd
166186 3>>results perf stat --log-fd 3 --append -- $cmd
167187
188
+--control=fifo:ctl-fifo[,ack-fifo]::
189
+--control=fd:ctl-fd[,ack-fd]::
190
+ctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as follows.
191
+Listen on ctl-fd descriptor for command to control measurement ('enable': enable events,
192
+'disable': disable events). Measurements can be started with events disabled using
193
+--delay=-1 option. Optionally send control command completion ('ack\n') to ack-fd descriptor
194
+to synchronize with the controlling process. Example of bash shell script to enable and
195
+disable events during measurements:
196
+
197
+ #!/bin/bash
198
+
199
+ ctl_dir=/tmp/
200
+
201
+ ctl_fifo=${ctl_dir}perf_ctl.fifo
202
+ test -p ${ctl_fifo} && unlink ${ctl_fifo}
203
+ mkfifo ${ctl_fifo}
204
+ exec {ctl_fd}<>${ctl_fifo}
205
+
206
+ ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo
207
+ test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo}
208
+ mkfifo ${ctl_ack_fifo}
209
+ exec {ctl_fd_ack}<>${ctl_ack_fifo}
210
+
211
+ perf stat -D -1 -e cpu-cycles -a -I 1000 \
212
+ --control fd:${ctl_fd},${ctl_fd_ack} \
213
+ -- sleep 30 &
214
+ perf_pid=$!
215
+
216
+ sleep 5 && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
217
+ sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
218
+
219
+ exec {ctl_fd_ack}>&-
220
+ unlink ${ctl_ack_fifo}
221
+
222
+ exec {ctl_fd}>&-
223
+ unlink ${ctl_fifo}
224
+
225
+ wait -n ${perf_pid}
226
+ exit $?
227
+
228
+
168229 --pre::
169230 --post::
170231 Pre and post measurement hooks, e.g.:
....@@ -176,6 +237,8 @@
176237 Print count deltas every N milliseconds (minimum: 1ms)
177238 The overhead percentage could be high in some cases, for instance with small, sub 100ms intervals. Use with caution.
178239 example: 'perf stat -I 1000 -e cycles -a sleep 5'
240
+
241
+If the metric exists, it is calculated by the counts generated in this interval and the metric is printed after #.
179242
180243 --interval-count times::
181244 Print count deltas for fixed number of times.
....@@ -201,6 +264,13 @@
201264 socket number and the number of online processors on that socket. This is
202265 useful to gauge the amount of aggregation.
203266
267
+--per-die::
268
+Aggregate counts per processor die for system-wide mode measurements. This
269
+is a useful mode to detect imbalance between dies. To enable this mode,
270
+use --per-die in addition to -a. (system-wide). The output includes the
271
+die number and the number of online processors on that die. This is
272
+useful to gauge the amount of aggregation.
273
+
204274 --per-core::
205275 Aggregate counts per physical processor for system-wide mode measurements. This
206276 is a useful mode to detect imbalance between physical cores. To enable this mode,
....@@ -211,15 +281,40 @@
211281 Aggregate counts per monitored threads, when monitoring threads (-t option)
212282 or processes (-p option).
213283
284
+--per-node::
285
+Aggregate counts per NUMA nodes for system-wide mode measurements. This
286
+is a useful mode to detect imbalance between NUMA nodes. To enable this
287
+mode, use --per-node in addition to -a. (system-wide).
288
+
214289 -D msecs::
215290 --delay msecs::
216
-After starting the program, wait msecs before measuring. This is useful to
217
-filter out the startup phase of the program, which is often very different.
291
+After starting the program, wait msecs before measuring (-1: start with events
292
+disabled). This is useful to filter out the startup phase of the program,
293
+which is often very different.
218294
219295 -T::
220296 --transaction::
221297
222298 Print statistics of transactional execution if supported.
299
+
300
+--metric-no-group::
301
+By default, events to compute a metric are placed in weak groups. The
302
+group tries to enforce scheduling all or none of the events. The
303
+--metric-no-group option places events outside of groups and may
304
+increase the chance of the event being scheduled - leading to more
305
+accuracy. However, as events may not be scheduled together accuracy
306
+for metrics like instructions per cycle can be lower - as both metrics
307
+may no longer be being measured at the same time.
308
+
309
+--metric-no-merge::
310
+By default metric events in different weak groups can be shared if one
311
+group contains all the events needed by another. In such cases one
312
+group will be eliminated reducing event multiplexing and making it so
313
+that certain groups of metrics sum to 100%. A downside to sharing a
314
+group is that the group may require multiplexing and so accuracy for a
315
+small group that need not have multiplexing is lowered. This option
316
+forbids the event merging logic from sharing events between groups and
317
+may be used to increase accuracy in this case.
223318
224319 STAT RECORD
225320 -----------
....@@ -239,6 +334,9 @@
239334
240335 --per-socket::
241336 Aggregate counts per processor socket for system-wide mode measurements.
337
+
338
+--per-die::
339
+Aggregate counts per processor die for system-wide mode measurements.
242340
243341 --per-core::
244342 Aggregate counts per physical processor for system-wide mode measurements.
....@@ -270,6 +368,11 @@
270368 For best results it is usually a good idea to use it with interval
271369 mode like -I 1000, as the bottleneck of workloads can change often.
272370
371
+This enables --metric-only, unless overridden with --no-metric-only.
372
+
373
+The following restrictions only apply to older Intel CPUs and Atom,
374
+on newer CPUs (IceLake and later) TopDown can be collected for any thread:
375
+
273376 The top down metrics are collected per core instead of per
274377 CPU thread. Per core mode is automatically enabled
275378 and -a (global monitoring) is needed, requiring root rights or
....@@ -280,8 +383,6 @@
280383 echo 0 > /proc/sys/kernel/nmi_watchdog
281384 for best results. Otherwise the bottlenecks may be inconsistent
282385 on workload with changing phases.
283
-
284
-This enables --metric-only, unless overriden with --no-metric-only.
285386
286387 To interpret the results it is usually needed to know on which
287388 CPUs the workload runs on. If needed the CPUs can be forced using
....@@ -314,6 +415,24 @@
314415
315416 Users who wants to get the actual value can apply --no-metric-only.
316417
418
+--all-kernel::
419
+Configure all used events to run in kernel space.
420
+
421
+--all-user::
422
+Configure all used events to run in user space.
423
+
424
+--percore-show-thread::
425
+The event modifier "percore" has supported to sum up the event counts
426
+for all hardware threads in a core and show the counts per core.
427
+
428
+This option with event modifier "percore" enabled also sums up the event
429
+counts for all hardware threads in a core but show the sum counts per
430
+hardware thread. This is essentially a replacement for the any bit and
431
+convenient for post processing.
432
+
433
+--summary::
434
+Print summary for interval mode (-I).
435
+
317436 EXAMPLES
318437 --------
319438