hc
2023-12-08 01573e231f18eb2d99162747186f59511f56b64d
kernel/Documentation/trace/ftrace.rst
....@@ -24,17 +24,17 @@
2424 performance issues that take place outside of user-space.
2525
2626 Although ftrace is typically considered the function tracer, it
27
-is really a frame work of several assorted tracing utilities.
27
+is really a framework of several assorted tracing utilities.
2828 There's latency tracing to examine what occurs between interrupts
2929 disabled and enabled, as well as for preemption and from a time
3030 a task is woken to the task is actually scheduled in.
3131
3232 One of the most common uses of ftrace is the event tracing.
33
-Through out the kernel is hundreds of static event points that
33
+Throughout the kernel is hundreds of static event points that
3434 can be enabled via the tracefs file system to see what is
3535 going on in certain parts of the kernel.
3636
37
-See events.txt for more information.
37
+See events.rst for more information.
3838
3939
4040 Implementation Details
....@@ -76,6 +76,9 @@
7676 All files located in the tracefs file system will be located in that
7777 debugfs file system directory as well.
7878
79
+ In order to not automount tracefs in the debugfs filesystem, enable the
80
+ defconfig option CONFIG_TRACEFS_DISABLE_AUTOMOUNT.
81
+
7982 .. attention::
8083
8184 Any selected ftrace option will also create the tracefs file system.
....@@ -95,7 +98,8 @@
9598 current_tracer:
9699
97100 This is used to set or display the current tracer
98
- that is configured.
101
+ that is configured. Changing the current tracer clears
102
+ the ring buffer content as well as the "snapshot" buffer.
99103
100104 available_tracers:
101105
....@@ -124,8 +128,13 @@
124128 trace:
125129
126130 This file holds the output of the trace in a human
127
- readable format (described below). Note, tracing is temporarily
128
- disabled while this file is being read (opened).
131
+ readable format (described below). Opening this file for
132
+ writing with the O_TRUNC flag clears the ring buffer content.
133
+ Note, this file is not a consumer. If tracing is off
134
+ (no tracer running, or tracing_on is zero), it will produce
135
+ the same output each time it is read. When tracing is on,
136
+ it may produce inconsistent results as it tries to read
137
+ the entire buffer without consuming it.
129138
130139 trace_pipe:
131140
....@@ -139,8 +148,7 @@
139148 will not be read again with a sequential read. The
140149 "trace" file is static, and if the tracer is not
141150 adding more data, it will display the same
142
- information every time it is read. This file will not
143
- disable tracing while being read.
151
+ information every time it is read.
144152
145153 trace_options:
146154
....@@ -183,7 +191,8 @@
183191 CPU buffer and not total size of all buffers. The
184192 trace buffers are allocated in pages (blocks of memory
185193 that the kernel uses for allocation, usually 4 KB in size).
186
- If the last page allocated has room for more bytes
194
+ A few extra pages may be allocated to accommodate buffer management
195
+ meta-data. If the last page allocated has room for more bytes
187196 than requested, the rest of the page will be used,
188197 making the actual allocation bigger than requested or shown.
189198 ( Note, the size may not be a multiple of the page size
....@@ -233,6 +242,12 @@
233242 This interface also allows for commands to be used. See the
234243 "Filter commands" section for more details.
235244
245
+ As a speed up, since processing strings can be quite expensive
246
+ and requires a check of all functions registered to tracing, instead
247
+ an index can be written into this file. A number (starting with "1")
248
+ written will instead select the same corresponding at the line position
249
+ of the "available_filter_functions" file.
250
+
236251 set_ftrace_notrace:
237252
238253 This has an effect opposite to that of
....@@ -251,11 +266,38 @@
251266 traced by the function tracer as well. This option will also
252267 cause PIDs of tasks that exit to be removed from the file.
253268
269
+ set_ftrace_notrace_pid:
270
+
271
+ Have the function tracer ignore threads whose PID are listed in
272
+ this file.
273
+
274
+ If the "function-fork" option is set, then when a task whose
275
+ PID is listed in this file forks, the child's PID will
276
+ automatically be added to this file, and the child will not be
277
+ traced by the function tracer as well. This option will also
278
+ cause PIDs of tasks that exit to be removed from the file.
279
+
280
+ If a PID is in both this file and "set_ftrace_pid", then this
281
+ file takes precedence, and the thread will not be traced.
282
+
254283 set_event_pid:
255284
256285 Have the events only trace a task with a PID listed in this file.
257286 Note, sched_switch and sched_wake_up will also trace events
258287 listed in this file.
288
+
289
+ To have the PIDs of children of tasks with their PID in this file
290
+ added on fork, enable the "event-fork" option. That option will also
291
+ cause the PIDs of tasks to be removed from this file when the task
292
+ exits.
293
+
294
+ set_event_notrace_pid:
295
+
296
+ Have the events not trace a task with a PID listed in this file.
297
+ Note, sched_switch and sched_wakeup will trace threads not listed
298
+ in this file, even if a thread's PID is in the file if the
299
+ sched_switch or sched_wakeup events also trace a thread that should
300
+ be traced.
259301
260302 To have the PIDs of children of tasks with their PID in this file
261303 added on fork, enable the "event-fork" option. That option will also
....@@ -337,11 +379,11 @@
337379
338380 kprobe_events:
339381
340
- Enable dynamic trace points. See kprobetrace.txt.
382
+ Enable dynamic trace points. See kprobetrace.rst.
341383
342384 kprobe_profile:
343385
344
- Dynamic trace points stats. See kprobetrace.txt.
386
+ Dynamic trace points stats. See kprobetrace.rst.
345387
346388 max_graph_depth:
347389
....@@ -374,7 +416,7 @@
374416
375417 By default, 128 comms are saved (see "saved_cmdlines" above). To
376418 increase or decrease the amount of comms that are cached, echo
377
- in a the number of comms to cache, into this file.
419
+ the number of comms to cache into this file.
378420
379421 saved_tgids:
380422
....@@ -462,7 +504,7 @@
462504
463505 mono_raw:
464506 This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
465
- which is montonic but is not subject to any rate adjustments
507
+ which is monotonic but is not subject to any rate adjustments
466508 and ticks at the same rate as the hardware clocksource.
467509
468510 boot:
....@@ -481,6 +523,9 @@
481523 To set a clock, simply echo the clock name into this file::
482524
483525 # echo global > trace_clock
526
+
527
+ Setting a clock clears the ring buffer content as well as the
528
+ "snapshot" buffer.
484529
485530 trace_marker:
486531
....@@ -519,14 +564,14 @@
519564
520565 trace_marker_raw:
521566
522
- This is similar to trace_marker above, but is meant for for binary data
567
+ This is similar to trace_marker above, but is meant for binary data
523568 to be written to it, where a tool can be used to parse the data
524569 from trace_pipe_raw.
525570
526571 uprobe_events:
527572
528573 Add dynamic tracepoints in programs.
529
- See uprobetracer.txt
574
+ See uprobetracer.rst
530575
531576 uprobe_profile:
532577
....@@ -547,19 +592,19 @@
547592 files at various levels that can enable the tracepoints
548593 when a "1" is written to them.
549594
550
- See events.txt for more information.
595
+ See events.rst for more information.
551596
552597 set_event:
553598
554599 By echoing in the event into this file, will enable that event.
555600
556
- See events.txt for more information.
601
+ See events.rst for more information.
557602
558603 available_events:
559604
560605 A list of events that can be enabled in tracing.
561606
562
- See events.txt for more information.
607
+ See events.rst for more information.
563608
564609 timestamp_mode:
565610
....@@ -759,6 +804,37 @@
759804 tracers from tracing simply echo "nop" into
760805 current_tracer.
761806
807
+Error conditions
808
+----------------
809
+
810
+ For most ftrace commands, failure modes are obvious and communicated
811
+ using standard return codes.
812
+
813
+ For other more involved commands, extended error information may be
814
+ available via the tracing/error_log file. For the commands that
815
+ support it, reading the tracing/error_log file after an error will
816
+ display more detailed information about what went wrong, if
817
+ information is available. The tracing/error_log file is a circular
818
+ error log displaying a small number (currently, 8) of ftrace errors
819
+ for the last (8) failed commands.
820
+
821
+ The extended error information and usage takes the form shown in
822
+ this example::
823
+
824
+ # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
825
+ echo: write error: Invalid argument
826
+
827
+ # cat /sys/kernel/debug/tracing/error_log
828
+ [ 5348.887237] location: error: Couldn't yyy: zzz
829
+ Command: xxx
830
+ ^
831
+ [ 7517.023364] location: error: Bad rrr: sss
832
+ Command: ppp qqq
833
+ ^
834
+
835
+ To clear the error log, echo the empty string into it::
836
+
837
+ # echo > /sys/kernel/debug/tracing/error_log
762838
763839 Examples of using the tracer
764840 ----------------------------
....@@ -914,8 +990,8 @@
914990 current trace and the next trace.
915991
916992 - '$' - greater than 1 second
917
- - '@' - greater than 100 milisecond
918
- - '*' - greater than 10 milisecond
993
+ - '@' - greater than 100 millisecond
994
+ - '*' - greater than 10 millisecond
919995 - '#' - greater than 1000 microsecond
920996 - '!' - greater than 100 microsecond
921997 - '+' - greater than 10 microsecond
....@@ -1080,6 +1156,12 @@
10801156 the trace displays additional information about the
10811157 latency, as described in "Latency trace format".
10821158
1159
+ pause-on-trace
1160
+ When set, opening the trace file for read, will pause
1161
+ writing to the ring buffer (as if tracing_on was set to zero).
1162
+ This simulates the original behavior of the trace file.
1163
+ When the file is closed, tracing will be enabled again.
1164
+
10831165 record-cmd
10841166 When any event or tracer is enabled, a hook is enabled
10851167 in the sched_switch trace point to fill comm cache
....@@ -1131,6 +1213,8 @@
11311213 tasks fork. Also, when tasks with PIDs in set_event_pid exit,
11321214 their PIDs will be removed from the file.
11331215
1216
+ This affects PIDs listed in set_event_notrace_pid as well.
1217
+
11341218 function-trace
11351219 The latency tracers will enable function tracing
11361220 if this option is enabled (default it is). When
....@@ -1144,6 +1228,8 @@
11441228 when those tasks fork. Also, when tasks with PIDs in
11451229 set_ftrace_pid exit, their PIDs will be removed from the
11461230 file.
1231
+
1232
+ This affects PIDs in set_ftrace_notrace_pid as well.
11471233
11481234 display-graph
11491235 When set, the latency tracers (irqsoff, wakeup, etc) will
....@@ -1311,7 +1397,7 @@
13111397 => x86_64_start_reservations
13121398 => x86_64_start_kernel
13131399
1314
-Here we see that that we had a latency of 16 microseconds (which is
1400
+Here we see that we had a latency of 16 microseconds (which is
13151401 very good). The _raw_spin_lock_irq in run_timer_softirq disabled
13161402 interrupts. The difference between the 16 and the displayed
13171403 timestamp 25us occurred because the clock was incremented
....@@ -1370,7 +1456,7 @@
13701456 => __blk_run_queue_uncond
13711457 => __blk_run_queue
13721458 => blk_queue_bio
1373
- => generic_make_request
1459
+ => submit_bio_noacct
13741460 => submit_bio
13751461 => submit_bh
13761462 => __ext3_get_inode_loc
....@@ -1396,6 +1482,58 @@
13961482 overhead may extend the latency times. But nevertheless, this
13971483 trace has provided some very helpful debugging information.
13981484
1485
+If we prefer function graph output instead of function, we can set
1486
+display-graph option::
1487
+
1488
+ with echo 1 > options/display-graph
1489
+
1490
+ # tracer: irqsoff
1491
+ #
1492
+ # irqsoff latency trace v1.1.5 on 4.20.0-rc6+
1493
+ # --------------------------------------------------------------------
1494
+ # latency: 3751 us, #274/274, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
1495
+ # -----------------
1496
+ # | task: bash-1507 (uid:0 nice:0 policy:0 rt_prio:0)
1497
+ # -----------------
1498
+ # => started at: free_debug_processing
1499
+ # => ended at: return_to_handler
1500
+ #
1501
+ #
1502
+ # _-----=> irqs-off
1503
+ # / _----=> need-resched
1504
+ # | / _---=> hardirq/softirq
1505
+ # || / _--=> preempt-depth
1506
+ # ||| /
1507
+ # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
1508
+ # | | | | |||| | | | | | |
1509
+ 0 us | 0) bash-1507 | d... | 0.000 us | _raw_spin_lock_irqsave();
1510
+ 0 us | 0) bash-1507 | d..1 | 0.378 us | do_raw_spin_trylock();
1511
+ 1 us | 0) bash-1507 | d..2 | | set_track() {
1512
+ 2 us | 0) bash-1507 | d..2 | | save_stack_trace() {
1513
+ 2 us | 0) bash-1507 | d..2 | | __save_stack_trace() {
1514
+ 3 us | 0) bash-1507 | d..2 | | __unwind_start() {
1515
+ 3 us | 0) bash-1507 | d..2 | | get_stack_info() {
1516
+ 3 us | 0) bash-1507 | d..2 | 0.351 us | in_task_stack();
1517
+ 4 us | 0) bash-1507 | d..2 | 1.107 us | }
1518
+ [...]
1519
+ 3750 us | 0) bash-1507 | d..1 | 0.516 us | do_raw_spin_unlock();
1520
+ 3750 us | 0) bash-1507 | d..1 | 0.000 us | _raw_spin_unlock_irqrestore();
1521
+ 3764 us | 0) bash-1507 | d..1 | 0.000 us | tracer_hardirqs_on();
1522
+ bash-1507 0d..1 3792us : <stack trace>
1523
+ => free_debug_processing
1524
+ => __slab_free
1525
+ => kmem_cache_free
1526
+ => vm_area_free
1527
+ => remove_vma
1528
+ => exit_mmap
1529
+ => mmput
1530
+ => begin_new_exec
1531
+ => load_elf_binary
1532
+ => search_binary_handler
1533
+ => __do_execve_file.isra.32
1534
+ => __x64_sys_execve
1535
+ => do_syscall_64
1536
+ => entry_SYSCALL_64_after_hwframe
13991537
14001538 preemptoff
14011539 ----------
....@@ -1603,7 +1741,7 @@
16031741 => __blk_run_queue_uncond
16041742 => __blk_run_queue
16051743 => blk_queue_bio
1606
- => generic_make_request
1744
+ => submit_bio_noacct
16071745 => submit_bio
16081746 => submit_bh
16091747 => ext3_bread
....@@ -2029,6 +2167,8 @@
20292167 # cat trace
20302168 # tracer: hwlat
20312169 #
2170
+ # entries-in-buffer/entries-written: 13/13 #P:8
2171
+ #
20322172 # _-----=> irqs-off
20332173 # / _----=> need-resched
20342174 # | / _---=> hardirq/softirq
....@@ -2036,12 +2176,18 @@
20362176 # ||| / delay
20372177 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
20382178 # | | | |||| | |
2039
- <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940
2040
- <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365
2041
- <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062
2042
- <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633
2043
- <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961
2044
- <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1
2179
+ <...>-1729 [001] d... 678.473449: #1 inner/outer(us): 11/12 ts:1581527483.343962693 count:6
2180
+ <...>-1729 [004] d... 689.556542: #2 inner/outer(us): 16/9 ts:1581527494.889008092 count:1
2181
+ <...>-1729 [005] d... 714.756290: #3 inner/outer(us): 16/16 ts:1581527519.678961629 count:5
2182
+ <...>-1729 [001] d... 718.788247: #4 inner/outer(us): 9/17 ts:1581527523.889012713 count:1
2183
+ <...>-1729 [002] d... 719.796341: #5 inner/outer(us): 13/9 ts:1581527524.912872606 count:1
2184
+ <...>-1729 [006] d... 844.787091: #6 inner/outer(us): 9/12 ts:1581527649.889048502 count:2
2185
+ <...>-1729 [003] d... 849.827033: #7 inner/outer(us): 18/9 ts:1581527654.889013793 count:1
2186
+ <...>-1729 [007] d... 853.859002: #8 inner/outer(us): 9/12 ts:1581527658.889065736 count:1
2187
+ <...>-1729 [001] d... 855.874978: #9 inner/outer(us): 9/11 ts:1581527660.861991877 count:1
2188
+ <...>-1729 [001] d... 863.938932: #10 inner/outer(us): 9/11 ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1
2189
+ <...>-1729 [007] d... 878.050780: #11 inner/outer(us): 9/12 ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1
2190
+ <...>-1729 [007] d... 886.114702: #12 inner/outer(us): 9/12 ts:1581527691.385001600 count:1
20452191
20462192
20472193 The above output is somewhat the same in the header. All events will have
....@@ -2051,7 +2197,7 @@
20512197 This is the count of events recorded that were greater than the
20522198 tracing_threshold (See below).
20532199
2054
- inner/outer(us): 12/14
2200
+ inner/outer(us): 11/11
20552201
20562202 This shows two numbers as "inner latency" and "outer latency". The test
20572203 runs in a loop checking a timestamp twice. The latency detected within
....@@ -2059,11 +2205,15 @@
20592205 after the previous timestamp and the next timestamp in the loop is
20602206 the "outer latency".
20612207
2062
- ts:1499801089.066141940
2208
+ ts:1581527483.343962693
20632209
2064
- The absolute timestamp that the event happened.
2210
+ The absolute timestamp that the first latency was recorded in the window.
20652211
2066
- nmi-total:4 nmi-count:1
2212
+ count:6
2213
+
2214
+ The number of times a latency was detected during the window.
2215
+
2216
+ nmi-total:7 nmi-count:1
20672217
20682218 On architectures that support it, if an NMI comes in during the
20692219 test, the time spent in NMI is reported in "nmi-total" (in
....@@ -2541,7 +2691,7 @@
25412691 recordmcount program (located in the scripts directory). This
25422692 program will parse the ELF headers in the C object to find all
25432693 the locations in the .text section that call mcount. Starting
2544
-with gcc verson 4.6, the -mfentry has been added for x86, which
2694
+with gcc version 4.6, the -mfentry has been added for x86, which
25452695 calls "__fentry__" instead of "mcount". Which is called before
25462696 the creation of the stack frame.
25472697
....@@ -2784,6 +2934,38 @@
27842934
27852935 We can see that there's no more lock or preempt tracing.
27862936
2937
+Selecting function filters via index
2938
+------------------------------------
2939
+
2940
+Because processing of strings is expensive (the address of the function
2941
+needs to be looked up before comparing to the string being passed in),
2942
+an index can be used as well to enable functions. This is useful in the
2943
+case of setting thousands of specific functions at a time. By passing
2944
+in a list of numbers, no string processing will occur. Instead, the function
2945
+at the specific location in the internal array (which corresponds to the
2946
+functions in the "available_filter_functions" file), is selected.
2947
+
2948
+::
2949
+
2950
+ # echo 1 > set_ftrace_filter
2951
+
2952
+Will select the first function listed in "available_filter_functions"
2953
+
2954
+::
2955
+
2956
+ # head -1 available_filter_functions
2957
+ trace_initcall_finish_cb
2958
+
2959
+ # cat set_ftrace_filter
2960
+ trace_initcall_finish_cb
2961
+
2962
+ # head -50 available_filter_functions | tail -1
2963
+ x86_pmu_commit_txn
2964
+
2965
+ # echo 1 50 > set_ftrace_filter
2966
+ # cat set_ftrace_filter
2967
+ trace_initcall_finish_cb
2968
+ x86_pmu_commit_txn
27872969
27882970 Dynamic ftrace with the function graph tracer
27892971 ---------------------------------------------
....@@ -2853,7 +3035,9 @@
28533035 function tracer. By default it is enabled (when function tracing is
28543036 enabled in the kernel). If it is disabled, all function tracing is
28553037 disabled. This includes not only the function tracers for ftrace, but
2856
-also for any other uses (perf, kprobes, stack tracing, profiling, etc).
3038
+also for any other uses (perf, kprobes, stack tracing, profiling, etc). It
3039
+cannot be disabled if there is a callback with FTRACE_OPS_FL_PERMANENT set
3040
+registered.
28573041
28583042 Please disable this with care.
28593043
....@@ -2978,7 +3162,7 @@
29783162 When the function is hit, it will dump the contents of the ftrace
29793163 ring buffer to the console. This is useful if you need to debug
29803164 something, and want to dump the trace when a certain function
2981
- is hit. Perhaps its a function that is called before a tripple
3165
+ is hit. Perhaps it's a function that is called before a triple
29823166 fault happens and does not allow you to get a regular dump.
29833167
29843168 - cpudump:
....@@ -2986,6 +3170,9 @@
29863170 ring buffer for the current CPU to the console. Unlike the "dump"
29873171 command, it only prints out the contents of the ring buffer for the
29883172 CPU that executed the function that triggered the dump.
3173
+
3174
+- stacktrace:
3175
+ When the function is hit, a stack trace is recorded.
29893176
29903177 trace_pipe
29913178 ----------
....@@ -3029,7 +3216,10 @@
30293216
30303217
30313218 Note, reading the trace_pipe file will block until more input is
3032
-added.
3219
+added. This is contrary to the trace file. If any process opened
3220
+the trace file for reading, it will actually disable tracing and
3221
+prevent new entries from being added. The trace_pipe file does
3222
+not have this limitation.
30333223
30343224 trace entries
30353225 -------------
....@@ -3193,7 +3383,7 @@
31933383
31943384 As you can see, the new directory looks similar to the tracing directory
31953385 itself. In fact, it is very similar, except that the buffer and
3196
-events are agnostic from the main director, or from any other
3386
+events are agnostic from the main directory, or from any other
31973387 instances that are created.
31983388
31993389 The files in the new directory work just like the files with the