hc
2023-12-11 d2ccde1c8e90d38cee87a1b0309ad2827f3fd30d
kernel/tools/perf/Documentation/intel-pt.txt
....@@ -1,891 +1 @@
1
-Intel Processor Trace
2
-=====================
3
-
4
-Overview
5
-========
6
-
7
-Intel Processor Trace (Intel PT) is an extension of Intel Architecture that
8
-collects information about software execution such as control flow, execution
9
-modes and timings and formats it into highly compressed binary packets.
10
-Technical details are documented in the Intel 64 and IA-32 Architectures
11
-Software Developer Manuals, Chapter 36 Intel Processor Trace.
12
-
13
-Intel PT is first supported in Intel Core M and 5th generation Intel Core
14
-processors that are based on the Intel micro-architecture code name Broadwell.
15
-
16
-Trace data is collected by 'perf record' and stored within the perf.data file.
17
-See below for options to 'perf record'.
18
-
19
-Trace data must be 'decoded' which involves walking the object code and matching
20
-the trace data packets. For example a TNT packet only tells whether a
21
-conditional branch was taken or not taken, so to make use of that packet the
22
-decoder must know precisely which instruction was being executed.
23
-
24
-Decoding is done on-the-fly. The decoder outputs samples in the same format as
25
-samples output by perf hardware events, for example as though the "instructions"
26
-or "branches" events had been recorded. Presently 3 tools support this:
27
-'perf script', 'perf report' and 'perf inject'. See below for more information
28
-on using those tools.
29
-
30
-The main distinguishing feature of Intel PT is that the decoder can determine
31
-the exact flow of software execution. Intel PT can be used to understand why
32
-and how did software get to a certain point, or behave a certain way. The
33
-software does not have to be recompiled, so Intel PT works with debug or release
34
-builds, however the executed images are needed - which makes use in JIT-compiled
35
-environments, or with self-modified code, a challenge. Also symbols need to be
36
-provided to make sense of addresses.
37
-
38
-A limitation of Intel PT is that it produces huge amounts of trace data
39
-(hundreds of megabytes per second per core) which takes a long time to decode,
40
-for example two or three orders of magnitude longer than it took to collect.
41
-Another limitation is the performance impact of tracing, something that will
42
-vary depending on the use-case and architecture.
43
-
44
-
45
-Quickstart
46
-==========
47
-
48
-It is important to start small. That is because it is easy to capture vastly
49
-more data than can possibly be processed.
50
-
51
-The simplest thing to do with Intel PT is userspace profiling of small programs.
52
-Data is captured with 'perf record' e.g. to trace 'ls' userspace-only:
53
-
54
- perf record -e intel_pt//u ls
55
-
56
-And profiled with 'perf report' e.g.
57
-
58
- perf report
59
-
60
-To also trace kernel space presents a problem, namely kernel self-modifying
61
-code. A fairly good kernel image is available in /proc/kcore but to get an
62
-accurate image a copy of /proc/kcore needs to be made under the same conditions
63
-as the data capture. A script perf-with-kcore can do that, but beware that the
64
-script makes use of 'sudo' to copy /proc/kcore. If you have perf installed
65
-locally from the source tree you can do:
66
-
67
- ~/libexec/perf-core/perf-with-kcore record pt_ls -e intel_pt// -- ls
68
-
69
-which will create a directory named 'pt_ls' and put the perf.data file and
70
-copies of /proc/kcore, /proc/kallsyms and /proc/modules into it. Then to use
71
-'perf report' becomes:
72
-
73
- ~/libexec/perf-core/perf-with-kcore report pt_ls
74
-
75
-Because samples are synthesized after-the-fact, the sampling period can be
76
-selected for reporting. e.g. sample every microsecond
77
-
78
- ~/libexec/perf-core/perf-with-kcore report pt_ls --itrace=i1usge
79
-
80
-See the sections below for more information about the --itrace option.
81
-
82
-Beware the smaller the period, the more samples that are produced, and the
83
-longer it takes to process them.
84
-
85
-Also note that the coarseness of Intel PT timing information will start to
86
-distort the statistical value of the sampling as the sampling period becomes
87
-smaller.
88
-
89
-To represent software control flow, "branches" samples are produced. By default
90
-a branch sample is synthesized for every single branch. To get an idea what
91
-data is available you can use the 'perf script' tool with no parameters, which
92
-will list all the samples.
93
-
94
- perf record -e intel_pt//u ls
95
- perf script
96
-
97
-An interesting field that is not printed by default is 'flags' which can be
98
-displayed as follows:
99
-
100
- perf script -Fcomm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr,symoff,flags
101
-
102
-The flags are "bcrosyiABEx" which stand for branch, call, return, conditional,
103
-system, asynchronous, interrupt, transaction abort, trace begin, trace end, and
104
-in transaction, respectively.
105
-
106
-While it is possible to create scripts to analyze the data, an alternative
107
-approach is available to export the data to a sqlite or postgresql database.
108
-Refer to script export-to-sqlite.py or export-to-postgresql.py for more details,
109
-and to script call-graph-from-sql.py for an example of using the database.
110
-
111
-There is also script intel-pt-events.py which provides an example of how to
112
-unpack the raw data for power events and PTWRITE.
113
-
114
-As mentioned above, it is easy to capture too much data. One way to limit the
115
-data captured is to use 'snapshot' mode which is explained further below.
116
-Refer to 'new snapshot option' and 'Intel PT modes of operation' further below.
117
-
118
-Another problem that will be experienced is decoder errors. They can be caused
119
-by inability to access the executed image, self-modified or JIT-ed code, or the
120
-inability to match side-band information (such as context switches and mmaps)
121
-which results in the decoder not knowing what code was executed.
122
-
123
-There is also the problem of perf not being able to copy the data fast enough,
124
-resulting in data lost because the buffer was full. See 'Buffer handling' below
125
-for more details.
126
-
127
-
128
-perf record
129
-===========
130
-
131
-new event
132
----------
133
-
134
-The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are
135
-selected by providing the PMU name followed by the "config" separated by slashes.
136
-An enhancement has been made to allow default "config" e.g. the option
137
-
138
- -e intel_pt//
139
-
140
-will use a default config value. Currently that is the same as
141
-
142
- -e intel_pt/tsc,noretcomp=0/
143
-
144
-which is the same as
145
-
146
- -e intel_pt/tsc=1,noretcomp=0/
147
-
148
-Note there are now new config terms - see section 'config terms' further below.
149
-
150
-The config terms are listed in /sys/devices/intel_pt/format. They are bit
151
-fields within the config member of the struct perf_event_attr which is
152
-passed to the kernel by the perf_event_open system call. They correspond to bit
153
-fields in the IA32_RTIT_CTL MSR. Here is a list of them and their definitions:
154
-
155
- $ grep -H . /sys/bus/event_source/devices/intel_pt/format/*
156
- /sys/bus/event_source/devices/intel_pt/format/cyc:config:1
157
- /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22
158
- /sys/bus/event_source/devices/intel_pt/format/mtc:config:9
159
- /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17
160
- /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11
161
- /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27
162
- /sys/bus/event_source/devices/intel_pt/format/tsc:config:10
163
-
164
-Note that the default config must be overridden for each term i.e.
165
-
166
- -e intel_pt/noretcomp=0/
167
-
168
-is the same as:
169
-
170
- -e intel_pt/tsc=1,noretcomp=0/
171
-
172
-So, to disable TSC packets use:
173
-
174
- -e intel_pt/tsc=0/
175
-
176
-It is also possible to specify the config value explicitly:
177
-
178
- -e intel_pt/config=0x400/
179
-
180
-Note that, as with all events, the event is suffixed with event modifiers:
181
-
182
- u userspace
183
- k kernel
184
- h hypervisor
185
- G guest
186
- H host
187
- p precise ip
188
-
189
-'h', 'G' and 'H' are for virtualization which is not supported by Intel PT.
190
-'p' is also not relevant to Intel PT. So only options 'u' and 'k' are
191
-meaningful for Intel PT.
192
-
193
-perf_event_attr is displayed if the -vv option is used e.g.
194
-
195
- ------------------------------------------------------------
196
- perf_event_attr:
197
- type 6
198
- size 112
199
- config 0x400
200
- { sample_period, sample_freq } 1
201
- sample_type IP|TID|TIME|CPU|IDENTIFIER
202
- read_format ID
203
- disabled 1
204
- inherit 1
205
- exclude_kernel 1
206
- exclude_hv 1
207
- enable_on_exec 1
208
- sample_id_all 1
209
- ------------------------------------------------------------
210
- sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
211
- sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
212
- sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
213
- sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
214
- ------------------------------------------------------------
215
-
216
-
217
-config terms
218
-------------
219
-
220
-The June 2015 version of Intel 64 and IA-32 Architectures Software Developer
221
-Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features.
222
-Some of the features are reflect in new config terms. All the config terms are
223
-described below.
224
-
225
-tsc Always supported. Produces TSC timestamp packets to provide
226
- timing information. In some cases it is possible to decode
227
- without timing information, for example a per-thread context
228
- that does not overlap executable memory maps.
229
-
230
- The default config selects tsc (i.e. tsc=1).
231
-
232
-noretcomp Always supported. Disables "return compression" so a TIP packet
233
- is produced when a function returns. Causes more packets to be
234
- produced but might make decoding more reliable.
235
-
236
- The default config does not select noretcomp (i.e. noretcomp=0).
237
-
238
-psb_period Allows the frequency of PSB packets to be specified.
239
-
240
- The PSB packet is a synchronization packet that provides a
241
- starting point for decoding or recovery from errors.
242
-
243
- Support for psb_period is indicated by:
244
-
245
- /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
246
-
247
- which contains "1" if the feature is supported and "0"
248
- otherwise.
249
-
250
- Valid values are given by:
251
-
252
- /sys/bus/event_source/devices/intel_pt/caps/psb_periods
253
-
254
- which contains a hexadecimal value, the bits of which represent
255
- valid values e.g. bit 2 set means value 2 is valid.
256
-
257
- The psb_period value is converted to the approximate number of
258
- trace bytes between PSB packets as:
259
-
260
- 2 ^ (value + 11)
261
-
262
- e.g. value 3 means 16KiB bytes between PSBs
263
-
264
- If an invalid value is entered, the error message
265
- will give a list of valid values e.g.
266
-
267
- $ perf record -e intel_pt/psb_period=15/u uname
268
- Invalid psb_period for intel_pt. Valid values are: 0-5
269
-
270
- If MTC packets are selected, the default config selects a value
271
- of 3 (i.e. psb_period=3) or the nearest lower value that is
272
- supported (0 is always supported). Otherwise the default is 0.
273
-
274
- If decoding is expected to be reliable and the buffer is large
275
- then a large PSB period can be used.
276
-
277
- Because a TSC packet is produced with PSB, the PSB period can
278
- also affect the granularity to timing information in the absence
279
- of MTC or CYC.
280
-
281
-mtc Produces MTC timing packets.
282
-
283
- MTC packets provide finer grain timestamp information than TSC
284
- packets. MTC packets record time using the hardware crystal
285
- clock (CTC) which is related to TSC packets using a TMA packet.
286
-
287
- Support for this feature is indicated by:
288
-
289
- /sys/bus/event_source/devices/intel_pt/caps/mtc
290
-
291
- which contains "1" if the feature is supported and
292
- "0" otherwise.
293
-
294
- The frequency of MTC packets can also be specified - see
295
- mtc_period below.
296
-
297
-mtc_period Specifies how frequently MTC packets are produced - see mtc
298
- above for how to determine if MTC packets are supported.
299
-
300
- Valid values are given by:
301
-
302
- /sys/bus/event_source/devices/intel_pt/caps/mtc_periods
303
-
304
- which contains a hexadecimal value, the bits of which represent
305
- valid values e.g. bit 2 set means value 2 is valid.
306
-
307
- The mtc_period value is converted to the MTC frequency as:
308
-
309
- CTC-frequency / (2 ^ value)
310
-
311
- e.g. value 3 means one eighth of CTC-frequency
312
-
313
- Where CTC is the hardware crystal clock, the frequency of which
314
- can be related to TSC via values provided in cpuid leaf 0x15.
315
-
316
- If an invalid value is entered, the error message
317
- will give a list of valid values e.g.
318
-
319
- $ perf record -e intel_pt/mtc_period=15/u uname
320
- Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
321
-
322
- The default value is 3 or the nearest lower value
323
- that is supported (0 is always supported).
324
-
325
-cyc Produces CYC timing packets.
326
-
327
- CYC packets provide even finer grain timestamp information than
328
- MTC and TSC packets. A CYC packet contains the number of CPU
329
- cycles since the last CYC packet. Unlike MTC and TSC packets,
330
- CYC packets are only sent when another packet is also sent.
331
-
332
- Support for this feature is indicated by:
333
-
334
- /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
335
-
336
- which contains "1" if the feature is supported and
337
- "0" otherwise.
338
-
339
- The number of CYC packets produced can be reduced by specifying
340
- a threshold - see cyc_thresh below.
341
-
342
-cyc_thresh Specifies how frequently CYC packets are produced - see cyc
343
- above for how to determine if CYC packets are supported.
344
-
345
- Valid cyc_thresh values are given by:
346
-
347
- /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
348
-
349
- which contains a hexadecimal value, the bits of which represent
350
- valid values e.g. bit 2 set means value 2 is valid.
351
-
352
- The cyc_thresh value represents the minimum number of CPU cycles
353
- that must have passed before a CYC packet can be sent. The
354
- number of CPU cycles is:
355
-
356
- 2 ^ (value - 1)
357
-
358
- e.g. value 4 means 8 CPU cycles must pass before a CYC packet
359
- can be sent. Note a CYC packet is still only sent when another
360
- packet is sent, not at, e.g. every 8 CPU cycles.
361
-
362
- If an invalid value is entered, the error message
363
- will give a list of valid values e.g.
364
-
365
- $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
366
- Invalid cyc_thresh for intel_pt. Valid values are: 0-12
367
-
368
- CYC packets are not requested by default.
369
-
370
-pt Specifies pass-through which enables the 'branch' config term.
371
-
372
- The default config selects 'pt' if it is available, so a user will
373
- never need to specify this term.
374
-
375
-branch Enable branch tracing. Branch tracing is enabled by default so to
376
- disable branch tracing use 'branch=0'.
377
-
378
- The default config selects 'branch' if it is available.
379
-
380
-ptw Enable PTWRITE packets which are produced when a ptwrite instruction
381
- is executed.
382
-
383
- Support for this feature is indicated by:
384
-
385
- /sys/bus/event_source/devices/intel_pt/caps/ptwrite
386
-
387
- which contains "1" if the feature is supported and
388
- "0" otherwise.
389
-
390
-fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP packet
391
- provides the address of the ptwrite instruction. In the absence of
392
- fup_on_ptw, the decoder will use the address of the previous branch
393
- if branch tracing is enabled, otherwise the address will be zero.
394
- Note that fup_on_ptw will work even when branch tracing is disabled.
395
-
396
-pwr_evt Enable power events. The power events provide information about
397
- changes to the CPU C-state.
398
-
399
- Support for this feature is indicated by:
400
-
401
- /sys/bus/event_source/devices/intel_pt/caps/power_event_trace
402
-
403
- which contains "1" if the feature is supported and
404
- "0" otherwise.
405
-
406
-
407
-new snapshot option
408
--------------------
409
-
410
-The difference between full trace and snapshot from the kernel's perspective is
411
-that in full trace we don't overwrite trace data that the user hasn't collected
412
-yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let
413
-the trace run and overwrite older data in the buffer so that whenever something
414
-interesting happens, we can stop it and grab a snapshot of what was going on
415
-around that interesting moment.
416
-
417
-To select snapshot mode a new option has been added:
418
-
419
- -S
420
-
421
-Optionally it can be followed by the snapshot size e.g.
422
-
423
- -S0x100000
424
-
425
-The default snapshot size is the auxtrace mmap size. If neither auxtrace mmap size
426
-nor snapshot size is specified, then the default is 4MiB for privileged users
427
-(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
428
-If an unprivileged user does not specify mmap pages, the mmap pages will be
429
-reduced as described in the 'new auxtrace mmap size option' section below.
430
-
431
-The snapshot size is displayed if the option -vv is used e.g.
432
-
433
- Intel PT snapshot size: %zu
434
-
435
-
436
-new auxtrace mmap size option
437
----------------------------
438
-
439
-Intel PT buffer size is specified by an addition to the -m option e.g.
440
-
441
- -m,16
442
-
443
-selects a buffer size of 16 pages i.e. 64KiB.
444
-
445
-Note that the existing functionality of -m is unchanged. The auxtrace mmap size
446
-is specified by the optional addition of a comma and the value.
447
-
448
-The default auxtrace mmap size for Intel PT is 4MiB/page_size for privileged users
449
-(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
450
-If an unprivileged user does not specify mmap pages, the mmap pages will be
451
-reduced from the default 512KiB/page_size to 256KiB/page_size, otherwise the
452
-user is likely to get an error as they exceed their mlock limit (Max locked
453
-memory as shown in /proc/self/limits). Note that perf does not count the first
454
-512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per cpu
455
-against the mlock limit so an unprivileged user is allowed 512KiB per cpu plus
456
-their mlock limit (which defaults to 64KiB but is not multiplied by the number
457
-of cpus).
458
-
459
-In full-trace mode, powers of two are allowed for buffer size, with a minimum
460
-size of 2 pages. In snapshot mode, it is the same but the minimum size is
461
-1 page.
462
-
463
-The mmap size and auxtrace mmap size are displayed if the -vv option is used e.g.
464
-
465
- mmap length 528384
466
- auxtrace mmap length 4198400
467
-
468
-
469
-Intel PT modes of operation
470
----------------------------
471
-
472
-Intel PT can be used in 2 modes:
473
- full-trace mode
474
- snapshot mode
475
-
476
-Full-trace mode traces continuously e.g.
477
-
478
- perf record -e intel_pt//u uname
479
-
480
-Snapshot mode captures the available data when a signal is sent e.g.
481
-
482
- perf record -v -e intel_pt//u -S ./loopy 1000000000 &
483
- [1] 11435
484
- kill -USR2 11435
485
- Recording AUX area tracing snapshot
486
-
487
-Note that the signal sent is SIGUSR2.
488
-Note that "Recording AUX area tracing snapshot" is displayed because the -v
489
-option is used.
490
-
491
-The 2 modes cannot be used together.
492
-
493
-
494
-Buffer handling
495
----------------
496
-
497
-There may be buffer limitations (i.e. single ToPa entry) which means that actual
498
-buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to
499
-provide other sizes, and in particular an arbitrarily large size, multiple
500
-buffers are logically concatenated. However an interrupt must be used to switch
501
-between buffers. That has two potential problems:
502
- a) the interrupt may not be handled in time so that the current buffer
503
- becomes full and some trace data is lost.
504
- b) the interrupts may slow the system and affect the performance
505
- results.
506
-
507
-If trace data is lost, the driver sets 'truncated' in the PERF_RECORD_AUX event
508
-which the tools report as an error.
509
-
510
-In full-trace mode, the driver waits for data to be copied out before allowing
511
-the (logical) buffer to wrap-around. If data is not copied out quickly enough,
512
-again 'truncated' is set in the PERF_RECORD_AUX event. If the driver has to
513
-wait, the intel_pt event gets disabled. Because it is difficult to know when
514
-that happens, perf tools always re-enable the intel_pt event after copying out
515
-data.
516
-
517
-
518
-Intel PT and build ids
519
-----------------------
520
-
521
-By default "perf record" post-processes the event stream to find all build ids
522
-for executables for all addresses sampled. Deliberately, Intel PT is not
523
-decoded for that purpose (it would take too long). Instead the build ids for
524
-all executables encountered (due to mmap, comm or task events) are included
525
-in the perf.data file.
526
-
527
-To see buildids included in the perf.data file use the command:
528
-
529
- perf buildid-list
530
-
531
-If the perf.data file contains Intel PT data, that is the same as:
532
-
533
- perf buildid-list --with-hits
534
-
535
-
536
-Snapshot mode and event disabling
537
----------------------------------
538
-
539
-In order to make a snapshot, the intel_pt event is disabled using an IOCTL,
540
-namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the
541
-collection of side-band information. In order to prevent that, a dummy
542
-software event has been introduced that permits tracking events (like mmaps) to
543
-continue to be recorded while intel_pt is disabled. That is important to ensure
544
-there is complete side-band information to allow the decoding of subsequent
545
-snapshots.
546
-
547
-A test has been created for that. To find the test:
548
-
549
- perf test list
550
- ...
551
- 23: Test using a dummy software event to keep tracking
552
-
553
-To run the test:
554
-
555
- perf test 23
556
- 23: Test using a dummy software event to keep tracking : Ok
557
-
558
-
559
-perf record modes (nothing new here)
560
-------------------------------------
561
-
562
-perf record essentially operates in one of three modes:
563
- per thread
564
- per cpu
565
- workload only
566
-
567
-"per thread" mode is selected by -t or by --per-thread (with -p or -u or just a
568
-workload).
569
-"per cpu" is selected by -C or -a.
570
-"workload only" mode is selected by not using the other options but providing a
571
-command to run (i.e. the workload).
572
-
573
-In per-thread mode an exact list of threads is traced. There is no inheritance.
574
-Each thread has its own event buffer.
575
-
576
-In per-cpu mode all processes (or processes from the selected cgroup i.e. -G
577
-option, or processes selected with -p or -u) are traced. Each cpu has its own
578
-buffer. Inheritance is allowed.
579
-
580
-In workload-only mode, the workload is traced but with per-cpu buffers.
581
-Inheritance is allowed. Note that you can now trace a workload in per-thread
582
-mode by using the --per-thread option.
583
-
584
-
585
-Privileged vs non-privileged users
586
-----------------------------------
587
-
588
-Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users
589
-have memory limits imposed upon them. That affects what buffer sizes they can
590
-have as outlined above.
591
-
592
-The v4.2 kernel introduced support for a context switch metadata event,
593
-PERF_RECORD_SWITCH, which allows unprivileged users to see when their processes
594
-are scheduled out and in, just not by whom, which is left for the
595
-PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide context,
596
-which in turn requires CAP_SYS_ADMIN.
597
-
598
-Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate context
599
-switches") commit, that introduces these metadata events for further info.
600
-
601
-When working with kernels < v4.2, the following considerations must be taken,
602
-as the sched:sched_switch tracepoints will be used to receive such information:
603
-
604
-Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users are
605
-not permitted to use tracepoints which means there is insufficient side-band
606
-information to decode Intel PT in per-cpu mode, and potentially workload-only
607
-mode too if the workload creates new processes.
608
-
609
-Note also, that to use tracepoints, read-access to debugfs is required. So if
610
-debugfs is not mounted or the user does not have read-access, it will again not
611
-be possible to decode Intel PT in per-cpu mode.
612
-
613
-
614
-sched_switch tracepoint
615
------------------------
616
-
617
-The sched_switch tracepoint is used to provide side-band data for Intel PT
618
-decoding in kernels where the PERF_RECORD_SWITCH metadata event isn't
619
-available.
620
-
621
-The sched_switch events are automatically added. e.g. the second event shown
622
-below:
623
-
624
- $ perf record -vv -e intel_pt//u uname
625
- ------------------------------------------------------------
626
- perf_event_attr:
627
- type 6
628
- size 112
629
- config 0x400
630
- { sample_period, sample_freq } 1
631
- sample_type IP|TID|TIME|CPU|IDENTIFIER
632
- read_format ID
633
- disabled 1
634
- inherit 1
635
- exclude_kernel 1
636
- exclude_hv 1
637
- enable_on_exec 1
638
- sample_id_all 1
639
- ------------------------------------------------------------
640
- sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
641
- sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
642
- sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
643
- sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
644
- ------------------------------------------------------------
645
- perf_event_attr:
646
- type 2
647
- size 112
648
- config 0x108
649
- { sample_period, sample_freq } 1
650
- sample_type IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER
651
- read_format ID
652
- inherit 1
653
- sample_id_all 1
654
- exclude_guest 1
655
- ------------------------------------------------------------
656
- sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8
657
- sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8
658
- sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8
659
- sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8
660
- ------------------------------------------------------------
661
- perf_event_attr:
662
- type 1
663
- size 112
664
- config 0x9
665
- { sample_period, sample_freq } 1
666
- sample_type IP|TID|TIME|IDENTIFIER
667
- read_format ID
668
- disabled 1
669
- inherit 1
670
- exclude_kernel 1
671
- exclude_hv 1
672
- mmap 1
673
- comm 1
674
- enable_on_exec 1
675
- task 1
676
- sample_id_all 1
677
- mmap2 1
678
- comm_exec 1
679
- ------------------------------------------------------------
680
- sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
681
- sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
682
- sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
683
- sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
684
- mmap size 528384B
685
- AUX area mmap length 4194304
686
- perf event ring buffer mmapped per cpu
687
- Synthesizing auxtrace information
688
- Linux
689
- [ perf record: Woken up 1 times to write data ]
690
- [ perf record: Captured and wrote 0.042 MB perf.data ]
691
-
692
-Note, the sched_switch event is only added if the user is permitted to use it
693
-and only in per-cpu mode.
694
-
695
-Note also, the sched_switch event is only added if TSC packets are requested.
696
-That is because, in the absence of timing information, the sched_switch events
697
-cannot be matched against the Intel PT trace.
698
-
699
-
700
-perf script
701
-===========
702
-
703
-By default, perf script will decode trace data found in the perf.data file.
704
-This can be further controlled by new option --itrace.
705
-
706
-
707
-New --itrace option
708
--------------------
709
-
710
-Having no option is the same as
711
-
712
- --itrace
713
-
714
-which, in turn, is the same as
715
-
716
- --itrace=ibxwpe
717
-
718
-The letters are:
719
-
720
- i synthesize "instructions" events
721
- b synthesize "branches" events
722
- x synthesize "transactions" events
723
- w synthesize "ptwrite" events
724
- p synthesize "power" events
725
- c synthesize branches events (calls only)
726
- r synthesize branches events (returns only)
727
- e synthesize tracing error events
728
- d create a debug log
729
- g synthesize a call chain (use with i or x)
730
- l synthesize last branch entries (use with i or x)
731
- s skip initial number of events
732
-
733
-"Instructions" events look like they were recorded by "perf record -e
734
-instructions".
735
-
736
-"Branches" events look like they were recorded by "perf record -e branches". "c"
737
-and "r" can be combined to get calls and returns.
738
-
739
-"Transactions" events correspond to the start or end of transactions. The
740
-'flags' field can be used in perf script to determine whether the event is a
741
-tranasaction start, commit or abort.
742
-
743
-Note that "instructions", "branches" and "transactions" events depend on code
744
-flow packets which can be disabled by using the config term "branch=0". Refer
745
-to the config terms section above.
746
-
747
-"ptwrite" events record the payload of the ptwrite instruction and whether
748
-"fup_on_ptw" was used. "ptwrite" events depend on PTWRITE packets which are
749
-recorded only if the "ptw" config term was used. Refer to the config terms
750
-section above. perf script "synth" field displays "ptwrite" information like
751
-this: "ip: 0 payload: 0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was
752
-used.
753
-
754
-"Power" events correspond to power event packets and CBR (core-to-bus ratio)
755
-packets. While CBR packets are always recorded when tracing is enabled, power
756
-event packets are recorded only if the "pwr_evt" config term was used. Refer to
757
-the config terms section above. The power events record information about
758
-C-state changes, whereas CBR is indicative of CPU frequency. perf script
759
-"event,synth" fields display information like this:
760
- cbr: cbr: 22 freq: 2189 MHz (200%)
761
- mwait: hints: 0x60 extensions: 0x1
762
- pwre: hw: 0 cstate: 2 sub-cstate: 0
763
- exstop: ip: 1
764
- pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4
765
-Where:
766
- "cbr" includes the frequency and the percentage of maximum non-turbo
767
- "mwait" shows mwait hints and extensions
768
- "pwre" shows C-state transitions (to a C-state deeper than C0) and
769
- whether initiated by hardware
770
- "exstop" indicates execution stopped and whether the IP was recorded
771
- exactly,
772
- "pwrx" indicates return to C0
773
-For more details refer to the Intel 64 and IA-32 Architectures Software
774
-Developer Manuals.
775
-
776
-Error events show where the decoder lost the trace. Error events
777
-are quite important. Users must know if what they are seeing is a complete
778
-picture or not.
779
-
780
-The "d" option will cause the creation of a file "intel_pt.log" containing all
781
-decoded packets and instructions. Note that this option slows down the decoder
782
-and that the resulting file may be very large.
783
-
784
-In addition, the period of the "instructions" event can be specified. e.g.
785
-
786
- --itrace=i10us
787
-
788
-sets the period to 10us i.e. one instruction sample is synthesized for each 10
789
-microseconds of trace. Alternatives to "us" are "ms" (milliseconds),
790
-"ns" (nanoseconds), "t" (TSC ticks) or "i" (instructions).
791
-
792
-"ms", "us" and "ns" are converted to TSC ticks.
793
-
794
-The timing information included with Intel PT does not give the time of every
795
-instruction. Consequently, for the purpose of sampling, the decoder estimates
796
-the time since the last timing packet based on 1 tick per instruction. The time
797
-on the sample is *not* adjusted and reflects the last known value of TSC.
798
-
799
-For Intel PT, the default period is 100us.
800
-
801
-Setting it to a zero period means "as often as possible".
802
-
803
-In the case of Intel PT that is the same as a period of 1 and a unit of
804
-'instructions' (i.e. --itrace=i1i).
805
-
806
-Also the call chain size (default 16, max. 1024) for instructions or
807
-transactions events can be specified. e.g.
808
-
809
- --itrace=ig32
810
- --itrace=xg32
811
-
812
-Also the number of last branch entries (default 64, max. 1024) for instructions or
813
-transactions events can be specified. e.g.
814
-
815
- --itrace=il10
816
- --itrace=xl10
817
-
818
-Note that last branch entries are cleared for each sample, so there is no overlap
819
-from one sample to the next.
820
-
821
-To disable trace decoding entirely, use the option --no-itrace.
822
-
823
-It is also possible to skip events generated (instructions, branches, transactions)
824
-at the beginning. This is useful to ignore initialization code.
825
-
826
- --itrace=i0nss1000000
827
-
828
-skips the first million instructions.
829
-
830
-dump option
831
------------
832
-
833
-perf script has an option (-D) to "dump" the events i.e. display the binary
834
-data.
835
-
836
-When -D is used, Intel PT packets are displayed. The packet decoder does not
837
-pay attention to PSB packets, but just decodes the bytes - so the packets seen
838
-by the actual decoder may not be identical in places where the data is corrupt.
839
-One example of that would be when the buffer-switching interrupt has been too
840
-slow, and the buffer has been filled completely. In that case, the last packet
841
-in the buffer might be truncated and immediately followed by a PSB as the trace
842
-continues in the next buffer.
843
-
844
-To disable the display of Intel PT packets, combine the -D option with
845
---no-itrace.
846
-
847
-
848
-perf report
849
-===========
850
-
851
-By default, perf report will decode trace data found in the perf.data file.
852
-This can be further controlled by new option --itrace exactly the same as
853
-perf script, with the exception that the default is --itrace=igxe.
854
-
855
-
856
-perf inject
857
-===========
858
-
859
-perf inject also accepts the --itrace option in which case tracing data is
860
-removed and replaced with the synthesized events. e.g.
861
-
862
- perf inject --itrace -i perf.data -o perf.data.new
863
-
864
-Below is an example of using Intel PT with autofdo. It requires autofdo
865
-(https://github.com/google/autofdo) and gcc version 5. The bubble
866
-sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial)
867
-amended to take the number of elements as a parameter.
868
-
869
- $ gcc-5 -O3 sort.c -o sort_optimized
870
- $ ./sort_optimized 30000
871
- Bubble sorting array of 30000 elements
872
- 2254 ms
873
-
874
- $ cat ~/.perfconfig
875
- [intel-pt]
876
- mispred-all = on
877
-
878
- $ perf record -e intel_pt//u ./sort 3000
879
- Bubble sorting array of 3000 elements
880
- 58 ms
881
- [ perf record: Woken up 2 times to write data ]
882
- [ perf record: Captured and wrote 3.939 MB perf.data ]
883
- $ perf inject -i perf.data -o inj --itrace=i100usle --strip
884
- $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
885
- $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
886
- $ ./sort_autofdo 30000
887
- Bubble sorting array of 30000 elements
888
- 2155 ms
889
-
890
-Note there is currently no advantage to using Intel PT instead of LBR, but
891
-that may change in the future if greater use is made of the data.
1
+Documentation for support for Intel Processor Trace within perf tools' has moved to file perf-intel-pt.txt