.. | .. |
---|
1 | | -Intel Processor Trace |
---|
2 | | -===================== |
---|
3 | | - |
---|
4 | | -Overview |
---|
5 | | -======== |
---|
6 | | - |
---|
7 | | -Intel Processor Trace (Intel PT) is an extension of Intel Architecture that |
---|
8 | | -collects information about software execution such as control flow, execution |
---|
9 | | -modes and timings and formats it into highly compressed binary packets. |
---|
10 | | -Technical details are documented in the Intel 64 and IA-32 Architectures |
---|
11 | | -Software Developer Manuals, Chapter 36 Intel Processor Trace. |
---|
12 | | - |
---|
13 | | -Intel PT is first supported in Intel Core M and 5th generation Intel Core |
---|
14 | | -processors that are based on the Intel micro-architecture code name Broadwell. |
---|
15 | | - |
---|
16 | | -Trace data is collected by 'perf record' and stored within the perf.data file. |
---|
17 | | -See below for options to 'perf record'. |
---|
18 | | - |
---|
19 | | -Trace data must be 'decoded' which involves walking the object code and matching |
---|
20 | | -the trace data packets. For example a TNT packet only tells whether a |
---|
21 | | -conditional branch was taken or not taken, so to make use of that packet the |
---|
22 | | -decoder must know precisely which instruction was being executed. |
---|
23 | | - |
---|
24 | | -Decoding is done on-the-fly. The decoder outputs samples in the same format as |
---|
25 | | -samples output by perf hardware events, for example as though the "instructions" |
---|
26 | | -or "branches" events had been recorded. Presently 3 tools support this: |
---|
27 | | -'perf script', 'perf report' and 'perf inject'. See below for more information |
---|
28 | | -on using those tools. |
---|
29 | | - |
---|
30 | | -The main distinguishing feature of Intel PT is that the decoder can determine |
---|
31 | | -the exact flow of software execution. Intel PT can be used to understand why |
---|
32 | | -and how did software get to a certain point, or behave a certain way. The |
---|
33 | | -software does not have to be recompiled, so Intel PT works with debug or release |
---|
34 | | -builds, however the executed images are needed - which makes use in JIT-compiled |
---|
35 | | -environments, or with self-modified code, a challenge. Also symbols need to be |
---|
36 | | -provided to make sense of addresses. |
---|
37 | | - |
---|
38 | | -A limitation of Intel PT is that it produces huge amounts of trace data |
---|
39 | | -(hundreds of megabytes per second per core) which takes a long time to decode, |
---|
40 | | -for example two or three orders of magnitude longer than it took to collect. |
---|
41 | | -Another limitation is the performance impact of tracing, something that will |
---|
42 | | -vary depending on the use-case and architecture. |
---|
43 | | - |
---|
44 | | - |
---|
45 | | -Quickstart |
---|
46 | | -========== |
---|
47 | | - |
---|
48 | | -It is important to start small. That is because it is easy to capture vastly |
---|
49 | | -more data than can possibly be processed. |
---|
50 | | - |
---|
51 | | -The simplest thing to do with Intel PT is userspace profiling of small programs. |
---|
52 | | -Data is captured with 'perf record' e.g. to trace 'ls' userspace-only: |
---|
53 | | - |
---|
54 | | - perf record -e intel_pt//u ls |
---|
55 | | - |
---|
56 | | -And profiled with 'perf report' e.g. |
---|
57 | | - |
---|
58 | | - perf report |
---|
59 | | - |
---|
60 | | -To also trace kernel space presents a problem, namely kernel self-modifying |
---|
61 | | -code. A fairly good kernel image is available in /proc/kcore but to get an |
---|
62 | | -accurate image a copy of /proc/kcore needs to be made under the same conditions |
---|
63 | | -as the data capture. A script perf-with-kcore can do that, but beware that the |
---|
64 | | -script makes use of 'sudo' to copy /proc/kcore. If you have perf installed |
---|
65 | | -locally from the source tree you can do: |
---|
66 | | - |
---|
67 | | - ~/libexec/perf-core/perf-with-kcore record pt_ls -e intel_pt// -- ls |
---|
68 | | - |
---|
69 | | -which will create a directory named 'pt_ls' and put the perf.data file and |
---|
70 | | -copies of /proc/kcore, /proc/kallsyms and /proc/modules into it. Then to use |
---|
71 | | -'perf report' becomes: |
---|
72 | | - |
---|
73 | | - ~/libexec/perf-core/perf-with-kcore report pt_ls |
---|
74 | | - |
---|
75 | | -Because samples are synthesized after-the-fact, the sampling period can be |
---|
76 | | -selected for reporting. e.g. sample every microsecond |
---|
77 | | - |
---|
78 | | - ~/libexec/perf-core/perf-with-kcore report pt_ls --itrace=i1usge |
---|
79 | | - |
---|
80 | | -See the sections below for more information about the --itrace option. |
---|
81 | | - |
---|
82 | | -Beware the smaller the period, the more samples that are produced, and the |
---|
83 | | -longer it takes to process them. |
---|
84 | | - |
---|
85 | | -Also note that the coarseness of Intel PT timing information will start to |
---|
86 | | -distort the statistical value of the sampling as the sampling period becomes |
---|
87 | | -smaller. |
---|
88 | | - |
---|
89 | | -To represent software control flow, "branches" samples are produced. By default |
---|
90 | | -a branch sample is synthesized for every single branch. To get an idea what |
---|
91 | | -data is available you can use the 'perf script' tool with no parameters, which |
---|
92 | | -will list all the samples. |
---|
93 | | - |
---|
94 | | - perf record -e intel_pt//u ls |
---|
95 | | - perf script |
---|
96 | | - |
---|
97 | | -An interesting field that is not printed by default is 'flags' which can be |
---|
98 | | -displayed as follows: |
---|
99 | | - |
---|
100 | | - perf script -Fcomm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr,symoff,flags |
---|
101 | | - |
---|
102 | | -The flags are "bcrosyiABEx" which stand for branch, call, return, conditional, |
---|
103 | | -system, asynchronous, interrupt, transaction abort, trace begin, trace end, and |
---|
104 | | -in transaction, respectively. |
---|
105 | | - |
---|
106 | | -While it is possible to create scripts to analyze the data, an alternative |
---|
107 | | -approach is available to export the data to a sqlite or postgresql database. |
---|
108 | | -Refer to script export-to-sqlite.py or export-to-postgresql.py for more details, |
---|
109 | | -and to script call-graph-from-sql.py for an example of using the database. |
---|
110 | | - |
---|
111 | | -There is also script intel-pt-events.py which provides an example of how to |
---|
112 | | -unpack the raw data for power events and PTWRITE. |
---|
113 | | - |
---|
114 | | -As mentioned above, it is easy to capture too much data. One way to limit the |
---|
115 | | -data captured is to use 'snapshot' mode which is explained further below. |
---|
116 | | -Refer to 'new snapshot option' and 'Intel PT modes of operation' further below. |
---|
117 | | - |
---|
118 | | -Another problem that will be experienced is decoder errors. They can be caused |
---|
119 | | -by inability to access the executed image, self-modified or JIT-ed code, or the |
---|
120 | | -inability to match side-band information (such as context switches and mmaps) |
---|
121 | | -which results in the decoder not knowing what code was executed. |
---|
122 | | - |
---|
123 | | -There is also the problem of perf not being able to copy the data fast enough, |
---|
124 | | -resulting in data lost because the buffer was full. See 'Buffer handling' below |
---|
125 | | -for more details. |
---|
126 | | - |
---|
127 | | - |
---|
128 | | -perf record |
---|
129 | | -=========== |
---|
130 | | - |
---|
131 | | -new event |
---|
132 | | ---------- |
---|
133 | | - |
---|
134 | | -The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are |
---|
135 | | -selected by providing the PMU name followed by the "config" separated by slashes. |
---|
136 | | -An enhancement has been made to allow default "config" e.g. the option |
---|
137 | | - |
---|
138 | | - -e intel_pt// |
---|
139 | | - |
---|
140 | | -will use a default config value. Currently that is the same as |
---|
141 | | - |
---|
142 | | - -e intel_pt/tsc,noretcomp=0/ |
---|
143 | | - |
---|
144 | | -which is the same as |
---|
145 | | - |
---|
146 | | - -e intel_pt/tsc=1,noretcomp=0/ |
---|
147 | | - |
---|
148 | | -Note there are now new config terms - see section 'config terms' further below. |
---|
149 | | - |
---|
150 | | -The config terms are listed in /sys/devices/intel_pt/format. They are bit |
---|
151 | | -fields within the config member of the struct perf_event_attr which is |
---|
152 | | -passed to the kernel by the perf_event_open system call. They correspond to bit |
---|
153 | | -fields in the IA32_RTIT_CTL MSR. Here is a list of them and their definitions: |
---|
154 | | - |
---|
155 | | - $ grep -H . /sys/bus/event_source/devices/intel_pt/format/* |
---|
156 | | - /sys/bus/event_source/devices/intel_pt/format/cyc:config:1 |
---|
157 | | - /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22 |
---|
158 | | - /sys/bus/event_source/devices/intel_pt/format/mtc:config:9 |
---|
159 | | - /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17 |
---|
160 | | - /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11 |
---|
161 | | - /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27 |
---|
162 | | - /sys/bus/event_source/devices/intel_pt/format/tsc:config:10 |
---|
163 | | - |
---|
164 | | -Note that the default config must be overridden for each term i.e. |
---|
165 | | - |
---|
166 | | - -e intel_pt/noretcomp=0/ |
---|
167 | | - |
---|
168 | | -is the same as: |
---|
169 | | - |
---|
170 | | - -e intel_pt/tsc=1,noretcomp=0/ |
---|
171 | | - |
---|
172 | | -So, to disable TSC packets use: |
---|
173 | | - |
---|
174 | | - -e intel_pt/tsc=0/ |
---|
175 | | - |
---|
176 | | -It is also possible to specify the config value explicitly: |
---|
177 | | - |
---|
178 | | - -e intel_pt/config=0x400/ |
---|
179 | | - |
---|
180 | | -Note that, as with all events, the event is suffixed with event modifiers: |
---|
181 | | - |
---|
182 | | - u userspace |
---|
183 | | - k kernel |
---|
184 | | - h hypervisor |
---|
185 | | - G guest |
---|
186 | | - H host |
---|
187 | | - p precise ip |
---|
188 | | - |
---|
189 | | -'h', 'G' and 'H' are for virtualization which is not supported by Intel PT. |
---|
190 | | -'p' is also not relevant to Intel PT. So only options 'u' and 'k' are |
---|
191 | | -meaningful for Intel PT. |
---|
192 | | - |
---|
193 | | -perf_event_attr is displayed if the -vv option is used e.g. |
---|
194 | | - |
---|
195 | | - ------------------------------------------------------------ |
---|
196 | | - perf_event_attr: |
---|
197 | | - type 6 |
---|
198 | | - size 112 |
---|
199 | | - config 0x400 |
---|
200 | | - { sample_period, sample_freq } 1 |
---|
201 | | - sample_type IP|TID|TIME|CPU|IDENTIFIER |
---|
202 | | - read_format ID |
---|
203 | | - disabled 1 |
---|
204 | | - inherit 1 |
---|
205 | | - exclude_kernel 1 |
---|
206 | | - exclude_hv 1 |
---|
207 | | - enable_on_exec 1 |
---|
208 | | - sample_id_all 1 |
---|
209 | | - ------------------------------------------------------------ |
---|
210 | | - sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 |
---|
211 | | - sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 |
---|
212 | | - sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 |
---|
213 | | - sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 |
---|
214 | | - ------------------------------------------------------------ |
---|
215 | | - |
---|
216 | | - |
---|
217 | | -config terms |
---|
218 | | ------------- |
---|
219 | | - |
---|
220 | | -The June 2015 version of Intel 64 and IA-32 Architectures Software Developer |
---|
221 | | -Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features. |
---|
222 | | -Some of the features are reflect in new config terms. All the config terms are |
---|
223 | | -described below. |
---|
224 | | - |
---|
225 | | -tsc Always supported. Produces TSC timestamp packets to provide |
---|
226 | | - timing information. In some cases it is possible to decode |
---|
227 | | - without timing information, for example a per-thread context |
---|
228 | | - that does not overlap executable memory maps. |
---|
229 | | - |
---|
230 | | - The default config selects tsc (i.e. tsc=1). |
---|
231 | | - |
---|
232 | | -noretcomp Always supported. Disables "return compression" so a TIP packet |
---|
233 | | - is produced when a function returns. Causes more packets to be |
---|
234 | | - produced but might make decoding more reliable. |
---|
235 | | - |
---|
236 | | - The default config does not select noretcomp (i.e. noretcomp=0). |
---|
237 | | - |
---|
238 | | -psb_period Allows the frequency of PSB packets to be specified. |
---|
239 | | - |
---|
240 | | - The PSB packet is a synchronization packet that provides a |
---|
241 | | - starting point for decoding or recovery from errors. |
---|
242 | | - |
---|
243 | | - Support for psb_period is indicated by: |
---|
244 | | - |
---|
245 | | - /sys/bus/event_source/devices/intel_pt/caps/psb_cyc |
---|
246 | | - |
---|
247 | | - which contains "1" if the feature is supported and "0" |
---|
248 | | - otherwise. |
---|
249 | | - |
---|
250 | | - Valid values are given by: |
---|
251 | | - |
---|
252 | | - /sys/bus/event_source/devices/intel_pt/caps/psb_periods |
---|
253 | | - |
---|
254 | | - which contains a hexadecimal value, the bits of which represent |
---|
255 | | - valid values e.g. bit 2 set means value 2 is valid. |
---|
256 | | - |
---|
257 | | - The psb_period value is converted to the approximate number of |
---|
258 | | - trace bytes between PSB packets as: |
---|
259 | | - |
---|
260 | | - 2 ^ (value + 11) |
---|
261 | | - |
---|
262 | | - e.g. value 3 means 16KiB bytes between PSBs |
---|
263 | | - |
---|
264 | | - If an invalid value is entered, the error message |
---|
265 | | - will give a list of valid values e.g. |
---|
266 | | - |
---|
267 | | - $ perf record -e intel_pt/psb_period=15/u uname |
---|
268 | | - Invalid psb_period for intel_pt. Valid values are: 0-5 |
---|
269 | | - |
---|
270 | | - If MTC packets are selected, the default config selects a value |
---|
271 | | - of 3 (i.e. psb_period=3) or the nearest lower value that is |
---|
272 | | - supported (0 is always supported). Otherwise the default is 0. |
---|
273 | | - |
---|
274 | | - If decoding is expected to be reliable and the buffer is large |
---|
275 | | - then a large PSB period can be used. |
---|
276 | | - |
---|
277 | | - Because a TSC packet is produced with PSB, the PSB period can |
---|
278 | | - also affect the granularity to timing information in the absence |
---|
279 | | - of MTC or CYC. |
---|
280 | | - |
---|
281 | | -mtc Produces MTC timing packets. |
---|
282 | | - |
---|
283 | | - MTC packets provide finer grain timestamp information than TSC |
---|
284 | | - packets. MTC packets record time using the hardware crystal |
---|
285 | | - clock (CTC) which is related to TSC packets using a TMA packet. |
---|
286 | | - |
---|
287 | | - Support for this feature is indicated by: |
---|
288 | | - |
---|
289 | | - /sys/bus/event_source/devices/intel_pt/caps/mtc |
---|
290 | | - |
---|
291 | | - which contains "1" if the feature is supported and |
---|
292 | | - "0" otherwise. |
---|
293 | | - |
---|
294 | | - The frequency of MTC packets can also be specified - see |
---|
295 | | - mtc_period below. |
---|
296 | | - |
---|
297 | | -mtc_period Specifies how frequently MTC packets are produced - see mtc |
---|
298 | | - above for how to determine if MTC packets are supported. |
---|
299 | | - |
---|
300 | | - Valid values are given by: |
---|
301 | | - |
---|
302 | | - /sys/bus/event_source/devices/intel_pt/caps/mtc_periods |
---|
303 | | - |
---|
304 | | - which contains a hexadecimal value, the bits of which represent |
---|
305 | | - valid values e.g. bit 2 set means value 2 is valid. |
---|
306 | | - |
---|
307 | | - The mtc_period value is converted to the MTC frequency as: |
---|
308 | | - |
---|
309 | | - CTC-frequency / (2 ^ value) |
---|
310 | | - |
---|
311 | | - e.g. value 3 means one eighth of CTC-frequency |
---|
312 | | - |
---|
313 | | - Where CTC is the hardware crystal clock, the frequency of which |
---|
314 | | - can be related to TSC via values provided in cpuid leaf 0x15. |
---|
315 | | - |
---|
316 | | - If an invalid value is entered, the error message |
---|
317 | | - will give a list of valid values e.g. |
---|
318 | | - |
---|
319 | | - $ perf record -e intel_pt/mtc_period=15/u uname |
---|
320 | | - Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9 |
---|
321 | | - |
---|
322 | | - The default value is 3 or the nearest lower value |
---|
323 | | - that is supported (0 is always supported). |
---|
324 | | - |
---|
325 | | -cyc Produces CYC timing packets. |
---|
326 | | - |
---|
327 | | - CYC packets provide even finer grain timestamp information than |
---|
328 | | - MTC and TSC packets. A CYC packet contains the number of CPU |
---|
329 | | - cycles since the last CYC packet. Unlike MTC and TSC packets, |
---|
330 | | - CYC packets are only sent when another packet is also sent. |
---|
331 | | - |
---|
332 | | - Support for this feature is indicated by: |
---|
333 | | - |
---|
334 | | - /sys/bus/event_source/devices/intel_pt/caps/psb_cyc |
---|
335 | | - |
---|
336 | | - which contains "1" if the feature is supported and |
---|
337 | | - "0" otherwise. |
---|
338 | | - |
---|
339 | | - The number of CYC packets produced can be reduced by specifying |
---|
340 | | - a threshold - see cyc_thresh below. |
---|
341 | | - |
---|
342 | | -cyc_thresh Specifies how frequently CYC packets are produced - see cyc |
---|
343 | | - above for how to determine if CYC packets are supported. |
---|
344 | | - |
---|
345 | | - Valid cyc_thresh values are given by: |
---|
346 | | - |
---|
347 | | - /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds |
---|
348 | | - |
---|
349 | | - which contains a hexadecimal value, the bits of which represent |
---|
350 | | - valid values e.g. bit 2 set means value 2 is valid. |
---|
351 | | - |
---|
352 | | - The cyc_thresh value represents the minimum number of CPU cycles |
---|
353 | | - that must have passed before a CYC packet can be sent. The |
---|
354 | | - number of CPU cycles is: |
---|
355 | | - |
---|
356 | | - 2 ^ (value - 1) |
---|
357 | | - |
---|
358 | | - e.g. value 4 means 8 CPU cycles must pass before a CYC packet |
---|
359 | | - can be sent. Note a CYC packet is still only sent when another |
---|
360 | | - packet is sent, not at, e.g. every 8 CPU cycles. |
---|
361 | | - |
---|
362 | | - If an invalid value is entered, the error message |
---|
363 | | - will give a list of valid values e.g. |
---|
364 | | - |
---|
365 | | - $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname |
---|
366 | | - Invalid cyc_thresh for intel_pt. Valid values are: 0-12 |
---|
367 | | - |
---|
368 | | - CYC packets are not requested by default. |
---|
369 | | - |
---|
370 | | -pt Specifies pass-through which enables the 'branch' config term. |
---|
371 | | - |
---|
372 | | - The default config selects 'pt' if it is available, so a user will |
---|
373 | | - never need to specify this term. |
---|
374 | | - |
---|
375 | | -branch Enable branch tracing. Branch tracing is enabled by default so to |
---|
376 | | - disable branch tracing use 'branch=0'. |
---|
377 | | - |
---|
378 | | - The default config selects 'branch' if it is available. |
---|
379 | | - |
---|
380 | | -ptw Enable PTWRITE packets which are produced when a ptwrite instruction |
---|
381 | | - is executed. |
---|
382 | | - |
---|
383 | | - Support for this feature is indicated by: |
---|
384 | | - |
---|
385 | | - /sys/bus/event_source/devices/intel_pt/caps/ptwrite |
---|
386 | | - |
---|
387 | | - which contains "1" if the feature is supported and |
---|
388 | | - "0" otherwise. |
---|
389 | | - |
---|
390 | | -fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP packet |
---|
391 | | - provides the address of the ptwrite instruction. In the absence of |
---|
392 | | - fup_on_ptw, the decoder will use the address of the previous branch |
---|
393 | | - if branch tracing is enabled, otherwise the address will be zero. |
---|
394 | | - Note that fup_on_ptw will work even when branch tracing is disabled. |
---|
395 | | - |
---|
396 | | -pwr_evt Enable power events. The power events provide information about |
---|
397 | | - changes to the CPU C-state. |
---|
398 | | - |
---|
399 | | - Support for this feature is indicated by: |
---|
400 | | - |
---|
401 | | - /sys/bus/event_source/devices/intel_pt/caps/power_event_trace |
---|
402 | | - |
---|
403 | | - which contains "1" if the feature is supported and |
---|
404 | | - "0" otherwise. |
---|
405 | | - |
---|
406 | | - |
---|
407 | | -new snapshot option |
---|
408 | | -------------------- |
---|
409 | | - |
---|
410 | | -The difference between full trace and snapshot from the kernel's perspective is |
---|
411 | | -that in full trace we don't overwrite trace data that the user hasn't collected |
---|
412 | | -yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let |
---|
413 | | -the trace run and overwrite older data in the buffer so that whenever something |
---|
414 | | -interesting happens, we can stop it and grab a snapshot of what was going on |
---|
415 | | -around that interesting moment. |
---|
416 | | - |
---|
417 | | -To select snapshot mode a new option has been added: |
---|
418 | | - |
---|
419 | | - -S |
---|
420 | | - |
---|
421 | | -Optionally it can be followed by the snapshot size e.g. |
---|
422 | | - |
---|
423 | | - -S0x100000 |
---|
424 | | - |
---|
425 | | -The default snapshot size is the auxtrace mmap size. If neither auxtrace mmap size |
---|
426 | | -nor snapshot size is specified, then the default is 4MiB for privileged users |
---|
427 | | -(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users. |
---|
428 | | -If an unprivileged user does not specify mmap pages, the mmap pages will be |
---|
429 | | -reduced as described in the 'new auxtrace mmap size option' section below. |
---|
430 | | - |
---|
431 | | -The snapshot size is displayed if the option -vv is used e.g. |
---|
432 | | - |
---|
433 | | - Intel PT snapshot size: %zu |
---|
434 | | - |
---|
435 | | - |
---|
436 | | -new auxtrace mmap size option |
---|
437 | | ---------------------------- |
---|
438 | | - |
---|
439 | | -Intel PT buffer size is specified by an addition to the -m option e.g. |
---|
440 | | - |
---|
441 | | - -m,16 |
---|
442 | | - |
---|
443 | | -selects a buffer size of 16 pages i.e. 64KiB. |
---|
444 | | - |
---|
445 | | -Note that the existing functionality of -m is unchanged. The auxtrace mmap size |
---|
446 | | -is specified by the optional addition of a comma and the value. |
---|
447 | | - |
---|
448 | | -The default auxtrace mmap size for Intel PT is 4MiB/page_size for privileged users |
---|
449 | | -(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users. |
---|
450 | | -If an unprivileged user does not specify mmap pages, the mmap pages will be |
---|
451 | | -reduced from the default 512KiB/page_size to 256KiB/page_size, otherwise the |
---|
452 | | -user is likely to get an error as they exceed their mlock limit (Max locked |
---|
453 | | -memory as shown in /proc/self/limits). Note that perf does not count the first |
---|
454 | | -512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per cpu |
---|
455 | | -against the mlock limit so an unprivileged user is allowed 512KiB per cpu plus |
---|
456 | | -their mlock limit (which defaults to 64KiB but is not multiplied by the number |
---|
457 | | -of cpus). |
---|
458 | | - |
---|
459 | | -In full-trace mode, powers of two are allowed for buffer size, with a minimum |
---|
460 | | -size of 2 pages. In snapshot mode, it is the same but the minimum size is |
---|
461 | | -1 page. |
---|
462 | | - |
---|
463 | | -The mmap size and auxtrace mmap size are displayed if the -vv option is used e.g. |
---|
464 | | - |
---|
465 | | - mmap length 528384 |
---|
466 | | - auxtrace mmap length 4198400 |
---|
467 | | - |
---|
468 | | - |
---|
469 | | -Intel PT modes of operation |
---|
470 | | ---------------------------- |
---|
471 | | - |
---|
472 | | -Intel PT can be used in 2 modes: |
---|
473 | | - full-trace mode |
---|
474 | | - snapshot mode |
---|
475 | | - |
---|
476 | | -Full-trace mode traces continuously e.g. |
---|
477 | | - |
---|
478 | | - perf record -e intel_pt//u uname |
---|
479 | | - |
---|
480 | | -Snapshot mode captures the available data when a signal is sent e.g. |
---|
481 | | - |
---|
482 | | - perf record -v -e intel_pt//u -S ./loopy 1000000000 & |
---|
483 | | - [1] 11435 |
---|
484 | | - kill -USR2 11435 |
---|
485 | | - Recording AUX area tracing snapshot |
---|
486 | | - |
---|
487 | | -Note that the signal sent is SIGUSR2. |
---|
488 | | -Note that "Recording AUX area tracing snapshot" is displayed because the -v |
---|
489 | | -option is used. |
---|
490 | | - |
---|
491 | | -The 2 modes cannot be used together. |
---|
492 | | - |
---|
493 | | - |
---|
494 | | -Buffer handling |
---|
495 | | ---------------- |
---|
496 | | - |
---|
497 | | -There may be buffer limitations (i.e. single ToPa entry) which means that actual |
---|
498 | | -buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to |
---|
499 | | -provide other sizes, and in particular an arbitrarily large size, multiple |
---|
500 | | -buffers are logically concatenated. However an interrupt must be used to switch |
---|
501 | | -between buffers. That has two potential problems: |
---|
502 | | - a) the interrupt may not be handled in time so that the current buffer |
---|
503 | | - becomes full and some trace data is lost. |
---|
504 | | - b) the interrupts may slow the system and affect the performance |
---|
505 | | - results. |
---|
506 | | - |
---|
507 | | -If trace data is lost, the driver sets 'truncated' in the PERF_RECORD_AUX event |
---|
508 | | -which the tools report as an error. |
---|
509 | | - |
---|
510 | | -In full-trace mode, the driver waits for data to be copied out before allowing |
---|
511 | | -the (logical) buffer to wrap-around. If data is not copied out quickly enough, |
---|
512 | | -again 'truncated' is set in the PERF_RECORD_AUX event. If the driver has to |
---|
513 | | -wait, the intel_pt event gets disabled. Because it is difficult to know when |
---|
514 | | -that happens, perf tools always re-enable the intel_pt event after copying out |
---|
515 | | -data. |
---|
516 | | - |
---|
517 | | - |
---|
518 | | -Intel PT and build ids |
---|
519 | | ----------------------- |
---|
520 | | - |
---|
521 | | -By default "perf record" post-processes the event stream to find all build ids |
---|
522 | | -for executables for all addresses sampled. Deliberately, Intel PT is not |
---|
523 | | -decoded for that purpose (it would take too long). Instead the build ids for |
---|
524 | | -all executables encountered (due to mmap, comm or task events) are included |
---|
525 | | -in the perf.data file. |
---|
526 | | - |
---|
527 | | -To see buildids included in the perf.data file use the command: |
---|
528 | | - |
---|
529 | | - perf buildid-list |
---|
530 | | - |
---|
531 | | -If the perf.data file contains Intel PT data, that is the same as: |
---|
532 | | - |
---|
533 | | - perf buildid-list --with-hits |
---|
534 | | - |
---|
535 | | - |
---|
536 | | -Snapshot mode and event disabling |
---|
537 | | ---------------------------------- |
---|
538 | | - |
---|
539 | | -In order to make a snapshot, the intel_pt event is disabled using an IOCTL, |
---|
540 | | -namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the |
---|
541 | | -collection of side-band information. In order to prevent that, a dummy |
---|
542 | | -software event has been introduced that permits tracking events (like mmaps) to |
---|
543 | | -continue to be recorded while intel_pt is disabled. That is important to ensure |
---|
544 | | -there is complete side-band information to allow the decoding of subsequent |
---|
545 | | -snapshots. |
---|
546 | | - |
---|
547 | | -A test has been created for that. To find the test: |
---|
548 | | - |
---|
549 | | - perf test list |
---|
550 | | - ... |
---|
551 | | - 23: Test using a dummy software event to keep tracking |
---|
552 | | - |
---|
553 | | -To run the test: |
---|
554 | | - |
---|
555 | | - perf test 23 |
---|
556 | | - 23: Test using a dummy software event to keep tracking : Ok |
---|
557 | | - |
---|
558 | | - |
---|
559 | | -perf record modes (nothing new here) |
---|
560 | | ------------------------------------- |
---|
561 | | - |
---|
562 | | -perf record essentially operates in one of three modes: |
---|
563 | | - per thread |
---|
564 | | - per cpu |
---|
565 | | - workload only |
---|
566 | | - |
---|
567 | | -"per thread" mode is selected by -t or by --per-thread (with -p or -u or just a |
---|
568 | | -workload). |
---|
569 | | -"per cpu" is selected by -C or -a. |
---|
570 | | -"workload only" mode is selected by not using the other options but providing a |
---|
571 | | -command to run (i.e. the workload). |
---|
572 | | - |
---|
573 | | -In per-thread mode an exact list of threads is traced. There is no inheritance. |
---|
574 | | -Each thread has its own event buffer. |
---|
575 | | - |
---|
576 | | -In per-cpu mode all processes (or processes from the selected cgroup i.e. -G |
---|
577 | | -option, or processes selected with -p or -u) are traced. Each cpu has its own |
---|
578 | | -buffer. Inheritance is allowed. |
---|
579 | | - |
---|
580 | | -In workload-only mode, the workload is traced but with per-cpu buffers. |
---|
581 | | -Inheritance is allowed. Note that you can now trace a workload in per-thread |
---|
582 | | -mode by using the --per-thread option. |
---|
583 | | - |
---|
584 | | - |
---|
585 | | -Privileged vs non-privileged users |
---|
586 | | ----------------------------------- |
---|
587 | | - |
---|
588 | | -Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users |
---|
589 | | -have memory limits imposed upon them. That affects what buffer sizes they can |
---|
590 | | -have as outlined above. |
---|
591 | | - |
---|
592 | | -The v4.2 kernel introduced support for a context switch metadata event, |
---|
593 | | -PERF_RECORD_SWITCH, which allows unprivileged users to see when their processes |
---|
594 | | -are scheduled out and in, just not by whom, which is left for the |
---|
595 | | -PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide context, |
---|
596 | | -which in turn requires CAP_SYS_ADMIN. |
---|
597 | | - |
---|
598 | | -Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate context |
---|
599 | | -switches") commit, that introduces these metadata events for further info. |
---|
600 | | - |
---|
601 | | -When working with kernels < v4.2, the following considerations must be taken, |
---|
602 | | -as the sched:sched_switch tracepoints will be used to receive such information: |
---|
603 | | - |
---|
604 | | -Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users are |
---|
605 | | -not permitted to use tracepoints which means there is insufficient side-band |
---|
606 | | -information to decode Intel PT in per-cpu mode, and potentially workload-only |
---|
607 | | -mode too if the workload creates new processes. |
---|
608 | | - |
---|
609 | | -Note also, that to use tracepoints, read-access to debugfs is required. So if |
---|
610 | | -debugfs is not mounted or the user does not have read-access, it will again not |
---|
611 | | -be possible to decode Intel PT in per-cpu mode. |
---|
612 | | - |
---|
613 | | - |
---|
614 | | -sched_switch tracepoint |
---|
615 | | ------------------------ |
---|
616 | | - |
---|
617 | | -The sched_switch tracepoint is used to provide side-band data for Intel PT |
---|
618 | | -decoding in kernels where the PERF_RECORD_SWITCH metadata event isn't |
---|
619 | | -available. |
---|
620 | | - |
---|
621 | | -The sched_switch events are automatically added. e.g. the second event shown |
---|
622 | | -below: |
---|
623 | | - |
---|
624 | | - $ perf record -vv -e intel_pt//u uname |
---|
625 | | - ------------------------------------------------------------ |
---|
626 | | - perf_event_attr: |
---|
627 | | - type 6 |
---|
628 | | - size 112 |
---|
629 | | - config 0x400 |
---|
630 | | - { sample_period, sample_freq } 1 |
---|
631 | | - sample_type IP|TID|TIME|CPU|IDENTIFIER |
---|
632 | | - read_format ID |
---|
633 | | - disabled 1 |
---|
634 | | - inherit 1 |
---|
635 | | - exclude_kernel 1 |
---|
636 | | - exclude_hv 1 |
---|
637 | | - enable_on_exec 1 |
---|
638 | | - sample_id_all 1 |
---|
639 | | - ------------------------------------------------------------ |
---|
640 | | - sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 |
---|
641 | | - sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 |
---|
642 | | - sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 |
---|
643 | | - sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 |
---|
644 | | - ------------------------------------------------------------ |
---|
645 | | - perf_event_attr: |
---|
646 | | - type 2 |
---|
647 | | - size 112 |
---|
648 | | - config 0x108 |
---|
649 | | - { sample_period, sample_freq } 1 |
---|
650 | | - sample_type IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER |
---|
651 | | - read_format ID |
---|
652 | | - inherit 1 |
---|
653 | | - sample_id_all 1 |
---|
654 | | - exclude_guest 1 |
---|
655 | | - ------------------------------------------------------------ |
---|
656 | | - sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 |
---|
657 | | - sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 |
---|
658 | | - sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 |
---|
659 | | - sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 |
---|
660 | | - ------------------------------------------------------------ |
---|
661 | | - perf_event_attr: |
---|
662 | | - type 1 |
---|
663 | | - size 112 |
---|
664 | | - config 0x9 |
---|
665 | | - { sample_period, sample_freq } 1 |
---|
666 | | - sample_type IP|TID|TIME|IDENTIFIER |
---|
667 | | - read_format ID |
---|
668 | | - disabled 1 |
---|
669 | | - inherit 1 |
---|
670 | | - exclude_kernel 1 |
---|
671 | | - exclude_hv 1 |
---|
672 | | - mmap 1 |
---|
673 | | - comm 1 |
---|
674 | | - enable_on_exec 1 |
---|
675 | | - task 1 |
---|
676 | | - sample_id_all 1 |
---|
677 | | - mmap2 1 |
---|
678 | | - comm_exec 1 |
---|
679 | | - ------------------------------------------------------------ |
---|
680 | | - sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 |
---|
681 | | - sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 |
---|
682 | | - sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 |
---|
683 | | - sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 |
---|
684 | | - mmap size 528384B |
---|
685 | | - AUX area mmap length 4194304 |
---|
686 | | - perf event ring buffer mmapped per cpu |
---|
687 | | - Synthesizing auxtrace information |
---|
688 | | - Linux |
---|
689 | | - [ perf record: Woken up 1 times to write data ] |
---|
690 | | - [ perf record: Captured and wrote 0.042 MB perf.data ] |
---|
691 | | - |
---|
692 | | -Note, the sched_switch event is only added if the user is permitted to use it |
---|
693 | | -and only in per-cpu mode. |
---|
694 | | - |
---|
695 | | -Note also, the sched_switch event is only added if TSC packets are requested. |
---|
696 | | -That is because, in the absence of timing information, the sched_switch events |
---|
697 | | -cannot be matched against the Intel PT trace. |
---|
698 | | - |
---|
699 | | - |
---|
700 | | -perf script |
---|
701 | | -=========== |
---|
702 | | - |
---|
703 | | -By default, perf script will decode trace data found in the perf.data file. |
---|
704 | | -This can be further controlled by new option --itrace. |
---|
705 | | - |
---|
706 | | - |
---|
707 | | -New --itrace option |
---|
708 | | -------------------- |
---|
709 | | - |
---|
710 | | -Having no option is the same as |
---|
711 | | - |
---|
712 | | - --itrace |
---|
713 | | - |
---|
714 | | -which, in turn, is the same as |
---|
715 | | - |
---|
716 | | - --itrace=ibxwpe |
---|
717 | | - |
---|
718 | | -The letters are: |
---|
719 | | - |
---|
720 | | - i synthesize "instructions" events |
---|
721 | | - b synthesize "branches" events |
---|
722 | | - x synthesize "transactions" events |
---|
723 | | - w synthesize "ptwrite" events |
---|
724 | | - p synthesize "power" events |
---|
725 | | - c synthesize branches events (calls only) |
---|
726 | | - r synthesize branches events (returns only) |
---|
727 | | - e synthesize tracing error events |
---|
728 | | - d create a debug log |
---|
729 | | - g synthesize a call chain (use with i or x) |
---|
730 | | - l synthesize last branch entries (use with i or x) |
---|
731 | | - s skip initial number of events |
---|
732 | | - |
---|
733 | | -"Instructions" events look like they were recorded by "perf record -e |
---|
734 | | -instructions". |
---|
735 | | - |
---|
736 | | -"Branches" events look like they were recorded by "perf record -e branches". "c" |
---|
737 | | -and "r" can be combined to get calls and returns. |
---|
738 | | - |
---|
739 | | -"Transactions" events correspond to the start or end of transactions. The |
---|
740 | | -'flags' field can be used in perf script to determine whether the event is a |
---|
741 | | -tranasaction start, commit or abort. |
---|
742 | | - |
---|
743 | | -Note that "instructions", "branches" and "transactions" events depend on code |
---|
744 | | -flow packets which can be disabled by using the config term "branch=0". Refer |
---|
745 | | -to the config terms section above. |
---|
746 | | - |
---|
747 | | -"ptwrite" events record the payload of the ptwrite instruction and whether |
---|
748 | | -"fup_on_ptw" was used. "ptwrite" events depend on PTWRITE packets which are |
---|
749 | | -recorded only if the "ptw" config term was used. Refer to the config terms |
---|
750 | | -section above. perf script "synth" field displays "ptwrite" information like |
---|
751 | | -this: "ip: 0 payload: 0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was |
---|
752 | | -used. |
---|
753 | | - |
---|
754 | | -"Power" events correspond to power event packets and CBR (core-to-bus ratio) |
---|
755 | | -packets. While CBR packets are always recorded when tracing is enabled, power |
---|
756 | | -event packets are recorded only if the "pwr_evt" config term was used. Refer to |
---|
757 | | -the config terms section above. The power events record information about |
---|
758 | | -C-state changes, whereas CBR is indicative of CPU frequency. perf script |
---|
759 | | -"event,synth" fields display information like this: |
---|
760 | | - cbr: cbr: 22 freq: 2189 MHz (200%) |
---|
761 | | - mwait: hints: 0x60 extensions: 0x1 |
---|
762 | | - pwre: hw: 0 cstate: 2 sub-cstate: 0 |
---|
763 | | - exstop: ip: 1 |
---|
764 | | - pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4 |
---|
765 | | -Where: |
---|
766 | | - "cbr" includes the frequency and the percentage of maximum non-turbo |
---|
767 | | - "mwait" shows mwait hints and extensions |
---|
768 | | - "pwre" shows C-state transitions (to a C-state deeper than C0) and |
---|
769 | | - whether initiated by hardware |
---|
770 | | - "exstop" indicates execution stopped and whether the IP was recorded |
---|
771 | | - exactly, |
---|
772 | | - "pwrx" indicates return to C0 |
---|
773 | | -For more details refer to the Intel 64 and IA-32 Architectures Software |
---|
774 | | -Developer Manuals. |
---|
775 | | - |
---|
776 | | -Error events show where the decoder lost the trace. Error events |
---|
777 | | -are quite important. Users must know if what they are seeing is a complete |
---|
778 | | -picture or not. |
---|
779 | | - |
---|
780 | | -The "d" option will cause the creation of a file "intel_pt.log" containing all |
---|
781 | | -decoded packets and instructions. Note that this option slows down the decoder |
---|
782 | | -and that the resulting file may be very large. |
---|
783 | | - |
---|
784 | | -In addition, the period of the "instructions" event can be specified. e.g. |
---|
785 | | - |
---|
786 | | - --itrace=i10us |
---|
787 | | - |
---|
788 | | -sets the period to 10us i.e. one instruction sample is synthesized for each 10 |
---|
789 | | -microseconds of trace. Alternatives to "us" are "ms" (milliseconds), |
---|
790 | | -"ns" (nanoseconds), "t" (TSC ticks) or "i" (instructions). |
---|
791 | | - |
---|
792 | | -"ms", "us" and "ns" are converted to TSC ticks. |
---|
793 | | - |
---|
794 | | -The timing information included with Intel PT does not give the time of every |
---|
795 | | -instruction. Consequently, for the purpose of sampling, the decoder estimates |
---|
796 | | -the time since the last timing packet based on 1 tick per instruction. The time |
---|
797 | | -on the sample is *not* adjusted and reflects the last known value of TSC. |
---|
798 | | - |
---|
799 | | -For Intel PT, the default period is 100us. |
---|
800 | | - |
---|
801 | | -Setting it to a zero period means "as often as possible". |
---|
802 | | - |
---|
803 | | -In the case of Intel PT that is the same as a period of 1 and a unit of |
---|
804 | | -'instructions' (i.e. --itrace=i1i). |
---|
805 | | - |
---|
806 | | -Also the call chain size (default 16, max. 1024) for instructions or |
---|
807 | | -transactions events can be specified. e.g. |
---|
808 | | - |
---|
809 | | - --itrace=ig32 |
---|
810 | | - --itrace=xg32 |
---|
811 | | - |
---|
812 | | -Also the number of last branch entries (default 64, max. 1024) for instructions or |
---|
813 | | -transactions events can be specified. e.g. |
---|
814 | | - |
---|
815 | | - --itrace=il10 |
---|
816 | | - --itrace=xl10 |
---|
817 | | - |
---|
818 | | -Note that last branch entries are cleared for each sample, so there is no overlap |
---|
819 | | -from one sample to the next. |
---|
820 | | - |
---|
821 | | -To disable trace decoding entirely, use the option --no-itrace. |
---|
822 | | - |
---|
823 | | -It is also possible to skip events generated (instructions, branches, transactions) |
---|
824 | | -at the beginning. This is useful to ignore initialization code. |
---|
825 | | - |
---|
826 | | - --itrace=i0nss1000000 |
---|
827 | | - |
---|
828 | | -skips the first million instructions. |
---|
829 | | - |
---|
830 | | -dump option |
---|
831 | | ------------ |
---|
832 | | - |
---|
833 | | -perf script has an option (-D) to "dump" the events i.e. display the binary |
---|
834 | | -data. |
---|
835 | | - |
---|
836 | | -When -D is used, Intel PT packets are displayed. The packet decoder does not |
---|
837 | | -pay attention to PSB packets, but just decodes the bytes - so the packets seen |
---|
838 | | -by the actual decoder may not be identical in places where the data is corrupt. |
---|
839 | | -One example of that would be when the buffer-switching interrupt has been too |
---|
840 | | -slow, and the buffer has been filled completely. In that case, the last packet |
---|
841 | | -in the buffer might be truncated and immediately followed by a PSB as the trace |
---|
842 | | -continues in the next buffer. |
---|
843 | | - |
---|
844 | | -To disable the display of Intel PT packets, combine the -D option with |
---|
845 | | ---no-itrace. |
---|
846 | | - |
---|
847 | | - |
---|
848 | | -perf report |
---|
849 | | -=========== |
---|
850 | | - |
---|
851 | | -By default, perf report will decode trace data found in the perf.data file. |
---|
852 | | -This can be further controlled by new option --itrace exactly the same as |
---|
853 | | -perf script, with the exception that the default is --itrace=igxe. |
---|
854 | | - |
---|
855 | | - |
---|
856 | | -perf inject |
---|
857 | | -=========== |
---|
858 | | - |
---|
859 | | -perf inject also accepts the --itrace option in which case tracing data is |
---|
860 | | -removed and replaced with the synthesized events. e.g. |
---|
861 | | - |
---|
862 | | - perf inject --itrace -i perf.data -o perf.data.new |
---|
863 | | - |
---|
864 | | -Below is an example of using Intel PT with autofdo. It requires autofdo |
---|
865 | | -(https://github.com/google/autofdo) and gcc version 5. The bubble |
---|
866 | | -sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial) |
---|
867 | | -amended to take the number of elements as a parameter. |
---|
868 | | - |
---|
869 | | - $ gcc-5 -O3 sort.c -o sort_optimized |
---|
870 | | - $ ./sort_optimized 30000 |
---|
871 | | - Bubble sorting array of 30000 elements |
---|
872 | | - 2254 ms |
---|
873 | | - |
---|
874 | | - $ cat ~/.perfconfig |
---|
875 | | - [intel-pt] |
---|
876 | | - mispred-all = on |
---|
877 | | - |
---|
878 | | - $ perf record -e intel_pt//u ./sort 3000 |
---|
879 | | - Bubble sorting array of 3000 elements |
---|
880 | | - 58 ms |
---|
881 | | - [ perf record: Woken up 2 times to write data ] |
---|
882 | | - [ perf record: Captured and wrote 3.939 MB perf.data ] |
---|
883 | | - $ perf inject -i perf.data -o inj --itrace=i100usle --strip |
---|
884 | | - $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1 |
---|
885 | | - $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo |
---|
886 | | - $ ./sort_autofdo 30000 |
---|
887 | | - Bubble sorting array of 30000 elements |
---|
888 | | - 2155 ms |
---|
889 | | - |
---|
890 | | -Note there is currently no advantage to using Intel PT instead of LBR, but |
---|
891 | | -that may change in the future if greater use is made of the data. |
---|
| 1 | +Documentation for support for Intel Processor Trace within perf tools' has moved to file perf-intel-pt.txt |
---|