~hc/RK356X_SDK_RELEASE.git

..	..	@@ -1,891 +1 @@
1		-Intel Processor Trace
2		-=====================
3		-
4		-Overview
5		-========
6		-
7		-Intel Processor Trace (Intel PT) is an extension of Intel Architecture that
8		-collects information about software execution such as control flow, execution
9		-modes and timings and formats it into highly compressed binary packets.
10		-Technical details are documented in the Intel 64 and IA-32 Architectures
11		-Software Developer Manuals, Chapter 36 Intel Processor Trace.
12		-
13		-Intel PT is first supported in Intel Core M and 5th generation Intel Core
14		-processors that are based on the Intel micro-architecture code name Broadwell.
15		-
16		-Trace data is collected by 'perf record' and stored within the perf.data file.
17		-See below for options to 'perf record'.
18		-
19		-Trace data must be 'decoded' which involves walking the object code and matching
20		-the trace data packets. For example a TNT packet only tells whether a
21		-conditional branch was taken or not taken, so to make use of that packet the
22		-decoder must know precisely which instruction was being executed.
23		-
24		-Decoding is done on-the-fly. The decoder outputs samples in the same format as
25		-samples output by perf hardware events, for example as though the "instructions"
26		-or "branches" events had been recorded. Presently 3 tools support this:
27		-'perf script', 'perf report' and 'perf inject'. See below for more information
28		-on using those tools.
29		-
30		-The main distinguishing feature of Intel PT is that the decoder can determine
31		-the exact flow of software execution. Intel PT can be used to understand why
32		-and how did software get to a certain point, or behave a certain way. The
33		-software does not have to be recompiled, so Intel PT works with debug or release
34		-builds, however the executed images are needed - which makes use in JIT-compiled
35		-environments, or with self-modified code, a challenge. Also symbols need to be
36		-provided to make sense of addresses.
37		-
38		-A limitation of Intel PT is that it produces huge amounts of trace data
39		-(hundreds of megabytes per second per core) which takes a long time to decode,
40		-for example two or three orders of magnitude longer than it took to collect.
41		-Another limitation is the performance impact of tracing, something that will
42		-vary depending on the use-case and architecture.
43		-
44		-
45		-Quickstart
46		-==========
47		-
48		-It is important to start small. That is because it is easy to capture vastly
49		-more data than can possibly be processed.
50		-
51		-The simplest thing to do with Intel PT is userspace profiling of small programs.
52		-Data is captured with 'perf record' e.g. to trace 'ls' userspace-only:
53		-
54		- perf record -e intel_pt//u ls
55		-
56		-And profiled with 'perf report' e.g.
57		-
58		- perf report
59		-
60		-To also trace kernel space presents a problem, namely kernel self-modifying
61		-code. A fairly good kernel image is available in /proc/kcore but to get an
62		-accurate image a copy of /proc/kcore needs to be made under the same conditions
63		-as the data capture. A script perf-with-kcore can do that, but beware that the
64		-script makes use of 'sudo' to copy /proc/kcore. If you have perf installed
65		-locally from the source tree you can do:
66		-
67		- ~/libexec/perf-core/perf-with-kcore record pt_ls -e intel_pt// -- ls
68		-
69		-which will create a directory named 'pt_ls' and put the perf.data file and
70		-copies of /proc/kcore, /proc/kallsyms and /proc/modules into it. Then to use
71		-'perf report' becomes:
72		-
73		- ~/libexec/perf-core/perf-with-kcore report pt_ls
74		-
75		-Because samples are synthesized after-the-fact, the sampling period can be
76		-selected for reporting. e.g. sample every microsecond
77		-
78		- ~/libexec/perf-core/perf-with-kcore report pt_ls --itrace=i1usge
79		-
80		-See the sections below for more information about the --itrace option.
81		-
82		-Beware the smaller the period, the more samples that are produced, and the
83		-longer it takes to process them.
84		-
85		-Also note that the coarseness of Intel PT timing information will start to
86		-distort the statistical value of the sampling as the sampling period becomes
87		-smaller.
88		-
89		-To represent software control flow, "branches" samples are produced. By default
90		-a branch sample is synthesized for every single branch. To get an idea what
91		-data is available you can use the 'perf script' tool with no parameters, which
92		-will list all the samples.
93		-
94		- perf record -e intel_pt//u ls
95		- perf script
96		-
97		-An interesting field that is not printed by default is 'flags' which can be
98		-displayed as follows:
99		-
100		- perf script -Fcomm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr,symoff,flags
101		-
102		-The flags are "bcrosyiABEx" which stand for branch, call, return, conditional,
103		-system, asynchronous, interrupt, transaction abort, trace begin, trace end, and
104		-in transaction, respectively.
105		-
106		-While it is possible to create scripts to analyze the data, an alternative
107		-approach is available to export the data to a sqlite or postgresql database.
108		-Refer to script export-to-sqlite.py or export-to-postgresql.py for more details,
109		-and to script call-graph-from-sql.py for an example of using the database.
110		-
111		-There is also script intel-pt-events.py which provides an example of how to
112		-unpack the raw data for power events and PTWRITE.
113		-
114		-As mentioned above, it is easy to capture too much data. One way to limit the
115		-data captured is to use 'snapshot' mode which is explained further below.
116		-Refer to 'new snapshot option' and 'Intel PT modes of operation' further below.
117		-
118		-Another problem that will be experienced is decoder errors. They can be caused
119		-by inability to access the executed image, self-modified or JIT-ed code, or the
120		-inability to match side-band information (such as context switches and mmaps)
121		-which results in the decoder not knowing what code was executed.
122		-
123		-There is also the problem of perf not being able to copy the data fast enough,
124		-resulting in data lost because the buffer was full. See 'Buffer handling' below
125		-for more details.
126		-
127		-
128		-perf record
129		-===========
130		-
131		-new event
132		----------
133		-
134		-The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are
135		-selected by providing the PMU name followed by the "config" separated by slashes.
136		-An enhancement has been made to allow default "config" e.g. the option
137		-
138		- -e intel_pt//
139		-
140		-will use a default config value. Currently that is the same as
141		-
142		- -e intel_pt/tsc,noretcomp=0/
143		-
144		-which is the same as
145		-
146		- -e intel_pt/tsc=1,noretcomp=0/
147		-
148		-Note there are now new config terms - see section 'config terms' further below.
149		-
150		-The config terms are listed in /sys/devices/intel_pt/format. They are bit
151		-fields within the config member of the struct perf_event_attr which is
152		-passed to the kernel by the perf_event_open system call. They correspond to bit
153		-fields in the IA32_RTIT_CTL MSR. Here is a list of them and their definitions:
154		-
155		- $ grep -H . /sys/bus/event_source/devices/intel_pt/format/*
156		- /sys/bus/event_source/devices/intel_pt/format/cyc:config:1
157		- /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22
158		- /sys/bus/event_source/devices/intel_pt/format/mtc:config:9
159		- /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17
160		- /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11
161		- /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27
162		- /sys/bus/event_source/devices/intel_pt/format/tsc:config:10
163		-
164		-Note that the default config must be overridden for each term i.e.
165		-
166		- -e intel_pt/noretcomp=0/
167		-
168		-is the same as:
169		-
170		- -e intel_pt/tsc=1,noretcomp=0/
171		-
172		-So, to disable TSC packets use:
173		-
174		- -e intel_pt/tsc=0/
175		-
176		-It is also possible to specify the config value explicitly:
177		-
178		- -e intel_pt/config=0x400/
179		-
180		-Note that, as with all events, the event is suffixed with event modifiers:
181		-
182		- u userspace
183		- k kernel
184		- h hypervisor
185		- G guest
186		- H host
187		- p precise ip
188		-
189		-'h', 'G' and 'H' are for virtualization which is not supported by Intel PT.
190		-'p' is also not relevant to Intel PT. So only options 'u' and 'k' are
191		-meaningful for Intel PT.
192		-
193		-perf_event_attr is displayed if the -vv option is used e.g.
194		-
195		- ------------------------------------------------------------
196		- perf_event_attr:
197		- type 6
198		- size 112
199		- config 0x400
200		- { sample_period, sample_freq } 1
201		- sample_type IP\|TID\|TIME\|CPU\|IDENTIFIER
202		- read_format ID
203		- disabled 1
204		- inherit 1
205		- exclude_kernel 1
206		- exclude_hv 1
207		- enable_on_exec 1
208		- sample_id_all 1
209		- ------------------------------------------------------------
210		- sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
211		- sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
212		- sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
213		- sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
214		- ------------------------------------------------------------
215		-
216		-
217		-config terms
218		-------------
219		-
220		-The June 2015 version of Intel 64 and IA-32 Architectures Software Developer
221		-Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features.
222		-Some of the features are reflect in new config terms. All the config terms are
223		-described below.
224		-
225		-tsc Always supported. Produces TSC timestamp packets to provide
226		- timing information. In some cases it is possible to decode
227		- without timing information, for example a per-thread context
228		- that does not overlap executable memory maps.
229		-
230		- The default config selects tsc (i.e. tsc=1).
231		-
232		-noretcomp Always supported. Disables "return compression" so a TIP packet
233		- is produced when a function returns. Causes more packets to be
234		- produced but might make decoding more reliable.
235		-
236		- The default config does not select noretcomp (i.e. noretcomp=0).
237		-
238		-psb_period Allows the frequency of PSB packets to be specified.
239		-
240		- The PSB packet is a synchronization packet that provides a
241		- starting point for decoding or recovery from errors.
242		-
243		- Support for psb_period is indicated by:
244		-
245		- /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
246		-
247		- which contains "1" if the feature is supported and "0"
248		- otherwise.
249		-
250		- Valid values are given by:
251		-
252		- /sys/bus/event_source/devices/intel_pt/caps/psb_periods
253		-
254		- which contains a hexadecimal value, the bits of which represent
255		- valid values e.g. bit 2 set means value 2 is valid.
256		-
257		- The psb_period value is converted to the approximate number of
258		- trace bytes between PSB packets as:
259		-
260		- 2 ^ (value + 11)
261		-
262		- e.g. value 3 means 16KiB bytes between PSBs
263		-
264		- If an invalid value is entered, the error message
265		- will give a list of valid values e.g.
266		-
267		- $ perf record -e intel_pt/psb_period=15/u uname
268		- Invalid psb_period for intel_pt. Valid values are: 0-5
269		-
270		- If MTC packets are selected, the default config selects a value
271		- of 3 (i.e. psb_period=3) or the nearest lower value that is
272		- supported (0 is always supported). Otherwise the default is 0.
273		-
274		- If decoding is expected to be reliable and the buffer is large
275		- then a large PSB period can be used.
276		-
277		- Because a TSC packet is produced with PSB, the PSB period can
278		- also affect the granularity to timing information in the absence
279		- of MTC or CYC.
280		-
281		-mtc Produces MTC timing packets.
282		-
283		- MTC packets provide finer grain timestamp information than TSC
284		- packets. MTC packets record time using the hardware crystal
285		- clock (CTC) which is related to TSC packets using a TMA packet.
286		-
287		- Support for this feature is indicated by:
288		-
289		- /sys/bus/event_source/devices/intel_pt/caps/mtc
290		-
291		- which contains "1" if the feature is supported and
292		- "0" otherwise.
293		-
294		- The frequency of MTC packets can also be specified - see
295		- mtc_period below.
296		-
297		-mtc_period Specifies how frequently MTC packets are produced - see mtc
298		- above for how to determine if MTC packets are supported.
299		-
300		- Valid values are given by:
301		-
302		- /sys/bus/event_source/devices/intel_pt/caps/mtc_periods
303		-
304		- which contains a hexadecimal value, the bits of which represent
305		- valid values e.g. bit 2 set means value 2 is valid.
306		-
307		- The mtc_period value is converted to the MTC frequency as:
308		-
309		- CTC-frequency / (2 ^ value)
310		-
311		- e.g. value 3 means one eighth of CTC-frequency
312		-
313		- Where CTC is the hardware crystal clock, the frequency of which
314		- can be related to TSC via values provided in cpuid leaf 0x15.
315		-
316		- If an invalid value is entered, the error message
317		- will give a list of valid values e.g.
318		-
319		- $ perf record -e intel_pt/mtc_period=15/u uname
320		- Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
321		-
322		- The default value is 3 or the nearest lower value
323		- that is supported (0 is always supported).
324		-
325		-cyc Produces CYC timing packets.
326		-
327		- CYC packets provide even finer grain timestamp information than
328		- MTC and TSC packets. A CYC packet contains the number of CPU
329		- cycles since the last CYC packet. Unlike MTC and TSC packets,
330		- CYC packets are only sent when another packet is also sent.
331		-
332		- Support for this feature is indicated by:
333		-
334		- /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
335		-
336		- which contains "1" if the feature is supported and
337		- "0" otherwise.
338		-
339		- The number of CYC packets produced can be reduced by specifying
340		- a threshold - see cyc_thresh below.
341		-
342		-cyc_thresh Specifies how frequently CYC packets are produced - see cyc
343		- above for how to determine if CYC packets are supported.
344		-
345		- Valid cyc_thresh values are given by:
346		-
347		- /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
348		-
349		- which contains a hexadecimal value, the bits of which represent
350		- valid values e.g. bit 2 set means value 2 is valid.
351		-
352		- The cyc_thresh value represents the minimum number of CPU cycles
353		- that must have passed before a CYC packet can be sent. The
354		- number of CPU cycles is:
355		-
356		- 2 ^ (value - 1)
357		-
358		- e.g. value 4 means 8 CPU cycles must pass before a CYC packet
359		- can be sent. Note a CYC packet is still only sent when another
360		- packet is sent, not at, e.g. every 8 CPU cycles.
361		-
362		- If an invalid value is entered, the error message
363		- will give a list of valid values e.g.
364		-
365		- $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
366		- Invalid cyc_thresh for intel_pt. Valid values are: 0-12
367		-
368		- CYC packets are not requested by default.
369		-
370		-pt Specifies pass-through which enables the 'branch' config term.
371		-
372		- The default config selects 'pt' if it is available, so a user will
373		- never need to specify this term.
374		-
375		-branch Enable branch tracing. Branch tracing is enabled by default so to
376		- disable branch tracing use 'branch=0'.
377		-
378		- The default config selects 'branch' if it is available.
379		-
380		-ptw Enable PTWRITE packets which are produced when a ptwrite instruction
381		- is executed.
382		-
383		- Support for this feature is indicated by:
384		-
385		- /sys/bus/event_source/devices/intel_pt/caps/ptwrite
386		-
387		- which contains "1" if the feature is supported and
388		- "0" otherwise.
389		-
390		-fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP packet
391		- provides the address of the ptwrite instruction. In the absence of
392		- fup_on_ptw, the decoder will use the address of the previous branch
393		- if branch tracing is enabled, otherwise the address will be zero.
394		- Note that fup_on_ptw will work even when branch tracing is disabled.
395		-
396		-pwr_evt Enable power events. The power events provide information about
397		- changes to the CPU C-state.
398		-
399		- Support for this feature is indicated by:
400		-
401		- /sys/bus/event_source/devices/intel_pt/caps/power_event_trace
402		-
403		- which contains "1" if the feature is supported and
404		- "0" otherwise.
405		-
406		-
407		-new snapshot option
408		--------------------
409		-
410		-The difference between full trace and snapshot from the kernel's perspective is
411		-that in full trace we don't overwrite trace data that the user hasn't collected
412		-yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let
413		-the trace run and overwrite older data in the buffer so that whenever something
414		-interesting happens, we can stop it and grab a snapshot of what was going on
415		-around that interesting moment.
416		-
417		-To select snapshot mode a new option has been added:
418		-
419		- -S
420		-
421		-Optionally it can be followed by the snapshot size e.g.
422		-
423		- -S0x100000
424		-
425		-The default snapshot size is the auxtrace mmap size. If neither auxtrace mmap size
426		-nor snapshot size is specified, then the default is 4MiB for privileged users
427		-(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
428		-If an unprivileged user does not specify mmap pages, the mmap pages will be
429		-reduced as described in the 'new auxtrace mmap size option' section below.
430		-
431		-The snapshot size is displayed if the option -vv is used e.g.
432		-
433		- Intel PT snapshot size: %zu
434		-
435		-
436		-new auxtrace mmap size option
437		----------------------------
438		-
439		-Intel PT buffer size is specified by an addition to the -m option e.g.
440		-
441		- -m,16
442		-
443		-selects a buffer size of 16 pages i.e. 64KiB.
444		-
445		-Note that the existing functionality of -m is unchanged. The auxtrace mmap size
446		-is specified by the optional addition of a comma and the value.
447		-
448		-The default auxtrace mmap size for Intel PT is 4MiB/page_size for privileged users
449		-(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
450		-If an unprivileged user does not specify mmap pages, the mmap pages will be
451		-reduced from the default 512KiB/page_size to 256KiB/page_size, otherwise the
452		-user is likely to get an error as they exceed their mlock limit (Max locked
453		-memory as shown in /proc/self/limits). Note that perf does not count the first
454		-512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per cpu
455		-against the mlock limit so an unprivileged user is allowed 512KiB per cpu plus
456		-their mlock limit (which defaults to 64KiB but is not multiplied by the number
457		-of cpus).
458		-
459		-In full-trace mode, powers of two are allowed for buffer size, with a minimum
460		-size of 2 pages. In snapshot mode, it is the same but the minimum size is
461		-1 page.
462		-
463		-The mmap size and auxtrace mmap size are displayed if the -vv option is used e.g.
464		-
465		- mmap length 528384
466		- auxtrace mmap length 4198400
467		-
468		-
469		-Intel PT modes of operation
470		----------------------------
471		-
472		-Intel PT can be used in 2 modes:
473		- full-trace mode
474		- snapshot mode
475		-
476		-Full-trace mode traces continuously e.g.
477		-
478		- perf record -e intel_pt//u uname
479		-
480		-Snapshot mode captures the available data when a signal is sent e.g.
481		-
482		- perf record -v -e intel_pt//u -S ./loopy 1000000000 &
483		- [1] 11435
484		- kill -USR2 11435
485		- Recording AUX area tracing snapshot
486		-
487		-Note that the signal sent is SIGUSR2.
488		-Note that "Recording AUX area tracing snapshot" is displayed because the -v
489		-option is used.
490		-
491		-The 2 modes cannot be used together.
492		-
493		-
494		-Buffer handling
495		----------------
496		-
497		-There may be buffer limitations (i.e. single ToPa entry) which means that actual
498		-buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to
499		-provide other sizes, and in particular an arbitrarily large size, multiple
500		-buffers are logically concatenated. However an interrupt must be used to switch
501		-between buffers. That has two potential problems:
502		- a) the interrupt may not be handled in time so that the current buffer
503		- becomes full and some trace data is lost.
504		- b) the interrupts may slow the system and affect the performance
505		- results.
506		-
507		-If trace data is lost, the driver sets 'truncated' in the PERF_RECORD_AUX event
508		-which the tools report as an error.
509		-
510		-In full-trace mode, the driver waits for data to be copied out before allowing
511		-the (logical) buffer to wrap-around. If data is not copied out quickly enough,
512		-again 'truncated' is set in the PERF_RECORD_AUX event. If the driver has to
513		-wait, the intel_pt event gets disabled. Because it is difficult to know when
514		-that happens, perf tools always re-enable the intel_pt event after copying out
515		-data.
516		-
517		-
518		-Intel PT and build ids
519		-----------------------
520		-
521		-By default "perf record" post-processes the event stream to find all build ids
522		-for executables for all addresses sampled. Deliberately, Intel PT is not
523		-decoded for that purpose (it would take too long). Instead the build ids for
524		-all executables encountered (due to mmap, comm or task events) are included
525		-in the perf.data file.
526		-
527		-To see buildids included in the perf.data file use the command:
528		-
529		- perf buildid-list
530		-
531		-If the perf.data file contains Intel PT data, that is the same as:
532		-
533		- perf buildid-list --with-hits
534		-
535		-
536		-Snapshot mode and event disabling
537		----------------------------------
538		-
539		-In order to make a snapshot, the intel_pt event is disabled using an IOCTL,
540		-namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the
541		-collection of side-band information. In order to prevent that, a dummy
542		-software event has been introduced that permits tracking events (like mmaps) to
543		-continue to be recorded while intel_pt is disabled. That is important to ensure
544		-there is complete side-band information to allow the decoding of subsequent
545		-snapshots.
546		-
547		-A test has been created for that. To find the test:
548		-
549		- perf test list
550		- ...
551		- 23: Test using a dummy software event to keep tracking
552		-
553		-To run the test:
554		-
555		- perf test 23
556		- 23: Test using a dummy software event to keep tracking : Ok
557		-
558		-
559		-perf record modes (nothing new here)
560		-------------------------------------
561		-
562		-perf record essentially operates in one of three modes:
563		- per thread
564		- per cpu
565		- workload only
566		-
567		-"per thread" mode is selected by -t or by --per-thread (with -p or -u or just a
568		-workload).
569		-"per cpu" is selected by -C or -a.
570		-"workload only" mode is selected by not using the other options but providing a
571		-command to run (i.e. the workload).
572		-
573		-In per-thread mode an exact list of threads is traced. There is no inheritance.
574		-Each thread has its own event buffer.
575		-
576		-In per-cpu mode all processes (or processes from the selected cgroup i.e. -G
577		-option, or processes selected with -p or -u) are traced. Each cpu has its own
578		-buffer. Inheritance is allowed.
579		-
580		-In workload-only mode, the workload is traced but with per-cpu buffers.
581		-Inheritance is allowed. Note that you can now trace a workload in per-thread
582		-mode by using the --per-thread option.
583		-
584		-
585		-Privileged vs non-privileged users
586		-----------------------------------
587		-
588		-Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users
589		-have memory limits imposed upon them. That affects what buffer sizes they can
590		-have as outlined above.
591		-
592		-The v4.2 kernel introduced support for a context switch metadata event,
593		-PERF_RECORD_SWITCH, which allows unprivileged users to see when their processes
594		-are scheduled out and in, just not by whom, which is left for the
595		-PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide context,
596		-which in turn requires CAP_SYS_ADMIN.
597		-
598		-Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate context
599		-switches") commit, that introduces these metadata events for further info.
600		-
601		-When working with kernels < v4.2, the following considerations must be taken,
602		-as the sched:sched_switch tracepoints will be used to receive such information:
603		-
604		-Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users are
605		-not permitted to use tracepoints which means there is insufficient side-band
606		-information to decode Intel PT in per-cpu mode, and potentially workload-only
607		-mode too if the workload creates new processes.
608		-
609		-Note also, that to use tracepoints, read-access to debugfs is required. So if
610		-debugfs is not mounted or the user does not have read-access, it will again not
611		-be possible to decode Intel PT in per-cpu mode.
612		-
613		-
614		-sched_switch tracepoint
615		------------------------
616		-
617		-The sched_switch tracepoint is used to provide side-band data for Intel PT
618		-decoding in kernels where the PERF_RECORD_SWITCH metadata event isn't
619		-available.
620		-
621		-The sched_switch events are automatically added. e.g. the second event shown
622		-below:
623		-
624		- $ perf record -vv -e intel_pt//u uname
625		- ------------------------------------------------------------
626		- perf_event_attr:
627		- type 6
628		- size 112
629		- config 0x400
630		- { sample_period, sample_freq } 1
631		- sample_type IP\|TID\|TIME\|CPU\|IDENTIFIER
632		- read_format ID
633		- disabled 1
634		- inherit 1
635		- exclude_kernel 1
636		- exclude_hv 1
637		- enable_on_exec 1
638		- sample_id_all 1
639		- ------------------------------------------------------------
640		- sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
641		- sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
642		- sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
643		- sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
644		- ------------------------------------------------------------
645		- perf_event_attr:
646		- type 2
647		- size 112
648		- config 0x108
649		- { sample_period, sample_freq } 1
650		- sample_type IP\|TID\|TIME\|CPU\|PERIOD\|RAW\|IDENTIFIER
651		- read_format ID
652		- inherit 1
653		- sample_id_all 1
654		- exclude_guest 1
655		- ------------------------------------------------------------
656		- sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8
657		- sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8
658		- sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8
659		- sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8
660		- ------------------------------------------------------------
661		- perf_event_attr:
662		- type 1
663		- size 112
664		- config 0x9
665		- { sample_period, sample_freq } 1
666		- sample_type IP\|TID\|TIME\|IDENTIFIER
667		- read_format ID
668		- disabled 1
669		- inherit 1
670		- exclude_kernel 1
671		- exclude_hv 1
672		- mmap 1
673		- comm 1
674		- enable_on_exec 1
675		- task 1
676		- sample_id_all 1
677		- mmap2 1
678		- comm_exec 1
679		- ------------------------------------------------------------
680		- sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
681		- sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
682		- sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
683		- sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
684		- mmap size 528384B
685		- AUX area mmap length 4194304
686		- perf event ring buffer mmapped per cpu
687		- Synthesizing auxtrace information
688		- Linux
689		- [ perf record: Woken up 1 times to write data ]
690		- [ perf record: Captured and wrote 0.042 MB perf.data ]
691		-
692		-Note, the sched_switch event is only added if the user is permitted to use it
693		-and only in per-cpu mode.
694		-
695		-Note also, the sched_switch event is only added if TSC packets are requested.
696		-That is because, in the absence of timing information, the sched_switch events
697		-cannot be matched against the Intel PT trace.
698		-
699		-
700		-perf script
701		-===========
702		-
703		-By default, perf script will decode trace data found in the perf.data file.
704		-This can be further controlled by new option --itrace.
705		-
706		-
707		-New --itrace option
708		--------------------
709		-
710		-Having no option is the same as
711		-
712		- --itrace
713		-
714		-which, in turn, is the same as
715		-
716		- --itrace=ibxwpe
717		-
718		-The letters are:
719		-
720		- i synthesize "instructions" events
721		- b synthesize "branches" events
722		- x synthesize "transactions" events
723		- w synthesize "ptwrite" events
724		- p synthesize "power" events
725		- c synthesize branches events (calls only)
726		- r synthesize branches events (returns only)
727		- e synthesize tracing error events
728		- d create a debug log
729		- g synthesize a call chain (use with i or x)
730		- l synthesize last branch entries (use with i or x)
731		- s skip initial number of events
732		-
733		-"Instructions" events look like they were recorded by "perf record -e
734		-instructions".
735		-
736		-"Branches" events look like they were recorded by "perf record -e branches". "c"
737		-and "r" can be combined to get calls and returns.
738		-
739		-"Transactions" events correspond to the start or end of transactions. The
740		-'flags' field can be used in perf script to determine whether the event is a
741		-tranasaction start, commit or abort.
742		-
743		-Note that "instructions", "branches" and "transactions" events depend on code
744		-flow packets which can be disabled by using the config term "branch=0". Refer
745		-to the config terms section above.
746		-
747		-"ptwrite" events record the payload of the ptwrite instruction and whether
748		-"fup_on_ptw" was used. "ptwrite" events depend on PTWRITE packets which are
749		-recorded only if the "ptw" config term was used. Refer to the config terms
750		-section above. perf script "synth" field displays "ptwrite" information like
751		-this: "ip: 0 payload: 0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was
752		-used.
753		-
754		-"Power" events correspond to power event packets and CBR (core-to-bus ratio)
755		-packets. While CBR packets are always recorded when tracing is enabled, power
756		-event packets are recorded only if the "pwr_evt" config term was used. Refer to
757		-the config terms section above. The power events record information about
758		-C-state changes, whereas CBR is indicative of CPU frequency. perf script
759		-"event,synth" fields display information like this:
760		- cbr: cbr: 22 freq: 2189 MHz (200%)
761		- mwait: hints: 0x60 extensions: 0x1
762		- pwre: hw: 0 cstate: 2 sub-cstate: 0
763		- exstop: ip: 1
764		- pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4
765		-Where:
766		- "cbr" includes the frequency and the percentage of maximum non-turbo
767		- "mwait" shows mwait hints and extensions
768		- "pwre" shows C-state transitions (to a C-state deeper than C0) and
769		- whether initiated by hardware
770		- "exstop" indicates execution stopped and whether the IP was recorded
771		- exactly,
772		- "pwrx" indicates return to C0
773		-For more details refer to the Intel 64 and IA-32 Architectures Software
774		-Developer Manuals.
775		-
776		-Error events show where the decoder lost the trace. Error events
777		-are quite important. Users must know if what they are seeing is a complete
778		-picture or not.
779		-
780		-The "d" option will cause the creation of a file "intel_pt.log" containing all
781		-decoded packets and instructions. Note that this option slows down the decoder
782		-and that the resulting file may be very large.
783		-
784		-In addition, the period of the "instructions" event can be specified. e.g.
785		-
786		- --itrace=i10us
787		-
788		-sets the period to 10us i.e. one instruction sample is synthesized for each 10
789		-microseconds of trace. Alternatives to "us" are "ms" (milliseconds),
790		-"ns" (nanoseconds), "t" (TSC ticks) or "i" (instructions).
791		-
792		-"ms", "us" and "ns" are converted to TSC ticks.
793		-
794		-The timing information included with Intel PT does not give the time of every
795		-instruction. Consequently, for the purpose of sampling, the decoder estimates
796		-the time since the last timing packet based on 1 tick per instruction. The time
797		-on the sample is not adjusted and reflects the last known value of TSC.
798		-
799		-For Intel PT, the default period is 100us.
800		-
801		-Setting it to a zero period means "as often as possible".
802		-
803		-In the case of Intel PT that is the same as a period of 1 and a unit of
804		-'instructions' (i.e. --itrace=i1i).
805		-
806		-Also the call chain size (default 16, max. 1024) for instructions or
807		-transactions events can be specified. e.g.
808		-
809		- --itrace=ig32
810		- --itrace=xg32
811		-
812		-Also the number of last branch entries (default 64, max. 1024) for instructions or
813		-transactions events can be specified. e.g.
814		-
815		- --itrace=il10
816		- --itrace=xl10
817		-
818		-Note that last branch entries are cleared for each sample, so there is no overlap
819		-from one sample to the next.
820		-
821		-To disable trace decoding entirely, use the option --no-itrace.
822		-
823		-It is also possible to skip events generated (instructions, branches, transactions)
824		-at the beginning. This is useful to ignore initialization code.
825		-
826		- --itrace=i0nss1000000
827		-
828		-skips the first million instructions.
829		-
830		-dump option
831		------------
832		-
833		-perf script has an option (-D) to "dump" the events i.e. display the binary
834		-data.
835		-
836		-When -D is used, Intel PT packets are displayed. The packet decoder does not
837		-pay attention to PSB packets, but just decodes the bytes - so the packets seen
838		-by the actual decoder may not be identical in places where the data is corrupt.
839		-One example of that would be when the buffer-switching interrupt has been too
840		-slow, and the buffer has been filled completely. In that case, the last packet
841		-in the buffer might be truncated and immediately followed by a PSB as the trace
842		-continues in the next buffer.
843		-
844		-To disable the display of Intel PT packets, combine the -D option with
845		---no-itrace.
846		-
847		-
848		-perf report
849		-===========
850		-
851		-By default, perf report will decode trace data found in the perf.data file.
852		-This can be further controlled by new option --itrace exactly the same as
853		-perf script, with the exception that the default is --itrace=igxe.
854		-
855		-
856		-perf inject
857		-===========
858		-
859		-perf inject also accepts the --itrace option in which case tracing data is
860		-removed and replaced with the synthesized events. e.g.
861		-
862		- perf inject --itrace -i perf.data -o perf.data.new
863		-
864		-Below is an example of using Intel PT with autofdo. It requires autofdo
865		-(https://github.com/google/autofdo) and gcc version 5. The bubble
866		-sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial)
867		-amended to take the number of elements as a parameter.
868		-
869		- $ gcc-5 -O3 sort.c -o sort_optimized
870		- $ ./sort_optimized 30000
871		- Bubble sorting array of 30000 elements
872		- 2254 ms
873		-
874		- $ cat ~/.perfconfig
875		- [intel-pt]
876		- mispred-all = on
877		-
878		- $ perf record -e intel_pt//u ./sort 3000
879		- Bubble sorting array of 3000 elements
880		- 58 ms
881		- [ perf record: Woken up 2 times to write data ]
882		- [ perf record: Captured and wrote 3.939 MB perf.data ]
883		- $ perf inject -i perf.data -o inj --itrace=i100usle --strip
884		- $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
885		- $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
886		- $ ./sort_autofdo 30000
887		- Bubble sorting array of 30000 elements
888		- 2155 ms
889		-
890		-Note there is currently no advantage to using Intel PT instead of LBR, but
891		-that may change in the future if greater use is made of the data.
	1	+Documentation for support for Intel Processor Trace within perf tools' has moved to file perf-intel-pt.txt