| .. | .. |
|---|
| 4 | 4 | Concepts overview |
|---|
| 5 | 5 | ================= |
|---|
| 6 | 6 | |
|---|
| 7 | | -The memory management in Linux is complex system that evolved over the |
|---|
| 8 | | -years and included more and more functionality to support variety of |
|---|
| 7 | +The memory management in Linux is a complex system that evolved over the |
|---|
| 8 | +years and included more and more functionality to support a variety of |
|---|
| 9 | 9 | systems from MMU-less microcontrollers to supercomputers. The memory |
|---|
| 10 | | -management for systems without MMU is called ``nommu`` and it |
|---|
| 10 | +management for systems without an MMU is called ``nommu`` and it |
|---|
| 11 | 11 | definitely deserves a dedicated document, which hopefully will be |
|---|
| 12 | 12 | eventually written. Yet, although some of the concepts are the same, |
|---|
| 13 | | -here we assume that MMU is available and CPU can translate a virtual |
|---|
| 13 | +here we assume that an MMU is available and a CPU can translate a virtual |
|---|
| 14 | 14 | address to a physical address. |
|---|
| 15 | 15 | |
|---|
| 16 | 16 | .. contents:: :local: |
|---|
| .. | .. |
|---|
| 21 | 21 | The physical memory in a computer system is a limited resource and |
|---|
| 22 | 22 | even for systems that support memory hotplug there is a hard limit on |
|---|
| 23 | 23 | the amount of memory that can be installed. The physical memory is not |
|---|
| 24 | | -necessary contiguous, it might be accessible as a set of distinct |
|---|
| 24 | +necessarily contiguous; it might be accessible as a set of distinct |
|---|
| 25 | 25 | address ranges. Besides, different CPU architectures, and even |
|---|
| 26 | | -different implementations of the same architecture have different view |
|---|
| 27 | | -how these address ranges defined. |
|---|
| 26 | +different implementations of the same architecture have different views |
|---|
| 27 | +of how these address ranges are defined. |
|---|
| 28 | 28 | |
|---|
| 29 | 29 | All this makes dealing directly with physical memory quite complex and |
|---|
| 30 | 30 | to avoid this complexity a concept of virtual memory was developed. |
|---|
| .. | .. |
|---|
| 35 | 35 | protection and controlled sharing of data between processes. |
|---|
| 36 | 36 | |
|---|
| 37 | 37 | With virtual memory, each and every memory access uses a virtual |
|---|
| 38 | | -address. When the CPU decodes the an instruction that reads (or |
|---|
| 38 | +address. When the CPU decodes an instruction that reads (or |
|---|
| 39 | 39 | writes) from (or to) the system memory, it translates the `virtual` |
|---|
| 40 | 40 | address encoded in that instruction to a `physical` address that the |
|---|
| 41 | 41 | memory controller can understand. |
|---|
| .. | .. |
|---|
| 48 | 48 | |
|---|
| 49 | 49 | Each physical memory page can be mapped as one or more virtual |
|---|
| 50 | 50 | pages. These mappings are described by page tables that allow |
|---|
| 51 | | -translation from virtual address used by programs to real address in |
|---|
| 52 | | -the physical memory. The page tables organized hierarchically. |
|---|
| 51 | +translation from a virtual address used by programs to the physical |
|---|
| 52 | +memory address. The page tables are organized hierarchically. |
|---|
| 53 | 53 | |
|---|
| 54 | 54 | The tables at the lowest level of the hierarchy contain physical |
|---|
| 55 | 55 | addresses of actual pages used by the software. The tables at higher |
|---|
| .. | .. |
|---|
| 121 | 121 | Many multi-processor machines are NUMA - Non-Uniform Memory Access - |
|---|
| 122 | 122 | systems. In such systems the memory is arranged into banks that have |
|---|
| 123 | 123 | different access latency depending on the "distance" from the |
|---|
| 124 | | -processor. Each bank is referred as `node` and for each node Linux |
|---|
| 125 | | -constructs an independent memory management subsystem. A node has it's |
|---|
| 124 | +processor. Each bank is referred to as a `node` and for each node Linux |
|---|
| 125 | +constructs an independent memory management subsystem. A node has its |
|---|
| 126 | 126 | own set of zones, lists of free and used pages and various statistics |
|---|
| 127 | 127 | counters. You can find more details about NUMA in |
|---|
| 128 | 128 | :ref:`Documentation/vm/numa.rst <numa>` and in |
|---|
| .. | .. |
|---|
| 149 | 149 | call. Usually, the anonymous mappings only define virtual memory areas |
|---|
| 150 | 150 | that the program is allowed to access. The read accesses will result |
|---|
| 151 | 151 | in creation of a page table entry that references a special physical |
|---|
| 152 | | -page filled with zeroes. When the program performs a write, regular |
|---|
| 152 | +page filled with zeroes. When the program performs a write, a regular |
|---|
| 153 | 153 | physical page will be allocated to hold the written data. The page |
|---|
| 154 | | -will be marked dirty and if the kernel will decide to repurpose it, |
|---|
| 154 | +will be marked dirty and if the kernel decides to repurpose it, |
|---|
| 155 | 155 | the dirty page will be swapped out. |
|---|
| 156 | 156 | |
|---|
| 157 | 157 | Reclaim |
|---|
| .. | .. |
|---|
| 181 | 181 | The process of freeing the reclaimable physical memory pages and |
|---|
| 182 | 182 | repurposing them is called (surprise!) `reclaim`. Linux can reclaim |
|---|
| 183 | 183 | pages either asynchronously or synchronously, depending on the state |
|---|
| 184 | | -of the system. When system is not loaded, most of the memory is free |
|---|
| 185 | | -and allocation request will be satisfied immediately from the free |
|---|
| 184 | +of the system. When the system is not loaded, most of the memory is free |
|---|
| 185 | +and allocation requests will be satisfied immediately from the free |
|---|
| 186 | 186 | pages supply. As the load increases, the amount of the free pages goes |
|---|
| 187 | 187 | down and when it reaches a certain threshold (high watermark), an |
|---|
| 188 | 188 | allocation request will awaken the ``kswapd`` daemon. It will |
|---|
| .. | .. |
|---|
| 190 | 190 | they contain is available elsewhere, or evict to the backing storage |
|---|
| 191 | 191 | device (remember those dirty pages?). As memory usage increases even |
|---|
| 192 | 192 | more and reaches another threshold - min watermark - an allocation |
|---|
| 193 | | -will trigger the `direct reclaim`. In this case allocation is stalled |
|---|
| 193 | +will trigger `direct reclaim`. In this case allocation is stalled |
|---|
| 194 | 194 | until enough memory pages are reclaimed to satisfy the request. |
|---|
| 195 | 195 | |
|---|
| 196 | 196 | Compaction |
|---|
| .. | .. |
|---|
| 200 | 200 | fragmented. Although with virtual memory it is possible to present |
|---|
| 201 | 201 | scattered physical pages as virtually contiguous range, sometimes it is |
|---|
| 202 | 202 | necessary to allocate large physically contiguous memory areas. Such |
|---|
| 203 | | -need may arise, for instance, when a device driver requires large |
|---|
| 203 | +need may arise, for instance, when a device driver requires a large |
|---|
| 204 | 204 | buffer for DMA, or when THP allocates a huge page. Memory `compaction` |
|---|
| 205 | 205 | addresses the fragmentation issue. This mechanism moves occupied pages |
|---|
| 206 | 206 | from the lower part of a memory zone to free pages in the upper part |
|---|
| .. | .. |
|---|
| 208 | 208 | together at the beginning of the zone and allocations of large |
|---|
| 209 | 209 | physically contiguous areas become possible. |
|---|
| 210 | 210 | |
|---|
| 211 | | -Like reclaim, the compaction may happen asynchronously in ``kcompactd`` |
|---|
| 212 | | -daemon or synchronously as a result of memory allocation request. |
|---|
| 211 | +Like reclaim, the compaction may happen asynchronously in the ``kcompactd`` |
|---|
| 212 | +daemon or synchronously as a result of a memory allocation request. |
|---|
| 213 | 213 | |
|---|
| 214 | 214 | OOM killer |
|---|
| 215 | 215 | ========== |
|---|
| 216 | 216 | |
|---|
| 217 | | -It may happen, that on a loaded machine memory will be exhausted. When |
|---|
| 218 | | -the kernel detects that the system runs out of memory (OOM) it invokes |
|---|
| 219 | | -`OOM killer`. Its mission is simple: all it has to do is to select a |
|---|
| 220 | | -task to sacrifice for the sake of the overall system health. The |
|---|
| 221 | | -selected task is killed in a hope that after it exits enough memory |
|---|
| 222 | | -will be freed to continue normal operation. |
|---|
| 217 | +It is possible that on a loaded machine memory will be exhausted and the |
|---|
| 218 | +kernel will be unable to reclaim enough memory to continue to operate. In |
|---|
| 219 | +order to save the rest of the system, it invokes the `OOM killer`. |
|---|
| 220 | + |
|---|
| 221 | +The `OOM killer` selects a task to sacrifice for the sake of the overall |
|---|
| 222 | +system health. The selected task is killed in a hope that after it exits |
|---|
| 223 | +enough memory will be freed to continue normal operation. |
|---|