Linux mem 1.1 用户态进程空间布局 --- mmap()详解

it2023-09-09  65

文章目录

1. 原理介绍1.1 映射1.2 vma管理1.3 mmap1.4 缺页异常1.5 layout1.6 brk 2. 代码详解2.1 关键数据结构2.2 mmap()2.2.1 mm->mmap_base(mmap基地址)2.2.2 get_unmapped_area()2.2.2.1 arch_get_unmapped_area()2.2.2.2 arch_get_unmapped_area_topdown()2.2.2.3 file->f_op->get_unmapped_area 2.2.3 mmap_region()2.2.3.1 vma_link() 2.2.4 do_munmap()2.2.5 mm_populate() 2.3 缺页处理2.3.1 do_page_fault()2.3.2 faultin_page()2.3.2.1 do_anonymous_page()2.3.2.2 do_fault()2.3.2.3 do_read_fault()2.3.2.4 do_cow_fault()2.3.2.5 do_shared_fault()2.3.2.6 do_swap_page() 参考资料:

1. 原理介绍

为了进行权限管理,linux的地址分为用户地址空间和内核地址空间。一般来说这两者大小相等,各占总空间的一半。

例如:

地址模式单个空间用户地址空间内核地址空间32位2G0x00000000 - 0x7FFFFFFF0x80000000 - 0xFFFFFFFF64位(48bit)128T0x00000000 00000000 - 0x00007FFF FFFFFFFF0xFFFF8000 00000000 - 0xFFFFFFFF FFFFFFFF64位(57bit)64P0x00000000 00000000 - 0x00FFFFFF FFFFFFFF0xFF000000 00000000 - 0xFFFFFFFF FFFFFFFF

本篇我们先讨论用户地址空间。

1.1 映射

一般来说地址空间就是实现虚拟地址和物理地址之间的映射,另外用户地址空间还有一个重要的因素参与映射那就是文件。所以用户地址空间的映射分为两类:文件映射(虚拟地址、物理地址、文件)、匿名映射(虚拟地址、物理地址)。

1、文件映射

虚拟地址、物理地址、文件,三者的详细映射关系如下图:

地址空间的映射为什么要把文件牵扯进来?因为在系统中内存的资源相比较文件存储资源往往比较小,需要把文件当成内存的一个慢速备份使用: 1、在加载的时候使用“lazy”的延迟分配策略,分配虚拟地址空间时,并不会马上分配对应的物理内存并建立mmu映射,而是在虚拟地址真正使用时触发缺页异常,在异常处理中再进行物理内存分配、数据读取和建立mmu映射。 2、在系统中内存不够需要进行内存回收时,文件映射的物理内存可以暂时释放掉。因为文件中还有备份,在真正使用的时候通过缺页异常再进行加载。

内核空间为什么不使用这种策略? 1、内核对文件的映射很少,一般就是vmlinux,不会消耗太多内存资源。 2、内核空间的代码一般要求快速响应,缺页处理这种会让内核速度未知。 3、内核操作可能处于各种复杂的锁上下文中,在这种上下文中处理缺页异常,会触发新的异常。

2、匿名映射

没有慢速备份的空间就只能进行匿名映射了,这类空间一般是数据类的,比如:全局数据、栈、堆。这类空间的处理策略: 1、分配时可以"lazy"延迟分配,真正使用时触发缺页异常,在异常处理中分配全零物理内存、建立mmu映射。 2、在内存回收时,这类空间的数据不能直接释放掉,只能通过swap把数据交换到文件/zram当中后,才能释放映射的物理内存。

1.2 vma管理

用户地址空间是和进程绑定的,每个进程的用户地址空间都是独立的。进程使用mm结构来管理用户地址空间(内核进程没有用户地址空间所以他的mm=NULL)。

active_mm mm指向的是进程拥有的内存描述符,而active_mm指向进程运行时使用的内存描述符。对普通进程来说,两个字段值相同。然而内核线程没有内存描述符,所以mm一直是NULL。内核线程得以运行时,active_mm字段会被初始化为上一个运行进程的active_mm值(参考schedule()调度函数)。

进程使用vma来管理一段虚拟地址,vma结构中关键的数据成员:

struct vm_area_struct { unsigned long vm_start; // 起始虚拟地址 unsigned long vm_end; // 虚拟地址长度 unsigned long vm_pgoff; // 起始虚拟地址对应的文件偏移 struct file * vm_file; // 虚拟地址对应的映射文件 }

mm结构使用红黑树来管理所有的虚拟地址vma,同时使用链表串联起所有的vma:

对vma的操作通常有查找(find_vma())、插入(vma_link())。

需要特别注意的是find_vma(mm, addr)。 从字面上理解是根据addr找到一个vma满足条件vma->vm_start < addr < vma->vm_end,但实质上他是根据addr找到第一个满足addr < vma->vm_end的vma。所以要注意你虽然使用find_vma()查找到了一个vma,但并不意味着你的addr位于这个vma当中。非常容易犯的错误!

1.3 mmap

上面的章节介绍了各种背景,而建立用户地址映射的具体实施是从mmap()开始的,mmap操作在进程执行的过程中被大量使用。

通常来说mmap()只会分配一段vma空间,并且记录vma和file的关系,并不会马上进行实质性的映射。分配物理内存page、拷贝文件内容、创建mmu映射的这些后续动作在缺页异常中进行。

1.4 缺页异常

当有用户访问上一节mmap的地址时,因为没有对应的mmu映射,缺页异常发生了。

在异常处理中进行:分配物理内存page、拷贝文件内容、创建mmu映射。

1.5 layout

除了execve()时把text、data映射完成以后,后续so的加载等都需要使用mmap()进行映射,mmap映射的起始地址称为mm->mmap_base。

根据mmap_base的不同映射方式,整个用户地址空间有两种layout模式。

1、legacy layout (mmap从低向高增长)

可以看到传统layout模式下,mm_base大概从用户空间的1/3处开始。stack的最大空间基本是固定的,能灵活增长的是heap和mmap,都是从下向上增长。

这种布局有什么问题吗?最大的问题是heap和mmap的空间不能共享,heap空间最大为1/3 task_size左右,mmap空间最大为2/3 task_size左右。

2、modern layout (mmap从高向低增长)

现在普遍采用新型的layout布局,能实现heap和mmap空间共享,heap从下向上增长、mmap从上向下增长。实现了更大空间的可能。

例如:

$ cat /proc/3389/maps 00400000-00401000 r-xp 00000000 fd:00 104079935 /home/ipu/hook/build/output/bin/sample-target 00600000-00601000 r--p 00000000 fd:00 104079935 /home/ipu/hook/build/output/bin/sample-target 00601000-00602000 rw-p 00001000 fd:00 104079935 /home/ipu/hook/build/output/bin/sample-target 01b9f000-01bc0000 rw-p 00000000 00:00 0 [heap] 7f5c77ad9000-7f5c77c9c000 r-xp 00000000 fd:00 191477 /usr/lib64/libc-2.17.so 7f5c77c9c000-7f5c77e9c000 ---p 001c3000 fd:00 191477 /usr/lib64/libc-2.17.so 7f5c77e9c000-7f5c77ea0000 r--p 001c3000 fd:00 191477 /usr/lib64/libc-2.17.so 7f5c77ea0000-7f5c77ea2000 rw-p 001c7000 fd:00 191477 /usr/lib64/libc-2.17.so 7f5c77ea2000-7f5c77ea7000 rw-p 00000000 00:00 0 7f5c77ea7000-7f5c77ec9000 r-xp 00000000 fd:00 191470 /usr/lib64/ld-2.17.so 7f5c7809d000-7f5c780a0000 rw-p 00000000 00:00 0 7f5c780c6000-7f5c780c8000 rw-p 00000000 00:00 0 7f5c780c8000-7f5c780c9000 r--p 00021000 fd:00 191470 /usr/lib64/ld-2.17.so 7f5c780c9000-7f5c780ca000 rw-p 00022000 fd:00 191470 /usr/lib64/ld-2.17.so 7f5c780ca000-7f5c780cb000 rw-p 00000000 00:00 0 7ffe4a3ef000-7ffe4a410000 rw-p 00000000 00:00 0 [stack] 7ffe4a418000-7ffe4a41a000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

1.6 brk

前面说过heap是一块特殊的mmap匿名映射区域,系统有一个专门的系统调用brk()来负责这个区域mmap的大小。

brk()和mmap()原理上是一样的的,相当于mmap()的一个特例,它调整heap区域结束指针mm->brk。

2. 代码详解

2.1 关键数据结构

mm:

/** * 内存描述符。task_struct的mm字段指向它。 * 它包含了进程地址空间有关的全部信息。 */ struct mm_struct { /** * 指向线性区对象的链表头。 */ struct vm_area_struct * mmap; /* list of VMAs */ /** * 指向线性区对象的红-黑树的根 */ struct rb_root mm_rb; /** * 指向最后一个引用的线性区对象。 */ struct vm_area_struct * mmap_cache; /* last find_vma result */ /** * 在进程地址空间中搜索有效线性地址区的方法。 */ unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); /** * 释放线性地址区间时调用的方法。 */ void (*unmap_area) (struct vm_area_struct *area); /** * 标识第一个分配的匿名线性区或文件内存映射的线性地址。 */ unsigned long mmap_base; /* base of mmap area */ /** * 内核从这个地址开始搜索进程地址空间中线性地址的空间区间。 */ unsigned long free_area_cache; /* first hole */ /** * 指向页全局目录。 */ pgd_t * pgd; /** * 次使用计数器。存放共享mm_struct数据结构的轻量级进程的个数。 */ atomic_t mm_users; /* How many users with user space? */ /** * 主使用计数器。每当mm_count递减时,内核都要检查它是否变为0,如果是,就要解除这个内存描述符。 */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ /** * 线性区的个数。 */ int map_count; /* number of VMAs */ /** * 内存描述符的读写信号量。 * 由于描述符可能在几个轻量级进程间共享,通过这个信号量可以避免竞争条件。 */ struct rw_semaphore mmap_sem; /** * 线性区和页表的自旋锁。 */ spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */ /** * 指向内存描述符链表中的相邻元素。 */ struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected * by mmlist_lock */ /** * start_code-可执行代码的起始地址。 * end_code-可执行代码的最后地址。 * start_data-已初始化数据的起始地址。 * end_data--已初始化数据的结束地址。 */ unsigned long start_code, end_code, start_data, end_data; /** * start_brk-堆的超始地址。 * brk-堆的当前最后地址。 * start_stack-用户态堆栈的起始地址。 */ unsigned long start_brk, brk, start_stack; /** * arg_start-命令行参数的起始地址。 * arg_end-命令行参数的结束地址。 * env_start-环境变量的起始地址。 * env_end-环境变量的结束地址。 */ unsigned long arg_start, arg_end, env_start, env_end; /** * rss-分配给进程的页框总数 * anon_rss-分配给匿名内存映射的页框数。s * total_vm-进程地址空间的大小(页框数) * locked_vm-锁住而不能换出的页的个数。 * shared_vm-共享文件内存映射中的页数。 */ unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; /** * exec_vm-可执行内存映射的页数。 * stack_vm-用户态堆栈中的页数。 * reserved_vm-在保留区中的页数或在特殊线性区中的页数。 * def_flags-线性区默认的访问标志。 * nr_ptes-this进程的页表数。 */ unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; /** * 开始执行elf程序时使用。 */ unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ /** * 表示是否可以产生内存信息转储的标志。 */ unsigned dumpable:1; /** * 懒惰TLB交换的位掩码。 */ cpumask_t cpu_vm_mask; /* Architecture-specific MM context */ /** * 特殊体系结构信息的表。 * 如80X86平台上的LDT地址。 */ mm_context_t context; /* Token based thrashing protection. */ /** * 进程有资格获得交换标记的时间。 */ unsigned long swap_token_time; /** * 如果最近发生了主缺页。则设置该标志。 */ char recent_pagein; /* coredumping support */ /** * 正在把进程地址空间的内容卸载到转储文件中的轻量级进程的数量。 */ int core_waiters; /** * core_startup_done-指向创建内存转储文件时的补充原语。 * core_done-创建内存转储文件时使用的补充原语。 */ struct completion *core_startup_done, core_done; /* aio bits */ /** * 用于保护异步IO上下文链表的锁。 */ rwlock_t ioctx_list_lock; /** * 异步IO上下文链表。 * 一个应用可以创建多个AIO环境,一个给定进程的所有的kioctx描述符存放在一个单向链表中,该链表位于ioctx_list字段 */ struct kioctx *ioctx_list; /** * 默认的异步IO上下文。 */ struct kioctx default_kioctx; /** * 进程所拥有的最大页框数。 */ unsigned long hiwater_rss; /* High-water RSS usage */ /** * 进程线性区中的最大页数。 */ unsigned long hiwater_vm; /* High-water virtual memory usage */ };

vma:

/* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory * space that has a special rule for the page-fault handlers (ie a shared * library, the executable area etc). */ /** * 线性区描述符。 */ struct vm_area_struct { /** * 指向线性区所在的内存描述符。 */ struct mm_struct * vm_mm; /* The address space we belong to. */ /** * 线性区内的第一个线性地址。 */ unsigned long vm_start; /* Our start address within vm_mm. */ /** * 线性区之后的第一个线性地址。 */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ /* linked list of VM areas per task, sorted by address */ /** * 进程链表中的下一个线性区。 */ struct vm_area_struct *vm_next; /** * 线性区中页框的访问许可权。 */ pgprot_t vm_page_prot; /* Access permissions of this VMA. */ /** * 线性区的标志。 */ unsigned long vm_flags; /* Flags, listed below. */ /** * 用于红黑树的数据。 */ struct rb_node vm_rb; /* * For areas with an address space and backing store, * linkage into the address_space->i_mmap prio tree, or * linkage to the list of like vmas hanging off its node, or * linkage of vma in the address_space->i_mmap_nonlinear list. */ /** * 链接到反映射所使用的数据结构。 */ union { /** * 如果在优先搜索树中,存在两个节点的基索引、堆索引、大小索引完全相同,那么这些相同的节点会被链接到一个链表,而vm_set就是这个链表的元素。 */ struct { struct list_head list; void *parent; /* aligns with prio_tree_node parent */ struct vm_area_struct *head; } vm_set; /** * 如果是文件映射,那么prio_tree_node用于将线性区插入到优先搜索树中。作为搜索树的一个节点。 */ struct raw_prio_tree_node prio_tree_node; } shared; /* * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma * list, after a COW of one of the file pages. A MAP_SHARED vma * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack * or brk vma (with NULL file) can only be in an anon_vma list. */ /** * 指向匿名线性区链表的指针(参见"映射页的反映射")。 * 页框结构有一个anon_vma指针,指向该页的第一个线性区,随后的线性区通过此字段链接起来。 * 通过此字段,可以将线性区链接到此链表中。 */ struct list_head anon_vma_node; /* Serialized by anon_vma->lock */ /** * 指向anon_vma数据结构的指针(参见"映射页的反映射")。此指针也存放在页结构的mapping字段中。 */ struct anon_vma *anon_vma; /* Serialized by page_table_lock */ /* Function pointers to deal with this struct. */ /** * 指向线性区的方法。 */ struct vm_operations_struct * vm_ops; /* Information about our backing store: */ /** * 在映射文件中的偏移量(以页为单位)。对匿名页,它等于0或vm_start/PAGE_SIZE */ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units, *not* PAGE_CACHE_SIZE */ /** * 指向映射文件的文件对象(如果有的话) */ struct file * vm_file; /* File we map to (can be NULL). */ /** * 指向内存区的私有数据。 */ void * vm_private_data; /* was vm_pte (shared mem) */ /** * 释放非线性文件内存映射中的一个线性地址区间时使用。 */ unsigned long vm_truncate_count;/* truncate_count or restart_addr */ #ifndef CONFIG_MMU atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */ #endif #ifdef CONFIG_NUMA struct mempolicy *vm_policy; /* NUMA policy for the VMA */ #endif };

2.2 mmap()

arch\x86\kernel\sys_x86_64.c: SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, unsigned long, prot, unsigned long, flags, unsigned long, fd, unsigned long, off) { long error; error = -EINVAL; if (off & ~PAGE_MASK) goto out; error = sys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); out: return error; } ↓ mm\mmap.c: SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len, unsigned long, prot, unsigned long, flags, unsigned long, fd, unsigned long, pgoff) { struct file *file = NULL; unsigned long retval; /* (1) 文件内存映射(非匿名内存映射) */ if (!(flags & MAP_ANONYMOUS)) { audit_mmap_fd(fd, flags); /* (1.1) 根据fd获取到file */ file = fget(fd); if (!file) return -EBADF; /* (1.2) 如果文件是hugepage,根据hugepage计算长度 */ if (is_file_hugepages(file)) len = ALIGN(len, huge_page_size(hstate_file(file))); retval = -EINVAL; /* (1.3) 如果指定了huge tlb映射,但是文件不是huge page,出错返回 */ if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) goto out_fput; /* (2) 匿名内存映射 且huge tlb */ } else if (flags & MAP_HUGETLB) { struct user_struct *user = NULL; struct hstate *hs; hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); if (!hs) return -EINVAL; len = ALIGN(len, huge_page_size(hs)); /* * VM_NORESERVE is used because the reservations will be * taken when vm_ops->mmap() is called * A dummy user value is used because we are not locking * memory so no accounting is necessary */ file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, VM_NORESERVE, &user, HUGETLB_ANONHUGE_INODE, (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); if (IS_ERR(file)) return PTR_ERR(file); } /* (3) 清除用户设置的以下两个标志 */ flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); /* (4) 进一步调用 */ retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); out_fput: if (file) fput(file); return retval; } ↓ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flag, unsigned long pgoff) { unsigned long ret; struct mm_struct *mm = current->mm; unsigned long populate; LIST_HEAD(uf); ret = security_mmap_file(file, prot, flag); if (!ret) { if (down_write_killable(&mm->mmap_sem)) return -EINTR; /* 进一步调用 */ ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, &populate, &uf); up_write(&mm->mmap_sem); userfaultfd_unmap_complete(mm, &uf); /* 如果需要,立即填充vma对应的内存 */ if (populate) mm_populate(ret, populate); } return ret; } ↓ do_mmap_pgoff() ↓ unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf) { struct mm_struct *mm = current->mm; int pkey = 0; *populate = 0; if (!len) return -EINVAL; /* * Does the application expect PROT_READ to imply PROT_EXEC? * 应用程序是否期望PROT_READ暗示PROT_EXEC? * * (the exception is when the underlying filesystem is noexec * mounted, in which case we dont add PROT_EXEC.) * (例外是在底层文件系统未执行noexec挂载时,在这种情况下,我们不添加PROT_EXEC。) */ /* (4.1.1) PROT_READ暗示PROT_EXEC */ if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) if (!(file && path_noexec(&file->f_path))) prot |= PROT_EXEC; /* (4.1.2) 如果没有设置固定地址的标志,给地址按page取整,且使其不小于mmap_min_addr */ if (!(flags & MAP_FIXED)) addr = round_hint_to_min(addr); /* Careful about overflows.. */ /* (4.1.3) 给长度按page取整 */ len = PAGE_ALIGN(len); if (!len) return -ENOMEM; /* offset overflow? */ /* (4.1.4) 判断page offset + 长度,是否已经溢出 */ if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) return -EOVERFLOW; /* Too many mappings? */ /* (4.1.5) 判断本进程mmap的区段个数已经超标 */ if (mm->map_count > sysctl_max_map_count) return -ENOMEM; /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ /* (4.2) 从本进程的线性地址红黑树中分配一块空白地址 */ addr = get_unmapped_area(file, addr, len, pgoff, flags); if (offset_in_page(addr)) return addr; /* (4.3.1) 如果prot只指定了exec */ if (prot == PROT_EXEC) { pkey = execute_only_pkey(mm); if (pkey < 0) pkey = 0; } /* Do simple checking here so the lower-level routines won't have * to. we assume access permissions have been handled by the open * of the memory object, so we don't do any here. * 在这里进行简单的检查,因此不必执行较低级别的例程。 我们假定访问权限已由内存对象的打开处理,因此在此不做任何操作。 */ /* (4.3.2) 计算初始vm flags: 根据prot计算相关的vm flag 根据flags计算相关的vm flag 再综合mm的默认flags等 */ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; /* (4.3.3) 如果指定了内存lock标志,但是当前进程不能lock,出错返回 */ if (flags & MAP_LOCKED) if (!can_do_mlock()) return -EPERM; /* (4.3.4) 如果指定了内存lock标志,但是lock的长度超标,出错返回 */ if (mlock_future_check(mm, vm_flags, len)) return -EAGAIN; /* (4.4) 文件内存映射的一系列判断和处理 */ if (file) { struct inode *inode = file_inode(file); unsigned long flags_mask; /* (4.4.1) 指定的page offset和len,需要在文件的合法长度内 */ if (!file_mmap_ok(file, inode, pgoff, len)) return -EOVERFLOW; /* (4.4.2) 本文件支持的mask,和mmap()传递下来的flags进行判断 */ flags_mask = LEGACY_MAP_MASK | file->f_op->mmap_supported_flags; switch (flags & MAP_TYPE) { /* (4.4.3) 共享映射 */ case MAP_SHARED: /* * Force use of MAP_SHARED_VALIDATE with non-legacy * flags. E.g. MAP_SYNC is dangerous to use with * MAP_SHARED as you don't know which consistency model * you will get. We silently ignore unsupported flags * with MAP_SHARED to preserve backward compatibility. */ flags &= LEGACY_MAP_MASK; /* fall through */ /* (4.4.4) 共享&校验映射 */ case MAP_SHARED_VALIDATE: if (flags & ~flags_mask) return -EOPNOTSUPP; if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE)) return -EACCES; /* * Make sure we don't allow writing to an append-only * file.. */ if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE)) return -EACCES; /* * Make sure there are no mandatory locks on the file. */ if (locks_verify_locked(file)) return -EAGAIN; vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); /* fall through */ /* (4.4.3) 私有映射 */ case MAP_PRIVATE: if (!(file->f_mode & FMODE_READ)) return -EACCES; if (path_noexec(&file->f_path)) { if (vm_flags & VM_EXEC) return -EPERM; vm_flags &= ~VM_MAYEXEC; } if (!file->f_op->mmap) return -ENODEV; if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; break; default: return -EINVAL; } /* (4.5) 匿名内存映射的一系列判断和处理 */ } else { switch (flags & MAP_TYPE) { case MAP_SHARED: if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; /* * Ignore pgoff. */ pgoff = 0; vm_flags |= VM_SHARED | VM_MAYSHARE; break; case MAP_PRIVATE: /* * Set pgoff according to addr for anon_vma. */ pgoff = addr >> PAGE_SHIFT; break; default: return -EINVAL; } } /* * Set 'VM_NORESERVE' if we should not account for the * memory use of this mapping. * 如果我们不应该考虑此映射的内存使用,则设置“ VM_NORESERVE”。 */ if (flags & MAP_NORESERVE) { /* We honor MAP_NORESERVE if allowed to overcommit */ if (sysctl_overcommit_memory != OVERCOMMIT_NEVER) vm_flags |= VM_NORESERVE; /* hugetlb applies strict overcommit unless MAP_NORESERVE */ if (file && is_file_hugepages(file)) vm_flags |= VM_NORESERVE; } /* (4.6) 根据查找到的地址、flags,正式在线性地址红黑树中插入一个新的VMA */ addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); /* (4.7) 默认只是分配vma,不进行实际的内存分配和mmu映射,延迟到page_fault时才处理 如果设置了立即填充的标志,在分配vma时就分配好内存 */ if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; return addr; }

2.2.1 mm->mmap_base(mmap基地址)

current->mm->get_unmapped_area默认赋值的是arch_get_unmapped_area()/arch_get_unmapped_area_topdown(),它是在进程创建时被赋值。在这个函数中还有另外一件重要的事情,给mmap base进行赋值:

sys_execve() ... → load_elf_binary() → setup_new_exec() → arch_pick_mmap_layout() void arch_pick_mmap_layout(struct mm_struct *mm) { /* (1) 给get_unmapped_area成员赋值 */ if (mmap_is_legacy()) mm->get_unmapped_area = arch_get_unmapped_area; else mm->get_unmapped_area = arch_get_unmapped_area_topdown; /* (2) 计算64bit模式下,mmap的基地址 传统layout模式:用户空间的1/3处,再加上随机偏移 现代layout模式:用户空间顶端减去堆栈和堆栈随机偏移,再减去随机偏移 */ arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base, arch_rnd(mmap64_rnd_bits), task_size_64bit(0)); #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES /* * The mmap syscall mapping base decision depends solely on the * syscall type (64-bit or compat). This applies for 64bit * applications and 32bit applications. The 64bit syscall uses * mmap_base, the compat syscall uses mmap_compat_base. */ /* (3) 计算32bit兼容模式下,mmap的基地址 arch_pick_mmap_base(&mm->mmap_compat_base, &mm->mmap_compat_legacy_base, arch_rnd(mmap32_rnd_bits), task_size_32bit()); #endif } |→ // 用户地址空间的最大值:2^47-0x1000 = 0x7FFFFFFFF000 // 约128T #define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE) #define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE) /* This decides where the kernel will search for a free chunk of vm * space during mmap's. */ #define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? \ 0xc0000000 : 0xFFFFe000) #define TASK_SIZE_LOW (test_thread_flag(TIF_ADDR32) ? \ IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW) #define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? \ IA32_PAGE_OFFSET : TASK_SIZE_MAX) #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \ IA32_PAGE_OFFSET : TASK_SIZE_MAX) #define STACK_TOP TASK_SIZE_LOW #define STACK_TOP_MAX TASK_SIZE_MAX |→ // 根据定义的随机bit,来计算随机偏移多少个page static unsigned long arch_rnd(unsigned int rndbits) { if (!(current->flags & PF_RANDOMIZE)) return 0; return (get_random_long() & ((1UL << rndbits) - 1)) << PAGE_SHIFT; } unsigned long task_size_64bit(int full_addr_space) { return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW; } |→ static void arch_pick_mmap_base(unsigned long *base, unsigned long *legacy_base, unsigned long random_factor, unsigned long task_size) { /* (2.1) 传统layout模式下,mmap的基址: PAGE_ALIGN(task_size / 3) + rnd // 用户空间的1/3处,再加上随机偏移 */ *legacy_base = mmap_legacy_base(random_factor, task_size); if (mmap_is_legacy()) *base = *legacy_base; else /* (2.2) 现代layout模式下,mmap的基址: PAGE_ALIGN(task_size - stask_gap - rnd) // 用户空间顶端减去堆栈和堆栈随机偏移,再减去随机偏移 */ *base = mmap_base(random_factor, task_size); } ||→ static unsigned long mmap_legacy_base(unsigned long rnd, unsigned long task_size) { return __TASK_UNMAPPED_BASE(task_size) + rnd; } #define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3)) ||→ static unsigned long mmap_base(unsigned long rnd, unsigned long task_size) { // 堆栈的最大值 unsigned long gap = rlimit(RLIMIT_STACK); // 堆栈的最大随机偏移 + 1M unsigned long pad = stack_maxrandom_size(task_size) + stack_guard_gap; unsigned long gap_min, gap_max; /* Values close to RLIM_INFINITY can overflow. */ if (gap + pad > gap) gap += pad; /* * Top of mmap area (just below the process stack). * Leave an at least ~128 MB hole with possible stack randomization. */ // 最小不小于128M,最大不大于用户空间的5/6 gap_min = SIZE_128M; gap_max = (task_size / 6) * 5; if (gap < gap_min) gap = gap_min; else if (gap > gap_max) gap = gap_max; return PAGE_ALIGN(task_size - gap - rnd); } static unsigned long stack_maxrandom_size(unsigned long task_size) { unsigned long max = 0; if (current->flags & PF_RANDOMIZE) { max = (-1UL) & __STACK_RND_MASK(task_size == task_size_32bit()); max <<= PAGE_SHIFT; } return max; } /* 1GB for 64bit, 8MB for 32bit */ // 堆栈的最大随机偏移:64bit下16G,32bit下8M #define __STACK_RND_MASK(is32bit) ((is32bit) ? 0x7ff : 0x3fffff) /* enforced gap between the expanding stack and other mappings. */ // 1M 的stack gap空间 unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;

2.2.2 get_unmapped_area()

get_unmapped_area()函数是从当前进程的用户地址空间中找出一块符合要求的空闲空间,给新的vma。

unsigned long get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { unsigned long (*get_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); unsigned long error = arch_mmap_check(addr, len, flags); if (error) return error; /* Careful about overflows.. */ if (len > TASK_SIZE) return -ENOMEM; /* (1.1) 默认,使用本进程的mm->get_unmapped_area函数 */ get_area = current->mm->get_unmapped_area; if (file) { /* (1.2) 文件内存映射,且文件有自己的get_unmapped_area,则使用file->f_op->get_unmapped_area */ if (file->f_op->get_unmapped_area) get_area = file->f_op->get_unmapped_area; } else if (flags & MAP_SHARED) { /* * mmap_region() will call shmem_zero_setup() to create a file, * so use shmem's get_unmapped_area in case it can be huge. * do_mmap_pgoff() will clear pgoff, so match alignment. */ pgoff = 0; /* (1.3) 匿名共享内存映射,使用shmem_get_unmapped_area函数 */ get_area = shmem_get_unmapped_area; } /* (2) 实际的获取线性区域 */ addr = get_area(file, addr, len, pgoff, flags); if (IS_ERR_VALUE(addr)) return addr; if (addr > TASK_SIZE - len) return -ENOMEM; if (offset_in_page(addr)) return -EINVAL; error = security_mmap_addr(addr); return error ? error : addr; }

2.2.2.1 arch_get_unmapped_area()

进程地址空间,传统/经典布局:

经典布局的缺点:在x86_32,虚拟地址空间从0到0xc0000000,每个用户进程有3GB可用。TASK_UNMAPPED_BASE一般起始于0x4000000(即1GB)。这意味着堆只有1GB的空间可供使用,继续增长则进入到mmap区域。这时mmap区域是自底向上扩展的。

传统layout模式下,mmap分配是从低到高的,从&mm->mmap_base到task_size。默认调用current->mm->get_unmapped_area -> arch_get_unmapped_area():

arch\x86\kernel\sys_x86_64.c: unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; struct vm_unmapped_area_info info; unsigned long begin, end; /* (2.1) 地址和长度不能大于 47 bits */ addr = mpx_unmapped_area_check(addr, len, flags); if (IS_ERR_VALUE(addr)) return addr; /* (2.2) 如果是固定映射,则返回原地址 */ if (flags & MAP_FIXED) return addr; /* (2.3) 获取mmap区域的开始、结束地址: begin :mm->mmap_base end :task_size */ find_start_end(addr, flags, &begin, &end); /* (2.4) 超长出错返回 */ if (len > end) return -ENOMEM; /* (2.5) 按照用户给出的原始addr和len查看这里是否有空洞 */ if (addr) { addr = PAGE_ALIGN(addr); vma = find_vma(mm, addr); /* 这里容易有个误解: 开始以为find_vma()的作用是找到一个vma满足:vma->vm_start <= addr < vma->vm_end 实际find_vma()的作用是找到一个地址最小的vma满足: addr < vma->vm_end */ /* 在vma红黑树之外找到空间: 1、如果addr的值在vma红黑树之上:!vma ,且有足够的空间:end - len >= addr 2、如果addr的值在vma红黑树之下,且有足够的空间:addr + len <= vm_start_gap(vma) */ if (end - len >= addr && (!vma || addr + len <= vm_start_gap(vma))) return addr; } info.flags = 0; info.length = len; info.low_limit = begin; info.high_limit = end; info.align_mask = 0; info.align_offset = pgoff << PAGE_SHIFT; if (filp) { info.align_mask = get_align_mask(); info.align_offset += get_align_bits(); } /* (2.6) 否则只能废弃掉用户指定的地址,根据长度重新给他找一个合适的地址 优先在vma红黑树的空洞中找,其次在空白位置找 */ return vm_unmapped_area(&info); } |→ /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ /* 查找最小的VMA,满足addr < vma->vm_end */ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { struct rb_node *rb_node; struct vm_area_struct *vma; /* Check the cache first. */ /* (2.5.1) 查找vma cache,是否有vma的区域能包含addr地址 */ vma = vmacache_find(mm, addr); if (likely(vma)) return vma; /* (2.5.2) 查找vma红黑树,是否有vma的区域能包含addr地址 */ rb_node = mm->mm_rb.rb_node; while (rb_node) { struct vm_area_struct *tmp; tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); if (tmp->vm_end > addr) { vma = tmp; if (tmp->vm_start <= addr) break; rb_node = rb_node->rb_left; } else rb_node = rb_node->rb_right; } /* (2.5.3) 利用查找到的vma来更新vma cache */ if (vma) vmacache_update(addr, vma); return vma; } |→ /* * Search for an unmapped address range. * * We are looking for a range that: * - does not intersect with any VMA; // 不与任何VMA相交; * - is contained within the [low_limit, high_limit) interval; // 包含在[low_limit,high_limit)间隔内; * - is at least the desired size. // 至少是所需的大小。 * - satisfies (begin_addr & align_mask) == (align_offset & align_mask) // 满足如下条件 */ static inline unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { /* (2.6.1) 从高往低查找 */ if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) return unmapped_area_topdown(info); /* (2.6.2) 默认从低往高查找 */ else return unmapped_area(info); } ||→ unsigned long unmapped_area(struct vm_unmapped_area_info *info) { /* * We implement the search by looking for an rbtree node that * immediately follows a suitable gap. That is, * 我们查找红黑树,找到一个合适的洞。需要满足以下条件: * - gap_start = vma->vm_prev->vm_end <= info->high_limit - length; * - gap_end = vma->vm_start >= info->low_limit + length; * - gap_end - gap_start >= length */ struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long length, low_limit, high_limit, gap_start, gap_end; /* Adjust search length to account for worst case alignment overhead */ /* (2.6.2.1) 长度加上mask开销 */ length = info->length + info->align_mask; if (length < info->length) return -ENOMEM; /* Adjust search limits by the desired length */ /* (2.6.2.2) 计算high_limit,gap_start<=high_limit */ if (info->high_limit < length) return -ENOMEM; high_limit = info->high_limit - length; /* (2.6.2.3) 计算low_limit,gap_end>=low_limit */ if (info->low_limit > high_limit) return -ENOMEM; low_limit = info->low_limit + length; /* Check if rbtree root looks promising */ if (RB_EMPTY_ROOT(&mm->mm_rb)) goto check_highest; vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb); /* * rb_subtree_gap的定义: * Largest free memory gap in bytes to the left of this VMA. * 此VMA左侧的最大可用内存空白(以字节为单位)。 * Either between this VMA and vma->vm_prev, or between one of the * VMAs below us in the VMA rbtree and its ->vm_prev. This helps * get_unmapped_area find a free area of the right size. * 在此VMA和vma-> vm_prev之间,或在VMA rbtree中我们下面的VMA之一与其-> vm_prev之间。 这有助于get_unmapped_area找到合适大小的空闲区域。 */ if (vma->rb_subtree_gap < length) goto check_highest; /* (2.6.2.4) 查找红黑树根节点的左子树中是否有符合要求的空洞。 有个疑问: 根节点的右子树不需要搜索了吗?还是根节点没有右子树? */ while (true) { /* Visit left subtree if it looks promising */ /* (2.6.2.4.1) 一直往左找,找到最左边有合适大小的节点 因为最左边的地址最小 */ gap_end = vm_start_gap(vma); if (gap_end >= low_limit && vma->vm_rb.rb_left) { struct vm_area_struct *left = rb_entry(vma->vm_rb.rb_left, struct vm_area_struct, vm_rb); if (left->rb_subtree_gap >= length) { vma = left; continue; } } gap_start = vma->vm_prev ? vm_end_gap(vma->vm_prev) : 0; check_current: /* Check if current node has a suitable gap */ if (gap_start > high_limit) return -ENOMEM; /* (2.6.2.4.2) 如果已找到合适的洞,则跳出循环 */ if (gap_end >= low_limit && gap_end > gap_start && gap_end - gap_start >= length) goto found; /* Visit right subtree if it looks promising */ /* (2.6.2.4.3) 如果左子树查找失败,从当前vm的右子树查找 */ if (vma->vm_rb.rb_right) { struct vm_area_struct *right = rb_entry(vma->vm_rb.rb_right, struct vm_area_struct, vm_rb); if (right->rb_subtree_gap >= length) { vma = right; continue; } } /* Go back up the rbtree to find next candidate node */ /* (2.6.2.4.4) 如果左右子树都搜寻失败,向回搜寻父节点 */ while (true) { struct rb_node *prev = &vma->vm_rb; if (!rb_parent(prev)) goto check_highest; vma = rb_entry(rb_parent(prev), struct vm_area_struct, vm_rb); if (prev == vma->vm_rb.rb_left) { gap_start = vm_end_gap(vma->vm_prev); gap_end = vm_start_gap(vma); goto check_current; } } } /* (2.6.2.5) 如果红黑树中没有合适的空洞,从highest空间查找是否有合适的 highest空间是还没有vma分配的空白空间 但是优先查找已分配vma之间的空洞 */ check_highest: /* Check highest gap, which does not precede any rbtree node */ gap_start = mm->highest_vm_end; gap_end = ULONG_MAX; /* Only for VM_BUG_ON below */ if (gap_start > high_limit) return -ENOMEM; /* (2.6.2.6) 搜索到了合适的空间,返回开始地址 */ found: /* We found a suitable gap. Clip it with the original low_limit. */ if (gap_start < info->low_limit) gap_start = info->low_limit; /* Adjust gap address to the desired alignment */ gap_start += (info->align_offset - gap_start) & info->align_mask; VM_BUG_ON(gap_start + info->length > info->high_limit); VM_BUG_ON(gap_start + info->length > gap_end); return gap_start; }

2.2.2.2 arch_get_unmapped_area_topdown()

新的虚拟地址空间::

与经典布局不同的是:使用固定值限制栈的最大长度。由于栈是有界的,因此安置内存映射的区域可以在栈末端的下方立即开始。这时mmap区是自顶向下扩展的。由于堆仍然位于虚拟地址空间中较低的区域并向上增长,因此mmap区域和堆可以相对扩展,直至耗尽虚拟地址空间中剩余的区域。

现代layout模式下,mmap分配是从高到低的,调用current->mm->get_unmapped_area -> arch_get_unmapped_area_topdown():

unsigned long arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, const unsigned long len, const unsigned long pgoff, const unsigned long flags) { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; unsigned long addr = addr0; struct vm_unmapped_area_info info; addr = mpx_unmapped_area_check(addr, len, flags); if (IS_ERR_VALUE(addr)) return addr; /* requested length too big for entire address space */ if (len > TASK_SIZE) return -ENOMEM; /* No address checking. See comment at mmap_address_hint_valid() */ if (flags & MAP_FIXED) return addr; /* for MAP_32BIT mappings we force the legacy mmap base */ if (!in_compat_syscall() && (flags & MAP_32BIT)) goto bottomup; /* requesting a specific address */ if (addr) { addr &= PAGE_MASK; if (!mmap_address_hint_valid(addr, len)) goto get_unmapped_area; vma = find_vma(mm, addr); if (!vma || addr + len <= vm_start_gap(vma)) return addr; } get_unmapped_area: info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = get_mmap_base(0); /* * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area * in the full address space. * * !in_compat_syscall() check to avoid high addresses for x32. */ if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall()) info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW; info.align_mask = 0; info.align_offset = pgoff << PAGE_SHIFT; if (filp) { info.align_mask = get_align_mask(); info.align_offset += get_align_bits(); } addr = vm_unmapped_area(&info); if (!(addr & ~PAGE_MASK)) return addr; VM_BUG_ON(addr != -ENOMEM); bottomup: /* * A failed mmap() very likely causes application failure, * so fall back to the bottom-up function here. This scenario * can happen with large stack limits and large mmap() * allocations. */ return arch_get_unmapped_area(filp, addr0, len, pgoff, flags); } ↓ unmapped_area_topdown()

unmapped_area_topdown()函数和unmapped_area()函数的逻辑类似,不过unmapped_area_topdown()是优先高地址的,所以它会优先搜寻右子树的。

2.2.2.3 file->f_op->get_unmapped_area

如果文件系统有get_unmapped_area函数会调用自己的,以ext4为例,它的get_unmapped_area函数也为空:

const struct file_operations ext4_file_operations = { .llseek = ext4_llseek, .read_iter = ext4_file_read_iter, .write_iter = ext4_file_write_iter, .unlocked_ioctl = ext4_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif .mmap = ext4_file_mmap, .mmap_supported_flags = MAP_SYNC, .open = ext4_file_open, .release = ext4_release_file, .fsync = ext4_sync_file, .get_unmapped_area = thp_get_unmapped_area, .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, .fallocate = ext4_fallocate, };

2.2.3 mmap_region()

unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; int error; struct rb_node **rb_link, *rb_parent; unsigned long charged = 0; /* Check against address space limit. */ /* (1) 判断地址空间大小是否已经超标 总的空间:mm->total_vm + npages > rlimit(RLIMIT_AS) >> PAGE_SHIFT 数据空间:mm->data_vm + npages > rlimit(RLIMIT_DATA) >> PAGE_SHIFT */ if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) { unsigned long nr_pages; /* * MAP_FIXED may remove pages of mappings that intersects with * requested mapping. Account for the pages it would unmap. */ /* (1.1) 固定映射指定地址的情况下,地址空间可能和已有的VMA重叠,其他情况下不会重叠 需要先unmap移除掉和新地址交错的vma地址 所以可以先减去这部分空间,再判断大小是否超标 */ nr_pages = count_vma_pages_range(mm, addr, addr + len); if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages)) return -ENOMEM; } /* Clear old maps */ /* (2) 如果新地址和旧的vma有覆盖的情况: 把覆盖的地址范围的vma分割出来,先释放掉 */ while (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) { if (do_munmap(mm, addr, len, uf)) return -ENOMEM; } /* * Private writable mapping: check memory availability * 私有可写映射:检查内存可用性 */ if (accountable_mapping(file, vm_flags)) { charged = len >> PAGE_SHIFT; if (security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; vm_flags |= VM_ACCOUNT; } /* * Can we just expand an old mapping? */ /* (3) 尝试和临近的vma进行merge */ vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX); if (vma) goto out; /* * Determine the object being mapped and call the appropriate * specific mapper. the address has already been validated, but * not unmapped, but the maps are removed from the list. */ /* (4.1) 分配新的vma结构体 */ vma = vm_area_alloc(mm); if (!vma) { error = -ENOMEM; goto unacct_error; } /* (4.2) 结构体相关成员 */ vma->vm_start = addr; vma->vm_end = addr + len; vma->vm_flags = vm_flags; vma->vm_page_prot = vm_get_page_prot(vm_flags); vma->vm_pgoff = pgoff; /* (4.3) 文件内存映射 */ if (file) { if (vm_flags & VM_DENYWRITE) { error = deny_write_access(file); if (error) goto free_vma; } if (vm_flags & VM_SHARED) { error = mapping_map_writable(file->f_mapping); if (error) goto allow_write_and_free_vma; } /* ->mmap() can change vma->vm_file, but must guarantee that * vma_link() below can deny write-access if VM_DENYWRITE is set * and map writably if VM_SHARED is set. This usually means the * new file must not have been exposed to user-space, yet. */ /* (4.3.1) 给vma->vm_file赋值 */ vma->vm_file = get_file(file); /* (4.3.2) 调用file->f_op->mmap,给vma->vm_ops赋值 例如ext4:vma->vm_ops = &ext4_file_vm_ops; */ error = call_mmap(file, vma); if (error) goto unmap_and_free_vma; /* Can addr have changed?? * * Answer: Yes, several device drivers can do it in their * f_op->mmap method. -DaveM * Bug: If addr is changed, prev, rb_link, rb_parent should * be updated for vma_link() */ WARN_ON_ONCE(addr != vma->vm_start); addr = vma->vm_start; vm_flags = vma->vm_flags; /* (4.4) 匿名共享内存映射 */ } else if (vm_flags & VM_SHARED) { /* (4.4.1) 赋值: vma->vm_file = shmem_kernel_file_setup("dev/zero", size, vma->vm_flags); vma->vm_ops = &shmem_vm_ops; */ error = shmem_zero_setup(vma); if (error) goto free_vma; } /* (4.4) 匿名私有内存映射呢? vma->vm_file = NULL? vma->vm_ops = NULL? */ /* (4.6) 将新的vma插入 */ vma_link(mm, vma, prev, rb_link, rb_parent); /* Once vma denies write, undo our temporary denial count */ if (file) { if (vm_flags & VM_SHARED) mapping_unmap_writable(file->f_mapping); if (vm_flags & VM_DENYWRITE) allow_write_access(file); } file = vma->vm_file; out: perf_event_mmap(vma); vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))) mm->locked_vm += (len >> PAGE_SHIFT); else vma->vm_flags &= VM_LOCKED_CLEAR_MASK; } if (file) uprobe_mmap(vma); /* * New (or expanded) vma always get soft dirty status. * Otherwise user-space soft-dirty page tracker won't * be able to distinguish situation when vma area unmapped, * then new mapped in-place (which must be aimed as * a completely new data area). */ vma->vm_flags |= VM_SOFTDIRTY; vma_set_page_prot(vma); return addr; unmap_and_free_vma: vma_fput(vma); vma->vm_file = NULL; /* Undo any partial mapping done by a device driver. */ unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end); charged = 0; if (vm_flags & VM_SHARED) mapping_unmap_writable(file->f_mapping); allow_write_and_free_vma: if (vm_flags & VM_DENYWRITE) allow_write_access(file); free_vma: vm_area_free(vma); unacct_error: if (charged) vm_unacct_memory(charged); return error; } ↓ vma_link()

2.2.3.1 vma_link()

static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node **rb_link, struct rb_node *rb_parent) { struct address_space *mapping = NULL; if (vma->vm_file) { mapping = vma->vm_file->f_mapping; i_mmap_lock_write(mapping); } /* (4.6.1) 将新的vma插入到vma红黑树和vma链表中,并且更新树上的各种参数 */ __vma_link(mm, vma, prev, rb_link, rb_parent); /* (4.6.2) 将vma插入到文件的file->f_mapping->i_mmap缓存树中 */ __vma_link_file(vma); if (mapping) i_mmap_unlock_write(mapping); mm->map_count++; validate_mm(mm); }

Mmap映射以后用户虚拟地址和文件磁盘缓存之间的映射关系如下:

f_op->mmap()函数并没有立即建起其vma所描述的用户地址和文件缓存之间的映射,而只是把vma->vm_ops函数操作赋值。在用户访问vma线性地址时,发生缺页异常,调用vma->vm_ops->nopage函数建立映射关系。

2.2.4 do_munmap()

/* Munmap is split into 2 main parts -- this part which finds * what needs doing, and the areas themselves, which do the * work. This now handles partial unmappings. * Jeremy Fitzhardinge <jeremy@goop.org> */ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf) { unsigned long end; struct vm_area_struct *vma, *prev, *last; if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) return -EINVAL; len = PAGE_ALIGN(len); if (len == 0) return -EINVAL; /* Find the first overlapping VMA */ /* (1) 找到第一个可能重叠的VMA */ vma = find_vma(mm, start); if (!vma) return 0; prev = vma->vm_prev; /* we have start < vma->vm_end */ /* if it doesn't overlap, we have nothing.. */ /* (2) 如果地址没有重叠,直接返回 */ end = start + len; if (vma->vm_start >= end) return 0; /* (3) 如果有unmap区域和vma有重叠,先尝试把unmap区域切分成独立的小块vma,再unmap掉 */ /* * If we need to split any vma, do it now to save pain later. * * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially * unmapped vm_area_struct will remain in use: so lower split_vma * places tmp vma above, and higher split_vma places tmp vma below. */ /* (3.1) 如果start和vma重叠,切一刀 */ if (start > vma->vm_start) { int error; /* * Make sure that map_count on return from munmap() will * not exceed its limit; but let map_count go just above * its limit temporarily, to help free resources as expected. */ if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) return -ENOMEM; error = __split_vma(mm, vma, start, 0); if (error) return error; prev = vma; } /* Does it split the last one? */ /* (3.2) 如果end和vma冲切,切一刀 */ last = find_vma(mm, end); if (last && end > last->vm_start) { int error = __split_vma(mm, last, end, 1); if (error) return error; } vma = prev ? prev->vm_next : mm->mmap; if (unlikely(uf)) { /* * If userfaultfd_unmap_prep returns an error the vmas * will remain splitted, but userland will get a * highly unexpected error anyway. This is no * different than the case where the first of the two * __split_vma fails, but we don't undo the first * split, despite we could. This is unlikely enough * failure that it's not worth optimizing it for. */ int error = userfaultfd_unmap_prep(vma, start, end, uf); if (error) return error; } /* * unlock any mlock()ed ranges before detaching vmas */ /* (4) 移除目标vma上的相关lock */ if (mm->locked_vm) { struct vm_area_struct *tmp = vma; while (tmp && tmp->vm_start < end) { if (tmp->vm_flags & VM_LOCKED) { mm->locked_vm -= vma_pages(tmp); munlock_vma_pages_all(tmp); } tmp = tmp->vm_next; } } /* (5) 移除目标vma */ /* * Remove the vma's, and unmap the actual pages */ /* (5.1) 从vma红黑树中移除vma */ detach_vmas_to_be_unmapped(mm, vma, prev, end); /* (5.2) 释放掉vma空间对应的mmu映射表以及内存 */ unmap_region(mm, vma, prev, start, end); /* (5.3) arch相关的vma释放 */ arch_unmap(mm, vma, start, end); /* (5.4) 移除掉vma的其他信息,最后释放掉vma结构体 */ /* Fix up all other VM information */ remove_vma_list(mm, vma); return 0; } |→ static void unmap_region(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, unsigned long start, unsigned long end) { struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; struct mmu_gather tlb; lru_add_drain(); tlb_gather_mmu(&tlb, mm, start, end); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end); free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, start, end); } ||→ free_pgtables()

2.2.5 mm_populate()

static inline void mm_populate(unsigned long addr, unsigned long len) { /* Ignore errors */ (void) __mm_populate(addr, len, 1); } ↓ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) { struct mm_struct *mm = current->mm; unsigned long end, nstart, nend; struct vm_area_struct *vma = NULL; int locked = 0; long ret = 0; end = start + len; for (nstart = start; nstart < end; nstart = nend) { /* * We want to fault in pages for [nstart; end) address range. * Find first corresponding VMA. */ /* (1) 根据目标地址,查找vma */ if (!locked) { locked = 1; down_read(&mm->mmap_sem); vma = find_vma(mm, nstart); } else if (nstart >= vma->vm_end) vma = vma->vm_next; if (!vma || vma->vm_start >= end) break; /* * Set [nstart; nend) to intersection of desired address * range with the first VMA. Also, skip undesirable VMA types. */ /* (2) 计算目标地址和vma的重叠部分 */ nend = min(end, vma->vm_end); if (vma->vm_flags & (VM_IO | VM_PFNMAP)) continue; if (nstart < vma->vm_start) nstart = vma->vm_start; /* * Now fault in a range of pages. populate_vma_page_range() * double checks the vma flags, so that it won't mlock pages * if the vma was already munlocked. */ /* (3) 对重叠部分进行page内存填充 */ ret = populate_vma_page_range(vma, nstart, nend, &locked); if (ret < 0) { if (ignore_errors) { ret = 0; continue; /* continue at next VMA */ } break; } nend = nstart + ret * PAGE_SIZE; ret = 0; } if (locked) up_read(&mm->mmap_sem); return ret; /* 0 or negative error code */ } ↓ long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *nonblocking) { struct mm_struct *mm = vma->vm_mm; unsigned long nr_pages = (end - start) / PAGE_SIZE; int gup_flags; VM_BUG_ON(start & ~PAGE_MASK); VM_BUG_ON(end & ~PAGE_MASK); VM_BUG_ON_VMA(start < vma->vm_start, vma); VM_BUG_ON_VMA(end > vma->vm_end, vma); VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm); gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK; if (vma->vm_flags & VM_LOCKONFAULT) gup_flags &= ~FOLL_POPULATE; /* * We want to touch writable mappings with a write fault in order * to break COW, except for shared mappings because these don't COW * and we would not want to dirty them for nothing. */ if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) gup_flags |= FOLL_WRITE; /* * We want mlock to succeed for regions that have any permissions * other than PROT_NONE. */ if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) gup_flags |= FOLL_FORCE; /* * We made sure addr is within a VMA, so the following will * not result in a stack expansion that recurses back here. */ return __get_user_pages(current, mm, start, nr_pages, gup_flags, NULL, NULL, nonblocking); } ↓ /** * __get_user_pages() - pin user pages in memory // 将用户页面固定在内存中 * @tsk: task_struct of target task * @mm: mm_struct of target mm * @start: starting user address * @nr_pages: number of pages from start to pin * @gup_flags: flags modifying pin behaviour * @pages: array that receives pointers to the pages pinned. * Should be at least nr_pages long. Or NULL, if caller * only intends to ensure the pages are faulted in. * @vmas: array of pointers to vmas corresponding to each page. * Or NULL if the caller does not require them. * @nonblocking: whether waiting for disk IO or mmap_sem contention * * Returns number of pages pinned. This may be fewer than the number * requested. If nr_pages is 0 or negative, returns 0. If no pages * were pinned, returns -errno. Each page returned must be released * with a put_page() call when it is finished with. vmas will only * remain valid while mmap_sem is held. * 返回固定的页数。这可能少于请求的数量。如果nr_pages为0或负数,则返回0。如果没有固定页面,则返回-errno。完成后,返回的每个页面都必须使用put_page()调用释放。只有保留mmap_sem时,vmas才保持有效。 * * Must be called with mmap_sem held. It may be released. See below. * 必须在保持mmap_sem的情况下调用。它可能会被释放。见下文。 * * __get_user_pages walks a process's page tables and takes a reference to * each struct page that each user address corresponds to at a given * instant. That is, it takes the page that would be accessed if a user * thread accesses the given user virtual address at that instant. * __get_user_pages遍历进程的页表,并引用给定瞬间每个用户地址所对应的每个struct页。也就是说,如果用户线程在该时刻访问给定的用户虚拟地址,则它将占用要访问的页面。 * * This does not guarantee that the page exists in the user mappings when * __get_user_pages returns, and there may even be a completely different * page there in some cases (eg. if mmapped pagecache has been invalidated * and subsequently re faulted). However it does guarantee that the page * won't be freed completely. And mostly callers simply care that the page * contains data that was valid *at some point in time*. Typically, an IO * or similar operation cannot guarantee anything stronger anyway because * locks can't be held over the syscall boundary. * 这不能保证在__get_user_pages返回时该页面存在于用户映射中,并且在某些情况下甚至可能存在一个完全不同的页面(例如,如果映射的页面缓存已失效并随后发生故障)。但是,它可以确保不会完全释放该页面。而且大多数调用者只是在乎页面是否包含在某个时间点有效的数据。通常,由于无法在系统调用边界上保持锁,因此IO或类似操作无论如何都无法保证更强大。 * * If @gup_flags & FOLL_WRITE == 0, the page must not be written to. If * the page is written to, set_page_dirty (or set_page_dirty_lock, as * appropriate) must be called after the page is finished with, and * before put_page is called. * 如果@gup_flags和FOLL_WRITE == 0,则不得写入该页面。如果要写入页面,则必须在页面完成之后且在调用put_page之前调用set_page_dirty(或适当的set_page_dirty_lock)。 * * If @nonblocking != NULL, __get_user_pages will not wait for disk IO * or mmap_sem contention, and if waiting is needed to pin all pages, * *@nonblocking will be set to 0. Further, if @gup_flags does not * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in * this case. * 如果@nonblocking!= NULL,则__get_user_pages将不等待磁盘IO或mmap_sem争用,并且如果需要等待固定所有页面,则* @ nonblocking将设置为0。此外,如果@gup_flags不包含FOLL_NOWAIT,则mmap_sem在这种情况下将通过up_read()释放。 * * A caller using such a combination of @nonblocking and @gup_flags * must therefore hold the mmap_sem for reading only, and recognize * when it's been released. Otherwise, it must be held for either * reading or writing and will not be released. * 因此,使用@nonblocking和@gup_flags的组合的调用者必须将mmap_sem保留为只读,并识别它何时被释放。否则,必须保留它以进行读取或书写,并且不会被释放。 * * In most cases, get_user_pages or get_user_pages_fast should be used * instead of __get_user_pages. __get_user_pages should be used only if * you need some special @gup_flags. * 在大多数情况下,应使用get_user_pages或get_user_pages_fast而不是__get_user_pages。 __get_user_pages仅在需要一些特殊的@gup_flags时使用。 */ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *nonblocking) { long i = 0; unsigned int page_mask; struct vm_area_struct *vma = NULL; if (!nr_pages) return 0; VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET)); /* * If FOLL_FORCE is set then do not force a full fault as the hinting * fault information is unrelated to the reference behaviour of a task * using the address space */ if (!(gup_flags & FOLL_FORCE)) gup_flags |= FOLL_NUMA; do { struct page *page; unsigned int foll_flags = gup_flags; unsigned int page_increm; /* first iteration or cross vma bound */ /* (3.1) 地址跨区域的处理 */ if (!vma || start >= vma->vm_end) { vma = find_extend_vma(mm, start); if (!vma && in_gate_area(mm, start)) { int ret; ret = get_gate_page(mm, start & PAGE_MASK, gup_flags, &vma, pages ? &pages[i] : NULL); if (ret) return i ? : ret; page_mask = 0; goto next_page; } if (!vma || check_vma_flags(vma, gup_flags)) return i ? : -EFAULT; if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, &start, &nr_pages, i, gup_flags, nonblocking); continue; } } retry: /* * If we have a pending SIGKILL, don't keep faulting pages and * potentially allocating memory. */ if (unlikely(fatal_signal_pending(current))) return i ? i : -ERESTARTSYS; cond_resched(); /* (3.2) 逐个查询vma中地址对应的page是否已经分配 */ page = follow_page_mask(vma, start, foll_flags, &page_mask); if (!page) { int ret; /* (3.3) 如果page没有分配,则使用缺页处理来分配page */ ret = faultin_page(tsk, vma, start, &foll_flags, nonblocking); switch (ret) { case 0: goto retry; case -EFAULT: case -ENOMEM: case -EHWPOISON: return i ? i : ret; case -EBUSY: return i; case -ENOENT: goto next_page; } BUG(); } else if (PTR_ERR(page) == -EEXIST) { /* * Proper page table entry exists, but no corresponding * struct page. */ goto next_page; } else if (IS_ERR(page)) { return i ? i : PTR_ERR(page); } if (pages) { pages[i] = page; flush_anon_page(vma, page, start); flush_dcache_page(page); page_mask = 0; } next_page: if (vmas) { vmas[i] = vma; page_mask = 0; } page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); if (page_increm > nr_pages) page_increm = nr_pages; i += page_increm; start += page_increm * PAGE_SIZE; nr_pages -= page_increm; } while (nr_pages); return i; } ↓

2.3 缺页处理

2.3.1 do_page_fault()

缺页异常的处理路径:do_page_fault() -> __do_page_fault() -> do_user_addr_fault() -> handle_mm_fault()。

核心函数为handle_mm_fault(),和mm_populate()中的处理一样,只不过实际不同。

mm_populate()是在mmap映射完vma虚拟地址以后,如果参数中指定了需要VM_LOCKED,会立即进行物理内存的分配和内容读取。

而do_page_fault()是在发生实际的内存访问以后,才进行物理内存的分配和内容读取。

2.3.2 faultin_page()

几种缺页原因:unrepresent、COW、swap。

static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, unsigned long address, unsigned int *flags, int *nonblocking) { unsigned int fault_flags = 0; int ret; /* mlock all present pages, but do not fault in new pages */ if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK) return -ENOENT; if (*flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; if (*flags & FOLL_REMOTE) fault_flags |= FAULT_FLAG_REMOTE; if (nonblocking) fault_flags |= FAULT_FLAG_ALLOW_RETRY; if (*flags & FOLL_NOWAIT) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; if (*flags & FOLL_TRIED) { VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY); fault_flags |= FAULT_FLAG_TRIED; } ret = handle_mm_fault(vma, address, fault_flags); if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, *flags); if (err) return err; BUG(); } if (tsk) { if (ret & VM_FAULT_MAJOR) tsk->maj_flt++; else tsk->min_flt++; } if (ret & VM_FAULT_RETRY) { if (nonblocking) *nonblocking = 0; return -EBUSY; } /* * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when * necessary, even if maybe_mkwrite decided not to set pte_write. We * can thus safely do subsequent page lookups as if they were reads. * But only do so when looping for pte_write is futile: in some cases * userspace may also be wanting to write to the gotten user page, * which a read fault here might prevent (a readonly page might get * reCOWed by userspace write). */ if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) *flags |= FOLL_COW; return 0; } ↓ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { int ret; __set_current_state(TASK_RUNNING); count_vm_event(PGFAULT); count_memcg_event_mm(vma->vm_mm, PGFAULT); /* do counter updates before entering really critical section. */ check_sync_rss_stat(current); if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTION, flags & FAULT_FLAG_REMOTE)) return VM_FAULT_SIGSEGV; /* * Enable the memcg OOM handling for faults triggered in user * space. Kernel faults are handled more gracefully. */ if (flags & FAULT_FLAG_USER) mem_cgroup_oom_enable(); if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); else ret = __handle_mm_fault(vma, address, flags); if (flags & FAULT_FLAG_USER) { mem_cgroup_oom_disable(); /* * The task may have entered a memcg OOM situation but * if the allocation error was handled gracefully (no * VM_FAULT_OOM), there is no need to kill anything. * Just clean up the OOM state peacefully. */ if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)) mem_cgroup_oom_synchronize(false); } return ret; } ↓ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { struct vm_fault vmf = { .vma = vma, .address = address & PAGE_MASK, .flags = flags, .pgoff = linear_page_index(vma, address), .gfp_mask = __get_fault_gfp_mask(vma), }; unsigned int dirty = flags & FAULT_FLAG_WRITE; struct mm_struct *mm = vma->vm_mm; pgd_t *pgd; p4d_t *p4d; int ret; /* (3.3.1) 查找addr对应的pgd和p4d */ pgd = pgd_offset(mm, address); p4d = p4d_alloc(mm, pgd, address); if (!p4d) return VM_FAULT_OOM; /* (3.3.2) 查找addr对应的pud,如果还没分配空间 */ vmf.pud = pud_alloc(mm, p4d, address); if (!vmf.pud) return VM_FAULT_OOM; if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { pud_t orig_pud = *vmf.pud; barrier(); if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) { /* NUMA case for anonymous PUDs would go here */ if (dirty && !pud_write(orig_pud)) { ret = wp_huge_pud(&vmf, orig_pud); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { huge_pud_set_accessed(&vmf, orig_pud); return 0; } } } /* (3.3.3) 查找addr对应的pmd,如果还没分配空间 */ vmf.pmd = pmd_alloc(mm, vmf.pud, address); if (!vmf.pmd) return VM_FAULT_OOM; if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { pmd_t orig_pmd = *vmf.pmd; barrier(); if (unlikely(is_swap_pmd(orig_pmd))) { VM_BUG_ON(thp_migration_supported() && !is_pmd_migration_entry(orig_pmd)); if (is_pmd_migration_entry(orig_pmd)) pmd_migration_entry_wait(mm, vmf.pmd); return 0; } if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf, orig_pmd); if (dirty && !pmd_write(orig_pmd)) { ret = wp_huge_pmd(&vmf, orig_pmd); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { huge_pmd_set_accessed(&vmf, orig_pmd); return 0; } } } /* (3.3.4) 处理PTE的缺页 */ return handle_pte_fault(&vmf); } ↓ static int handle_pte_fault(struct vm_fault *vmf) { pte_t entry; /* (3.3.4.1) pmd为空,说明pte肯定还没有分配 */ if (unlikely(pmd_none(*vmf->pmd))) { /* * Leave __pte_alloc() until later: because vm_ops->fault may * want to allocate huge page, and if we expose page table * for an instant, it will be difficult to retract from * concurrent faults and from rmap lookups. */ vmf->pte = NULL; /* (3.3.4.2) pmd不为空,说明: 1、pte可能分配了, 2、也可能没分配只是临近的PTE分配连带创建了pmd */ } else { /* See comment in pte_alloc_one_map() */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /* * A regular pmd is established and it can't morph into a huge * pmd from under us anymore at this point because we hold the * mmap_sem read mode and khugepaged takes it in write mode. * So now it's safe to run pte_offset_map(). */ vmf->pte = pte_offset_map(vmf->pmd, vmf->address); vmf->orig_pte = *vmf->pte; /* * some architectures can have larger ptes than wordsize, * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic * accesses. The code below just needs a consistent view * for the ifs and we later double check anyway with the * ptl lock held. So here a barrier will do. */ barrier(); /* pte还没分配 */ if (pte_none(vmf->orig_pte)) { pte_unmap(vmf->pte); vmf->pte = NULL; } } /* (3.3.4.3) 如果PTE都没有创建过,那是新的缺页异常 */ if (!vmf->pte) { if (vma_is_anonymous(vmf->vma)) /* (3.3.4.3.1) 匿名映射 */ return do_anonymous_page(vmf); else /* (3.3.4.3.2) 文件映射 */ return do_fault(vmf); } /* (3.3.4.4) 如果PTE存在,但是page不存在,page是被swap出去了 */ if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf); if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); spin_lock(vmf->ptl); entry = vmf->orig_pte; if (unlikely(!pte_same(*vmf->pte, entry))) goto unlock; if (vmf->flags & FAULT_FLAG_WRITE) { if (!pte_write(entry)) return do_wp_page(vmf); entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE)) { update_mmu_cache(vmf->vma, vmf->address, vmf->pte); } else { /* * This is needed only for protection faults but the arch code * is not yet telling us if this is a protection fault or not. * This still avoids useless tlb flushes for .text page faults * with threads. */ if (vmf->flags & FAULT_FLAG_WRITE) flush_tlb_fix_spurious_fault(vmf->vma, vmf->address); } unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } |→

2.3.2.1 do_anonymous_page()

static int do_anonymous_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct mem_cgroup *memcg; struct page *page; int ret = 0; pte_t entry; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) return VM_FAULT_SIGBUS; /* * Use pte_alloc() instead of pte_alloc_map(). We can't run * pte_offset_map() on pmds where a huge pmd might be created * from a different thread. * * pte_alloc_map() is safe to use under down_write(mmap_sem) or when * parallel threads are excluded by other means. * * Here we only have down_read(mmap_sem). */ /* (3.3.4.3.1.1) 分配pte结构 */ if (pte_alloc(vma->vm_mm, vmf->pmd, vmf->address)) return VM_FAULT_OOM; /* See the comment in pte_alloc_one_map() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; /* Use the zero-page for reads */ if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm)) { entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), vma->vm_page_prot)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!pte_none(*vmf->pte)) goto unlock; ret = check_stable_address_space(vma->vm_mm); if (ret) goto unlock; /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_MISSING); } goto setpte; } /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; /* (3.3.4.3.1.2) 分配page结构 */ page = alloc_zeroed_user_highpage_movable(vma, vmf->address); if (!page) goto oom; if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false)) goto oom_free_page; /* * The memory barrier inside __SetPageUptodate makes sure that * preceeding stores to the page contents become visible before * the set_pte_at() write. */ __SetPageUptodate(page); /* (3.3.4.3.1.3) pte成员赋值 */ entry = mk_pte(page, vma->vm_page_prot); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!pte_none(*vmf->pte)) goto release; ret = check_stable_address_space(vma->vm_mm); if (ret) goto release; /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); mem_cgroup_cancel_charge(page, memcg, false); put_page(page); return handle_userfault(vmf, VM_UFFD_MISSING); } inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address, false); mem_cgroup_commit_charge(page, memcg, false, false); lru_cache_add_active_or_unevictable(page, vma); /* (3.3.4.3.1.4) 设置pte */ setpte: set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, vmf->address, vmf->pte); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; release: mem_cgroup_cancel_charge(page, memcg, false); put_page(page); goto unlock; oom_free_page: put_page(page); oom: return VM_FAULT_OOM; }

2.3.2.2 do_fault()

static int do_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct mm_struct *vm_mm = vma->vm_mm; int ret; /* * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */ if (!vma->vm_ops->fault) { /* * If we find a migration pmd entry or a none pmd entry, which * should never happen, return SIGBUS */ if (unlikely(!pmd_present(*vmf->pmd))) ret = VM_FAULT_SIGBUS; else { vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); /* * Make sure this is not a temporary clearing of pte * by holding ptl and checking again. A R/M/W update * of pte involves: take ptl, clearing the pte so that * we don't have concurrent modification by hardware * followed by an update. */ if (unlikely(pte_none(*vmf->pte))) ret = VM_FAULT_SIGBUS; else ret = VM_FAULT_NOPAGE; pte_unmap_unlock(vmf->pte, vmf->ptl); } } else if (!(vmf->flags & FAULT_FLAG_WRITE)) /* (1) 普通文件的nopage处理 */ ret = do_read_fault(vmf); else if (!(vma->vm_flags & VM_SHARED)) /* (2) copy on write的处理 */ ret = do_cow_fault(vmf); else /* (3) 共享文件的处理 */ ret = do_shared_fault(vmf); /* preallocated pagetable is unused: free it */ if (vmf->prealloc_pte) { pte_free(vm_mm, vmf->prealloc_pte); vmf->prealloc_pte = NULL; } return ret; }

2.3.2.3 do_read_fault()

static int do_read_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; int ret = 0; /* * Let's call ->map_pages() first and use ->fault() as fallback * if page by the offset is not ready to be mapped (cold cache or * something). */ /* (1.1) 首先尝试一次准备多个page,调用vm_ops->map_pages() */ if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { ret = do_fault_around(vmf); if (ret) return ret; } /* (1.2) 如果失败,尝试一次准备一个page,调用vm_ops->fault() */ ret = __do_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; ret |= finish_fault(vmf); unlock_page(vmf->page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) put_page(vmf->page); return ret; } ↓ static int __do_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; int ret; /* * Preallocate pte before we take page_lock because this might lead to * deadlocks for memcg reclaim which waits for pages under writeback: * lock_page(A) * SetPageWriteback(A) * unlock_page(A) * lock_page(B) * lock_page(B) * pte_alloc_pne * shrink_page_list * wait_on_page_writeback(A) * SetPageWriteback(B) * unlock_page(B) * # flush A, B to clear the writeback */ if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address); if (!vmf->prealloc_pte) return VM_FAULT_OOM; smp_wmb(); /* See comment in __pte_alloc() */ } /* (1.2.1) 调用vma对应的具体vm_ops */ ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW))) return ret; if (unlikely(PageHWPoison(vmf->page))) { if (ret & VM_FAULT_LOCKED) unlock_page(vmf->page); put_page(vmf->page); vmf->page = NULL; return VM_FAULT_HWPOISON; } if (unlikely(!(ret & VM_FAULT_LOCKED))) lock_page(vmf->page); else VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); return ret; }

以ext4为例,vm_ops实现如下:

static const struct vm_operations_struct ext4_file_vm_ops = { .fault = ext4_filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = ext4_page_mkwrite, };

我们简单分析其中的单page映射函数ext4_filemap_fault():

int ext4_filemap_fault(struct vm_fault *vmf) { struct inode *inode = file_inode(vmf->vma->vm_file); int err; down_read(&EXT4_I(inode)->i_mmap_sem); err = filemap_fault(vmf); up_read(&EXT4_I(inode)->i_mmap_sem); return err; } ↓ int filemap_fault(struct vm_fault *vmf) { int error; struct file *file = vmf->vma->vm_file; struct address_space *mapping = file->f_mapping; struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; pgoff_t max_off; struct page *page; int ret = 0; max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); if (unlikely(offset >= max_off)) return VM_FAULT_SIGBUS; /* * Do we have something in the page cache already? */ /* (1.2.1.1) 尝试从文件的cache缓存树中,找到offset对应的page */ page = find_get_page(mapping, offset); if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { /* * We found the page, so try async readahead before * waiting for the lock. */ do_async_mmap_readahead(vmf->vma, ra, file, page, offset); } else if (!page) { /* No page in the page cache at all */ /* 如果文件cache缓存中没有page,先读入文件到缓存 */ do_sync_mmap_readahead(vmf->vma, ra, file, offset); count_vm_event(PGMAJFAULT); count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; retry_find: /* 再重新查找page */ page = find_get_page(mapping, offset); if (!page) goto no_cached_page; } /* (1.2.1.2) 锁住page */ if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) { put_page(page); return ret | VM_FAULT_RETRY; } /* Did it get truncated? */ if (unlikely(page->mapping != mapping)) { unlock_page(page); put_page(page); goto retry_find; } VM_BUG_ON_PAGE(page->index != offset, page); /* * We have a locked page in the page cache, now we need to check * that it's up-to-date. If not, it is going to be due to an error. */ if (unlikely(!PageUptodate(page))) goto page_not_uptodate; /* * Found the page and have a reference on it. * We must recheck i_size under page lock. */ max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); if (unlikely(offset >= max_off)) { unlock_page(page); put_page(page); return VM_FAULT_SIGBUS; } /* (1.2.1.3) 返回获取到的page,已经正确填充了对应offset文件的内容 */ vmf->page = page; return ret | VM_FAULT_LOCKED; no_cached_page: /* * We're only likely to ever get here if MADV_RANDOM is in * effect. */ error = page_cache_read(file, offset, vmf->gfp_mask); /* * The page we want has now been added to the page cache. * In the unlikely event that someone removed it in the * meantime, we'll just come back here and read it again. */ if (error >= 0) goto retry_find; /* * An error return from page_cache_read can result if the * system is low on memory, or a problem occurs while trying * to schedule I/O. */ if (error == -ENOMEM) return VM_FAULT_OOM; return VM_FAULT_SIGBUS; page_not_uptodate: /* * Umm, take care of errors if the page isn't up-to-date. * Try to re-read it _once_. We do this synchronously, * because there really aren't any performance issues here * and we need to check for errors. */ ClearPageError(page); error = mapping->a_ops->readpage(file, page); if (!error) { wait_on_page_locked(page); if (!PageUptodate(page)) error = -EIO; } put_page(page); if (!error || error == AOP_TRUNCATED_PAGE) goto retry_find; /* Things didn't work out. Return zero to tell the mm layer so. */ shrink_readahead_size_eio(file, ra); return VM_FAULT_SIGBUS; }

2.3.2.4 do_cow_fault()

static int do_cow_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; int ret; if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM; /* (2.1) 分配一个新page:cow_page */ vmf->cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (!vmf->cow_page) return VM_FAULT_OOM; if (mem_cgroup_try_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL, &vmf->memcg, false)) { put_page(vmf->cow_page); return VM_FAULT_OOM; } /* (2.2) 更新原page的内容 */ ret = __do_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; if (ret & VM_FAULT_DONE_COW) return ret; /* (2.3) 拷贝page的内容到cow_page */ copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); __SetPageUptodate(vmf->cow_page); /* (2.4) 提交PTE */ ret |= finish_fault(vmf); unlock_page(vmf->page); put_page(vmf->page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; return ret; uncharge_out: mem_cgroup_cancel_charge(vmf->cow_page, vmf->memcg, false); put_page(vmf->cow_page); return ret; }

2.3.2.5 do_shared_fault()

static int do_shared_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; int ret, tmp; /* (3.1) 更新共享page中的内容 */ ret = __do_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; /* * Check if the backing address space wants to know that the page is * about to become writable */ if (vma->vm_ops->page_mkwrite) { unlock_page(vmf->page); tmp = do_page_mkwrite(vmf); if (unlikely(!tmp || (tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { put_page(vmf->page); return tmp; } } ret |= finish_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) { unlock_page(vmf->page); put_page(vmf->page); return ret; } fault_dirty_shared_page(vma, vmf->page); return ret; }

2.3.2.6 do_swap_page()

int do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct page *page = NULL, *swapcache = NULL; struct mem_cgroup *memcg; struct vma_swap_readahead swap_ra; swp_entry_t entry; pte_t pte; int locked; int exclusive = 0; int ret = 0; bool vma_readahead = swap_use_vma_readahead(); if (vma_readahead) { page = swap_readahead_detect(vmf, &swap_ra); swapcache = page; } if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) { if (page) put_page(page); goto out; } /* (3.3.4.4.1) 从PTE中读出swap entry */ entry = pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) { migration_entry_wait(vma->vm_mm, vmf->pmd, vmf->address); } else if (is_device_private_entry(entry)) { /* * For un-addressable device memory we call the pgmap * fault handler callback. The callback must migrate * the page back to some CPU accessible page. */ ret = device_private_entry_fault(vma, vmf->address, entry, vmf->flags, vmf->pmd); } else if (is_hwpoison_entry(entry)) { ret = VM_FAULT_HWPOISON; } else { print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL); ret = VM_FAULT_SIGBUS; } goto out; } delayacct_set_flag(DELAYACCT_PF_SWAPIN); if (!page) { page = lookup_swap_cache(entry, vma_readahead ? vma : NULL, vmf->address); swapcache = page; } /* (3.3.4.4.2) 分配新的page,并且读出swap中的内容到新page中 */ if (!page) { struct swap_info_struct *si = swp_swap_info(entry); if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(si, entry) == 1) { /* skip swapcache */ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (page) { __SetPageLocked(page); __SetPageSwapBacked(page); set_page_private(page, entry.val); lru_cache_add_anon(page); swap_readpage(page, true); } } else { if (vma_readahead) page = do_swap_page_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf, &swap_ra); else page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vma, vmf->address); swapcache = page; } if (!page) { /* * Back out if somebody else faulted in this pte * while we released the pte lock. */ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (likely(pte_same(*vmf->pte, vmf->orig_pte))) ret = VM_FAULT_OOM; delayacct_clear_flag(DELAYACCT_PF_SWAPIN); goto unlock; } /* Had to read the page from swap area: Major fault */ ret = VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); } else if (PageHWPoison(page)) { /* * hwpoisoned dirty swapcache pages are kept for killing * owner processes (which may be unknown at hwpoison time) */ ret = VM_FAULT_HWPOISON; delayacct_clear_flag(DELAYACCT_PF_SWAPIN); swapcache = page; goto out_release; } locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags); delayacct_clear_flag(DELAYACCT_PF_SWAPIN); if (!locked) { ret |= VM_FAULT_RETRY; goto out_release; } /* * Make sure try_to_free_swap or reuse_swap_page or swapoff did not * release the swapcache from under us. The page pin, and pte_same * test below, are not enough to exclude that. Even if it is still * swapcache, we need to check that the page's swap has not changed. */ if (unlikely((!PageSwapCache(page) || page_private(page) != entry.val)) && swapcache) goto out_page; page = ksm_might_need_to_copy(page, vma, vmf->address); if (unlikely(!page)) { ret = VM_FAULT_OOM; page = swapcache; goto out_page; } if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false)) { ret = VM_FAULT_OOM; goto out_page; } /* * Back out if somebody else already faulted in this pte. */ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) goto out_nomap; if (unlikely(!PageUptodate(page))) { ret = VM_FAULT_SIGBUS; goto out_nomap; } /* * The page isn't present yet, go ahead with the fault. * * Be careful about the sequence of operations here. * To get its accounting right, reuse_swap_page() must be called * while the page is counted on swap but not yet in mapcount i.e. * before page_add_anon_rmap() and swap_free(); try_to_free_swap() * must be called after the swap_free(), or it will never succeed. */ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot); if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = RMAP_EXCLUSIVE; } flush_icache_page(vma, page); if (pte_swp_soft_dirty(vmf->orig_pte)) pte = pte_mksoft_dirty(pte); set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); vmf->orig_pte = pte; /* ksm created a completely new copy */ if (unlikely(page != swapcache && swapcache)) { page_add_new_anon_rmap(page, vma, vmf->address, false); mem_cgroup_commit_charge(page, memcg, false, false); lru_cache_add_active_or_unevictable(page, vma); } else { do_page_add_anon_rmap(page, vma, vmf->address, exclusive); mem_cgroup_commit_charge(page, memcg, true, false); activate_page(page); } swap_free(entry); if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) try_to_free_swap(page); unlock_page(page); if (page != swapcache && swapcache) { /* * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check * (to avoid false positives from pte_same). For * further safety release the lock after the swap_free * so that the swap count won't change under a * parallel locked swapcache. */ unlock_page(swapcache); put_page(swapcache); } if (vmf->flags & FAULT_FLAG_WRITE) { ret |= do_wp_page(vmf); if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; goto out; } /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, vmf->address, vmf->pte); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); out: return ret; out_nomap: mem_cgroup_cancel_charge(page, memcg, false); pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: unlock_page(page); out_release: put_page(page); if (page != swapcache && swapcache) { unlock_page(swapcache); put_page(swapcache); } return ret; }

参考资料:

1.Linux内核学习——内存管理之进程地址空间 2.关于内核中PST的实现

3.进程地址空间 get_unmmapped_area() 4.linux如何感知通过mmap进行的文件修改 5.linux内存布局和ASLR下的可分配地址空间 6.关于内核中PST的实现 7.linux sbrk/brk函数使用整理

最新回复(0)