This is the third article in the<Linux kernel memory management>series
Part ISimple sorting of knowledge points for kernel memory management process
Part IIIntroduced the data structure of the kernel
preface
Taking Intel X64 CPU as an example, Linux initialization can be roughly divided into the following processes:
Real Mode after the loader jumps to the kernel
Jump from 32-bit protection mode to 64 bit long mode
Decompress the kernel in 64 bit long mode
After decompressing the kernel, create a new page table mapping and jump to the Arch (platform) related C code
Execute platform independent initialization code
Memory management plays an important role in the above process.It includes memory layout planning, segmentation management, page table configuration, kernel movement, etc.
This article uses Qemu simulation, based on Linux v5.13.9, to introduce the memory management in the above process in order.
Real Mode
Start the compiled 64 bit kernel with the following command:
Kernel Parameters "console=ttyS0 nokaslr"It is mainly used to specify the kernel console and turn off the KASLR function (the main reason is that for the convenience of debugging, the kernel decompression address is random every time you start KASLR).
The - s and - S parameters are mainly used for GDB debugging Qemu.
After executing the above command, you can get the kernel address distribution as shown in the figure below.
According to kernel documentationLinux/x86 Boot Protocol, any Boot Loader (Grub/ILO/...) loading the X86 kernel must comply with this protocol.So far, the protocol version has reached 2.15.In the figureXThe starting offset of loading the kernel for the Boot Loader. On the Qemu platform, the offset is0x10000。 After loading, the kernel Boot Sector starts to execute, and the execution entry point is_start。Reference Linker Scriptarch/x86/boot/setup.ld。
two: # Now %dx should point to the end of our stack space andw $~three, %dx #dwordalign(might as well...) jnz3f movw$0xfffc, %dx # Make sure we're not zero 3: movw %ax, %ss movzwl %dx, %esp # Clear upper half of %esp sti # Now we should have a working stack # We will have entered with %cs = %ds+0x20, normalize %cs so it is on par with the other segments. pushw %ds pushw $6f lretw 6: # Check signature at end of setup cmpl $0x5a5aaa55, setup_sig jne setup_bad # Zero the bss movw $__bss_start, %di movw $_end+3, %cx xorl %eax, %eax subw %di, %cx shrw $2, %cx rep; stosl # Jump to C code (should not return) calll main
The above code will clear the direction bit for the execution of real mode code, and does not allocate heap space and stack space for the execution of C code.Then jump to 6 to check the correctness of kernel code loading.Let me explain,lretwAnd the previous two lines of assembly statements are used to call the return, and the previous two lines are used to save the return address in the stack. Refer to<Intel® 64 and IA-32 Architectures Software Developer’s Manual>。As noted, the purpose of using lret is to reset the value of CS register and ensure that it is consistent with other segment registers.Please refer to Intel manual for instructions on ret instruction:
When executing a far return, the processor pops the return instruction pointer from the top of the stack into the EIP register,then pops the segment selector from the top of the stack into the CS register. The processor then begins program execution in the new code segment at the new instruction pointer.
Then clear the BSS segment and jump to the main function for execution.
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three
/* First, copy the boot header into the "zeropage" */ copy_boot_params(); console_init(); if(cmdline_find_option_bool("debug")) puts("early console in setup code\n"); init_heap(); if(validate_cpu()) { puts("Unable to boot - please use a kernel appropriate " "for your CPU.\n"); die(); } set_bios_mode(); detect_memory(); keyboard_init(); query_ist(); #ifdefined(CONFIG_APM) || defined(CONFIG_APM_MODULE) query_apm_bios(); #endif #ifdefined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE) query_edd(); #endif set_video(); go_to_protected_mode();
mainThe annotation of the function is relatively clear. Here we only talk about copy_boot_param/detect_memory/go_to_protected_mode:
Copy_boot_param copies the boot_param information in memory (see Figure "Real Mode Memory Distribution") to the global variable boot_params.Boot_params stores the parameters defined by the Linux Boot Protocol.Some fields are rewritten during compilation, and some uncompleted fields are filled in by Boot Loader.Boot_param, including the kernel cmdline, will intersperse each sub process of kernel initialization
Detect_memory mainly usese820Get the basic layout of memory and store it in the boot_param specified area(boot_params.e820_tableandboot_params.e820_entries)。
Go_to_protected_mode is mainly used to open 32-bit address lines(A20 Gate), configure GDT/IDT table, turn off interrupt, turn on protection mode, and jump to 32-bit code to start execution.The codes are as follows:
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty
protected_mode_jumpIt is an assembly code defined in arch/x86/boot/pmjumpS. There is not much analysis here.It is mainly to modify the PE (Protect Enable) bit of the CR0 register and execute the jump instruction to jump to the 32-bit code (. Lin_pm32 label) for execution.
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine thirty thirty-one thirty-two thirty-three thirty-four thirty-five thirty-six thirty-seven thirty-eight thirty-nine forty forty-one forty-two forty-three forty-four
SYM_FUNC_START_LOCAL_NOALIGN(.Lin_pm32) # Setupdata segments for flatthirty-two-bit mode movl %ecx, %ds movl %ecx, %es movl %ecx, %fs movl %ecx, %gs movl %ecx, %ss # Thethirty-two-bit codesetsupits own stack, but this way wedohave # a valid stack if some debugging hack wants to use it. addl %ebx, %esp
# SetupTR to make Intel VT happy ltr%di
# Clear registers to allow for future extensions to the #thirty-two-bit boot protocol xorl %ecx, %ecx xorl %edx, %edx xorl %ebx, %ebx xorl %ebp, %ebp xorl %edi, %edi
# SetupLDTR to make Intel VT happy lldt%cx
jmpl *%eax # Jump to thethirty-two-bit entrypoint SYM_FUNC_END(.Lin_pm32)
The 32-bit code starts by rebuilding each segment register asBOOT_DS。The segment register content is the segment selector of an item in GDT, andBOOT_DS is the third entry of GDT.At this time, GDT table entries can be found in arch/x86/boot/pm. c, roughly defining a segment with a base of 0 and a size of 4G, which is enough to cover the area where the kernel initializes 32-bit code execution.For GDT table and segment selection related knowledge, please refer to<Intel® 64 and IA-32 Architectures Software Developer’s Manual>Volume 3, Chapter 3 PROTECTED-MODE MEMORY MANAGEMENT.Do some cleaning of register contents, and then jump to the starting address of 32-bit kernel for execution.
This starting address is the first parameter of the protected_mode_jump function - boot_params.hdr.code32_start.In our QEMU environment, this value is0x100000
Why is it stored in eax register? Here we need to knowSystem V Application Binary Interface AMD64Aboutcalling conventionThe Linux kernel also complies with the System V ABI.ABI refers to the Application Binary Interface, which has different definitions according to the Arch of the program.
Jump from 32-bit protection mode to 64 bit long mode
startup_32
_ 0x100000The starting address of the 32-bit code is stored. For the specific layout, refer to the link script:vmlinux.lds
Link script, that isLinker Script, which is a script that tells the linker how to link the target file.Generally, we do not specify link script for GCC compilation, because it has a default link script.
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine thirty thirty-one thirty-two thirty-three thirty-four thirty-five thirty-six thirty-seven thirty-eight thirty-nine forty forty-one forty-two forty-three forty-four forty-five forty-six forty-seven forty-eight forty-nine fifty fifty-one fifty-two fifty-three fifty-four fifty-five fifty-six fifty-seven fifty-eight fifty-nine sixty
After ld linking and qemu loading, the memory layout on the left side of the figure below is obtained.Starting from the address 0x100000, the first is the 32-bit protected mode entry code, decompression code, etc., followed by the compressed kernel.The code segment, read-only data segment, data segment, uninitialized data segment and 32-bit code page table of the decompressed kernel are followed.
From the link script, we can see that the entry address of 32-bit code isstartup_32。The code first clears the interrupt, loads the new GDT table, resets the registers of each segment, and builds the stack.
Note that the code defines a macrorva, which is mainly used to calculate the relative address in the segment, so that the same code can be executed when the kernel is loaded to different locations.
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine thirty thirty-one thirty-two thirty-three thirty-four thirty-five thirty-six thirty-seven thirty-eight thirty-nine forty forty-one forty-two forty-three forty-four forty-five forty-six forty-seven forty-eight forty-nine fifty fifty-one fifty-two fifty-three fifty-four fifty-five fifty-six fifty-seven fifty-eight
After loading the IDT, turn on the PAE mode.Then the accountant calculates to place the compression core in ebx for in situ(in-place)Unzip.In the above codeBP_kernel_alignment(%esi)It is mainly used to fetch the corresponding value from the boot_param corresponding area.We open it againLinux/x86 Boot ProtocolandBoot Protocol Subordinate FieldsView the description of these fields:
Offset/bytes occupied
parameter
describe
0230/4
kernel_alignment
Physical addr alignment required for kernel
0260/4
init_size
Linear memory required during initialization
01E4/4
scratch
Scratch field for the kernel setup code
The init_size stores the space required for kernel initialization and decompression. This is to reserve enough space according to the in place decompression of kernel compression.The calculation of this part size can refer to the kernel source codearch/x86/boot/header.SDescription of(I haven't fully understood it yet. It needs to be supplemented)。 Next, the kernel creates a 2MB kernel page table for 4GB memory (see Figure 2 on the right) and loads the page table directory address(pgtable)Go to CR3 register and turn on 64 bit long mode.reference resourcesWiki:
When in Long mode, 64 bit applications (or operating systems) can use 64 bit instructions and registers, while 32-bit programs will run in a compatible submode.
4GB is large enough to perform kernel decompression and other actions.Then the kernel sends the 64 bit addressstartup_64Push the stack, start paging, and executelretCommand jumps tostartup_64Executed at.
Here we omit the check of SEV function, which is a feature of AMD CPU.No analysis will be made here.
startup_64
startup_64The beginning of the will also clear the interrupt and clear each segment of the register.At the same time, calculate the address to which the compressed kernel will be moved, that is, LOAD_PHYSICAL_ADDR+INIT_SIZE - the length of the compressed kernel(rva(_end))。This deals withstartup_32identical
You may wonder why this code was done in startup_32, and we need to do it again here.The main reason is described in the code. The kernel may be directly loaded by the 64 bit loader andstartup_64Executed at.
Then the kernel loads the empty IDT table, checks whether the level five page table needs to be opened, and handles it accordingly.After clearing the EFLAGS register, move the compressed kernel to the In place decompression location (LOAD_PHYSICAL_ADDR+INIT_SIZE - length of the compressed kernel), and then reload the GDT table that has been moved.Then jump to the moved.LrelocatedExecution started at address.
.Lrelocated
.LrelocatedThe code has three main functions:
Load IDT: At this time, only Page Fault Trap is enabled for the contents of the IDT, and the corresponding processing function isboot_page_faultIn fact, nowarch/x86/boot/compressed/ident_map_64.cThe main function is to establish a consistency map for the address of the page missing after doing some basic checks.
Create a consistency map: It mainly establishes consistency mapping for [_head, _end], bootparam and boot cmdline.
Decompress the kernel: Decompressing the kernel is not analyzed in this article.It is mentioned that if KASLR is enabled, a random offset is calculated to generate the real decompression address of the kernel before the kernel is decompressed.
After decompressing the kernel, jump to the entry address of the pressurized kernel, that is, arch/x86/kernel/head_64. SStartup_64 label
After kernel decompression
startup_64The codes are as follows:
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight
The segment descriptors in the Startup GDT are all 4GB in size starting from 0 address.Startup IDT (also called binrgup IDT) mainly handles VMM communication exceptions under AMD architecture, which are related to virtual machines. Then the kernel continues to execute untilverify_cpuThis assembly function is defined inverify_cpu.S, which mainly usescpuidThe instruction is supported by the CPU for long mode and SSE instruction set. After checking, the kernel jumps to execute__startup_64, which is mainly used to re-establish the early level 4 or level 5 page table of the kernel. At this time, we need to consider the random offset generated by KASLR, so we can see that this function has been called many timesfixup_pointerFunction to correct page table entries. Page table is defined inhead_64.s, as follows:
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine thirty thirty-one thirty-two thirty-three thirty-four thirty-five thirty-six thirty-seven thirty-eight thirty-nine forty forty-one forty-two forty-three forty-four forty-five forty-six forty-seven forty-eight forty-nine fifty fifty-one fifty-two fifty-three fifty-four fifty-five fifty-six fifty-seven fifty-eight fifty-nine sixty sixty-one sixty-two sixty-three sixty-four sixty-five sixty-six sixty-seven sixty-eight sixty-nine seventy seventy-one seventy-two seventy-three
SYM_DATA_START_PTI_ALIGNED(early_top_pgt) .fillfive hundred and twelve,eight,zero .fillPTI_USER_PGD_FILL,eight,zero SYM_DATA_END(early_top_pgt)
SYM_DATA_START_PAGE_ALIGNED(early_dynamic_pgts) .fillfive hundred and twelve*EARLY_DYNAMIC_PAGE_TABLES,eight,zero SYM_DATA_END(early_dynamic_pgts)
SYM_DATA_START_PAGE_ALIGNED(level1_fixmap_pgt) .rept(FIXMAP_PMD_NUM) .fillfive hundred and twelve,eight,zero .endr SYM_DATA_END(level1_fixmap_pgt)
It is difficult to understand. Let's use the figure to translate it:
The figure establishes an early mapping for the kernel code, so that the kernel code can be executed happily.(Of course, it is not necessarily pleasant to execute the kernel code. As we will see later, the kernel needs to register an IDT table entry to handle Page Fault Trap).
one two three four five six seven
/* Switch to new page-table */ movq%rax, %cr3
/* Ensure I am executing from virtual addresses */ movq$1f, %rax ANNOTATE_RETPOLINE_SAFE jmp*%rax
__startup_64After execution, we skip some SEV processing and start using the new kernel page table.Then we jump to the virtual address starting from __START_KERNEL_map for execution.Then reinitialize GDT, set segment register, establish stack for initialization operation, and establish IDT.There is a code in the middle:
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one
/* Setup%gs. * * The base of %gs always points to fixed_percpu_data. If the * stack protector canary is enabled, it is locatedat%gs:forty. * Note that, on SMP, the bootcpuuses init datasectionuntil * the percpuareas are setup. */ movl $MSR_GS_BASE,%ecx movl initial_gs(%rip),%eax movl initial_gs+four(%rip),%edx wrmsr .................. pushq $.Lafter_lret # put return address on stack for unwinder xorl %ebp, %ebp # clear frame pointer movqinitial_code(%rip), %rax pushq $__KERNEL_CS # set correctcs pushq %rax # target addressinnegative space lretq
It is used to save the address of per CPU variable to 64 bit model specific register (MSR) for multiprocessor systems.Then jump to the initialization c code, that isx86_64_start_kernel。
summary
This article focuses on the memory management from the kernel being loaded by Loader to the C code entry.Some main steps:
Turn on protection mode
Turn on long mode
Add random offset while decompressing the kernel
Create the kernel page table and jump to the virtual address for execution
Later in the series, we will analyze the processing after the C code entry