Linux kernel memory management - from the perspective of kernel startup process

This is the third article in the<Linux kernel memory management>series

Part I Simple sorting of knowledge points for kernel memory management process

Part II Introduced the data structure of the kernel

preface

Taking Intel X64 CPU as an example, Linux initialization can be roughly divided into the following processes:

Real Mode after the loader jumps to the kernel
Jump from 32-bit protection mode to 64 bit long mode
Decompress the kernel in 64 bit long mode
After decompressing the kernel, create a new page table mapping and jump to the Arch (platform) related C code
Execute platform independent initialization code

Memory management plays an important role in the above process. It includes memory layout planning, segmentation management, page table configuration, kernel movement, etc.

This article uses Qemu simulation, based on Linux v5.13.9, to introduce the memory management in the above process in order.

Real Mode

Start the compiled 64 bit kernel with the following command:

one	qemu-system-x86_64 -kernel arch/x86/boot/bzImage -nographic -append "console=ttyS0 nokaslr" -s -S

Including:

Kernel Parameters " console=ttyS0 nokaslr "It is mainly used to specify the kernel console and turn off the KASLR function (the main reason is that for the convenience of debugging, the kernel decompression address is random every time you start KASLR).
The - s and - S parameters are mainly used for GDB debugging Qemu.

After executing the above command, you can get the kernel address distribution as shown in the figure below.
Real mode memory distribution

According to kernel documentation Linux/x86 Boot Protocol , any Boot Loader (Grub/ILO/...) loading the X86 kernel must comply with this protocol. So far, the protocol version has reached 2.15. In the figure X The starting offset of loading the kernel for the Boot Loader. On the Qemu platform, the offset is 0x10000 。 After loading, the kernel Boot Sector starts to execute, and the execution entry point is _start 。 Reference Linker Script arch/x86/boot/setup.ld 。

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten

 OUTPUT_FORMAT("elf32-i386")
 OUTPUT_ARCH(i386)
 ENTRY(_start)

 SECTIONS
 {
 . = 0;
 .bstext		: { *(.bstext) }
 .bsdata		: { *(.bsdata) }
 ....

It will jump to start_of_setup Start execution.

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty
 twenty-one
 twenty-two
 twenty-three
 twenty-four
 twenty-five
 twenty-six
 twenty-seven
 twenty-eight
 twenty-nine
 thirty
 thirty-one
 thirty-two
 thirty-three
 thirty-four
 thirty-five
 thirty-six
 thirty-seven
 thirty-eight
 thirty-nine
 forty
 forty-one
 forty-two
 forty-three
 forty-four
 forty-five
 forty-six
 forty-seven
 forty-eight
 forty-nine
 fifty
 fifty-one
 fifty-two
 fifty-three
 fifty-four

 #arch/x86/boot/header . S
 .globl _start
 _start:
 .byte	 0xeb # short ( two - byte ) jump
 .byte start_of_setup-1f

 .section  ".entrytext" , "ax"
 start_of_setup:
 # Force %es = %ds
 movw	%ds, %ax
 movw	%ax, %es
	 cld

 movw	%ss, %dx
 cmpw	%ax, %dx	# %ds == %ss?
 movw	%sp, %dx
	 je 2f		# -> assume %sp is reasonably set

 # Invalid %ss, make up a new stack
 movw	$_end, %dx
 testb	$CAN_USE_HEAP, loadflags
	 jz 1f
 movw	heap_end_ptr, %dx
 one :	addw	$STACK_SIZE, %dx
	 jnc 2f
 xorw	%dx, %dx	# Prevent wraparound

 two :	# Now %dx should point to the end of our stack space
 andw	$~ three , %dx	# dword  align (might as well...)
	 jnz 3f
 movw $0 xfffc, %dx	# Make sure we 're not zero
 3:	movw	%ax, %ss
 movzwl	%dx, %esp	# Clear upper half of %esp
 sti			# Now we should have a working stack

 # We will have entered with %cs = %ds+0x20, normalize %cs so it is on par with the other segments.
 pushw	%ds
 pushw	$6f
 lretw
 6:
 # Check signature at end of setup
 cmpl	$0x5a5aaa55, setup_sig
 jne	setup_bad

 # Zero the bss
 movw	$__bss_start, %di
 movw	$_end+3, %cx
 xorl	%eax, %eax
 subw	%di, %cx
 shrw	$2, %cx
 rep;  stosl

 # Jump to C code (should not return)
 calll	main

The above code will clear the direction bit for the execution of real mode code, and does not allocate heap space and stack space for the execution of C code. Then jump to 6 to check the correctness of kernel code loading. Let me explain, lretw And the previous two lines of assembly statements are used to call the return, and the previous two lines are used to save the return address in the stack. Refer to< Intel ® 64 and IA-32 Architectures Software Developer’s Manual >。 As noted, the purpose of using lret is to reset the value of CS register and ensure that it is consistent with other segment registers. Please refer to Intel manual for instructions on ret instruction:

When executing a far return, the processor pops the return instruction pointer from the top of the stack into the EIP
register, then pops the segment selector from the top of the stack into the CS register . The processor then begins
program execution in the new code segment at the new instruction pointer.

Then clear the BSS segment and jump to the main function for execution.

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty
 twenty-one
 twenty-two
 twenty-three

	 /* First, copy the boot header into the "zeropage" */
 copy_boot_params();
 console_init();
	 if (cmdline_find_option_bool( "debug" ))
		 puts ( "early console in setup code\n" );
 init_heap();
	 if (validate_cpu()) {
		 puts ( "Unable to boot - please use a kernel appropriate "
		      "for your CPU.\n" );
 die();
 }
 set_bios_mode();
 detect_memory();
 keyboard_init();
 query_ist();
 # if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
 query_apm_bios();
 # endif
 # if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
 query_edd();
 # endif
 set_video();
 go_to_protected_mode();

main The annotation of the function is relatively clear. Here we only talk about copy_boot_param/detect_memory/go_to_protected_mode:

Copy_boot_param copies the boot_param information in memory (see Figure "Real Mode Memory Distribution") to the global variable boot_params. Boot_params stores the parameters defined by the Linux Boot Protocol. Some fields are rewritten during compilation, and some uncompleted fields are filled in by Boot Loader. Boot_param, including the kernel cmdline, will intersperse each sub process of kernel initialization
Detect_memory mainly uses e820 Get the basic layout of memory and store it in the boot_param specified area（ boot_params.e820_table and boot_params.e820_entries ）。
Go_to_protected_mode is mainly used to open 32-bit address lines（ A20 Gate ）, configure GDT/IDT table, turn off interrupt, turn on protection mode, and jump to 32-bit code to start execution. The codes are as follows:

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty

 //arch/x86/boot/pm.c
 void  go_to_protected_mode ( void )
 {
 realmode_switch_hook();

	 /* Enable the A20 gate */
	 if (enable_a20()) {
		 puts ( "A20 gate not responding, unable to boot...\n" );
 die();
 }

 reset_coprocessor();

 mask_all_interrupts();

 setup_idt();
 setup_gdt();
 protected_mode_jump(boot_params.hdr.code32_start,
 (u32)&boot_params + (ds() << four ));
 }

protected_mode_jump It is an assembly code defined in arch/x86/boot/pmjump S. There is not much analysis here. It is mainly to modify the PE (Protect Enable) bit of the CR0 register and execute the jump instruction to jump to the 32-bit code (. Lin_pm32 label) for execution.

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty
 twenty-one
 twenty-two
 twenty-three
 twenty-four
 twenty-five
 twenty-six
 twenty-seven
 twenty-eight
 twenty-nine
 thirty
 thirty-one
 thirty-two
 thirty-three
 thirty-four
 thirty-five
 thirty-six
 thirty-seven
 thirty-eight
 thirty-nine
 forty
 forty-one
 forty-two
 forty-three
 forty-four

 #arch/x86/boot/pmjump . S
 /*
 * void protected_mode_jump(u32 entrypoint, u32 bootparams) ;
 */
 SYM_FUNC_START_NOALIGN(protected_mode_jump)
 ........

 movl	%cr0, %edx
 orb	$X86_CR0_PE, %dl	# Protected mode
 movl	%edx, %cr0

 # Transition to thirty-two -bit mode
 .byte	 0x66 , 0xea # ljmpl opcode
 two :	.long	. Lin_pm32		# offset
 .word __BOOT_CS		# segment
 SYM_FUNC_END(protected_mode_jump)

 SYM_FUNC_START_LOCAL_NOALIGN(.Lin_pm32)
 # Set up data segments for flat thirty-two -bit mode
 movl	%ecx, %ds
 movl	%ecx, %es
 movl	%ecx, %fs
 movl	%ecx, %gs
 movl	%ecx, %ss
 # The thirty-two -bit code sets  up its own stack, but this way we do have
 # a valid stack if some debugging hack wants to use it.
 addl	%ebx, %esp

 # Set up TR to make Intel VT happy
	 ltr %di

 # Clear registers to allow for future extensions to the
 # thirty-two -bit boot protocol
 xorl	%ecx, %ecx
 xorl	%edx, %edx
 xorl	%ebx, %ebx
 xorl	%ebp, %ebp
 xorl	%edi, %edi

 # Set up LDTR to make Intel VT happy
	 lldt %cx

 jmpl	*%eax			# Jump to the thirty-two -bit entrypoint
 SYM_FUNC_END(.Lin_pm32)

The 32-bit code starts by rebuilding each segment register as BOOT_DS。 The segment register content is the segment selector of an item in GDT, and BOOT_DS is the third entry of GDT. At this time, GDT table entries can be found in arch/x86/boot/pm. c, roughly defining a segment with a base of 0 and a size of 4G, which is enough to cover the area where the kernel initializes 32-bit code execution. For GDT table and segment selection related knowledge, please refer to< Intel ® 64 and IA-32 Architectures Software Developer’s Manual >Volume 3, Chapter 3 PROTECTED-MODE MEMORY MANAGEMENT. Do some cleaning of register contents, and then jump to the starting address of 32-bit kernel for execution.

This starting address is the first parameter of the protected_mode_jump function - boot_params.hdr.code32_start. In our QEMU environment, this value is 0x100000

Why is it stored in eax register? Here we need to know System V Application Binary Interface AMD64 About calling convention The Linux kernel also complies with the System V ABI. ABI refers to the Application Binary Interface, which has different definitions according to the Arch of the program.

Jump from 32-bit protection mode to 64 bit long mode

startup_32

_
0x100000 The starting address of the 32-bit code is stored. For the specific layout, refer to the link script: vmlinux.lds

Link script, that is Linker Script , which is a script that tells the linker how to link the target file. Generally, we do not specify link script for GCC compilation, because it has a default link script.

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty
 twenty-one
 twenty-two
 twenty-three
 twenty-four
 twenty-five
 twenty-six
 twenty-seven
 twenty-eight
 twenty-nine
 thirty
 thirty-one
 thirty-two
 thirty-three
 thirty-four
 thirty-five
 thirty-six
 thirty-seven
 thirty-eight
 thirty-nine
 forty
 forty-one
 forty-two
 forty-three
 forty-four
 forty-five
 forty-six
 forty-seven
 forty-eight
 forty-nine
 fifty
 fifty-one
 fifty-two
 fifty-three
 fifty-four
 fifty-five
 fifty-six
 fifty-seven
 fifty-eight
 fifty-nine
 sixty

 #ifdef CONFIG_X86_64
 OUTPUT_ARCH(i386:x86-64)
 ENTRY(startup_64)
 #else
 OUTPUT_ARCH(i386)
 ENTRY(startup_32)
 #endif

 SECTIONS
 {
 /* Be careful parts of head_64.S assume startup_32 is at
 * address 0.
 */
 . = 0;
 .head.text : {
 _head = . ;
 HEAD_TEXT
 _ehead = . ;
 }
 .rodata..compressed : {
 *(.rodata..compressed)
 }
 .text :	{
 _text = .;  	/* Text */
 *(.text)
 *(.text.*)
 _etext = . ;
 }
 .rodata : {
 _rodata = . ;
 *(.rodata)	 /* read-only data */
 *(.rodata.*)
 _erodata = . ;
 }
 .data :	{
 _data = . ;
 *(.data)
 *(.data.*)
 *(.bss.efistub)
 _edata = . ;
 }
 . = ALIGN(L1_CACHE_BYTES);
 .bss : {
 _bss = . ;
 *(.bss)
 *(.bss.*)
 *(COMMON)
 . = ALIGN(8); 	/* For convenience during zeroing */
 _ebss = .;
 }
 #ifdef CONFIG_X86_64
 . = ALIGN(PAGE_SIZE);
 .pgtable : {
 _pgtable = . ;
 *(.pgtable)
 _epgtable = . ;
 }
 #endif
 . = ALIGN(PAGE_SIZE); 	/* keep ZO size page aligned */
 _end = .;

After ld linking and qemu loading, the memory layout on the left side of the figure below is obtained. Starting from the address 0x100000, the first is the 32-bit protected mode entry code, decompression code, etc., followed by the compressed kernel. The code segment, read-only data segment, data segment, uninitialized data segment and 32-bit code page table of the decompressed kernel are followed.
Memory distribution in 32-bit

From the link script, we can see that the entry address of 32-bit code is startup_32 。 The code first clears the interrupt, loads the new GDT table, resets the registers of each segment, and builds the stack.

Note that the code defines a macro rva , which is mainly used to calculate the relative address in the segment, so that the same code can be executed when the kernel is loaded to different locations.

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty
 twenty-one
 twenty-two
 twenty-three
 twenty-four
 twenty-five
 twenty-six
 twenty-seven
 twenty-eight
 twenty-nine
 thirty
 thirty-one
 thirty-two
 thirty-three
 thirty-four
 thirty-five
 thirty-six
 thirty-seven
 thirty-eight
 thirty-nine
 forty
 forty-one
 forty-two
 forty-three
 forty-four
 forty-five
 forty-six
 forty-seven
 forty-eight
 forty-nine
 fifty
 fifty-one
 fifty-two
 fifty-three
 fifty-four
 fifty-five
 fifty-six
 fifty-seven
 fifty-eight

 #arch/x86/boot/compressed/head_64 . S
 #define rva(X) ((X) - startup_32)

 .code32
 SYM_FUNC_START(startup_32)
	 cld
	 cli

 leal	(BP_scratch+ four )(%esi), %esp
	 call 1f
 one :	popl	%ebp
 subl	$ rva( 1b ), %ebp

 leal	rva(gdt)(%ebp), %eax
 movl	%eax, two (%eax)
	 lgdt (%eax)

 /* Load segment registers with our descriptors */
 movl	$__BOOT_DS, %eax
 movl	%eax, %ds
 movl	%eax, %es
 movl	%eax, %fs
 movl	%eax, %gs
 movl	%eax, %ss

 leal	rva(boot_stack_end)(%ebp), %esp

 pushl	$__KERNEL32_CS
 leal	rva(1f)(%ebp), %eax
 pushl	%eax
 lretl
 one :
	 call startup32_load_idt
	 call verify_cpu
 testl	%eax, %eax
	 jnz .Lno_longmode

 #ifdef CONFIG_RELOCATABLE
 movl	%ebp, %ebx
 ......
 movl	BP_kernel_alignment(%esi), %eax
 decl	%eax
 addl	%eax, %ebx
 notl	%eax
 andl	%eax, %ebx
 cmpl	$LOAD_PHYSICAL_ADDR, %ebx
	 jae 1f
 #endif
 movl	$LOAD_PHYSICAL_ADDR, %ebx
 one :

 addl	BP_init_size(%esi), %ebx
 subl	$ rva(_end), %ebx

 /* Enable PAE mode */
 movl	%cr4, %eax
 orl	$X86_CR4_PAE, %eax
 movl	%eax, %cr4

After loading the IDT, turn on the PAE mode. Then the accountant calculates to place the compression core in ebx for in situ（ in-place ）Unzip. In the above code BP_kernel_alignment(%esi) It is mainly used to fetch the corresponding value from the boot_param corresponding area. We open it again Linux/x86 Boot Protocol and Boot Protocol Subordinate Fields View the description of these fields:

Offset/bytes occupied	parameter	describe
0230/4	kernel_alignment	Physical addr alignment required for kernel
0260/4	init_size	Linear memory required during initialization
01E4/4	scratch	Scratch field for the kernel setup code

The init_size stores the space required for kernel initialization and decompression. This is to reserve enough space according to the in place decompression of kernel compression. The calculation of this part size can refer to the kernel source code arch/x86/boot/header. S Description of（ I haven't fully understood it yet. It needs to be supplemented ）。
Next, the kernel creates a 2MB kernel page table for 4GB memory (see Figure 2 on the right) and loads the page table directory address（ pgtable ）Go to CR3 register and turn on 64 bit long mode. reference resources Wiki :

When in Long mode, 64 bit applications (or operating systems) can use 64 bit instructions and registers, while 32-bit programs will run in a compatible submode.

4GB is large enough to perform kernel decompression and other actions. Then the kernel sends the 64 bit address startup_64 Push the stack, start paging, and execute lret Command jumps to startup_64 Executed at.

Here we omit the check of SEV function, which is a feature of AMD CPU. No analysis will be made here.

startup_64

startup_64 The beginning of the will also clear the interrupt and clear each segment of the register. At the same time, calculate the address to which the compressed kernel will be moved, that is, LOAD_PHYSICAL_ADDR+INIT_SIZE - the length of the compressed kernel（ rva（_end） )。 This deals with startup_32 identical

You may wonder why this code was done in startup_32, and we need to do it again here. The main reason is described in the code. The kernel may be directly loaded by the 64 bit loader and startup_64 Executed at.

Then the kernel loads the empty IDT table, checks whether the level five page table needs to be opened, and handles it accordingly. After clearing the EFLAGS register, move the compressed kernel to the In place decompression location (LOAD_PHYSICAL_ADDR+INIT_SIZE - length of the compressed kernel), and then reload the GDT table that has been moved. Then jump to the moved .Lrelocated Execution started at address.

.Lrelocated

.Lrelocated The code has three main functions:

Load IDT : At this time, only Page Fault Trap is enabled for the contents of the IDT, and the corresponding processing function is boot_page_fault In fact, now arch/x86/boot/compressed/ident_map_64.c The main function is to establish a consistency map for the address of the page missing after doing some basic checks.
Create a consistency map : It mainly establishes consistency mapping for [_head, _end], bootparam and boot cmdline.
Decompress the kernel : Decompressing the kernel is not analyzed in this article. It is mentioned that if KASLR is enabled, a random offset is calculated to generate the real decompression address of the kernel before the kernel is decompressed.

After decompressing the kernel, jump to the entry address of the pressurized kernel, that is, arch/x86/kernel/head_64. S Startup_64 label

After kernel decompression

startup_64 The codes are as follows:

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty
 twenty-one
 twenty-two
 twenty-three
 twenty-four
 twenty-five
 twenty-six
 twenty-seven
 twenty-eight

 SYM_CODE_START_NOALIGN(startup_64)
 UNWIND_HINT_EMPTY
 leaq	(__end_init_task - SIZEOF_PTREGS)(%rip), %rsp

 leaq	_text(%rip), %rdi
 pushq	%rsi
	 call startup_64_setup_env
 popq	%rsi

 pushq	$__KERNEL_CS
 leaq	. Lon_kernel_cs(%rip), %rax
 pushq	%rax
 lretq

 .Lon_kernel_cs:
 UNWIND_HINT_EMPTY

 /* Sanitize CPU configuration */
	 call verify_cpu

 leaq	_text(%rip), %rdi
 pushq	%rsi
	 call __startup_64
 popq	%rsi

 addq	$(early_top_pgt - __START_KERNEL_map), %rax
	 jmp 1f
 SYM_CODE_END(startup_64)

After the above code configures the stack, call startup_64_setup_env Configure Startup GDT and IDT. The contents of the GDT table are as follows:

 one
 two
 three
 four
 five

 static struct desc_struct startup_gdt[GDT_ENTRIES] = {
 [GDT_ENTRY_KERNEL32_CS]         = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
 [GDT_ENTRY_KERNEL_CS]           = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
 [GDT_ENTRY_KERNEL_DS]           = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
 };

The segment descriptors in the Startup GDT are all 4GB in size starting from 0 address. Startup IDT (also called binrgup IDT) mainly handles VMM communication exceptions under AMD architecture, which are related to virtual machines.
Then the kernel continues to execute until verify_cpu This assembly function is defined in verify_cpu. S , which mainly uses cpuid The instruction is supported by the CPU for long mode and SSE instruction set.
After checking, the kernel jumps to execute __startup_64 , which is mainly used to re-establish the early level 4 or level 5 page table of the kernel. At this time, we need to consider the random offset generated by KASLR, so we can see that this function has been called many times fixup_pointer Function to correct page table entries.
Page table is defined in head_64.s , as follows:

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty
 twenty-one
 twenty-two
 twenty-three
 twenty-four
 twenty-five
 twenty-six
 twenty-seven
 twenty-eight
 twenty-nine
 thirty
 thirty-one
 thirty-two
 thirty-three
 thirty-four
 thirty-five
 thirty-six
 thirty-seven
 thirty-eight
 thirty-nine
 forty
 forty-one
 forty-two
 forty-three
 forty-four
 forty-five
 forty-six
 forty-seven
 forty-eight
 forty-nine
 fifty
 fifty-one
 fifty-two
 fifty-three
 fifty-four
 fifty-five
 fifty-six
 fifty-seven
 fifty-eight
 fifty-nine
 sixty
 sixty-one
 sixty-two
 sixty-three
 sixty-four
 sixty-five
 sixty-six
 sixty-seven
 sixty-eight
 sixty-nine
 seventy
 seventy-one
 seventy-two
 seventy-three

 SYM_DATA_START_PTI_ALIGNED(early_top_pgt)
 .fill	 five hundred and twelve , eight , zero
 .fill PTI_USER_PGD_FILL, eight , zero
 SYM_DATA_END(early_top_pgt)

 SYM_DATA_START_PAGE_ALIGNED(early_dynamic_pgts)
 .fill	 five hundred and twelve *EARLY_DYNAMIC_PAGE_TABLES, eight , zero
 SYM_DATA_END(early_dynamic_pgts)

 SYM_DATA(early_recursion_flag, .long zero )

 .data

 #if defined(CONFIG_XEN_PV) || defined(CONFIG_PVH)
 SYM_DATA_START_PTI_ALIGNED(init_top_pgt)
 .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 .org init_top_pgt + L4_PAGE_OFFSET* eight , zero
 .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 .org init_top_pgt + L4_START_KERNEL* eight , zero
 /* ( two ^ forty-eight -( two * one thousand and twenty-four * one thousand and twenty-four * one thousand and twenty-four ))/( two ^ thirty-nine ) = five hundred and eleven */
 .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 .fill PTI_USER_PGD_FILL, eight , zero
 SYM_DATA_END(init_top_pgt)

 SYM_DATA_START_PAGE_ALIGNED(level3_ident_pgt)
 .quad level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 .fill	 five hundred and eleven , eight , zero
 SYM_DATA_END(level3_ident_pgt)
 SYM_DATA_START_PAGE_ALIGNED(level2_ident_pgt)
 PMDS( zero , __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 SYM_DATA_END(level2_ident_pgt)
 #else
 SYM_DATA_START_PTI_ALIGNED(init_top_pgt)
 .fill	 five hundred and twelve , eight , zero
 .fill PTI_USER_PGD_FILL, eight , zero
 SYM_DATA_END(init_top_pgt)
 #endif

 #ifdef CONFIG_X86_5LEVEL
 SYM_DATA_START_PAGE_ALIGNED(level4_kernel_pgt)
 .fill	 five hundred and eleven , eight , zero
 .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 SYM_DATA_END(level4_kernel_pgt)
 #endif

 SYM_DATA_START_PAGE_ALIGNED(level3_kernel_pgt)
 .fill L3_START_KERNEL, eight , zero
 /* ( two ^ forty-eight -( two * one thousand and twenty-four * one thousand and twenty-four * one thousand and twenty-four )-(( two ^ thirty-nine )* five hundred and eleven ))/( two ^ thirty ) = five hundred and ten */
 .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 .quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 SYM_DATA_END(level3_kernel_pgt)

 SYM_DATA_START_PAGE_ALIGNED(level2_kernel_pgt)
 PMDS( zero , __PAGE_KERNEL_LARGE_EXEC, KERNEL_IMAGE_SIZE/PMD_SIZE)
 SYM_DATA_END(level2_kernel_pgt)

 SYM_DATA_START_PAGE_ALIGNED(level2_fixmap_pgt)
 .fill ( five hundred and twelve - four - FIXMAP_PMD_NUM), eight , zero
 pgtno = zero
 .rept (FIXMAP_PMD_NUM)
 .quad level1_fixmap_pgt + (pgtno << PAGE_SHIFT) - __START_KERNEL_map \
 + _PAGE_TABLE_NOENC ;
 pgtno = pgtno + one
 .endr
 /* six MB reserved space + a 2MB hole */
 .fill	 four , eight , zero
 SYM_DATA_END(level2_fixmap_pgt)

 SYM_DATA_START_PAGE_ALIGNED(level1_fixmap_pgt)
 .rept (FIXMAP_PMD_NUM)
 .fill	 five hundred and twelve , eight , zero
 .endr
 SYM_DATA_END(level1_fixmap_pgt)

It is difficult to understand. Let's use the figure to translate it:
Kernel early page table

The figure establishes an early mapping for the kernel code, so that the kernel code can be executed happily. (Of course, it is not necessarily pleasant to execute the kernel code. As we will see later, the kernel needs to register an IDT table entry to handle Page Fault Trap).

 one
 two
 three
 four
 five
 six
 seven

 /* Switch to new page-table */
 movq %rax, %cr3

 /* Ensure I am executing from virtual addresses */
 movq	 $1f , %rax
 ANNOTATE_RETPOLINE_SAFE
 jmp *%rax

__startup_64 After execution, we skip some SEV processing and start using the new kernel page table. Then we jump to the virtual address starting from __START_KERNEL_map for execution. Then reinitialize GDT, set segment register, establish stack for initialization operation, and establish IDT. There is a code in the middle:

 one
 two
 three
 four
 five
 six
 seven
 eight
 nine
 ten
 eleven
 twelve
 thirteen
 fourteen
 fifteen
 sixteen
 seventeen
 eighteen
 nineteen
 twenty
 twenty-one

 /* Set up %gs.
 *
 * The base of %gs always points to fixed_percpu_data.  If the
 * stack protector canary is enabled, it is located at %gs: forty .
 * Note that, on SMP, the boot cpu uses init data section until
 * the per cpu areas are set up .
 */
 movl	$MSR_GS_BASE,%ecx
 movl	initial_gs(%rip),%eax
 movl	initial_gs+ four (%rip),%edx
	 wrmsr
 ..................
 pushq	$. Lafter_lret	# put return address on stack for unwinder
 xorl	%ebp, %ebp	# clear frame pointer
	 movq initial_code(%rip), %rax
 pushq	$__KERNEL_CS	# set correct cs
 pushq	%rax		# target address in negative space
 lretq

 ......
 SYM_DATA(initial_code,	.quad x86_64_start_kernel)

It is used to save the address of per CPU variable to 64 bit model specific register (MSR) for multiprocessor systems. Then jump to the initialization c code, that is x86_64_start_kernel 。

summary

This article focuses on the memory management from the kernel being loaded by Loader to the C code entry. Some main steps:

Turn on protection mode
Turn on long mode
Add random offset while decompressing the kernel
Create the kernel page table and jump to the virtual address for execution

Later in the series, we will analyze the processing after the C code entry