0%

Linux kernel memory management - from the perspective of kernel startup process

This is the third article in the<Linux kernel memory management>series

Part I Simple sorting of knowledge points for kernel memory management process

Part II Introduced the data structure of the kernel

preface

Taking Intel X64 CPU as an example, Linux initialization can be roughly divided into the following processes:

  1. Real Mode after the loader jumps to the kernel
  2. Jump from 32-bit protection mode to 64 bit long mode
  3. Decompress the kernel in 64 bit long mode
  4. After decompressing the kernel, create a new page table mapping and jump to the Arch (platform) related C code
  5. Execute platform independent initialization code

Memory management plays an important role in the above process. It includes memory layout planning, segmentation management, page table configuration, kernel movement, etc.

This article uses Qemu simulation, based on Linux v5.13.9, to introduce the memory management in the above process in order.

Real Mode

Start the compiled 64 bit kernel with the following command:

 one
 qemu-system-x86_64 -kernel arch/x86/boot/bzImage -nographic  -append "console=ttyS0 nokaslr"  -s -S

Including:

  • Kernel Parameters " console=ttyS0 nokaslr "It is mainly used to specify the kernel console and turn off the KASLR function (the main reason is that for the convenience of debugging, the kernel decompression address is random every time you start KASLR).
  • The - s and - S parameters are mainly used for GDB debugging Qemu.

After executing the above command, you can get the kernel address distribution as shown in the figure below.
 Real mode memory distribution

According to kernel documentation Linux/x86 Boot Protocol , any Boot Loader (Grub/ILO/...) loading the X86 kernel must comply with this protocol. So far, the protocol version has reached 2.15. In the figure X The starting offset of loading the kernel for the Boot Loader. On the Qemu platform, the offset is 0x10000 After loading, the kernel Boot Sector starts to execute, and the execution entry point is _start Reference Linker Script arch/x86/boot/setup.ld

 one
two
three
four
five
six
seven
eight
nine
ten
 OUTPUT_FORMAT("elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)

SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
.bsdata : { *(.bsdata) }
....

It will jump to start_of_setup Start execution.

 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
thirty-one
thirty-two
thirty-three
thirty-four
thirty-five
thirty-six
thirty-seven
thirty-eight
thirty-nine
forty
forty-one
forty-two
forty-three
forty-four
forty-five
forty-six
forty-seven
forty-eight
forty-nine
fifty
fifty-one
fifty-two
fifty-three
fifty-four
 #arch/x86/boot/header . S
.globl _start
_start:
.byte 0xeb # short ( two - byte ) jump
.byte start_of_setup-1f

.section ".entrytext" , "ax"
start_of_setup:
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld

movw %ss, %dx
cmpw %ax, %dx # %ds == %ss?
movw %sp, %dx
je 2f # -> assume %sp is reasonably set

# Invalid %ss, make up a new stack
movw $_end, %dx
testb $CAN_USE_HEAP, loadflags
jz 1f
movw heap_end_ptr, %dx
one : addw $STACK_SIZE, %dx
jnc 2f
xorw %dx, %dx # Prevent wraparound

two : # Now %dx should point to the end of our stack space
andw $~ three , %dx # dword align (might as well...)
jnz 3f
movw $0 xfffc, %dx # Make sure we 're not zero
3: movw %ax, %ss
movzwl %dx, %esp # Clear upper half of %esp
sti # Now we should have a working stack

# We will have entered with %cs = %ds+0x20, normalize %cs so it is on par with the other segments.
pushw %ds
pushw $6f
lretw
6:
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad

# Zero the bss
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl

# Jump to C code (should not return)
calll main

The above code will clear the direction bit for the execution of real mode code, and does not allocate heap space and stack space for the execution of C code. Then jump to 6 to check the correctness of kernel code loading. Let me explain, lretw And the previous two lines of assembly statements are used to call the return, and the previous two lines are used to save the return address in the stack. Refer to< Intel ® 64 and IA-32 Architectures Software Developer’s Manual >。 As noted, the purpose of using lret is to reset the value of CS register and ensure that it is consistent with other segment registers. Please refer to Intel manual for instructions on ret instruction:

When executing a far return, the processor pops the return instruction pointer from the top of the stack into the EIP
register, then pops the segment selector from the top of the stack into the CS register . The processor then begins
program execution in the new code segment at the new instruction pointer.

Then clear the BSS segment and jump to the main function for execution.

 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
	 /* First, copy the boot header into the "zeropage" */
copy_boot_params();
console_init();
if (cmdline_find_option_bool( "debug" ))
puts ( "early console in setup code\n" );
init_heap();
if (validate_cpu()) {
puts ( "Unable to boot - please use a kernel appropriate "
"for your CPU.\n" );
die();
}
set_bios_mode();
detect_memory();
keyboard_init();
query_ist();
# if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
query_apm_bios();
# endif
# if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
query_edd();
# endif
set_video();
go_to_protected_mode();

main The annotation of the function is relatively clear. Here we only talk about copy_boot_param/detect_memory/go_to_protected_mode:

  • Copy_boot_param copies the boot_param information in memory (see Figure "Real Mode Memory Distribution") to the global variable boot_params. Boot_params stores the parameters defined by the Linux Boot Protocol. Some fields are rewritten during compilation, and some uncompleted fields are filled in by Boot Loader. Boot_param, including the kernel cmdline, will intersperse each sub process of kernel initialization
  • Detect_memory mainly uses e820 Get the basic layout of memory and store it in the boot_param specified area( boot_params.e820_table and boot_params.e820_entries )。
  • Go_to_protected_mode is mainly used to open 32-bit address lines( A20 Gate ), configure GDT/IDT table, turn off interrupt, turn on protection mode, and jump to 32-bit code to start execution. The codes are as follows:
 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
 //arch/x86/boot/pm.c
void go_to_protected_mode ( void )
{
realmode_switch_hook();

/* Enable the A20 gate */
if (enable_a20()) {
puts ( "A20 gate not responding, unable to boot...\n" );
die();
}

reset_coprocessor();

mask_all_interrupts();

setup_idt();
setup_gdt();
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << four ));
}

protected_mode_jump It is an assembly code defined in arch/x86/boot/pmjump S. There is not much analysis here. It is mainly to modify the PE (Protect Enable) bit of the CR0 register and execute the jump instruction to jump to the 32-bit code (. Lin_pm32 label) for execution.

 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
thirty-one
thirty-two
thirty-three
thirty-four
thirty-five
thirty-six
thirty-seven
thirty-eight
thirty-nine
forty
forty-one
forty-two
forty-three
forty-four
 #arch/x86/boot/pmjump . S
/*
* void protected_mode_jump(u32 entrypoint, u32 bootparams) ;
*/
SYM_FUNC_START_NOALIGN(protected_mode_jump)
........

movl %cr0, %edx
orb $X86_CR0_PE, %dl # Protected mode
movl %edx, %cr0

# Transition to thirty-two -bit mode
.byte 0x66 , 0xea # ljmpl opcode
two : .long . Lin_pm32 # offset
.word __BOOT_CS # segment
SYM_FUNC_END(protected_mode_jump)

SYM_FUNC_START_LOCAL_NOALIGN(.Lin_pm32)
# Set up data segments for flat thirty-two -bit mode
movl %ecx, %ds
movl %ecx, %es
movl %ecx, %fs
movl %ecx, %gs
movl %ecx, %ss
# The thirty-two -bit code sets up its own stack, but this way we do have
# a valid stack if some debugging hack wants to use it.
addl %ebx, %esp

# Set up TR to make Intel VT happy
ltr %di

# Clear registers to allow for future extensions to the
# thirty-two -bit boot protocol
xorl %ecx, %ecx
xorl %edx, %edx
xorl %ebx, %ebx
xorl %ebp, %ebp
xorl %edi, %edi

# Set up LDTR to make Intel VT happy
lldt %cx

jmpl *%eax # Jump to the thirty-two -bit entrypoint
SYM_FUNC_END(.Lin_pm32)

The 32-bit code starts by rebuilding each segment register as BOOT_DS。 The segment register content is the segment selector of an item in GDT, and BOOT_DS is the third entry of GDT. At this time, GDT table entries can be found in arch/x86/boot/pm. c, roughly defining a segment with a base of 0 and a size of 4G, which is enough to cover the area where the kernel initializes 32-bit code execution. For GDT table and segment selection related knowledge, please refer to< Intel ® 64 and IA-32 Architectures Software Developer’s Manual >Volume 3, Chapter 3 PROTECTED-MODE MEMORY MANAGEMENT. Do some cleaning of register contents, and then jump to the starting address of 32-bit kernel for execution.

This starting address is the first parameter of the protected_mode_jump function - boot_params.hdr.code32_start. In our QEMU environment, this value is 0x100000

Why is it stored in eax register? Here we need to know System V Application Binary Interface AMD64 About calling convention The Linux kernel also complies with the System V ABI. ABI refers to the Application Binary Interface, which has different definitions according to the Arch of the program.

Jump from 32-bit protection mode to 64 bit long mode

startup_32

_
0x100000 The starting address of the 32-bit code is stored. For the specific layout, refer to the link script: vmlinux.lds

Link script, that is Linker Script , which is a script that tells the linker how to link the target file. Generally, we do not specify link script for GCC compilation, because it has a default link script.

 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
thirty-one
thirty-two
thirty-three
thirty-four
thirty-five
thirty-six
thirty-seven
thirty-eight
thirty-nine
forty
forty-one
forty-two
forty-three
forty-four
forty-five
forty-six
forty-seven
forty-eight
forty-nine
fifty
fifty-one
fifty-two
fifty-three
fifty-four
fifty-five
fifty-six
fifty-seven
fifty-eight
fifty-nine
sixty
 #ifdef CONFIG_X86_64
OUTPUT_ARCH(i386:x86-64)
ENTRY(startup_64)
#else
OUTPUT_ARCH(i386)
ENTRY(startup_32)
#endif

SECTIONS
{
/* Be careful parts of head_64.S assume startup_32 is at
* address 0.
*/
. = 0;
.head.text : {
_head = . ;
HEAD_TEXT
_ehead = . ;
}
.rodata..compressed : {
*(.rodata..compressed)
}
.text : {
_text = .; /* Text */
*(.text)
*(.text.*)
_etext = . ;
}
.rodata : {
_rodata = . ;
*(.rodata) /* read-only data */
*(.rodata.*)
_erodata = . ;
}
.data : {
_data = . ;
*(.data)
*(.data.*)
*(.bss.efistub)
_edata = . ;
}
. = ALIGN(L1_CACHE_BYTES);
.bss : {
_bss = . ;
*(.bss)
*(.bss.*)
*(COMMON)
. = ALIGN(8); /* For convenience during zeroing */
_ebss = .;
}
#ifdef CONFIG_X86_64
. = ALIGN(PAGE_SIZE);
.pgtable : {
_pgtable = . ;
*(.pgtable)
_epgtable = . ;
}
#endif
. = ALIGN(PAGE_SIZE); /* keep ZO size page aligned */
_end = .;

After ld linking and qemu loading, the memory layout on the left side of the figure below is obtained. Starting from the address 0x100000, the first is the 32-bit protected mode entry code, decompression code, etc., followed by the compressed kernel. The code segment, read-only data segment, data segment, uninitialized data segment and 32-bit code page table of the decompressed kernel are followed.
 Memory distribution in 32-bit

From the link script, we can see that the entry address of 32-bit code is startup_32 The code first clears the interrupt, loads the new GDT table, resets the registers of each segment, and builds the stack.

Note that the code defines a macro rva , which is mainly used to calculate the relative address in the segment, so that the same code can be executed when the kernel is loaded to different locations.

 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
thirty-one
thirty-two
thirty-three
thirty-four
thirty-five
thirty-six
thirty-seven
thirty-eight
thirty-nine
forty
forty-one
forty-two
forty-three
forty-four
forty-five
forty-six
forty-seven
forty-eight
forty-nine
fifty
fifty-one
fifty-two
fifty-three
fifty-four
fifty-five
fifty-six
fifty-seven
fifty-eight
 #arch/x86/boot/compressed/head_64 . S
#define rva(X) ((X) - startup_32)

.code32
SYM_FUNC_START(startup_32)
cld
cli

leal (BP_scratch+ four )(%esi), %esp
call 1f
one : popl %ebp
subl $ rva( 1b ), %ebp

leal rva(gdt)(%ebp), %eax
movl %eax, two (%eax)
lgdt (%eax)

/* Load segment registers with our descriptors */
movl $__BOOT_DS, %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %fs
movl %eax, %gs
movl %eax, %ss

leal rva(boot_stack_end)(%ebp), %esp

pushl $__KERNEL32_CS
leal rva(1f)(%ebp), %eax
pushl %eax
lretl
one :
call startup32_load_idt
call verify_cpu
testl %eax, %eax
jnz .Lno_longmode

#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
......
movl BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
notl %eax
andl %eax, %ebx
cmpl $LOAD_PHYSICAL_ADDR, %ebx
jae 1f
#endif
movl $LOAD_PHYSICAL_ADDR, %ebx
one :

addl BP_init_size(%esi), %ebx
subl $ rva(_end), %ebx

/* Enable PAE mode */
movl %cr4, %eax
orl $X86_CR4_PAE, %eax
movl %eax, %cr4

After loading the IDT, turn on the PAE mode. Then the accountant calculates to place the compression core in ebx for in situ( in-place )Unzip. In the above code BP_kernel_alignment(%esi) It is mainly used to fetch the corresponding value from the boot_param corresponding area. We open it again Linux/x86 Boot Protocol and Boot Protocol Subordinate Fields View the description of these fields:

Offset/bytes occupied parameter describe
0230/4 kernel_alignment Physical addr alignment required for kernel
0260/4 init_size Linear memory required during initialization
01E4/4 scratch Scratch field for the kernel setup code

The init_size stores the space required for kernel initialization and decompression. This is to reserve enough space according to the in place decompression of kernel compression. The calculation of this part size can refer to the kernel source code arch/x86/boot/header. S Description of( I haven't fully understood it yet. It needs to be supplemented )。
Next, the kernel creates a 2MB kernel page table for 4GB memory (see Figure 2 on the right) and loads the page table directory address( pgtable )Go to CR3 register and turn on 64 bit long mode. reference resources Wiki :

When in Long mode, 64 bit applications (or operating systems) can use 64 bit instructions and registers, while 32-bit programs will run in a compatible submode.

4GB is large enough to perform kernel decompression and other actions. Then the kernel sends the 64 bit address startup_64 Push the stack, start paging, and execute lret Command jumps to startup_64 Executed at.

Here we omit the check of SEV function, which is a feature of AMD CPU. No analysis will be made here.

startup_64

startup_64 The beginning of the will also clear the interrupt and clear each segment of the register. At the same time, calculate the address to which the compressed kernel will be moved, that is, LOAD_PHYSICAL_ADDR+INIT_SIZE - the length of the compressed kernel( rva(_end) )。 This deals with startup_32 identical

You may wonder why this code was done in startup_32, and we need to do it again here. The main reason is described in the code. The kernel may be directly loaded by the 64 bit loader and startup_64 Executed at.

Then the kernel loads the empty IDT table, checks whether the level five page table needs to be opened, and handles it accordingly. After clearing the EFLAGS register, move the compressed kernel to the In place decompression location (LOAD_PHYSICAL_ADDR+INIT_SIZE - length of the compressed kernel), and then reload the GDT table that has been moved. Then jump to the moved .Lrelocated Execution started at address.

.Lrelocated

.Lrelocated The code has three main functions:

  • Load IDT : At this time, only Page Fault Trap is enabled for the contents of the IDT, and the corresponding processing function is boot_page_fault In fact, now arch/x86/boot/compressed/ident_map_64.c The main function is to establish a consistency map for the address of the page missing after doing some basic checks.
  • Create a consistency map : It mainly establishes consistency mapping for [_head, _end], bootparam and boot cmdline.
  • Decompress the kernel : Decompressing the kernel is not analyzed in this article. It is mentioned that if KASLR is enabled, a random offset is calculated to generate the real decompression address of the kernel before the kernel is decompressed.

After decompressing the kernel, jump to the entry address of the pressurized kernel, that is, arch/x86/kernel/head_64. S Startup_64 label

After kernel decompression

startup_64 The codes are as follows:

 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
 SYM_CODE_START_NOALIGN(startup_64)
UNWIND_HINT_EMPTY
leaq (__end_init_task - SIZEOF_PTREGS)(%rip), %rsp

leaq _text(%rip), %rdi
pushq %rsi
call startup_64_setup_env
popq %rsi

pushq $__KERNEL_CS
leaq . Lon_kernel_cs(%rip), %rax
pushq %rax
lretq

.Lon_kernel_cs:
UNWIND_HINT_EMPTY

/* Sanitize CPU configuration */
call verify_cpu

leaq _text(%rip), %rdi
pushq %rsi
call __startup_64
popq %rsi

addq $(early_top_pgt - __START_KERNEL_map), %rax
jmp 1f
SYM_CODE_END(startup_64)

After the above code configures the stack, call startup_64_setup_env Configure Startup GDT and IDT. The contents of the GDT table are as follows:

 one
two
three
four
five
 static struct desc_struct startup_gdt[GDT_ENTRIES] = {
[GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
};

The segment descriptors in the Startup GDT are all 4GB in size starting from 0 address. Startup IDT (also called binrgup IDT) mainly handles VMM communication exceptions under AMD architecture, which are related to virtual machines.
Then the kernel continues to execute until verify_cpu This assembly function is defined in verify_cpu. S , which mainly uses cpuid The instruction is supported by the CPU for long mode and SSE instruction set.
After checking, the kernel jumps to execute __startup_64 , which is mainly used to re-establish the early level 4 or level 5 page table of the kernel. At this time, we need to consider the random offset generated by KASLR, so we can see that this function has been called many times fixup_pointer Function to correct page table entries.
Page table is defined in head_64.s , as follows:

 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
thirty-one
thirty-two
thirty-three
thirty-four
thirty-five
thirty-six
thirty-seven
thirty-eight
thirty-nine
forty
forty-one
forty-two
forty-three
forty-four
forty-five
forty-six
forty-seven
forty-eight
forty-nine
fifty
fifty-one
fifty-two
fifty-three
fifty-four
fifty-five
fifty-six
fifty-seven
fifty-eight
fifty-nine
sixty
sixty-one
sixty-two
sixty-three
sixty-four
sixty-five
sixty-six
sixty-seven
sixty-eight
sixty-nine
seventy
seventy-one
seventy-two
seventy-three
 SYM_DATA_START_PTI_ALIGNED(early_top_pgt)
.fill five hundred and twelve , eight , zero
.fill PTI_USER_PGD_FILL, eight , zero
SYM_DATA_END(early_top_pgt)

SYM_DATA_START_PAGE_ALIGNED(early_dynamic_pgts)
.fill five hundred and twelve *EARLY_DYNAMIC_PAGE_TABLES, eight , zero
SYM_DATA_END(early_dynamic_pgts)

SYM_DATA(early_recursion_flag, .long zero )

.data

#if defined(CONFIG_XEN_PV) || defined(CONFIG_PVH)
SYM_DATA_START_PTI_ALIGNED(init_top_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
.org init_top_pgt + L4_PAGE_OFFSET* eight , zero
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
.org init_top_pgt + L4_START_KERNEL* eight , zero
/* ( two ^ forty-eight -( two * one thousand and twenty-four * one thousand and twenty-four * one thousand and twenty-four ))/( two ^ thirty-nine ) = five hundred and eleven */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
.fill PTI_USER_PGD_FILL, eight , zero
SYM_DATA_END(init_top_pgt)

SYM_DATA_START_PAGE_ALIGNED(level3_ident_pgt)
.quad level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
.fill five hundred and eleven , eight , zero
SYM_DATA_END(level3_ident_pgt)
SYM_DATA_START_PAGE_ALIGNED(level2_ident_pgt)
PMDS( zero , __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
SYM_DATA_END(level2_ident_pgt)
#else
SYM_DATA_START_PTI_ALIGNED(init_top_pgt)
.fill five hundred and twelve , eight , zero
.fill PTI_USER_PGD_FILL, eight , zero
SYM_DATA_END(init_top_pgt)
#endif

#ifdef CONFIG_X86_5LEVEL
SYM_DATA_START_PAGE_ALIGNED(level4_kernel_pgt)
.fill five hundred and eleven , eight , zero
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
SYM_DATA_END(level4_kernel_pgt)
#endif

SYM_DATA_START_PAGE_ALIGNED(level3_kernel_pgt)
.fill L3_START_KERNEL, eight , zero
/* ( two ^ forty-eight -( two * one thousand and twenty-four * one thousand and twenty-four * one thousand and twenty-four )-(( two ^ thirty-nine )* five hundred and eleven ))/( two ^ thirty ) = five hundred and ten */
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
SYM_DATA_END(level3_kernel_pgt)

SYM_DATA_START_PAGE_ALIGNED(level2_kernel_pgt)
PMDS( zero , __PAGE_KERNEL_LARGE_EXEC, KERNEL_IMAGE_SIZE/PMD_SIZE)
SYM_DATA_END(level2_kernel_pgt)

SYM_DATA_START_PAGE_ALIGNED(level2_fixmap_pgt)
.fill ( five hundred and twelve - four - FIXMAP_PMD_NUM), eight , zero
pgtno = zero
.rept (FIXMAP_PMD_NUM)
.quad level1_fixmap_pgt + (pgtno << PAGE_SHIFT) - __START_KERNEL_map \
+ _PAGE_TABLE_NOENC ;
pgtno = pgtno + one
.endr
/* six MB reserved space + a 2MB hole */
.fill four , eight , zero
SYM_DATA_END(level2_fixmap_pgt)

SYM_DATA_START_PAGE_ALIGNED(level1_fixmap_pgt)
.rept (FIXMAP_PMD_NUM)
.fill five hundred and twelve , eight , zero
.endr
SYM_DATA_END(level1_fixmap_pgt)

It is difficult to understand. Let's use the figure to translate it:
 Kernel early page table

The figure establishes an early mapping for the kernel code, so that the kernel code can be executed happily. (Of course, it is not necessarily pleasant to execute the kernel code. As we will see later, the kernel needs to register an IDT table entry to handle Page Fault Trap).

 one
two
three
four
five
six
seven
 /* Switch to new page-table */
movq %rax, %cr3

/* Ensure I am executing from virtual addresses */
movq $1f , %rax
ANNOTATE_RETPOLINE_SAFE
jmp *%rax

__startup_64 After execution, we skip some SEV processing and start using the new kernel page table. Then we jump to the virtual address starting from __START_KERNEL_map for execution. Then reinitialize GDT, set segment register, establish stack for initialization operation, and establish IDT. There is a code in the middle:

 one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
 /* Set up %gs.
*
* The base of %gs always points to fixed_percpu_data. If the
* stack protector canary is enabled, it is located at %gs: forty .
* Note that, on SMP, the boot cpu uses init data section until
* the per cpu areas are set up .
*/
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+ four (%rip),%edx
wrmsr
..................
pushq $. Lafter_lret # put return address on stack for unwinder
xorl %ebp, %ebp # clear frame pointer
movq initial_code(%rip), %rax
pushq $__KERNEL_CS # set correct cs
pushq %rax # target address in negative space
lretq

......
SYM_DATA(initial_code, .quad x86_64_start_kernel)

It is used to save the address of per CPU variable to 64 bit model specific register (MSR) for multiprocessor systems. Then jump to the initialization c code, that is x86_64_start_kernel

summary

This article focuses on the memory management from the kernel being loaded by Loader to the C code entry. Some main steps:

  • Turn on protection mode
  • Turn on long mode
  • Add random offset while decompressing the kernel
  • Create the kernel page table and jump to the virtual address for execution

Later in the series, we will analyze the processing after the C code entry