Cache memory

Announce Upload video
Cache memory
Collection
zero Useful+1
zero
synonym Cache (Cache) generally refers to cache memory
The original meaning of cache is Access speed Compared with general random access Memory RAM )A fast RAM. Generally speaking, it is not used like the main memory of the system DRAM Technology, while using expensive but faster SRAM Technology, also has the name of cache memory.
Cache storage Is present in Main storage And CPU Primary storage between Memory chip (SRAM) composition, small capacity but Speed ratio The main memory is much higher, close to the CPU speed. On the computer storage system Of hierarchical structure Medium, is between a central processor and Main memory High speed small capacity memory between. Together with the main memory, it forms a primary memory. The scheduling and transmission of information between cache memory and main memory is automatically carried out by hardware.
Cache most important Technical indicators It's his hit rate
Chinese name
Cache memory
Foreign name
Cache
Account
computer
Field
Hardware

Composition

Announce
edit
Cache memory is the primary memory between main memory and CPU Memory chip ( SRAM )Composition, small capacity, but Speed ratio The main memory is much higher, close to the CPU speed.
It mainly consists of three parts:
Cache memory bank : Store instructions transferred in from main memory and data block
address translation Part: Create a directory table to realize the conversion from main memory address to cache address.
Replace part: When the cache is full, replace the data block according to a certain strategy, and modify the address translation part. [1]

working principle

Announce
edit
[2] Cache memory is usually composed of high-speed memory Associative memory , Replace logical circuit And corresponding Control line form. In case of cache memory computer system The address of central processing unit accessing main memory is divided into Line number , column number and intra group address. Therefore, the main memory is logically divided into several lines; Each line is divided into several Storage unit Group; Each group contains several or dozens of words. The high-speed memory is also correspondingly divided into storage unit groups of rows and columns. Both have the same number of columns and the same size of groups, but the number of rows in high-speed memory is much less than that in main memory.
Associative memory is used for address association and has a storage unit with the same number of rows and columns as high-speed memory. When a row storage unit group in a column of the main memory is called into an empty storage unit group in the same column of the high-speed memory, the storage unit in the location corresponding to the associative memory records the row number of the transferred storage unit group in the main memory.
When the central processor accesses the main memory, the hardware first automatically decodes the column number field of the access address, so as to compare all the row numbers of the associative memory column with the row number field of the access main memory address: if there are the same, it indicates that the main memory unit to be accessed has been in the high-speed memory, called hit, and the hardware will access the Address mapping Is the address of high-speed memory and performs access operation; If they are all different, it means that the unit is not in the high-speed memory, which is called miss. The hardware will perform the operation of accessing the main memory and automatically transfer the main memory unit group where the unit is located to the empty storage unit group in the same column of high-speed memory, and at the same time, store the line number of the group in the main memory into the unit at the corresponding position of associative memory.
When there is a miss and there is no empty position in the corresponding column of high-speed memory, a group in the column will be eliminated to make room for the new incoming group, which is called replacement. The rule for determining replacement is Replacement algorithm , common replacement algorithms are: Least recently used Algorithm( LRU )、 FIFO FIFO )And random method (RAND), etc. The replacement logic circuit performs this function. In addition, in order to maintain the consistency of main memory and high-speed memory, hit and miss must be handled separately when writing to main memory.

Storage hierarchy

Main- Secondary deposit Storage hierarchy Because the computer Main storage The capacity is always too small compared with the capacity required by the programmer. The transfer of programs and data from secondary storage to main storage is arranged by the programmer himself. The programmer must spend a lot of energy and time to divide the large program into blocks in advance, determine the location of these program blocks in secondary storage and the address to be loaded into main memory, and also arrange the program in advance Runtime How and when the blocks are transferred in and out, so there are storage space Of assignment problem The formation and development of the operating system enables programmers to get rid of the address location between the main memory and the auxiliary memory as much as possible. At the same time, the "auxiliary hardware" supporting these functions has been formed. Through the combination of software and hardware, the main memory and the auxiliary memory are unified as a whole, as shown in the figure. At this time, a storage hierarchy is formed from primary storage and secondary storage, namely storage system As a whole, its speed is close to the speed of main memory, and its capacity is close to the capacity of secondary memory average price It is also close to the average price of cheap and slow secondary storage. With the continuous development and improvement of this system, it has gradually formed a widely used Virtual storage system In the system, the application programmer can Machine commands Address code For the whole program Unified addressing , just as the programmer has all the virtual memory space corresponding to the width of the address code. This space can be much larger than the actual main memory space, so that the entire program can be stored. This instruction address code is called virtual address (virtual memory address Virtual address )Or Logical address , its corresponding storage capacity It is called virtual storage capacity or virtual storage space; The address of the actual main memory is called Physical address , real (storage) address, whose corresponding storage capacity is called main storage capacity Real existence Capacity or real (main) storage space
Primary secondary storage hierarchy
Primary secondary storage hierarchy

CACHE - Main memory storage hierarchy

When the virtual address is used to access main memory, the machine automatically converts it into main memory through auxiliary software and hardware Real address Check whether the unit content corresponding to this address has been loaded into the main memory. If it is in the main memory, it will be accessed. If it is not in the main memory, it will be transferred from the auxiliary memory to the main memory by the auxiliary software and hardware, and then accessed. These operations do not need to be arranged by programmers, that is, they are transparent to application engineers. The primary secondary memory level solves the contradiction between the requirement of large storage capacity and low cost. In terms of speed, the main memory and CPU Keep about one Order of magnitude Gap. Obviously, this gap limits the potential of CPU speed. In order to bridge this gap, a single memory using only one process is not feasible, and it must be further improved from Computer system structure And organize research. Setting cache is the solution Access speed Important methods. A cache memory is set between the CPU and main memory to form a cache main memory hierarchy. The cache is required to keep up with the CPU in speed. The address mapping and scheduling between Cache and main memory absorb the technology of the primary secondary memory storage layer that appeared earlier than it. The difference is that its speed requirements are high, and it is not realized by the combination of software and hardware but by hardware, as shown in the figure.
Cache - main memory hierarchy

Address mapping and transformation

Address mapping refers to the correspondence between the address of a data in memory and the address in buffer. Here are three address mapping methods.
1. Fully connected mode
Address mapping rule: any block of main memory can be mapped to any block in the Cache
(1) Main memory and cache are divided into the same size data block
(2) A data block in main memory can be loaded into any space in the cache. If the number of cache blocks is Cb and the number of main memory blocks is Mb, there are Cb × Mb mapping relationships.
The directory table is stored in the associated memory, which includes three parts: the block address of the data block in the main memory, the block address after being stored in the cache, and Significant bit (also called loading bit). Because it is fully associative, the capacity of the directory table should be the same as the number of cached blocks.
advantage: hit rate Quite high. Cache storage space utilization is high.
Disadvantages: When accessing the relevant memory, you have to compare it with the whole content every time. The speed is low and the cost is high, so there are few applications.
2. Direct connection mode
Address mapping rule: A block in main memory can only be mapped to a specific block in Cache.
(1) Main memory and cache are divided into data blocks of the same size.
(2) The main memory capacity should be Cache capacity The number of blocks in each area of main memory is equal to the total number of blocks in the cache.
(3) When a block in an area of main memory is stored in the cache, it can only be stored in a location with the same block number in the cache.
The data blocks with the same block number in each area of main memory can be respectively transferred to the address with the same block number in the cache, but at the same time, only the blocks in one area can be stored in the cache. Since the primary and cache block numbers are the same, only the area code of the transfer in block can be recorded during directory registration. The primary, cache block number and address within the block are identical. The directory table is stored in a high-speed small capacity memory, which includes two parts: the area code and the effective bit of the data block in the main memory. The capacity of the directory table is the same as the number of cached blocks.
Advantages: The address mapping method is simple. When accessing data, you only need to check whether the area codes are equal, so you can get faster access speed. The hardware device is simple.
Disadvantages: Replace operation Frequent, low hit rate.
3. Group associated image mode
Group associated image rules:
(1) Main memory and cache are divided into blocks according to the same size.
(2) Main memory and cache are divided by the same size Group
(3) The main memory capacity is an integer multiple of the cache capacity buffer The size of is divided into areas Number of groups Same number of groups as cached.
(4) When the data in main memory is transferred into the cache, the group numbers of main memory and cache should be the same, that is, a block in each area can only be stored in the space with the same group number in the cache, but the address of each block in the group can be stored at will, that is, the direct mapping method is used from the group in main memory to the group in the cache; Fully associative mapping is used in two corresponding groups.
The conversion of main memory address and cache address has two parts. Group address is accessed by address in direct mapping mode, while block address is accessed by content in full association mode. Group Associated address translation The components are also implemented using related memory.
Advantages: The collision probability of blocks is relatively low, the utilization rate of blocks is greatly improved, and the failure rate of blocks is significantly reduced.
Disadvantages: The implementation difficulty and cost are higher than that of direct mapping.

Replacement Policy

1. According to Procedural locality It can be seen from the rule that programs always frequently use the recently used instructions and data in the running process. This provides a theoretical basis for the replacement strategy. Based on various factors such as hit rate, difficulty of implementation and speed, replacement strategies can include random method, first in first out method, least recently used method, etc.
(1) . Random method (RAND method)
The random method is to determine the replacement storage block randomly. Set one random number The generator determines the replacement block according to the generated random number. This method is simple and easy to implement, but its hit rate is relatively low.
(2) . First in, first out (FIFO) method
The first in, first out method is to select the first imported block to replace. When a block is called in first and hit many times, it is likely to be replaced first, so it does not conform to the local rule. The hit rate of this method is better than that of the random method, but it does not meet the requirements. The first in, first out method is easy to implement,
(3) . Least recently used method (LRU method)
The LRU method is based on the use of each block. It always selects the least recently used block to be replaced. This method well reflects the local law of the program. There are many ways to implement the LRU strategy.
2 In multi-body Parallel storage system Since the level of the I/O device's request to main memory is higher than that of the CPU's memory access, the CPU waits for the I/O device's memory access, causing the CPU to wait for a period of time, or even wait for several main memory cycles, thus reducing the CPU's work efficiency To avoid contention between CPU and I/O devices for memory access, add L1 cache In this way, the main memory can send the information that the CPU wants to fetch to the cache in advance. Once the main memory is exchanged with the I/O device, the CPU can read the required information from the cache directly, without having to wait empty to affect the efficiency.
3 The algorithms proposed at present can be divided into the following three categories (the first category is the key to master):
(1) The traditional replacement algorithm and its direct evolution are represented by the following algorithms: ① LRU (Least Recently Used) algorithm: replace the least recently used content from the Cache; ② LFU (Lease Frequently Used) algorithm: Number of visits Replace Cache with the least content; ③ If all the contents in the cache are cached on the same day, replace the largest document from the cache; otherwise, replace it according to the LRU algorithm. ④ FIFO (First In First Out): Follow First in, first out In principle, if the current cache is full, replace the one that entered the cache the earliest.
(2) The replacement algorithms based on the key features of cache content include: ① Size replacement algorithm: replace the largest content out of the Cache ② LRU-MIN replacement algorithm: this algorithm tries to minimize the number of replaced documents. Set the size of the document to be cached as S, and replace the document cached in the Cache with the size of at least S according to the LRU algorithm; If there is no object with the size of at least S, replace it from the document with the size of at least S/2 according to the LRU algorithm; ③ LRU Threshold replacement algorithm: consistent with the LRU algorithm, but documents whose size exceeds a certain threshold cannot be cached; ④ Lowest Latency First replacement algorithm: replace the document with the least access delay from the cache.
(3) Cost based replacement algorithm, which uses a Cost function Evaluate the objects in the cache, and finally determine the replacement object according to the generation value. The representative algorithms are: ① Hybrid algorithm: the algorithm assigns a utility function , replace the object with the least utility with Cache; ② Lowest Relative Value algorithm: replace the object with the lowest utility value with the Cache; ③ Least Normalized Cost Replacement (LCNR) algorithm: This algorithm uses a Document Access Reasoning function of frequency, transmission time and size to determine the replacement document; ④ Bolot et al. proposed a weight reasoning function based on the cost, size, and last access time of document transmission time to determine document replacement; ⑤ Size - Adjust LRU (SLRU) algorithm: sort cached objects according to the ratio of cost to size, and select the object with the smallest ratio to replace it.

Function introduction

Announce
edit
On the computer technological development In the process, Main memory Access speed Always compare a central processor The operation speed is much slower, so that the high-speed processing capacity of the central processor cannot be fully utilized computer system Of work efficiency Affected. There are many ways to ease the speed mismatch between the CPU and main memory, such as using multiple General register , many memory bank Cross access Wait at Storage hierarchy The use of cache memory is also one of the common methods. Many large and medium-sized computers and the latest Minicomputer microcomputer They also use cache memory.
The capacity of cache memory is generally only a few hundred times of the main memory, but its access speed can match the central processing unit. according to Principle of program locality The cells adjacent to a cell of the main memory being used are very likely to be used. Therefore, when the central processor accesses a unit of main memory, the computer hardware will automatically transfer the contents of the group of units including the unit into the cache memory. The main memory unit that the central processor is about to access is likely to be in the group of units that have just been transferred into the cache memory. Thus, the CPU can directly access the cache memory. Throughout Processing If most of the operations of the central processing unit accessing the main memory can be replaced by the access cache memory, the computer system processing speed Can be significantly improved. [3]

Read hit rate

Announce
edit
[4] When the CPU finds useful data in the cache, it is called a hit. When there is no data required by the CPU in the cache (this is called a miss), the CPU accesses the memory. Theoretically, in a CPU with Level 2 Cache, read the hit rate 80%. That is to say, the useful data found by the CPU from the L1Cache accounts for 80% of the total data, and the remaining 20% is from L2Cache Read. Because the data to be executed cannot be accurately predicted, the hit rate of reading L2 is about 80% (16% of the total data is read from L2). Then the remaining data will have to be called from memory, but this is a relatively small proportion. In some high-end CPUs, we often hear about L3Cache. It is a cache designed for missing data after reading L2Cache. Among CPUs with L3Cache, only about 5% of data needs to be called from memory, which further improves CPU efficiency.
To ensure a high hit rate during CPU access, the contents of the cache should be replaced by a certain algorithm. A common algorithm is“ Least recently used Algorithm "( LRU Algorithm), which is to eliminate the least visited rows in the recent period. Therefore, you need to set one for each line Counter The LRU algorithm is to clear the counters of hit lines, and add 1 to the counters of other lines. Knock out row counters when replacement is required Count value The largest data row is out. This is an efficient and scientific algorithm. Its counter clearing process can eliminate some data that is no longer needed after frequent calls from the cache, and improve the cache's Utilization
Cache Replacement algorithm Impact on hit rate. When a new main memory block needs to be called into the Cache and its spatial location When the cache is full again, the cache data needs to be replaced, which causes the replacement strategy (algorithm) problem. according to Procedural locality It can be seen from the rule that programs always frequently use the recently used instructions and data in the running process. This provides a theoretical basis for the replacement strategy. The goal of the replacement algorithm is to maximize the hit rate of the Cache. Cache replacement algorithm affects Proxy Cache system performance A good cache replacement algorithm can produce a high hit rate. Common algorithms are as follows:
(1) Random method (RAND method) The random replacement algorithm uses Random number generator Generate a block number to be replaced and replace the block. This algorithm is simple and easy to implement, and it does not consider the past, present and future use of the Cache block, but does not use the“ historical information ”. Not based on the visit Principle of locality , so the hit rate of the cache cannot be improved, and the hit rate is low.
(2) FIFO( FIFO First In First Out (FIFO) algorithm. It is to replace the information block that first enters the cache. The FIFO algorithm determines the order of elimination according to the order of incoming cache, and selects the word block that was first imported into the cache for replacement. It does not need to record the usage of each word block, and is relatively easy to implement, System overhead Small. Its disadvantage is that some program blocks that need to be used frequently (such as loop programs) may also be replaced as the first blocks to enter the cache. Moreover, it does not follow the locality principle of memory access, so it cannot improve the hit rate of the cache. Because the information that was first called in may be used later or often, such as a circular program. This method is simple and convenient. It uses the "historical information" in the main memory, but it cannot be said that the first one to enter will not be used frequently. Its disadvantage is that it cannot be reflected correctly Principle of program locality , the hit rate is not high, and an exception may occur.
(3) LRU (Least Recently Used) algorithm. This method is to replace the information blocks in the cache that are used least recently. This algorithm is better than First in first out algorithm Better. However, this method cannot guarantee that it was not used in the past and will not be used in the future. The LRU method is based on the use of each block. It always selects the least recently used block to be replaced. Although this method better reflects the law of program locality, this replacement method needs to record the use of each block in the Cache at any time to determine which block is the least used recently. The LRU algorithm is relatively reasonable, but it is complex to implement and has high system overhead. It is usually necessary to set a hardware or software module called counter for each block to record its use.