Expose the problem of online JVM memory overflow caused by FileSystem

original
05/08 11:32
Reading 5.5K

Author: Ye Jidong from vivo Internet Big Data Team

This article mainly introduces the whole process of analyzing and solving the problem of memory overflow caused by an online memory leak caused by the FileSystem class.

Memory leak definition : An object or variable that is no longer used by the program still occupies storage space in memory, and the JVM cannot recover the modified object or variable normally. A memory leak does not seem to have a big impact, but the result of a memory leak heap is a memory overflow.

Out of memory : refers to an error that causes the program to fail to continue executing due to insufficient allocated memory space or improper use during program operation. At this time, an error OOM will be reported, which is called memory overflow.  

1、 Background

At the weekend, Xiaoye was killing in the King's Canyon. Suddenly, the mobile phone received a large number of machine CPU alarms. If the CPU utilization rate exceeded 80%, the alarm would be given. At the same time, it also received the Full GC alarm of the service. This service is a very important service for Xiaoye project team. Xiaoye quickly put down the king glory in his hands and opened the computer to check the problem.

 picture

 

 picture

Figure 1.1 CPU Alarm Full GC Alarm

2、 Problem discovery

2.1 Monitoring View

Because the service CPU and Full GC are alarmed, open the service monitoring to view the CPU monitoring and Full GC monitoring. You can see that both of the two monitoring have an abnormal bulge at the same time point. You can see that when the CPU is alarmed, Full GC Especially frequently, guess may be Full GC The resulting CPU utilization rise alarm.

 picture

Figure 2.1 CPU Usage

 picture

Figure 2.2 Full GC Times

2.2 Memory leak

From Full Gc, it can be seen that there must be a problem with the service's memory recycling. So check the monitoring of the service's heap memory, memory of the old age, and memory of the young generation. From the resident memory map of the old age, we can see that there is more and more resident memory in the old age, and objects in the old age cannot be recycled. Finally, all resident memory is occupied, which can show that there is an obvious memory leak.

 picture

Figure 2.3 Memory in the Old Age

 picture

Figure 2.4 JVM memory

2.3 Memory overflow

From the online error log, it can be clearly known that the service is finally OOM, so the root cause of the problem is Memory leak cause Memory overflow, OOM, and finally service unavailability

 picture

Figure 2.5 OOM Log

3、 Troubleshooting

3.1 Heap memory analysis

After identifying the cause of the problem as a memory leak, we immediately took the dump service memory snapshot and imported the dump file to MAT (Eclipse Memory Analyzer) for analysis. Leak Suspects enters the view of suspected leakage points.

 picture

Figure 3.1 Memory Object Analysis

 

 picture

Figure 3.2 Object Link Diagram

 

The open dump file is shown in Figure 3.1. The 2.3G heap memory contains The org.apache.hadoop.conf.Configuration object accounts for 1.8G and 78.63% of the total heap memory

 

Expand the associated object and path of the object, and you can see that the main occupied objects are HashMap , the HashMap is created by FileSystem.Cache Object holding, and then the upper level FileSystem We can guess that the approximate rate of memory leakage is related to FileSystem.

3.2 Source code analysis

Find the memory leaking object, then the next step is to find the memory leaking code.

In Figure 3.3, we can find such a piece of code that every time we interact with hdfs, we will establish a connection with hdfs and create a FileSystem object. However, after using the FileSystem object, the close() method was not called to release the connection.

But here Configuration Instance and FileSystem Instances are local variables. After the execution of this method is completed, these two objects should be recyclable by the JVM. How can memory leak be caused?

 picture

Figure 3.3

(1) Guess 1: Does FileSystem have constant objects?

Next, we will look at the source code of the FileSystem class, Init and get of FileSystem The method is as follows:

 picture

 picture

 picture

Figure 3.4

As can be seen from the last line of code in Figure 3.4, there is a CACHE in the FileSystem class, which controls whether to get objects from the cache through disableCacheName The default value of this parameter is false. that is By default, CACHE Object returns FileSystem.

 picture

Figure 3.5

From Figure 3.5, we can see that CACHE is a static object of the FileSystem class, that is, the CACHE object will always exist and will not be recycled. There is indeed a constant object CACHE, and guess one has been verified.

Let's take a look at the CACHE.get method:

 picture

It can be seen from this code:

  1. A Map is maintained inside the Cache class, which is used to cache connected FileSystem objects. The Key of the Map is Cache Key object. Every time through the Cache The key gets the FileSystem. If not, the process will continue to be created.

  2. A Set (toAutoClose) is maintained inside the Cache class, which is used to store connections that need to be closed automatically. The connection in the collection is automatically closed when the client closes.

  3. Each time a FileSystem is created, a Cache The Key is the key, and the FileSystem is the Value, which is stored in the Map in the Cache class. As for whether there will be multiple caches of the same hdfs URI when caching, you need to check the Cache The hashCode method of the Key.

The hashCode method of Cache.Key is as follows:

 picture

The schema and authority variables are of String type. If the URI is the same, their hashCodes are the same. The value of the unique parameter is 0 every time. Then Cache The hashCode of the Key is determined by ugi.hashCode() decision.

From the above code analysis, it can be concluded that:

  1. During the interaction between business code and hdfs, a new one will be created for each interaction FileSystem The FileSystem connection was not closed at the end of the connection.

  2. FileSystem has a static Cache , there is a Map inside the cache, which is used to cache the FileSystem that has created a connection.

  3. The parameter fs.hdfs.impl.disable.cache is used to control whether the FileSystem needs caching. By default, it is false, that is, caching.

  4. Map and Key in Cache are Cache Key class, which passes schem,authority,ugi,unique Four parameters to determine a key, such as Cache HashCode method of Key.

(2) Guess 2: Does the FileSystem cache the hdfs URI multiple times?

The FileSystem.Cache.Key constructor is as follows: ugi is determined by getCurrentUser() of UserGroupInformation.

 picture

Continue to see the getCurrentUser() method of UserGroupInformation, as follows:

 picture

The key is whether the Subject object can be obtained through AccessControlContext. In this example, when getting through the get (final URI uri, final Configuration conf, final String user) and debugging, it is found that a new Subject object can be obtained here every time. That is to say, the same hdfs path will cache one each time FileSystem object

Conjecture 2 is verified: the same hdfs URI will be cached multiple times, resulting in rapid cache expansion, and the cache does not set expiration time and elimination strategy, which eventually leads to memory overflow.

(3) Why does FileSystem cache repeatedly?

Then why is it that a new Subject object is obtained every time? Let's take a look at the code to obtain AccessControlContext, as follows:

 picture

The key is the getStackAccessControlContext method, which calls the Native method, as follows:

 picture

This method will return the AccessControlContext object of the protection domain permission of the current stack.

We use Figure 3.6 get(final URI uri, final Configuration conf,final String user)  The method can be seen as follows:

  • Pass first UserGroupInformation.getBestUGI Method gets a UserGroupInformation Object.

  • Then after passing UserGroupInformation To call the get (URI uri, Configuration conf) method

  • Figure 3.7 UserGroupInformation.getBestUGI Method. Here we focus on the two parameters passed in ticketCachePath,user TicketCachePath is to get the value of hadoop.security.kerberos.ticket.cache.path configured. In this example, this parameter is not configured, so the ticketCachePath is empty. The user parameter is the user name passed in in this example.

  • TicketCachePath is empty, user is not empty, so the final execution will be as shown in Figure 3.7 createRemoteUser method

 picture

Figure 3.6

 picture

Figure 3.7

 picture

Figure 3.8

From the code marked in red in Figure 3.8, we can see that in the createRemoteUser method, a new Subject object , and created a UserGroupInformation Object. So far, the UserGroupInformation.getBestUGI method has been executed.

Let's take a look UserGroupInformation.doAs Method (the last method executed by FileSystem.get (final URI uri, final Configuration conf, final String user)), as follows:

 picture

Then call the Subject.doAs method, as follows:

 picture

Finally, call the AccessController.doPrivileged method, as follows:

 picture

This method is the Native method, which uses the specified AccessControlContext to execute PrivilegedExceptionAction, that is, to call the run method of this implementation. Namely, FileSystem. get (uri, conf) method.

So far, it can be explained that in this example, when a FileSystem is created through the get (final URI uri, final Configuration conf, final String user) method, the hashCode of the Cache.key stored in the Cache of the FileSystem is inconsistent each time.

To summarize:

  1. By get(final URI uri, final Configuration conf,final String user) Method Creation FileSystem Because each time a new UserGroupInformation and Subject Object.

  2. In Cache Key object calculation hashCode The calculation result is affected by calling UserGroupInformation.hashCode method.

  3. UserGroupInformation.hashCode method, calculated as: System.identityHashCode(subject) That is, if the Subject is the same object, the same hashCode will be returned. Because it is different each time in this example, the calculated hashCode is inconsistent.

  4. To sum up, every time the hashCode of the Cache.key is calculated inconsistently, it will be repeatedly written to the Cache of the FileSystem.

(4) Correct usage of FileSystem

From the above analysis, since FileSystem The cache does not play its role, so why design this cache. In fact, it's just that our usage is not correct.

In FileSystem, there are two overloaded get methods:

 public static FileSystem get(final URI uri,  final Configuration conf, final String user) public static FileSystem get(URI uri, Configuration conf)

 picture

We can see that the FileSystem get (final URI uri, final Configuration conf, final String user) method finally calls the FileSystem get (URI uri, Configuration conf) method. The difference is that the FileSystem get (URI uri, Configuration conf) method is short of operations for each new Subject.

 picture

Figure 3.9

If there is no operation to create a new Subject, the Subject in Figure 3.9 is null, and the last getLoginUser method will be used to obtain the loginUser. The loginUser is a static variable, so once the loginUser object is initialized successfully, it will always be used later. The UserGroupInformation.hashCode method will return the same hashCode value. That is, the cache in the FileSystem can be used successfully.

 picture

 picture

Figure 3.10

4、 Solution

After the previous introduction, if we want to solve the memory leak problem of FileSystem, we have the following two ways:

(1) Using public static FileSystem get(URI uri, Configuration conf):

  • This method can use the FileSystem's Cache, that is, there will only be one FileSystem connection object for the same hdfs URI.

  • Set the access user through System.setProperty ("HADOOP_USER_NAME", "hive").

  • By default, fs. automatic. close=true, that is, all connections will be closed through ShutdownHook.

(2) Using public static FileSystem get(final URI uri, final Configuration conf, final String user):

  • As analyzed above, this method will cause the Cache of the FileSystem to become invalid, and will be added to the Map of the Cache every time, so that it cannot be recycled.

  • When using, one solution is to ensure that only one FileSystem connection object exists for the same hdfs URI.

  • Another solution is to call the close method after using the FileSystem each time, which will delete the FileSystem in the Cache.

 picture

 picture

 picture

Based on our existing history code with minimal changes, we chose the second modification method. Close the FileSystem object every time we use the FileSystem.

5、 Optimization results

After the code was repaired and released online, as shown in Figure 1 below, it can be seen that the memory of the old age can be recovered normally after the repair, and this problem has finally been solved.

 picture

 

 picture

6、 Summary

out of memory Is one of the most common problems in Java development, usually due to Memory leak The memory cannot be recycled normally. In this article, we will introduce a complete online memory overflow processing in detail.

Summarize our common solutions when encountering memory overflow:

(1) Generate heap memory file

Add in the service startup command

 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/base

Let the service dump the memory file automatically when oom occurs, or use the jam command to dump the memory file.

(2) Heap memory analysis : Use the memory analysis tool to help us analyze the memory overflow problem more deeply and find the cause of the memory overflow. The following are some common memory analysis tools:

  • Eclipse Memory Analyzer : An open source Java memory analysis tool that can help us quickly locate memory leakage problems.

  • VisualVM Memory Analyzer : A tool based on graphical interface, which can help us analyze the memory usage of java applications.

(3) Locate the specific memory leak code according to the heap memory analysis.

(4) Modify the memory leak code and re issue the verification.

Memory leaks are a common cause of memory overflows, but they are not the only cause. The common causes of memory overflow are: Oversized objects, too small heap memory allocation, endless loop calls And so on will lead to memory overflow problems.

When encountering the problem of memory overflow, we need to think in many ways and analyze the problem from different perspectives. The methods, tools and various monitoring methods mentioned above help us quickly locate and solve problems and improve the stability and availability of our system.

Expand to read the full text
Loading
Click to join the discussion 🔥 (1) Post and join the discussion 🔥
Reward
one comment
twelve Collection
one fabulous
 Back to top
Top