Expose the problem of online JVM memory overflow caused by FileSystem

Author: Ye Jidong from vivo Internet Big Data Team

This article mainly introduces the whole process of analyzing and solving the problem of memory overflow caused by an online memory leak caused by the FileSystem class.

Memory leak definition : An object or variable that is no longer used by the program still occupies storage space in memory, and the JVM cannot recover the modified object or variable normally. A memory leak does not seem to have a big impact, but the result of a memory leak heap is a memory overflow.

Out of memory : refers to an error that causes the program to fail to continue executing due to insufficient allocated memory space or improper use during program operation. At this time, an error OOM will be reported, which is called memory overflow.

1、 Background

At the weekend, Xiaoye was killing in the King's Canyon. Suddenly, the mobile phone received a large number of machine CPU alarms. If the CPU utilization rate exceeded 80%, the alarm would be given. At the same time, it also received the Full GC alarm of the service. This service is a very important service for Xiaoye project team. Xiaoye quickly put down the king glory in his hands and opened the computer to check the problem.

Figure 1.1 CPU Alarm Full GC Alarm

2、 Problem discovery

2.1 Monitoring View

Because the service CPU and Full GC are alarmed, open the service monitoring to view the CPU monitoring and Full GC monitoring. You can see that both of the two monitoring have an abnormal bulge at the same time point. You can see that when the CPU is alarmed, Full GC Especially frequently, guess may be Full GC The resulting CPU utilization rise alarm.

Figure 2.1 CPU Usage

Figure 2.2 Full GC Times

2.2 Memory leak

From Full Gc, it can be seen that there must be a problem with the service's memory recycling. So check the monitoring of the service's heap memory, memory of the old age, and memory of the young generation. From the resident memory map of the old age, we can see that there is more and more resident memory in the old age, and objects in the old age cannot be recycled. Finally, all resident memory is occupied, which can show that there is an obvious memory leak.

Figure 2.3 Memory in the Old Age

Figure 2.4 JVM memory

2.3 Memory overflow

From the online error log, it can be clearly known that the service is finally OOM, so the root cause of the problem is Memory leak cause Memory overflow, OOM, and finally service unavailability 。

Figure 2.5 OOM Log

3、 Troubleshooting

3.1 Heap memory analysis

After identifying the cause of the problem as a memory leak, we immediately took the dump service memory snapshot and imported the dump file to MAT (Eclipse Memory Analyzer) for analysis. Leak Suspects enters the view of suspected leakage points.

Figure 3.1 Memory Object Analysis

Figure 3.2 Object Link Diagram

The open dump file is shown in Figure 3.1. The 2.3G heap memory contains The org.apache.hadoop.conf.Configuration object accounts for 1.8G and 78.63% of the total heap memory 。

Expand the associated object and path of the object, and you can see that the main occupied objects are HashMap , the HashMap is created by FileSystem.Cache Object holding, and then the upper level FileSystem 。 We can guess that the approximate rate of memory leakage is related to FileSystem.

3.2 Source code analysis

Find the memory leaking object, then the next step is to find the memory leaking code.

In Figure 3.3, we can find such a piece of code that every time we interact with hdfs, we will establish a connection with hdfs and create a FileSystem object. However, after using the FileSystem object, the close() method was not called to release the connection.

But here Configuration Instance and FileSystem Instances are local variables. After the execution of this method is completed, these two objects should be recyclable by the JVM. How can memory leak be caused?

Figure 3.3

(1) Guess 1: Does FileSystem have constant objects?

Next, we will look at the source code of the FileSystem class, Init and get of FileSystem The method is as follows:

Figure 3.4

As can be seen from the last line of code in Figure 3.4, there is a CACHE in the FileSystem class, which controls whether to get objects from the cache through disableCacheName 。 The default value of this parameter is false. that is By default, CACHE Object returns FileSystem.

Figure 3.5

From Figure 3.5, we can see that CACHE is a static object of the FileSystem class, that is, the CACHE object will always exist and will not be recycled. There is indeed a constant object CACHE, and guess one has been verified.

Let's take a look at the CACHE.get method:

It can be seen from this code:

A Map is maintained inside the Cache class, which is used to cache connected FileSystem objects. The Key of the Map is Cache Key object. Every time through the Cache The key gets the FileSystem. If not, the process will continue to be created.
A Set (toAutoClose) is maintained inside the Cache class, which is used to store connections that need to be closed automatically. The connection in the collection is automatically closed when the client closes.
Each time a FileSystem is created, a Cache The Key is the key, and the FileSystem is the Value, which is stored in the Map in the Cache class. As for whether there will be multiple caches of the same hdfs URI when caching, you need to check the Cache The hashCode method of the Key.

The hashCode method of Cache.Key is as follows:

The schema and authority variables are of String type. If the URI is the same, their hashCodes are the same. The value of the unique parameter is 0 every time. Then Cache The hashCode of the Key is determined by ugi.hashCode() decision.

From the above code analysis, it can be concluded that:

During the interaction between business code and hdfs, a new one will be created for each interaction FileSystem The FileSystem connection was not closed at the end of the connection.
FileSystem has a static Cache , there is a Map inside the cache, which is used to cache the FileSystem that has created a connection.
The parameter fs.hdfs.impl.disable.cache is used to control whether the FileSystem needs caching. By default, it is false, that is, caching.
Map and Key in Cache are Cache Key class, which passes schem，authority，ugi，unique Four parameters to determine a key, such as Cache HashCode method of Key.

(2) Guess 2: Does the FileSystem cache the hdfs URI multiple times?

The FileSystem.Cache.Key constructor is as follows: ugi is determined by getCurrentUser() of UserGroupInformation.

Continue to see the getCurrentUser() method of UserGroupInformation, as follows:

The key is whether the Subject object can be obtained through AccessControlContext. In this example, when getting through the get (final URI uri, final Configuration conf, final String user) and debugging, it is found that a new Subject object can be obtained here every time. That is to say, the same hdfs path will cache one each time FileSystem object 。

Conjecture 2 is verified: the same hdfs URI will be cached multiple times, resulting in rapid cache expansion, and the cache does not set expiration time and elimination strategy, which eventually leads to memory overflow.

(3) Why does FileSystem cache repeatedly?

Then why is it that a new Subject object is obtained every time? Let's take a look at the code to obtain AccessControlContext, as follows:

The key is the getStackAccessControlContext method, which calls the Native method, as follows:

This method will return the AccessControlContext object of the protection domain permission of the current stack.

We use Figure 3.6 get(final URI uri, final Configuration conf,final String user) The method can be seen as follows:

Pass first UserGroupInformation.getBestUGI Method gets a UserGroupInformation Object.
Then after passing UserGroupInformation To call the get (URI uri, Configuration conf) method
Figure 3.7 UserGroupInformation.getBestUGI Method. Here we focus on the two parameters passed in ticketCachePath，user 。 TicketCachePath is to get the value of hadoop.security.kerberos.ticket.cache.path configured. In this example, this parameter is not configured, so the ticketCachePath is empty. The user parameter is the user name passed in in this example.
TicketCachePath is empty, user is not empty, so the final execution will be as shown in Figure 3.7 createRemoteUser method

Figure 3.6

Figure 3.7

Figure 3.8

From the code marked in red in Figure 3.8, we can see that in the createRemoteUser method, a new Subject object , and created a UserGroupInformation Object. So far, the UserGroupInformation.getBestUGI method has been executed.

Let's take a look UserGroupInformation.doAs Method (the last method executed by FileSystem.get (final URI uri, final Configuration conf, final String user)), as follows:

Then call the Subject.doAs method, as follows:

Finally, call the AccessController.doPrivileged method, as follows:

This method is the Native method, which uses the specified AccessControlContext to execute PrivilegedExceptionAction, that is, to call the run method of this implementation. Namely, FileSystem. get (uri, conf) method.

So far, it can be explained that in this example, when a FileSystem is created through the get (final URI uri, final Configuration conf, final String user) method, the hashCode of the Cache.key stored in the Cache of the FileSystem is inconsistent each time.

To summarize:

By get(final URI uri, final Configuration conf,final String user) Method Creation FileSystem Because each time a new UserGroupInformation and Subject Object.
In Cache Key object calculation hashCode The calculation result is affected by calling UserGroupInformation.hashCode method.
UserGroupInformation.hashCode method, calculated as: System.identityHashCode(subject) 。 That is, if the Subject is the same object, the same hashCode will be returned. Because it is different each time in this example, the calculated hashCode is inconsistent.
To sum up, every time the hashCode of the Cache.key is calculated inconsistently, it will be repeatedly written to the Cache of the FileSystem.

(4) Correct usage of FileSystem

From the above analysis, since FileSystem The cache does not play its role, so why design this cache. In fact, it's just that our usage is not correct.

In FileSystem, there are two overloaded get methods:

 public static FileSystem get(final URI uri,  final Configuration conf, final String user) public static FileSystem get(URI uri, Configuration conf)

We can see that the FileSystem get (final URI uri, final Configuration conf, final String user) method finally calls the FileSystem get (URI uri, Configuration conf) method. The difference is that the FileSystem get (URI uri, Configuration conf) method is short of operations for each new Subject.

Figure 3.9

If there is no operation to create a new Subject, the Subject in Figure 3.9 is null, and the last getLoginUser method will be used to obtain the loginUser. The loginUser is a static variable, so once the loginUser object is initialized successfully, it will always be used later. The UserGroupInformation.hashCode method will return the same hashCode value. That is, the cache in the FileSystem can be used successfully.

Figure 3.10

4、 Solution

After the previous introduction, if we want to solve the memory leak problem of FileSystem, we have the following two ways:

(1) Using public static FileSystem get(URI uri, Configuration conf)：

This method can use the FileSystem's Cache, that is, there will only be one FileSystem connection object for the same hdfs URI.
Set the access user through System.setProperty ("HADOOP_USER_NAME", "hive").
By default, fs. automatic. close=true, that is, all connections will be closed through ShutdownHook.

(2) Using public static FileSystem get(final URI uri, final Configuration conf, final String user)：

As analyzed above, this method will cause the Cache of the FileSystem to become invalid, and will be added to the Map of the Cache every time, so that it cannot be recycled.
When using, one solution is to ensure that only one FileSystem connection object exists for the same hdfs URI.
Another solution is to call the close method after using the FileSystem each time, which will delete the FileSystem in the Cache.

Based on our existing history code with minimal changes, we chose the second modification method. Close the FileSystem object every time we use the FileSystem.

5、 Optimization results

After the code was repaired and released online, as shown in Figure 1 below, it can be seen that the memory of the old age can be recovered normally after the repair, and this problem has finally been solved.

6、 Summary

out of memory Is one of the most common problems in Java development, usually due to Memory leak The memory cannot be recycled normally. In this article, we will introduce a complete online memory overflow processing in detail.

Summarize our common solutions when encountering memory overflow:

（1） Generate heap memory file ：

Add in the service startup command

 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/base

Let the service dump the memory file automatically when oom occurs, or use the jam command to dump the memory file.

（2） Heap memory analysis : Use the memory analysis tool to help us analyze the memory overflow problem more deeply and find the cause of the memory overflow. The following are some common memory analysis tools:

Eclipse Memory Analyzer : An open source Java memory analysis tool that can help us quickly locate memory leakage problems.
VisualVM Memory Analyzer : A tool based on graphical interface, which can help us analyze the memory usage of java applications.

(3) Locate the specific memory leak code according to the heap memory analysis.

(4) Modify the memory leak code and re issue the verification.

Memory leaks are a common cause of memory overflows, but they are not the only cause. The common causes of memory overflow are: Oversized objects, too small heap memory allocation, endless loop calls And so on will lead to memory overflow problems.

When encountering the problem of memory overflow, we need to think in many ways and analyze the problem from different perspectives. The methods, tools and various monitoring methods mentioned above help us quickly locate and solve problems and improve the stability and availability of our system.

zoujiaqing 2024-06-07 21:22

I dare not use it

Ding Yun H 2024-06-07 20:44

There is no querydsl. Since querydsl was used, I can't look at other forms anymore

gamedot 2024-05-17 11:14

Old Zhou is deeply concerned about Huawei's great cause of open source. He is not a Huawei person, but has Huawei's soul.

Chief taxi captain 2024-05-17 11:17

I suggest that 360 open source all its products, and then become the leading enterprise in the domestic open source industry through open source, leading everyone to compete with foreign enterprises

lyh97157268 2024-06-09 20:58

Like c++

zhy 2024-05-16 13:16

At the end of Shannon is Nong

muwanqing123 2024-06-09 08:28

Bullshit authentication

iVista 2024-06-10 18:13

I was blinded by the math test

Happy LeapFrog 2024-05-18 09:18

But the question is: "What's the use of this for ordinary Android users?" Now the answer seems to be: "Almost nothing.".

Wang Zheng 2024-06-08 09:46

You said, "All the tests are graduate students" and smiled. I don't know my level is low.

SnailJob 2024-06-09 09:13

Yes, please continue to follow Snail Job

yh2216 2024-06-09 13:15

One code Yma 2024-05-09 09:58

Recently, I often go to interviews. People who hate Ali background most regard me as a fool, even though I am a fool

osc_566335 2024-04-28 14:44

This is also called floor washing? Does it mean that Tesla will not wash the floor if it releases all the source code? Some people HWptds? That is to say, the language is ambiguous, which will also rise to the washing ground? Are some people too focused? Think the people he pays attention to must be staring at?

golyu 2024-06-10 14:45

If only this was the library of solidjs

oldpig 2024-04-28 09:59

”Huawei contributed all the source code "?, the title is completely inconsistent with the content.

sunday12345 2024-05-15 18:31

What does the bank do? It's blamed on the remote desktop. Persimmons really pick up soft pinches~?

Xiao Xu Middle aged 2024-06-08 12:43

Do AI functions need networking? Will it be 404?

Xiao_f 2024-06-07 22:59

One thing to say, compared with other domestic manufacturers, Qwen's relaxed licensing fully demonstrates the style of a large factory

Yeah, for 2024-05-17 13:42

That's too right. Old Zhou can't control Google, but he can control 360. Do not do to others what you do not want. All 360 products should be opened first.

One code Yma 2024-05-06 09:14

My technical article was moved by CSDN. Why didn't anyone step on the sewing machine? This kind of report is a joke to me. The monsters with background are fine, and the monsters without background fight to death

Shuimu Yi'an 2024-05-20 09:58

The news should be read continuously. I'm waiting for the third news besides rustdesk and teamviewer. Localized remote desktop software is far ahead.

abeet 2024-06-08 20:38

There are no pictures, for fear that we will learn, right

Xiao Xu Middle aged 2024-06-10 07:05

Learn

zoujiaqing 2024-06-07 21:21

Spring boot was not updated last year

zhuzhua 2024-05-21 10:08

I'm laughing to death. Those who have been deeply kidnapped dare not pay? Who will use the domestic open source framework of small companies in the future will be 213!!! Wait for harvesting later

zzeric 2024-04-28 20:01

Although France is the parent community, the core developers of OCCT on github are all Russians. Without Russians, the French parent community cannot continue to operate. So Huawei took over, moved to China, changed its name and resumed open source and community operations. What's the problem?

-SORA- 2024-04-30 17:07

When this happened in a foreign country, the comment area suddenly became very objective and rational**

CodeDoger 2024-05-02 20:48

35 It's too old to go to work and too early to retire at 60

Xiao Xu Middle aged 2024-06-08 10:12

First place in making money!! Money and treasures will be plentiful

MrChen89 2024-04-29 09:18

There are a group of people like this. I don't know what they have experienced. When it comes to HW, I can't say anything good, even if it's neutral

Francesca 2024-06-09 13:21

But the end of closed source must be open source, because many people who are dissatisfied with closed source have created open source, so the end of open source is not necessarily closed source, but to find a business model that is open source= Free Admission

monkey_cici 2024-05-09 00:25

My I9 CPU, 64GB memory module and 3080Ti computer are inferior to the top configuration of 19999 on a tablet

yh2216 2024-06-09 23:03

I remember saying that one year C++was the language of the year,

osc_27546117 2024-06-09 22:36

Learned electric programming and expected its progress

xiaoqibabby 2024-05-15 17:36

The bank is strongly required to be responsible for

generation

Code e person 2024-06-09 10:03

Prepare the next project and try it

zhangleijie 2024-06-08 10:08

pretty good

kangert 2024-06-09 20:07

The problem of docker hub is very uncomfortable

Small and beautiful software development 2024-06-08 23:03

It's mainly about waist training

Francesca 2024-05-19 18:00

Wine runs the Android emulator of Windows. Chrome OS is installed in the Android emulator. Linux environment is installed in chrome OS. Linux environment is installed in the Linux environment. Wine is installed in the Android emulator

brucepapa 2024-06-09 21:02

I also have several backaches... After a few days of exercise, it will be much better to focus on stretching the back muscles.

Ai East 2024-06-10 19:11

Absolutely easy to use

Xiaoxia cat ball 2024-06-09 21:29

Very good, come on

Qin Liming 2024-05-11 09:12

be devoid of any sense of shame

Bright 2024-05-19 23:25

What a fool! I killed myself. How can people deal with me later.

H Fine water and long flow H 2024-06-10 09:39

I haven't heard about whether fartran has paid. I'm in the top ten

GDWhisperer 2024-05-15 17:23

I transferred tens of thousands of yuan to my own account, which was under risk control. How did I do this? The bank should be responsible for this**

Yoona520 2024-05-17 16:34

Zhou Hongyi is now living more and more like a clown. If he stays behind the scenes, he has to become an online celebrity. Can you learn from Lei Jun?

Li Yinghui 2024-05-09 16:40

Buddhism has a good word, evil opinion. In dealing with the world, it is meaningless to draw conclusions from preset positions; It is also important to receive good logic training.

kakai 2024-05-10 10:21

The world only knows that Android was created by Google. Several people know that Android is only a product acquired by Google. Similarly, what is the problem with Huawei's contribution to the collection of OGG open source work and integration into its own proprietary product line?

Kevin586 2024-06-08 14:41

Dream is garbage, which can also be listed and refresh my cognition

Ma Nong Little Fatty Brother 2024-05-16 14:40

I give you six seconds. I give you six moves with the same effect in the martial arts contest, which shows the invincibility and confidence of the master

Francesca 2024-06-10 16:19

Be ignorant. This thing has a long history. It is used for scientific computing and has high performance

pan3793 2024-06-07 22:26

Let AI give AI a score

infoworld 2024-05-11 15:12

Universities should use open source free software instead of commercial ones. In this way, hands and feet will not be tied technically.

Monkeys think of apes 2024-05-31 18:31

You can cheat your brother. Just don't cheat yourself

kangert 2024-06-09 20:10

Really need to practice

Single structure 2024-05-11 10:09

Selected as Open Source China's disgrace pillar

osc_92224065 2024-04-29 10:57

Long term oppressed outsourcing of state-owned enterprises

Expose the problem of online JVM memory overflow caused by FileSystem

1、 Background

2、 Problem discovery

2.1 Monitoring View

2.2 Memory leak

2.3 Memory overflow

3、 Troubleshooting

3.1 Heap memory analysis

3.2 Source code analysis

4、 Solution

5、 Optimization results

6、 Summary

Hot content

Popular comments of the whole site

About the author

Author's Album

Author's other popular articles

Hot News

Hot software

OSCHINA Community

Online tools

Introduction

QQ group

Public account

Video number

Expose the problem of online JVM memory overflow caused by FileSystem

1、 Background

2、 Problem discovery

2.1 Monitoring View

2.2 Memory leak

2.3 Memory overflow

3、 Troubleshooting

3.1 Heap memory analysis

3.2 Source code analysis

4、 Solution

5、 Optimization results

6、 Summary

Hot content

Popular comments of the whole site

About the author

Author's Album

Author's other popular articles

Hot News

Recommended attention

Hot software

OSCHINA Community

Online tools

Introduction

QQ group

Public account

Video number