This article was originally written in early May 2016, and then was greatly updated and revised in mid February 2020, and will continue to be updated later.

What is dynamic tracking

I am very happy to share the topic of Dynamic Tracing with you here, which is also an exciting topic for me personally. So, what is dynamic tracking technology?

Dynamic tracing technology is actually a post-modern advanced debugging technology. It can help software engineers answer some difficult questions about software systems at a very low cost in a very short time, so as to troubleshoot and solve problems more quickly. One of the major backgrounds of its rise and prosperity is that we are in a fast growing Internet era. As engineers, we are faced with two major challenges: first, scale. Both the size of users, the size of computer rooms and the number of machines are in a fast growing era. The second challenge is complexity. Our business logic is becoming more and more complex, and the software systems we run are also becoming more and more complex. We know that it will be divided into many layers, including the operating system kernel, and then various system software, such as databases and Web servers, and then virtual machines, interpreters and just in time (JIT) compilers with scripting languages or other high-level languages, The top is the abstraction level of various business logic and many complex code logic at the application level.

The most serious consequence of these huge challenges is that today's software engineers are rapidly losing insight and control over the entire production system. In such a complex and huge system, the probability of various problems has been greatly improved. Some problems may be fatal, such as 500 error pages, memory leaks, and returning error results. Another major type of problem is performance. We may find that the software runs very slowly at some time or on some machines, but we don't know why. Now everyone is embracing cloud computing and big data. There will only be more and more weird problems in this large-scale production environment, which will easily occupy most of engineers' time and energy. Most of the problems are actually online problems, which are difficult or almost impossible to reproduce. And the rate of some problems is very small, only one percent, one thousandth, or even lower. We'd better not take off the machine, modify our code or configuration, or restart the service. When the system is still running, we can analyze the problem, locate it, and then take targeted solutions. If we can do this, it will be perfect and we can have a good sleep every night.

Dynamic tracking technology can actually help us realize this vision and dream, thus greatly liberating the productivity of our engineers. I still remember that when I worked at Yahoo China, I sometimes had to take a taxi to the company in the middle of the night to deal with online problems. This is obviously a very helpless and frustrated way of life and work. Later, I worked for a CDN company in the United States. At that time, our customers would also have their own operation and maintenance teams. They would go to the original logs provided by CDN when they were free. It may be one percent or one thousandth of a problem for us, but it is a relatively important problem for them, which will be reported, and we must check, find out the real reason, and give feedback to them. These practical problems stimulate the invention and production of new technologies.

I think the most remarkable thing about dynamic tracking technology is that it is a "live analysis" technology. In other words, when a program or the entire software system is still running, still serving online, and still processing real requests, we can analyze it (whether it wants to or not), just like querying a database. This is very interesting. What many engineers tend to ignore is that the running software system itself contains most of the valuable information, which can be directly regarded as a real-time database for "query". Of course, this special "database" must be read-only, otherwise our analysis and debugging work may affect the behavior of the system itself, and may endanger online services. With the help of the operating system kernel, we can launch a series of targeted queries from the outside to obtain many first-hand valuable details about the running process of the software system itself, so as to guide our problem analysis, performance analysis and other work.

Dynamic tracing technology is usually based on the operating system kernel. In fact, the operating system kernel can control the whole software world, because it is actually in a position of "creator". It has absolute permissions, and it can ensure that various "queries" we send against the software system will not affect the normal operation of the software system itself. In other words, our query must be secure enough to be widely used on production systems. Querying the software system as a "database" will involve a query method. Obviously, we do not query this special "database" through SQL.

In dynamic tracking, queries are usually initiated through mechanisms such as probes. We will place some probes at a certain level or several levels of the software system, and then we will define the handlers associated with these probes ourselves. This is a bit like acupuncture in traditional Chinese medicine. That is to say, if we regard the software system as a person, we can put some "needles" on some of his acupoints, then these needles usually have some "sensors" defined by us, and we can freely collect the key information on those acupoints we need, and then summarize these information, To produce reliable etiological diagnosis and feasible treatment scheme. Tracking here usually involves two dimensions. One is the time dimension. Because the software is still running, it has a continuous change process on the timeline. The other dimension is the spatial dimension, because it may involve multiple different processes, including kernel processes, and each process often has its own memory space and process space. So between different levels, as well as within the same level of memory space, I can obtain a lot of valuable information in space both vertically and horizontally. It's a bit like a spider searching for prey on a spider web.

Spiders

We can not only collect some information in the operating system kernel, but also collect some information at a higher level such as user mode programs. These information can be linked on the timeline to build a complete software landscape, which can effectively guide us to do some very complex analysis. The key point here is that it is non intrusive. If we compare the software system to a person, we obviously don't want to open the mouth of a living person, but only to help him diagnose diseases. Instead, we will take an X-ray for him, make an MRI for him, signal his pulse, or simply listen with a stethoscope, and so on. This is also true for the diagnosis of a production system. The dynamic tracking technology allows us to use a non-invasive way, without modifying our operating system kernel, our applications, our business code or any configuration, to quickly and efficiently obtain the information we want, first-hand information, and help locate the various problems we are troubleshooting.

I think most engineers may be particularly familiar with the process of software construction, which is actually our basic skill. We usually build different levels of abstraction, layer by layer, whether bottom-up or top-down. There are many ways to establish software abstraction levels, such as through object-oriented classes and methods, or directly through functions and subroutines. In fact, the process of debugging is just the opposite of the way software is constructed. We just need to be able to easily "break" these previously established abstraction levels, and get any information needed at any one or several abstraction levels, no matter what packaging design, no matter what isolation design, No matter any rules and regulations that are artificially established at the time of software construction. This is because we always want to get as much information as possible when debugging. After all, problems may occur at any level.

Because the dynamic tracking technology is generally based on the operating system kernel, and the kernel is the "creator" and the absolute authority, this technology can easily penetrate the abstraction and encapsulation of each software level, so the abstraction and encapsulation level established during software construction will not become an obstacle. On the contrary, well-designed abstraction and encapsulation levels established during software construction actually contribute to the debugging process, which we will specifically mention later. In my own work, I often find that when some engineers have problems online, they are very flustered and will guess the possible reasons, but they lack any evidence to support or disprove their guesses and assumptions. He will even try and make mistakes repeatedly on the line, repeatedly tossing and turning, making a mess, without a clue, making himself and his colleagues very painful, wasting valuable troubleshooting time. When we have dynamic tracking technology, problem solving itself may become a very interesting process, which makes us excited when encountering weird online problems, as if we had a chance to solve a fascinating puzzle. Of course, the premise of all this is that we have powerful tools at our disposal to help us collect information and reason, and help us quickly prove or disprove any hypothesis and speculation.

Advantages of dynamic tracking

Dynamic tracking technology generally does not need the cooperation of target application. For example, when we were doing a physical examination for a buddy, he was still running on the playground, so we could take a dynamic X-ray for him while he was still exercising, and he didn't know it. On second thought, this is actually a great thing. The operation mode of various dynamic tracking based analysis tools is a "hot plug" mode, that is, we can run this tool at any time to sample at any time, and end sampling at any time, regardless of the current state of the target system. A lot of statistics and analysis were actually thought of after the target system went online. It is impossible for us to predict what problems we might encounter in the future before going online, let alone predict all the information we need to collect to troubleshoot those unknown problems. The advantage of dynamic tracking is that it can achieve "anytime, anywhere, on-demand collection". Another advantage is that its own performance loss is very small. The impact of carefully written debugging tools on the extreme performance of the system is usually less than 5%, or even lower, so it generally does not have an observable performance impact on our end users. In addition, even such a small performance loss only occurs in the tens of seconds or minutes when we actually sample. Once our debugging tool finishes running, the online system will automatically recover to 100% of its original performance and continue to rush forward.

The

DTrace and SystemTap

When it comes to dynamic tracking, we have to mention it DTrace DTrace is the ancestor of modern dynamic tracking technology. It was born in the Solaris operating system at the beginning of the 21st century and was written by engineers of the original Sun Microsystems. Maybe many students have heard of Solaris and Sun.

When it first came into being, I remember a story that several engineers of the Solaris operating system spent several days and nights troubleshooting a seemingly strange online problem. At first, they thought it was a very advanced problem, so they worked very hard. As a result, they struggled for several days. Finally, they found that it was actually a very stupid configuration problem in an inconspicuous place. Since then, these engineers have learned from their mistakes and created DTrace, a very advanced debugging tool, to help them avoid spending too much energy on stupid problems in their future work. After all, most of the so-called "weird problems" are actually low-level problems, which belong to the type of "very depressed if not transferred out, more depressed if transferred out".

It should be said that DTrace is a very general debugging platform, which provides a scripting language much like C language, called D. DTrace based debugging tools are written in this language. D language supports special syntax to specify a "probe", which usually has a location description information. You can locate it on the entry or exit of a kernel function, or the function entry or exit of a user mode process, or even on any program statement or machine instruction. Writing D language debugging program requires a certain understanding and knowledge of the system. These debugging programs are powerful tools for us to regain our insight into complex systems. There is a former engineer of Sun called Brendan Gregg, who was the original user of DTrace, even before DTrace was open source. Brendan has written many reusable debugging tools based on DTrace, one of which is called DTrace Toolkit Open source project. Dtrace is the earliest and most famous dynamic tracking framework.

The advantage of DTrace is that it is tightly integrated with the operating system kernel. The implementation of D language is actually a virtual machine (VM), a bit like a Java virtual machine (JVM). One of its advantages is that the runtime of D language is kernel resident and very compact, so the startup time and exit time of each debugging tool are very short. But I think DTrace also has obvious shortcomings. One of the shortcomings that I find hard to deal with is the lack of circular structure in D language, which makes it difficult to write many analysis tools for complex data structures in the target process. Although DTrace officially claims that the reason for the lack of cycles is to avoid overheating cycles, it is clear that DTrace can effectively limit the number of executions of each cycle at the VM level. Another major disadvantage is that DTrace's tracking support for user mode code is weak, and it does not have the function to automatically load user mode debugging symbols. You need to declare user mode C language structures and other types used in D language. one

DTrace has a great impact. Many engineers transplant it to other operating systems. For example, there is DTrace migration on Apple's Mac OS X operating system. In fact, every Apple laptop or desktop released in recent years has a ready-made dtrace command line tool that can be called. You can try it on the command line terminal of Apple. This is the transplant of a DTrace on the Apple system. The FreeBSD operating system also has such a DTrace migration. Only it is not enabled by default. You need to load the DTrace kernel module of FreeBSD through the command. Oracle has also started DTrace migration for Linux kernel in its own Oracle Linux operating system distribution. However, it seems that the migration of Oracle has not improved much. After all, the Linux kernel is not controlled by Oracle, and DTrace needs to be tightly integrated with the operating system kernel. For similar reasons, the Linux porting of DTrace tried by some brave civil engineers has always been far away from the requirements of production level.

Compared with the native DTrace on Solaris, these DTrace migrations are more or less lacking in some advanced features, so their capabilities are not as good as the original DTrace.

Another impact of DTrace on Linux operating system is reflected in SystemTap This open source project. This is a relatively independent dynamic tracking framework created by engineers from Red Hat. SystemTap provides its own Minor language , is not the same as D language. Obviously, Red Hat itself serves a large number of enterprise level users, and their engineers need to deal with many online "weird problems" every day. The emergence of this technology must be motivated by the actual demand. I think SystemTap is the most powerful and practical dynamic tracking framework in the Linux world. I have successfully used it in my work for many years. Frank Ch. Eigler, the author of SystemTap, Josh Stone and others are very enthusiastic and intelligent engineers. My questions in IRC or mailing list are generally answered very quickly and in great detail. It is worth mentioning that I have also contributed an important new feature to SystemTap, which enables it to access the value of global variables in user mode in any probe context. I merged this into the main line of SystemTap C++patch Its scale has reached about 1000 lines, thanks to the enthusiastic help of the authors of SystemTap two This new feature is implemented in the dynamic scripting language (such as Perl and Lua) based on SystemTap Flame diagram Tools play a key role.

The advantage of SystemTap is that it has a very mature automatic loading of user mode debugging symbols. At the same time, it also has a language structure such as loop, which can be used to write more complex probe processing programs and support many complex analysis and processing. Due to the immature implementation of SystemTap in the early years, the Internet is flooded with many outdated criticisms and criticisms against it. SystemTap has made great progress in recent years. OpenResty Inc., which I founded in 2017, has also made great improvements and optimizations to SystemTap.

Of course, SystemTap also has disadvantages. First of all, it is not part of the Linux kernel, that is, it is not tightly integrated with the kernel, so it needs to keep up with changes in the mainline kernel. Another disadvantage is that it usually compiles its "small language" script (a bit like D language) dynamically into the C source code of a Linux kernel module, so it often needs to deploy the C compiler tool chain and the header file of the Linux kernel online. For these reasons, the start time of SystemTap script is much slower than that of DTrace, and is somewhat similar to that of JVM. Despite these shortcomings three In general, SystemTap is a very mature dynamic tracking framework.

 Schematic diagram of SystemTap

In fact, neither DTrace nor SystemTap supports writing complete debugging tools, because they lack convenient command line interaction primitives. So we can see that many tools based on them in the real world actually have a package written by Perl, Python or shell script on the outside. In order to use a clean language to write a complete debugging tool, I have extended the SystemTap language to implement a higher-level "macro language" called stap++ four The stap++interpreter I implemented with Perl can directly interpret and execute the stap++source code, and internally call the SystemTap command line tool. Interested friends can view my open source stapxx code repository on GitHub. This repository also contains many complete debugging tools that are directly implemented using my stap++macro language.

Application of SystemTap in production

The great influence of DTrace today cannot be separated from the famous DTrace preacher Brendan Gregg teacher. We mentioned his name earlier. He first worked in Sun Microsystems and the file system optimization team of Solaris, and was the earliest DTrace user. He has written several books on DTrace and performance optimization, as well as many blog posts on dynamic tracking.

After I left Taobao in 2011, I lived in Fuzhou for a year, which is called "pastoral life". In the last few months of pastoral life, I passed Brendan's Public Blog I have systematically learned DTrace and dynamic tracking technology. In fact, I first heard of DTrace because of a comment from a Weibo friend, who only mentioned the name DTrace. So I wanted to know what it was. Who knows? If you don't know, you will be shocked. This is a completely new world, completely changing my view of the whole computing world. So I spent a lot of time reading Brendan's personal blog one by one. Then one day, I had a feeling of great insight, and finally I could master the essence of dynamic tracking technology.

In 2012, I ended my "pastoral life" in Fuzhou and came to the United States to join the CDN company mentioned earlier. Then I immediately started to apply SystemTap and the whole set of dynamic tracking methods I had learned to the global network of this CDN company to solve those very strange online problems. I observed in this company that many engineers often bury their own points in the software system when troubleshooting online problems. This is mainly in the business code, even in the code base of system software such as Nginx, to make your own modifications, add some counters, or bury some points for logging. In this way, a large number of logs will be collected online in real time, entered into a special database, and then analyzed offline. Obviously, the cost of this approach is huge. It not only involves the sudden increase of the modification and maintenance cost of the business system itself, but also the online cost of full collection and storage of a large amount of buried point information. What's more, it often happens that Zhang San buried a collection point in the business code today, and Li Si buried another similar point in the business code tomorrow. Later, these points may be forgotten in the code base, and no one will pay attention to them. Finally, there will be more and more such points, making the code base more and more messy. This intrusive modification will make the corresponding software, whether system software or business code, more and more difficult to maintain.

There are two major problems in the way of burying points, one is "too much", the other is "too little". "Too much" means that we tend to collect some information that we don't need at all, and we just crave more for the time being, thus causing unnecessary collection and storage costs. Most of the time, we can analyze problems through sampling. We may habitually collect all the data in the whole network, which is obviously very expensive. The "too few" problem means that it is often difficult for us to plan all the information collection points we need at the beginning. After all, no one is a prophet who can predict the problems that need to be investigated in the future. So when we encounter new problems, the information collected by the existing collection points is almost always not enough. This has led to frequent modification of software systems and frequent online operations, greatly increasing the workload of development engineers and operation and maintenance engineers, and increasing the risk of larger failures on the line.

Another method of brute force debugging is often used by some of our O&M engineers, that is, pull the machine off the line, and then set a series of temporary firewall rules to shield user traffic or their own monitoring traffic, and then make various troubles on the production machine. This is a very complicated and influential process. First, it will make the machine unable to continue to serve, reducing the overall throughput of the entire online system. At the same time, some problems that can only be reproduced with real traffic can no longer be reproduced. I can imagine how painful these rude practices are.

In fact, using SystemTap dynamic tracking technology can solve this problem very well, which has the advantage of "moistening things silently". First, we do not need to modify our software stack itself, whether it is system software or business software. I often write some targeted tools, and then place some carefully arranged probes on some key system "acupoints". These probes will collect their own information, and the debugging tool will summarize the information and output it to the terminal. In this way, I can quickly get the key information I want on a machine or several machines through sampling, so as to quickly answer some very basic questions and point out the direction for subsequent debugging.

As I mentioned earlier, instead of manually burying points in the production system to record logs, and then collecting logs for storage, it is better to regard the entire production system itself as a "database" that can be directly queried. We can get the information we want from this "database" directly, safely and quickly, and never leave traces or collect information we don't need. Using this idea, I have written many debugging tools, most of which have been open source on GitHub. Many are for system software such as Nginx, LuaJIT and operating system kernel, and some are for higher level Web frameworks such as OpenRest. Interested friends can check the nginx-systemtap-toolkit perl-systemtap-toolkit and stappxx These code repositories.

 My SystemTap tool cloud

Using these tools, I successfully located countless online problems, some of which I even found accidentally. Here are just a few examples.

The first example is that I used the flame graph tool based on SystemTap to analyze our online Nginx process, and found that a considerable part of CPU time was spent on a very strange code path. This is actually the temporary debugging code left by a colleague of mine when debugging an old problem a long time ago, a bit like the "burying code" we mentioned earlier. As a result, it was forgotten online and in the company code warehouse, although the problem was actually solved at that time. Because this expensive "buried code" has not been removed, it has always produced a large performance loss, which has never been noticed. So it was my unexpected discovery. At that time, I let the tool automatically draw a flame map by sampling. I can find problems and take measures as soon as I look at this picture. This is a very, very effective way.

The second example is that a small number of requests have a long delay, which is called "long tail requests". The number of these requests is very low, but may reach a latency of "seconds". At that time, a colleague guessed that there was a bug in my OpenResty. I was not convinced, so I immediately wrote a SystemTap tool to sample online and analyze those requests with a total delay of more than one second. The tool will directly test the internal time distribution of these problem requests, including the latency of each typical I/O operation during request processing and the pure CPU computing latency. The result is that OpenResty has a slow latency when accessing the DNS server written by Go. Then I asked my tool to output the specific contents of these long tail DNS queries. It was found that CNAME expansion was involved. Obviously, this has nothing to do with OpenResty, and there is a clear direction for further troubleshooting and optimization.

The third example is that we have noticed that the proportion of machines in one computer room is significantly higher than that in other computer rooms, but only 1%. At the beginning, we naturally doubted the details of the network protocol stack. But later, through a series of special SystemTap tools, I directly analyzed the internal details of those timeout requests, and found that it was a problem with hard disk configuration. From network to hard disk, this kind of debugging is very interesting. First hand data let us quickly get on the right track.

Another example is that we once observed in the Nginx process through the flame graph that the opening and closing of files took up more CPU time, so we naturally enabled Nginx's own file handle cache configuration, but the optimization effect was not obvious. Then a new flame map was made, and it was found that the "spin lock" used by Nginx's file handle cache metadata took up a lot of CPU time. This is because we have enabled caching, but set the size of the cache too large, so the overhead of metadata spinlocks offset the benefits of caching. All this can be clearly seen on the flame map. If we do not have a flame graph, but only blindly experiment, we may conclude that Nginx's file handle cache is useless, without thinking about adjusting the cache parameters.

The last example is that after a certain online operation, we observed in the latest online flame graph that the compilation of regular expressions took up a lot of CPU time, but actually we have enabled caching of regular compilation results online. Obviously, the number of regular expressions used in our business system has exceeded the cache size we initially set, so it is natural to think of adjusting the online regular cache to a larger size. Then, we can no longer see the regular compilation operation in the online flame graph.

Through these examples, we can see that different data centers, different machines, and even different periods of time of the same machine will have their own unique new problems. What we need is to directly analyze the problem itself and take samples, rather than randomly guess to try and make mistakes. With powerful tools, troubleshooting is actually a matter of getting twice the result with half the effort.

Since we founded OpenResty Inc., we have developed OpenResty XRay This is a new generation of dynamic tracking platform. We no longer use open source solutions like SystemTap manually.

Flame diagram

We have mentioned it many times before Flame diagram (Flame Graph) What is a flame graph? It is actually a great visualization method, which was invented by Brendan Gregg, who has been mentioned repeatedly before.

The flame diagram is like an X-ray photograph taken for a software system, which can naturally fuse the information in the two dimensions of time and space into one diagram and show it in a very intuitive form, thus reflecting many quantitative statistical laws of the system in terms of performance.

 Example of Nginx's C-level on CPU flame diagram

For example, the most classic flame graph is to count the time distribution of all code paths of a certain software on the CPU. From this distribution diagram, we can intuitively see which code paths spend more CPU time, and which ones are irrelevant. Furthermore, we can generate flame diagrams at different software levels, for example, we can draw a diagram at the C/C++language level of the system software, and then draw flame diagrams at a higher level, such as the dynamic scripting language level, such as the level of Lua and Perl code. Flame maps at different levels often provide different perspectives, reflecting code hotspots at different levels.

Because I maintain open source software communities such as OpenResty, we have our own mailing lists, and I often encourage users who report problems to actively provide their own flame maps, so that we can comfortably Look at the picture and speak To help users quickly locate performance problems without repeatedly trying and making mistakes, so as to save a lot of time for each other and make everyone happy.

It is worth noting that even if we encounter a strange program that we do not know, we can roughly deduce the performance problem by looking at the flame diagram, even if we have never read a line of its source code. This is a great thing. Because most programs are actually well written, that is to say, they often use abstraction levels when constructing software, such as through functions. The names of these functions usually contain semantic information and are displayed directly on the flame diagram. Through these function names, we can roughly infer what the corresponding function, or even a corresponding code path, is doing, and thus infer the performance problems of this program. So, back to the old saying, naming in program code is very important, not only for reading source code, but also for debugging problems. On the other hand, the flame diagram also provides us with a shortcut to learn strange software systems. After all, important code paths are almost always those that spend more time, so they deserve our focus; Otherwise, there must be a big problem with the way the software is constructed.

In fact, the flame graph can be extended to other dimensions. For example, the flame graph we just talked about is to see the distribution of the time that the program runs on the CPU in all code paths. This is the dimension of on CPU time. Similarly, the time when a process does not run on any CPU is also very interesting. We call it off CPU time. Off CPU time generally means that the process is sleeping for some reason, such as waiting for a system level lock, or being forcibly deprived of CPU time slice by a very busy process scheduler. These conditions will cause the process to fail to run on the CPU, but still spend a lot of wall clock time. Through the flame map of this dimension, we can get another very different picture. Through the information in this dimension, we can analyze the cost of system locks (such as sem_wait Such system calls), some blocked I/O operations (such as open read You can also analyze the CPU contention between processes or threads. It is clear at a glance through the off CPU flame diagram.

It should be said that the off CPU flame diagram is also a bold attempt of my own. I remember reading Brendan's off CPU time at a lake called Tahoe between California and Nevada A blog post Of course, it occurred to me that off CPU time may be used instead of on CPU time to show the flame diagram. So when I came back, I made such an attempt in the company's production system, and used SystemTap to draw the off CPU flame diagram of the Nginx process. After I published this successful attempt on Twitter, Brendan also contacted me specifically, saying that he had tried this method before, but the effect was not ideal. I think this is because he applied it to multithreaded programs, such as MySQL, while multithreaded programs, due to thread synchronization, have a lot of noise on the off CPU diagram, which is easy to cover up the really interesting parts. The scene where I use the off CPU flame graph is a single threaded program like Nginx, so the off CPU flame graph often immediately indicates those system calls that block the Nginx event cycle, or sem_wait Such lock operations, or the forced intervention of the preemptive process scheduler, can help analyze a wide range of performance problems. In such off CPU flame diagram, the only "noise" is actually the Nginx event cycle itself epoll_wait Such system calls can be easily identified and ignored.

 Off CPU time

Similarly, we can extend the flame graph to other system indicator dimensions, such as the number of bytes leaked. Once I used the "memory leak flame map" to quickly locate a very subtle leak problem in the Nginx core. Since the leak occurs in Nginx's own memory pool, use Valgrind and AddressSanitizer Such traditional tools cannot be captured. Another time, the "memory leak flame map" was also used to easily locate the leak in the Nginx C module written by a European developer himself. The leak was very subtle and slow, which puzzled him for a long time. I didn't need to read his source code before I helped him locate it. When I think about it, I think it's a bit magical. Of course, we can also extend the flame graph to other system indicators such as file I/O latency and data volume. So this is really a great visualization method that can be used for many completely different problem categories.

ours OpenResty XRay The product supports automatic sampling of various types of flame maps, including C/C++level flame maps, Lua level flame maps, off CPU and on CPU flame maps, memory dynamic allocation flame maps, memory object reference relationship flame maps, file IO flame maps, and so on.

Methodology

Previously, we introduced a sampling based visualization method such as flame graph, which is actually a very general method. No matter what system or language it is written in, we can generally get a flame diagram on a certain performance dimension, and then analyze it easily. But more often, we may need to analyze and troubleshoot some deeper and more special problems. At this time, we need to write a series of specialized dynamic tracking tools to approach the real problem step by step.

In this process, the strategy we recommend is a so-called small step forward and continuous inquiry. That is to say, we do not expect to write a very large and complex debugging tool to collect all the information that may be needed at once, so as to solve the final problem at once. On the contrary, we will decompose the assumptions of the final problem into a series of small assumptions, and then gradually explore, gradually verify, constantly determine and correct our direction, constantly adjust our trajectory and our assumptions to approach the final problem. One advantage of this is that the tools at each step and stage can be simple enough, so the possibility of making mistakes in these tools will be greatly reduced. Brendan also noticed that if he tried to write multi-purpose complex tools, the possibility of introducing bugs into such complex tools would also be greatly increased. The wrong tool will give wrong information, thus misleading us to draw wrong conclusions. This is very dangerous. Another advantage of simple tools is that the cost to the production system during the sampling process is relatively small. After all, the number of probes introduced is small, and the processor of each probe will not have too many complicated calculations. Each debugging tool here has its own pertinence and can be used separately, so the chances of reusing these tools in the future will be greatly improved. So in general, this debugging strategy is very beneficial.

It is worth mentioning that here we reject the so-called "big data" debugging practice. That is, we will not try to collect as much information and data as possible at one time. On the contrary, we only collect the information that we really need in the current step at each stage and each step. At each step, based on the information we have collected, support or modify our original scheme and direction, and then guide the preparation of more detailed analysis tools in the next step.

In addition, for online events with very small frequency, we usually adopt the "wait for the hare" approach, that is, we will set a threshold or other screening conditions, and wait for interesting events to be captured by our probe. For example, when tracking small frequency and large delay requests, we will first filter out those requests whose delay exceeds a certain threshold in the debugging tool, and then collect as much detailed information as possible for these requests. In fact, this strategy is completely opposite to our traditional practice of collecting as many full statistical data as possible. It is precisely because we are targeted and specific strategies for sampling and analysis that we can reduce losses and overhead to the lowest point and avoid unnecessary waste of resources.

ours OpenResty XRay Through the knowledge base and reasoning engine, the product can automatically apply various dynamic tracking methodologies, and can automatically use systematic methods to gradually narrow the scope of the problem until the root cause of the problem is located, and then report to the user, and propose optimization or repair methods to the user.

Knowledge is power

I think the dynamic tracking technology is a good interpretation of the old saying that "knowledge is power".

Through dynamic tracking tools, we can transform some of our previous knowledge and knowledge of the system into very practical tools that can solve practical problems. The originally abstract concepts we learned from textbooks in computer professional education, such as virtual file system, virtual memory system, process scheduler, etc., can now become very vivid and specific. For the first time, we can truly observe their specific operation and statistical laws in the actual production system, without changing the source code of the operating system kernel or system software completely. These non-invasive real-time observation capabilities benefit from dynamic tracking technology.

This technique is like the dark iron heavy sword of Yang Guoshi in Jin Yong's novel. People who do not know martial arts at all will not be able to use it. But as long as you can master some martial arts, you can make better and better, and make continuous progress until the wooden sword can also prevail in the world. So if you have some systematic knowledge, you can wave this "sword" and solve some basic but unimaginable problems. And the more systematic knowledge you accumulate, the better this "sword" can be. What's more, it's also interesting that every time you know more, you can solve more new problems immediately. On the other hand, because we can solve many problems through these debugging tools, we can measure and learn many interesting statistical laws of micro or macro aspects in the production system, and these visible results will also become a powerful motivation for us to learn more system knowledge. So naturally, it has become the "level training artifact" of pursuing engineers.

I remember that I once said on my microblog that "the tool that encourages engineers to keep learning in depth is a good tool with a bright future". This is actually a benign process of mutual promotion.

Open source and debugging symbols

As we mentioned earlier, dynamic tracking technology can turn a running software system into a real-time read-only database that can be queried, but it is usually conditional that this software system should have relatively complete debugging symbols. So what are debugging symbols? Debugging symbols are generally the meta information generated by the compiler for debugging when the software is compiled. This information can map many details in the compiled binary program, such as the address of functions and variables, the memory layout of data structures, etc., back to the names of abstract entities in the source code, such as function names, variable names, type names, etc. The format of debugging symbols commonly seen in the Linux world is called DWARF (The same as the English word "dwarf"). It is precisely because of these debugging symbols that we have a map and a lighthouse in the cold and dark binary world, and it is possible to interpret and restore the semantics of every subtle aspect in the underlying world, and reconstruct high-level abstract concepts and relationships.

"Dwarves"

Generally, only open source software can easily generate debugging symbols, because most closed source software does not provide any debugging symbols for confidentiality reasons, to increase the difficulty of reverse engineering and cracking. One example is Intel's IPP program library. IPP provides optimized implementations of many common algorithms for Intel chips. We also tried to use the IPP based gzip compression library on the production system, but unfortunately we encountered a problem - IPP will crash online from time to time. Obviously, closed source software without debugging symbols will be very painful when debugging. We used to communicate with Intel engineers remotely for many times, but we couldn't locate and solve the problem. Finally, we had to give up. If the source code or debugging symbols are available, the debugging process is likely to become much simpler.

Brendan Gregg also mentioned the relationship between open source and dynamic tracking technology in his previous sharing. Especially when our entire software stack is open source, the power of dynamic tracking can be maximized. The software stack usually includes the operating system kernel, various system software and higher-level high-level language programs. When the whole stack is open source, we can easily get the desired information from all software levels, and convert it into knowledge and action plans.

Typical

Because more complex dynamic tracing depends on debugging symbols, some C compilers generate debugging symbols that are problematic. These debugging information containing errors will lead to a great discount in the effect of dynamic tracking, and even directly hinder our analysis. For example, GCC, a C compiler that is widely used, has a poor quality of debugging symbols generated before 4.5, but has made great progress since 4.5, especially when compiler optimization is enabled.

ours OpenResty XRay The dynamic tracking platform will capture the debugging symbol package and binary package of common open source software on the public network in real time, and analyze and index them. At present, this database has indexed dozens of terabytes of data.

Linux kernel support

As mentioned earlier, the dynamic tracing technology is generally based on the operating system kernel. For the Linux operating system kernel, which is widely used at ordinary times, its dynamic tracing support is a long and arduous process. One of the main reasons may be that Linus, the leader of Linux, has always felt that this technology is unnecessary.

At first, engineers from Red Hat Company prepared a so-called utrace A patch to support user mode dynamic tracking technology. This is the basis on which frameworks like SystemTap initially relied. For a long time, Linux distributions of the Red Hat family have included this utrace patch by default, such as RHEL, CentOS and Fedora. In those days when utrace dominated, SystemTap was only meaningful in the Red Hat operating system. The utrace patch was eventually not incorporated into the Linux kernel of the mainline version, and was replaced by another compromise solution.

Linux mainline version has been available for a long time kprobes This mechanism can dynamically place probes at the entrance and exit of the specified kernel function, and define its own probe handler.

The dynamic tracking support of user mode is late, and has gone through countless discussions and repeated modifications. Starting from the 3.5 version of the official Linux kernel, the inode based uprobes The kernel mechanism can safely set the dynamic probe at the entrance of the user mode function and execute its own probe handler. Later, starting from the 3.10 kernel, the so-called uretprobes This mechanism five , you can further set the dynamic probe on the return address of the user state function. Together, uptprobes and uretprobes can finally replace the main functions of utrace. The utrace patch has since completed its historical mission. SystemTap can now automatically use the mechanisms of upsrobes and uretprobes on the newer kernel, instead of relying on utrace patches.

In recent years, Linux mainline developers have used the dynamic compiler originally used in netfilter for firewalls, namely BPF , expanded a little, and got a so-called eBPF , which can be used as a more general kernel virtual machine. Through this mechanism, we can actually build a kernel resident dynamic tracking virtual machine similar to DTrace in Linux. In fact, there have been some attempts in this field recently, such as BPF compiler (BCC), which uses LLVM tool chain to compile C code into byte code accepted by eBPF virtual machine. In general, Linux's dynamic tracking support is getting better and better. Especially since the 3.15 kernel, the kernel mechanism related to dynamic tracking has finally become more robust and stable. Unfortunately, the design of eBPF has always been severely limited, making those dynamic tracking tools developed based on eBPF stay at a relatively simple level. In my words, they are still in the "Stone Age". Although SystemTap has recently begun to support the eBPF runtime, the stap language features supported by this runtime are extremely limited. Even Frank, the boss of SystemTap, expressed his concern in this regard.

Hardware Tracking

We see that dynamic tracking technology can play a very important role in the analysis of software systems, so it is natural to wonder whether similar methods and ideas can also be used to track hardware.

We know that the operating system is directly related to hardware. By tracking some drivers or other aspects of the operating system, we can also indirectly analyze some behaviors and problems of the hardware devices connected to it. At the same time, modern hardware, such as Intel's CPU, usually has built-in registers for performance statistics( Hardware Performance Counter )Through software reading the information in these special registers, we can also get a lot of interesting information directly about hardware. For example, in the Linux world perf The tool was originally designed for this purpose. Even virtual machine software such as VMWare will simulate such special hardware registers. Based on this special register, the image Mozilla rr This interesting debugging tool can record and play back the process execution efficiently.

Set dynamic probes directly inside the hardware and implement dynamic tracking, which may still exist at the level of science fiction at present. Welcome interested students to contribute more inspiration and information.

Analysis of the remains of the death process

What we have seen before is actually the analysis of living processes, or running programs. What about the process of death? For a dead process, the most common form is that the process crashes abnormally, resulting in the so-called core dump File. In fact, we can also carry out a lot of in-depth analysis on the "remains" left in the process of such death, so that it is possible to determine the cause of its death. In this sense, as programmers, we play the role of "forensic medicine".

The most classic tool for analyzing the remains of the dead process is famous GNU Debugger (GDB)。 Then there is a similar tool called LLDB Obviously, GDB's native command language is very limited. If we manually analyze the core dump command by command, we can get very limited information. In fact, most engineers analyze core dump only with bt full Command to check the current C call stack track, or use the info reg Command to check the current value of each CPU register, or check the machine code sequence at the crash location, etc. In fact, more information is hidden in various complex binary data structures allocated in the heap. It is obviously necessary to automate the scanning and analysis of complex data structures in the heap. We need a programmable way to write complex core dump analysis tools.

To meet this requirement, GDB has a built-in Support for Python scripts We can now use Python to implement more complex GDB commands, so as to conduct in-depth analysis of things like core dump. In fact, I have written many advanced debugging tools based on GDB in Python, and even many tools are corresponding to SystemTap tool for analyzing live processes. Similar to dynamic tracing, with the help of debugging symbols, we can find a bright way in the dark "death world".

The

However, one problem caused by this approach is that the development and migration of tools have become a great burden. It is not interesting to use a scripting language such as Python to traverse C style data structures. This strange Python code can really drive people crazy. In addition, for the same tool, we need to write it once in the script language of SystemTap, and then in the Python code of GDB: no doubt this is a great burden, and both implementations need to be carefully developed and tested. Although they do similar things, the implementation code and corresponding API are completely different (it is worth mentioning that LLDB tools in LLVM world also provide similar Python programming support, and the Python API there is incompatible with GDB).

Of course, we can also use GDB to analyze live programs, but compared with SystemTap, the most obvious problem with GDB is performance. I have compared the SystemTap version of a more complex tool with the GDB Python version. Their performance differs by an order of magnitude. GDB is obviously not designed for this kind of online analysis, on the contrary, more consideration is given to the use of interactivity. Although it can also run in batch mode, its internal implementation determines that it has very serious performance limitations. Among them, the one that drives me crazy is the internal abuse of GDB longjmp To do conventional error handling, which has caused serious performance loss, which is very obvious on the GDB flame diagram generated by SystemTap. Fortunately, the analysis of dead processes can always be done offline. We don't need to do such things online, so timing is not so important. Unfortunately, some of our very complex GDB Python tools need to run for several minutes. Even offline, it is frustrating.

I used SystemTap to analyze the performance of GDB+Python, and located the two largest execution hotspots in GDB according to the flame map. Then, I proposed two C patches to the GDB official. One is For Python string operations One is Error handling mode for GDB They have improved the overall running speed of our most complex GDB Python tool by 100%. GDB has officially merged one of the patches. It is also very interesting to use dynamic tracing technology to analyze and improve traditional debugging tools.

I have opened a lot of GDB Python debugging tools that I wrote in my own work to GitHub for interested students to take a look. It is usually placed in nginx-gdb-utils This GitHub repository is mainly for Nginx and LuaJIT. I used these tools to help Mike Pall, the author of LuaJIT, locate more than ten internal bugs in LuaJIT. Most of these bugs have been hidden for many years, and they are very subtle problems in the Just in Time (JIT) compiler.

Since dead processes are not likely to change over time, let's call this analysis of core dump "static tracking".

 GDB debug commands I wrote

ours OpenResty XRay Product passed Y Language The compiler enables various analysis tools written in Y language to support platforms such as GDB at the same time, thus automating in-depth analysis of core dump files.

Traditional debugging technology

When it comes to GDB, we have to talk about the difference and relationship between dynamic tracing and traditional debugging methods. Careful and experienced engineers should find that the "predecessor" of dynamic tracking is to set breakpoints in GDB, and then conduct a series of checks at breakpoints. The only difference is that dynamic tracing always emphasizes non interactive batch processing and the lowest possible performance loss. However, tools like GDB are naturally created for interactive operation, so the implementation does not consider production security or performance loss. Generally, its performance loss is great. At the same time, GDB is based on ptrace This kind of very old system call has many pitfalls and problems. For example, ptrace needs to change the father of the target debugging process, and does not allow multiple debuggers to analyze the same process at the same time. Therefore, in a sense, using GDB can simulate a so-called "dynamic tracking of the poor".

Many beginners prefer to use GDB for "single step execution", which is often inefficient in real industrial production and development. This is because single step execution often results in changes in the program execution timing, resulting in many timing related problems that cannot be repeated. In addition, for complex software systems, single step execution can easily lead people to get lost in the numerous code paths, or get lost in the so-called "garden path", where only trees can be seen, not forests.

Therefore, for debugging in the daily development process, we recommend the simplest and most stupid method, that is, printing and outputting statements on the critical code path. In this way, we can get a very complete context by viewing the log and other outputs, so that we can effectively analyze the program execution results. This approach is particularly efficient when combined with test driven development. Obviously, this method of adding logs and burying points is impractical for online debugging, which has been fully discussed previously. On the other hand, traditional performance analysis tools, such as DProf in Perl, gprof in C, and performance analyzers in other languages and environments, often need to recompile programs with special options or rerun programs in special ways. This kind of performance analysis tool that needs special processing and coordination is obviously not suitable for online real-time live analysis.

A messy debugging world

Today's debugging world is very messy. As we have seen before, there are DTrace, SystemTap, eBPF/BCC, GDB, LLDB, and many others that we have not mentioned. You can find them on the network. Perhaps this reflects the chaos of the real world we live in from one side.

Many years ago, I thought that we could design and implement a unified debugging language. Later, I finally implemented such a language in OpenResty Inc., called Y Language Its compiler can automatically generate input code accepted by various debugging frameworks and technologies. For example, generate D language code accepted by DTrace, generate stap script accepted by SystemTap, and Python script accepted by GDB, as well as another Python script of LLDB that is incompatible with API, or byte code accepted by eBPF, or even a mixture of C and Python code accepted by BCC.

If a debugging tool we designed needs to be migrated to multiple different debugging frameworks, then obviously the workload of manual migration is very large, as I mentioned earlier. If there is such a unified Y language, its compiler can automatically convert the same Y code into input code for different debugging platforms, and automatically optimize those platforms, then we only need to write each debugging tool in Y language once. This will be a huge relief. As a debugger, it is not necessary to learn all the messy details of specific debugging technologies in person, and step on the "pit" of each debugging technology in person.

Y language has been used as OpenResty XRay Part of the product is provided to users.

A friend may ask why it is called Y? This is because my name is Yi Chun, and the first letter of the Chinese phonetic alphabet of Yi is Y. Of course, there is a more important reason, that is, it is a language used to answer questions that begin with "why", and "why" in English is "why", and why is homophonic with Y.

OpenResty XRay

OpenResty XRay Is created by OpenResty Inc. Commercial products provided by the company. OpenResty XRay can help users gain insight into the behavior details of various software systems online or offline without any cooperation from the target program, and effectively analyze and locate various performance, reliability and security problems. Technically, it has a stronger tracking function than SystemTap, and at the same time, it is much better in performance and ease of use than SystemTap. It can also support automatic analysis of program remains such as core dump files.

Interested friends welcome contact us , apply for free trial.

 OpenResty XRay Console Dashboard

Learn more

If you also want to know more about dynamic tracking technologies, tools, methodologies and cases, you can follow our OpenResty Inc Blog Site You are also welcome to scan our WeChat official account:

Our

You are also welcome to try our OpenResty XRay Commercial products.

Pioneer in dynamic tracking, Mr. Brendan Gregg Blog There are also many wonderful contents.

Acknowledgement

This article has been helped by many of my friends and family. First of all, thank Shi Rui for his hard dictation work; This article actually comes from an hour long voice sharing. Then I would like to thank many friends for their careful review and feedback. I also thank my father and wife for their patience and help in writing.

About the author

Zhang Yichun is open source OpenResty ® Project founder and OpenResty Inc. CEO and founder of the company.

Zhang Yichun (Github ID: agentzh) was born in Jiangsu, China, and now lives in the U.S. Bay Area. He was an advocate and leader of China's early open source technology and culture, and once worked for many internationally renowned high-tech enterprises, such as Cloudflare Yahoo, Alibaba, the pioneer of "edge computing", "dynamic tracking" and "machine programming", has more than 22 years of programming and 16 years of open source experience. As the leader of open source projects with more than 40 million global domain name users, he OpenResty ® High tech enterprises created by open source projects OpenResty Inc. It is located in the center of Silicon Valley in the United States. Its two main products OpenResty XRay (Utilize Dynamic tracking Technology) and OpenResty Edge (The all-purpose gateway software most suitable for microservices and distributed traffic), widely favored by many listed and large enterprises worldwide. Besides OpenResty, Zhang Yichun has contributed more than one million lines of code to many open source projects, including Linux kernel, Nginx LuaJIT GDB SystemTap LLVM , Perl, etc., and has written more than 60 open source software libraries.

translation

We provide English translation And the original Chinese text (this article). We welcome readers to provide translations in other languages. As long as the full text translation is not omitted, we will consider using it. Thank you very much!


  1. Neither SystemTap nor OpenResty XRay has these limitations and shortcomings.  

  2. After I founded OpenResty Inc., our team also contributed a large number of patches to the open source SystemTap project, from new functions to bug fixes.  

  3. These shortcomings of SystemTap do not exist in the dynamic tracking platform of OpenResty XRay.  

  4. The stap++project is no longer maintained and has been used with a new generation of dynamic tracking technology OpenResty XRay Platforms and toolsets replace.  

  5. In fact, uretprobes has a big problem in its implementation, because it will directly modify the system runtime stack of the target program, thus destroying many important functions, such as stack unwinding. Our OpenResty XRay has re implemented the effect similar to uretprobes, but it does not have these shortcomings.