Lua code running inside OpenResty or Nginx servers is very common nowadays sincepeople want both performance and flexibility out of their nonblocking web servers.Some people use Lua for very simple tasks like checking and modifying certainrequest headers and response bodies while other people use Lua to build verycomplicated web applications, CDN software, and API gateways. Lua is known forits simplicity, small memory footprint, and high execution efficiency, especially when usingJust-in-Time (JIT) compilers likeLuaJIT. But still some times the Lua code running atop OpenRestyor Nginx servers may consume too much CPU resources due to the programmer’scoding mistakes, calling out to some expensive C/C++ library code, or some otherreasons.
The best way to quickly find all the CPU performance bottlenecks in anonlineOpenRestyor Nginx instance is the Lua-land CPU flame graph sampling tool as providedby theOpenResty XRayproduct. It doesnotrequire any changes to the targetOpenResty or Nginx processes nor have any noticeable impact to the processesin production.
This article will introduce the idea of Lua-land CPU flame graphsand useOpenResty XRayto produce real flame graphs for several small and standaloneLua examples. We choose small examples because it is much easier to predictand verify the profiling results. The same idea and tools do apply equally wellto those most complex Lua applications in the wild. And we have had many successesin using this technique and visualization to help our enterprise customers with verybusy sites and applications in the past few years.
Flame graphs are a visualization method invented byBrendan Greggfor showinghow a system resource or a performance metric is quantitatively distributed across all the code paths inthe target software.
Thesystem resourceor metric could be CPU time, off-CPU time, memoryusage, disk usage, latency, or any other things you can imagine.
Thecode pathshere are defined by the backtraces in the target software’s code. A backtraceis usually a stack of function call frames as in the output of thebtGDBcommand and in an exception error message of a Python or Java program. For example,below is a Lua program’s backtrace sample:
In this sample, the Lua stack grows from the base frameaccess_by_lua.lua:130all the way to the top frameC:ngx_http_lua_ngx_timer_at. It clearly shows how various different Lua or C functions call each other, forming an approximate representation of “code path”.
When we say “all code paths”, we actually mean it from a statistical perspectiveinstead of literally iterating through every single code path in a program.Obviously the latter would be prohibitively expensive to do in reality due to combinatorial explosions.We just make sure all code paths with nontrivial overhead are showing up in our graphsand we can identify their cost quantitatively with good enough confidence.
In this article we just focus on the type of flame graphs which show how theCPU time (or CPU resources) is quantitatively distributed across all the Lua code paths ina target OpenResty or Nginx process (or processes), hence the name “Lua-LandCPU Flame Graphs”.
The header picture of this article shows a sample flame graph and we will seeseveral more in later parts of this post.
Why Flame Graphs
Flame Graphs are a great way to show the “big picture” of all the bottlenecks quantitativelyin a single small graph, no matter how complex the target software is.
Traditional profilers would usually throw a ton ofdetails and numbers at the user’s face. And the user may lose insight of thewhole picture and go through rabbit holes for things that do not really matter.Another drawback of traditional profilers is that they just give you latenciesof all the functions while it’s hard to see the contexts of these function calls,not to mention the user also has to distinguish the exclusive time and inclusivetime of a function call.
On the other hand, Flame Graphs can squeeze a greatdeal of information very compactly into a limited size graph (usually fit in a single screen). Code paths which donot matter fade out naturally while code paths truly important would stand out.No more, no less, just the right amount of information for the user.
How to read a Flame Graph
Flame Graphs might be a bit tricky to read for a newcomer. But with a littleguidance the user would find it so intuitive to understand. A Flame Graph istwo-dimensional graph. The y-axis shows the code (or data) context, i.e., the backtraces ofthe target programming language while the x-axis shows how much percentageof the system resource aparticular backtrace takes. The full x-axis usually means 100% of the system resource (like the CPU time)spent on the target software. The order of backtraces along the x-axis usuallydo not matter since they are usually simply sorted by function frame names alphabetically.There are exceptions however, where I invented a type ofTime-Series FlameGraphswhere the x-axis actually means the time axis instead and the order of backtracesare in the time order. For this article,we only care about the classic type of flame graphs where the order along thex-axis does not matter at all.
The best way to learn how to read a flame graph is to read sample flame graphs.We will see several examples below with detailed explanation for OpenResty andNginx’s Lua applications.
Simple Lua samples
In this section, we will consider several simple Lua samples with obvious performancecharacteristics and we will useOpenResty XRayto profile the real nginx processesto show Lua-land CPU Flame Graphs and verify the performance behaviors in thegraphs. We will check different cases like with and without JIT compilationenabled for the Lua code, as well as Lua code calling into external C library code.
JIT compiled Lua code
We will first investigate a Lua sample program with JIT compilation enabled(which is enabled by default in LuaJIT).
Let’s consider the following standalone OpenResty application. We will use thisexample throughout this section with minor modifications for different cases.
We first prepare the applications' directory layout:
mkdir -p ~/workcd ~/workmkdir conf logs lua
And then we create theconf/nginx.confconfiguration file as follows:
Here we load the external Lua module namedtestand immediately calls itsmainfunction in our Lua handler for the location/t. We use thelua_package_pathdirective to add thelua/directory into the Lua module search paths sincewe will shortly put the aforementionedtestLua module intolua/.
ThetestLua module is defined in thelua/test.luafile as follows:
local _M = {}local N = 1e7local function heavy()local sum = 0for i = 1, N dosum = sum + iendreturn sumendlocal function foo()local a = heavy()a = a + heavy()return aendlocal function bar()return (heavy())endfunction _M.main()ngx.say(foo())ngx.say(bar())endreturn _M
Here we define a computation-heavy Lua function,heavy(), which computes thesum of numbers from 1 to 10 million (1e7). Then we call thisheavy()functiontwice in functionfoo()and just once in functionbar(). Finally, the moduleentry function_M.main()callsfooandbarjust for once in turn and printout their return values respectively to the HTTP response body viangx.say.
Intuitively, for this Lua handler, thefoo()function would takes exactlytwice of the CPU time taken by thebar()function becausefoo()callsheavy()twice whilebar()only callsheavy()once. We can easily verify this observationin the Lua-land CPU flame graphs sampled byOpenResty XRaybelow.
Because we did not touch LuaJIT’s JIT compiler settings in this example, the JITcompilation is turnedonby default since modern OpenResty platform versionsalways useLuaJITanyway (the support for the standard Lua 5.1 interpreter has been removed for long).
Now we can start this OpenResty web application like this:
cd ~/work//usr/local/openresty/bin/openresty -p $PWD/
assuming OpenResty is installed under/usr/local/openresty/in the current system (which is the default installation location).
To make this OpenResty application busy, we can use tools likeabandweighttpto load the URIhttp://localhost:8080/tor the load generator provided bytheOpenResty XRayproduct. Either way, while the target OpenResty applications'nginx worker process is busy, we can get the following Lua-land CPU flame graphinOpenResty XRay’s web console:
From this graph we can make the following observations:
All Lua backtraces in this graph stem from the same entry point,content_by_lua(nginx.conf:24),which is expected.
There are mainly two code paths shown in the graph, which are
The only difference in these two code paths arefoovsbar. This is also expected.
The left-hand-side code path involvingbaris just half as wide as theright-hand-side code path involvingfoo. In other words, their width ratioalong the x-axis is 1:2, which means that thebarcode path takes only 50%of the time taken byfoo. By putting your mouse onto thetest.lua:barframe(or box) in the graph, we can see that it takes 33.3% of the total samples (or total CPU time) while thetest.lua:fooframe shows 66.7%. Obviously it is very accurate as compared to our predictions even though it takes a samplingand statistical approach.
We did not see other code paths likengx.say()in the graph since suchcode paths simply take too little CPU time as compared to the two dominatingLua code paths involvingheavy(). Trivial things are simply noises which won’tcatch our attention in the flame graph. We always focus on really importantthings and cannot get distracted.
The top frames for both code paths (or backtraces) are the same, which istrace#2:test.lua:8. This is not a really Lua function call frame, but rathera pseudo frame indicating that it is running a JIT compiled Lua code path whichis called “trace” in LuaJIT’s terminology (becauseLuaJITis a tracing JIT compiler).And this “trace” has the ID number 2 and the compiled Lua code path startingfrom the Lua source code line 8 of thetest.luafile.test.lua:8is thisLua code line:
sum = sum + i
It is wonderful to see our noninvasive sampling tool can get such accurate flamegraphs from a standard binary build of OpenRestywithoutany extra modules,modifications, or special build flags. The tool doesnotuse any special featuresor interfaces of theLuaJITruntime at all, not even theLUAJIT_USE_PERFTOOLSfeature or its built-in profiler. Instead it uses advanceddynamic tracingtechnologies which simply read the information already available in the targetprocess itself. And we are able to get good enough information even from JITcompiled Lua code.
Interpreted Lua code can usually result in perfectly accurate backtraces andflame graphs. If the sampling tool can handle JIT-compiled Lua code just fine,then it can only do a better job when dealing with interpreted Lua code. Oneinteresting thing about LuaJIT’s interpreter is that the interpreter is writtenalmost completely in hand-crafted assembly code (of course, LuaJIT introduces its own assembly language syntax named DynASM).
For our continuing Lua example, we simply add the followingnginx.confsnippetinside theserver {}configuration block:
init_by_lua_block {jit.off()}
And then reload or restart the server processes and still keep the traffic load.
This time we get the following Lua-land CPU flame graph:
This graph is very similar to the previous one in that:
We are still only seeing two main code paths, thebarone and thefooone.
Thebarcode path still takes approximately one third of the total CPUtime and thefooone still takes almost all the remaining part (i.e., about two-thirds).
The entry point for all the code paths shown in the graph is thecontent_by_luathing.
This graph still has an important difference, however: the tips of the codepaths (or backtraces) are no longer “traces”. This is expected since no JITcompiled Lua code paths are possible this time. The tips or top frames are nowC function frames likelj_BC_IFORLandlj_BC_ADDVV. These C functions frames marked by theC:prefixare actually not C functions per se. Instead they are assembly code frames correspondingtoLuaJITbyte code interpretation handlers specially annotated by symbols likelj_BC_IFORL. Naturally,lj_BC_IFORLis for theLuaJITbyte code instructionIFORLwhilelj_BC_ADDVVis for the byte code instructionADDVV. TheIFORLis for interpreted Luaforloops whileADDVVis for arithmetic additions.All these are expected according to our Lua functionheavy(). There are alsosome auxiliary assembly routines likelj_meta_arithandlj_vm_foldarith.
By looking at the percentage numbers for these function frames, we can also understandhow the CPU time is distributed across inside theLuaJITvirtual machine andinterpreter, paving the way to optimize the VM and the interpreter themselves.
Calling external C/C++ functions
It is very common for Lua code to invoke external C/C++ library functions. We also want to see their proportional parts in the Lua-land CPUflame graphs, because such C function calls are initiated from within the Luacode anyway. This is also where dynamic-tracing-based profiling really shineswhere such external C function calls would never become the blind spots forthe profilerone.
Let’s modify theheavy()Lua function in our ongoing example as follows:
local ffi = require "ffi"local C = ffi.Cffi.cdef[[double sqrt(double x);]]local function heavy()local sum = 0for i = 1, N do-- sum = sum + isum = sum + C.sqrt(i)endreturn sumend
Here we first use LuaJIT’sFFIAPI to declare the standard C library functionsqrt(), and then invoke it directly from within the Lua functionheavy(). It should show upin the corresponding Lua-land CPU flame graphs.
This time we got the following flame graph:
Interestingly we indeed see the C function frameC:sqrtshowing up as thetips of those two main Lua code paths. It’s also worth noting that we stillsee thetrace#Nframes near the top, which means that ourFFIcalls into theC function can also be JIT compiled (this time we removed thejit.off()statementfrom theinit_by_lua_blockdirective).
Line-Level Flame Graphs
The previous flame graphs we have seen are allfunction-levelflame graphsbecause almost all the function frames shown in the flame graphs have only functionnames instead of the source lines initiating the calls.
FortunatelyOpenResty XRay’s Lua-land profiling tools can also generate Luasource lines' file names and line numbers in its line-level flame graphs by which we can easily knowexactly what Lua source lines are hot. Below is such an example for our ongoingLua example:
We can see that now there are one more source-line frame added above each correspondingfunction name frame. For example, inside functionmain, on line 32 of filetest.lua, therecomes the call to thefoo()function. And inside thefoo()function, online 22 of filetest.lua, there is the call to theheave()function, and etc.
Line-level flame graphs are very useful to pinpoint the hottest source lineand Lua statements. This can save a lot of time when the corresponding Lua functionbody is large.
Multiple processes
It is common to configure multiple nginx worker processes for a single OpenRestyor Nginx server instance on a system with multiple CPU cores.OpenResty XRay’sprofiling tools support sampling all the processes in a specific process groupat the same time. This is useful when the incoming traffic is moderate and isspread across arbitrary nginx worker processes.
Complicated Lua applications
We can also get Lua-land CPU flame graphs from very complicated OpenResty/Luaapplications in the wild. For example, below is a Lua-land CPU flame graph sampledon one of our mini-CDN server running ourOpenResty Edgeproduct, which is acomplex Lua application, including a dynamic CDN gateway, a geo-sensitiveDNS server, and a web application firewall (WAF):
From this graph, we can see that the WAF takes most of the CPU time while thebuilt-in DNS server also takes a good portion. Our global mini-CDN network is also securing and speeding up our own web sites likeopenresty.organdopenresty.com.
It can surely analyze OpenResty-based API gateway software likeKongas well.
Sampling overhead
Because we use the sampling-based approach instead of full instrumentation, the overheadinvolved with the sampling for generating Lua-land CPU flame graphs is usuallynegligible which makes such tools usable in production or online environments. Both the data volume and CPU overhead are minimal.
If we load the target nginx worker process with requests of a constant rate,the CPU usage changes of the target process while the Lua-land CPU flame graph samplingis frequently performed is like this:
This CPU usage line graph is also generated and rendered byOpenResty XRayautomatically.
And then we stop sampling at all and the CPU usage curve of the same nginx workerprocess is very similar:
We cannot really see any differences between these two curves with human eyes.So the profiling and sampling cost is indeed very small.
When the tools are not sampling, the performance impact is strictly zero sincewe never change anything in the target processes anyway.
Safety
Because we use dynamic tracing technologies, we do not change any state in thetarget processes, not even a single bit of informationtwo. This makes surethat the target process behaves (almost) exactly the same as the case when nosampling is performed. This guarantees that the target process’s reliability (no unexpected behavior changes or process crashes)and behaviors won’t get compromised by the profiling tool. They stay exactlythe same as when no one is watching, just like taking an X-ray image againsta live animal.
Traditional Application Performance Manager (APM) products mayrequire loading special modules or plugins into the target software, or evenbloodily patching or injecting machine code or byte code into the target software’s executableor process space, severely compromising the stability and correctness of theuser systems.
For these reasons, these tools are safe to use in production environments toanalyze those really hard problems which cannot be easily reproduced offline.
Compatibility
The Lua-land CPU flame graph sampling tool provided by theOpenResty XRayproductsupports any OpenResty or Nginx binaries including those compiled by the usersthemselves with aribitrary build options, optimized or unoptimized, using theGC64 modeor non-GC64 mode in theLuaJITlibrary, and etc.
OpenResty and Nginx server processes running inside Docker or Kubernetes containerscan also be analyzed transparently byOpenResty XRayand perfect Lua-land CPU flamegraphs can be rendered without problems.
Our tool can also analyze console-based user Lua programs run by therestyorluajitcommand-line utilities.
We also support old Linux operating systems and old kernels, like CentOS 6 with the kernel 2.6.32.
Other types of Lua-land Flame Graphs
As mentioned earlier in this post, flame graphs can be used for visualizingany system resources or performance metrics, not just CPU time. Naturally we do have othertypes of Lua-land flame graphs in ourOpenResty XRayproduct, like off-CPU flamegraphs, garbage collectable (GC) object size and data reference path flame graphs, new GC object allocationflame graphs, Lua coroutine yielding time flame graphs, file I/O latency flamegraphs, and many more.
We will cover these different kinds of flame graphs in future articles on ourblog site.
Conclusion
In this article we introduce the very useful visualization, Flame Graphs, forprofiling software performance. And we have a deep look at one particular typeof flame graphs, Lua-land CPU Flame Graphs, for profiling Lua applications runningatop OpenResty and Nginx. We investigate several small Lua sample programs usingreal flame graphs produced byOpenResty XRayto demonstrate the strength ofour sampling tools based on dynamic tracing technologies. Finally we look atthe performance overhead of sampling and the safety of online uses.
Yichun Zhang (Github handle: agentzh), is the original creator of theOpenResty®open-source project and the CEO ofOpenResty Inc..
Yichun is one of the earliest advocates and leaders of “open-source technology”. He worked at many internationally renowned tech companies, such asCloudflare, Yahoo!. He is a pioneer of “edge computing”, “dynamic tracing” and “machine coding”, with over 22 years of programming and 16 years of open source experience. Yichun is well-known in the open-source space as the project leader ofOpenResty®, adopted by more than 40 million global website domains.
OpenResty Inc., the enterprise software start-up founded by Yichun in 2017, has customers from some of the biggest companies in the world. Its flagship product,OpenResty XRay, is a non-invasive profiling and troubleshooting tool that significantly enhances and utilizesdynamic tracingtechnology. And itsOpenResty Edgeproduct is a powerful distributed traffic management and private CDN software product.
As an avid open-source contributor, Yichun has contributed more than a million lines of code to numerous open-source projects, including Linux kernel, Nginx,LuaJIT,GDB,SystemTap,LLVM, Perl, etc. He has also authored more than 60 open-source software libraries.
Translations
We provide aChinese translationfor this article on ourblog.openresty.com.cnsite.We also welcome interested readers to contribute translations in other naturallanguages as long as the full article is translated without any omissions. Wethank them in advance.
We are hiring
We always welcome talented and enthusiastic engineers to join our team atOpenResty Inc.to explore various open source software’s internals and build powerful analyzers andvisualizers for real world applications built atop the open source software. If you areinterested, please send your resume totalents@openresty.com. Thank you!
Similarly any primitive routines belonging to the VM itself won’t be blind spots either. So we can profile the VM itself at the same time just fine.↩︎
The Linux kernel’suprobesfacility would still change some minor memory state inside the target process in a completely safe way (guaranteed by the kernel) and such modifications are completely transparent to the target processes.↩︎
Related Articles
OpenResty XRayJul 15, 2023
OpenResty XRayUpdated Aug 28, 20235 mins read
How we solved a CPU bottleneck caused by Lua exceptions in a custom Kong plugin (using OpenResty XRay)
The problem: high CPU usage in Kong servers
The analysis & report
The result: improved performance and reduced CPU usage
The problem: high CPU usage in Kong servers
The analysis & report
The result: improved performance and reduced CPU usage
OpenResty XRayJul 15, 2023
OpenResty XRayUpdated Jul 15, 20237 mins read
Memory and CPU usage statistics among Kong plugins online (using OpenResty XRay)
CPU usage among all Kong plugins in a server process
Memory usage among all Kong plugins in a server process
Extra overhead for the servers
CPU usage among all Kong plugins in a server process
Memory usage among all Kong plugins in a server process
Extra overhead for the servers
OpenResty XRayJun 11, 2022
OpenResty XRayUpdated Feb 6, 20239 mins read
Listing Loaded Lua Modules in OpenResty or Nginx Processes
System Environment
Names of Loaded Lua Modules
Running Directly in the Web Console
Tracing Applications inside Containers
How The Tools are Implemented
The Overhead of the Tools
System Environment
Names of Loaded Lua Modules
Running Directly in the Web Console
Tracing Applications inside Containers
How The Tools are Implemented
The Overhead of the Tools
OpenResty XRayAug 10, 2020
OpenResty XRayUpdated Jul 6, 202310 mins read
Memory Fragmentation in OpenResty and Nginx's Shared Memory Zones
An empty zone
Filling entries of similar sizes
Deleting odd-numbered keys
Deleting the keys in the first half
Mitigating Fragmentation
An empty zone
Filling entries of similar sizes
Deleting odd-numbered keys
Deleting the keys in the first half
Mitigating Fragmentation
OpenResty XRayAug 4, 2020
OpenResty XRayUpdated Jul 6, 202313 mins read
How OpenResty and Nginx Shared Memory Zones Consume RAM
Slabs and pages
What is allocated is not what is paid for
Fake Memory Leaks
HUP reload
Slabs and pages
What is allocated is not what is paid for
Fake Memory Leaks
HUP reload
Table of Contents
Trending
OpenResty XRayJan 14, 2023
OpenResty XRayJan 14, 2023
Automatic Analysis Reports in OpenResty XRay
The Past
The Present
The Future
OpenResty XRayNov 12, 2021
OpenResty XRayNov 12, 2021
Ylang: Universal Language for eBPF, Stap+, GDB, and More (Part 1 of 4)
What is Dynamic Tracing
Why the “Y” Name
Getting Started
Various Backends and Runtimes
Why a Unified Frontend Language
The Language Syntax
OpenResty XRayAug 31, 2020
OpenResty XRayAug 31, 2020
Introduction to Lua-Land CPU Flame Graphs
What is a Flame Graph
Simple Lua samples
Complicated Lua applications
Sampling overhead
Safety
Compatibility
Other types of Lua-land Flame Graphs
OpenResty XRayAug 10, 2020
OpenResty XRayAug 10, 2020
Memory Fragmentation in OpenResty and Nginx's Shared Memory Zones
An empty zone
Filling entries of similar sizes
Deleting odd-numbered keys
Deleting the keys in the first half
Mitigating Fragmentation
OpenResty XRayAug 4, 2020
OpenResty XRayAug 4, 2020
How OpenResty and Nginx Shared Memory Zones Consume RAM
Slabs and pages
What is allocated is not what is paid for
Fake Memory Leaks
HUP reload
OpenResty XRayJan 21, 2020
OpenResty XRayJan 21, 2020
How OpenResty and Nginx Allocate and Manage Memory
On The System Level
On The Application Level
For Traditional Nginx Servers
Latest Articles
OpenResty XRayMar 27, 2024
OpenResty XRayMar 27, 2024
Analyze OpenResty/Nginx Applications without Debug Symbols (using OpenResty XRay)