To Fix Bad Performance, You Have To Profile
When everything is on fire, where should you throw the first bucket of water? Figuring out where to start can be overwhelming, especially when your system performs OK under normal daily loads but collapses when a traffic spike hits you.
Introducing New Relic
Profiling a complex set of online systems has historically been very challenging. The development tools available to most developers, such as the Visual Studio Performance Profiler and Java VisualVM include profilers suitable for measuring the performance of an application running in a single process. They typically do not scale well in the multi-tier and multi-server deployments typical of modern production web applications.
New Relic offers a suite of sampling profilers, server monitoring tools, and performance instrumentation suitable both for production and staging systems. When Healthcare.gov launched, and had to be fixed in a hurry, the team in the war room turned to New Relic to get things back on track. You can use the same tool kit, with a dash of free software, to improve your application performance.
The great advantage of systems like this is that you can enable them in production, either on your entire set of production systems, or on a subset to get a valid sample, because they have very low overhead. Problems with normal daily loads will stand out in the transaction, database, and error-related APM (Application Performance Monitoring) error pages.
However, problems dealing with periodic heavier loads won’t be soluble by looking at this. Think of what happens when you try to buy tickets for a popular concert or movie, or registering your preschooler for gymnastics classes when there is limited space available. These sorts of loads might come only once a year – but they are the key times when you have to be available or you will lose trust with your users and possibly lose money!
Simulating Load – JMeter to the Rescue
Jmeter is our choice of performance testing tool. You simulate your scenario by recording a script using its built in proxy and edit the script to handle dynamic content. We have some easy to follow tutorials which can help you get started.
Measure Twice, Cut Once
Part of the trick about performance measurement is that you have to devise a test that simulates a realistic load. An intense load test will often overwhelm a poorly performing system with tests played back at machine speeds. Just doing that much can help you get started, but it is not enough.
You might want to create several different load profiles, for example:
- Steady State: simulates the steady state load your system can sustain
- Thundering Herd: simulates a load spike that starts strong and trails off
- Ramp to Fail: simulates a load that builds up slowly to an unbearable level that makes your system fail
Each of these test types will ferret out different types of problems, and they require different strategies and test interpretations.
We’ll discuss test design more in Part 2 of this series.
The Usual Performance Suspects
Once you have a test that can simulate the load you expect to get, you can dive into the details of fixing the problems. So many things can drag you down, but we have found some areas that are frequently to blame. These areas all have distinctive features that New Relic APM reports can identify. We recommend that you look at remediating these issues in order:
- Slow Database Queries: fix the slowest database queries first, it may be enough to rescue your application performance. The New Relic reports to use here are the Transaction report and Database and Slow Queries report.
- Lack of Caching: cache frequently used information and session data, to reduce CPU and database load dramatically. Looking at the Transaction report helps here. Doing this well can help with item #3 also:
- Lack of Horizontal Scaling – deploy more servers to deal with the load. Most people think of doing this first but it won’t save you many times if you haven’t fixed your slow database queries and caching issues. Doing series of tests with more servers deployed can help you predict how large of a load you will be able to handle. The New Relic server reports come into play here.
- Synchronous Processing: find places where you can defer work in the critical path of your application (assembling an email confirmation, for example) to reduce the time people wait for requests to complete. This tends to be hard but it can sometimes yield massive improvements. The APM transaction report and detailed traces can help identify problems in this area.
We’ll discuss these specific remediation strategies and how to measure that they are working in Part 3 of this series.
Always Verify The Results
If you have full set of functional regression tests, you can run these to verify that your fixes have not broken other parts of your system. You can build manual regression tests using spreadsheets, but these are tedious to maintain and require lots of manpower to run. We often build such a suite using our Cucumber-Watir test harness or Nightwatch-Cucumber. That way, if you have made performance-related fixes, and you can run these functional tests at will, you will have higher assurance that the fixes have not caused other regressions in behavior.
Following this path to performance remediation has yielded good results for multiple customers of ours – and it can work for you too. If you have a good way to model and simulate the loads on your systems, and you can measure what happens inside the systems when they are under load, you can know where to make changes to remediate the problems. You can’t just try to overload your systems to see where they fail – test design is really important, we will cover that in Part 2. In part 3, we will cover some of the most common remediations and how to determine where to start fixing performance problems. If you follow the path we are outlining, you can avoid some naive approaches and reap the benefits of taking a quantitative approach.