Fixing Bad Performance with New Relic: 1 of 3

When everything is on fire, where should you throw the first bucket of water? To fix bad performance, you have to profile. Figuring out where to start can be overwhelming, especially when your system performs OK under normal daily loads but collapses when a traffic spike hits you. This is part 1 of 3 in this series, see also part 2 and part 3. In this part, we focus on how to measure performance and defining where systems commonly face performance problems.

Introducing New Relic

Profiling a complex set of online systems has historically been very challenging. The development tools available to most developers, such as the Visual Studio Performance Profiler and Java VisualVM include profilers suitable for measuring the performance of an application running in a single process. They typically do not scale well in the multi-tier and multi-server deployments typical of modern production web applications.

New Relic offers a suite of sampling profilers, server monitoring tools, and performance instrumentation suitable both for production and staging systems. When Healthcare.gov launched, and had to be fixed in a hurry, the team in the war room turned to New Relic to get things back on track. You can use the same tool kit, with a dash of free software, to improve your application performance.

The great advantage of systems like this is that you can enable them in production, either on your entire set of production systems, or on a subset to get a valid sample, because they have very low overhead. Problems with normal daily loads will stand out in the transaction, database, and error-related APM (Application Performance Monitoring) error pages.

However, problems dealing with periodic heavier loads won’t be soluble by looking at this. Think of what happens when you try to buy tickets for a popular concert or movie, or registering your preschooler for gymnastics classes when there is limited space available. These sorts of loads might come only once a year – but they are the key times when you have to be available or you will lose trust with your users and possibly lose money!

Simulating Load – JMeter to the Rescue

Jmeter is our choice of performance testing tool. You simulate your scenario by recording a script using its built in proxy and edit the script to handle dynamic content. We have some easy to follow tutorials which can help you get started. To handle larger load tests economically you can use AWS in conjunction with jmeter-ec2 or another service.

Measure Twice, Cut Once

Part of the trick about performance measurement is that you have to devise a test that simulates a realistic load. An intense load test will often overwhelm a poorly performing system with tests played back at machine speeds. Just doing that much can help you get started, but it is not enough.

You might want to create several different load profiles, for example:

Steady State (soak): simulates the steady state load your system can sustain
Thundering Herd (spike): simulates a load surge that starts quickly and trails off
Ramp to Fail (stress): simulates a load that builds up gradually to an unbearable level that forces your system to fail

Each of these test types will ferret out different types of problems, and they require different strategies and test interpretations.

We’ll discuss test design more in Part 2 of this series.

The Usual Performance Suspects

Once you have a test that can simulate the load you expect to get, you can dive into the details of fixing the problems. So many things can drag you down, but we have found some areas that are frequently to blame. These areas all have distinctive features that New Relic APM reports can identify. We recommend that you look at remediating these issues in order:

Slow Database Queries: fix the slowest database queries first, it may be enough to rescue your application performance. The New Relic reports to use here are the Transaction report and Database and Slow Queries report.
Lack of Caching: cache frequently used information and session data, to reduce CPU and database load dramatically. Looking at the Transaction report helps here. Doing this well can help with item #3 also:
Lack of Horizontal Scaling – deploy more servers to deal with the load. Most people think of doing this first but it won’t save you many times if you haven’t fixed your slow database queries and caching issues. Doing series of tests with more servers deployed can help you predict how large of a load you will be able to handle. The New Relic server reports come into play here.
Synchronous Processing: find places where you can defer work in the critical path of your application (assembling an email confirmation, for example) to reduce the time people wait for requests to complete. This tends to be hard but it can sometimes yield massive improvements. The APM transaction report and detailed traces can help identify problems in this area.

We’ll discuss these specific remediation strategies and how to measure that they are working in Part 3 of this series.

Always Verify The Results

If you have full set of functional regression tests, you can run these to verify that your fixes have not broken other parts of your system. You can build manual regression tests using spreadsheets, but these are tedious to maintain and require lots of manpower to run. We often build such a suite using our Cucumber-Watir test harness. Depending on the application and technology, we might use another functional QA automation stack, such as Webdriver.io, Protractor, Nightwatch-Cucumber, or Pytest-BDD. See our Quality Assurance blog posts for more details. That way, if you have made performance-related fixes, and you can run these functional tests at will, you will have higher assurance that the fixes have not caused other regressions in behavior.

Conclusion

Following this path to performance remediation has yielded good results for multiple customers of ours – and it can work for you too. If you have a good way to model and simulate the loads on your systems, and you can measure what happens inside the systems when they are under load, you can know where to make changes to remediate the problems. You can’t just try to overload your systems to see where they fail – test design is really important, we will cover that in part 2. In part 3, we will cover some of the most common ways to remediate problems and how to determine where to start fixing performance problems. If you follow the path we are outlining, you can avoid some naive approaches and reap the benefits of taking a quantitative approach. See part 2 for more on how to design valid load tests.

Editor’s Note

This article was revised in November 2019 to add forward links to part 2, to add links to more contemporary QA automation stack blog articles, and to clean up some minor copy issues.

Posted in DevOps

Richard Bullington-McGuire

Richard Bullington-McGuire is a Director, Engineering. He is a serial entrepreneur and versatile technologist with experience in software development, system architecture, devops, agile processes, mobile computing, for-profit and non-profit start-up companies, and design. Richard holds both Professional-level AWS certifications. He has more than 25 years of consulting experience, and is a member of both IEEE and ACM.

Introducing New Relic

Simulating Load – JMeter to the Rescue

Measure Twice, Cut Once

The Usual Performance Suspects

Always Verify The Results

Conclusion

Further Reading

Editor’s Note

Richard Bullington-McGuire

Want more insights to fuel your innovation efforts?

Introducing New Relic

Simulating Load – JMeter to the Rescue

Measure Twice, Cut Once

The Usual Performance Suspects

Always Verify The Results

Conclusion

Further Reading

Editor’s Note

Richard Bullington-McGuire

Related Posts

Want more insights to fuel your innovation efforts?