New Relic and JMeter Performance Remediation - Part 2

When everything is on fire, where should you throw the first bucket of water? To fix bad performance, you have to profile. Figuring out where to start can be overwhelming, especially when your system performs OK under normal daily loads but collapses when a traffic spike hits you. This is part 2 of 3 in this series, see part 1 and part 3. In this part, we focus on test design and interpreting test results.

Creating Realistic Load Tests

Measuring a realistic load can really help you out of big performance binds. Most simple JMeter load tests ramp up traffic and then hold it at a steady level. This might simulate a steady state traffic load, but often these tests just push the system under test into congestion collapse before the ramp up is even finished. While congestion collapse is usually thought of in terms of network traffic, queueing theory also limits the available throughput of server applications.

To craft a test that matches your expected load profile, it is helpful to do some timing of real users going through the site and add delays to your tests. You might need to add delays using one of the more sophisticated timer types, for example, the Gaussian delay timer, and consider whether your load really loops or whether it just builds up to a peak and then dissipates.

You also might need to model production traffic and build a statistical sample of the types of requests that happen, and use controllers that allow you to vary the proportion of requests, such as the Throughput Controller or the Switch Controller. For some tests, you should probably also parameterize the data submitted through individual requests, to pass a more varied data set in, using the CSV Data Set Config, for example. Varying your data may require you to also use JMeter’s facilities for extracting variables from responses through regex or css. The extracted variables can then be substituted into subsequent requests if required to model a user flow. All of these things contribute to making the load tests as similar to a real production load as possible.

Putting a realistic load on your server may also necessitate increasing the number of threads – and once you get somewhere above 200-500 threads, JMeter on one node may not be able to handle it. Once you reach this point you may need to distribute your test with Blazemeter or use scripts from the jmeter-ec2 project or one of its many derivatives such as gee to run a larger test.

Load Profiles

If your current production load is very high on a regular basis, you might not need to create any JMeter tests to get meaningful analytics from New Relic, as you can just use information from the running systems to guide changes. That’s not the case for many systems where you need to prepare in advance for a big product launch, public event, a surge related to a news article, or for a system that exhibits seasonal load patterns. Different load profiles are good for different situations. Performance testing practitioners recognize a variety of different load test types; this article will focus on three variants of load tests profiles related to soak, spike, and stress tests that we have found particularly useful.

Steady State (Soak)

Delays: Runs with or without delays between requests (running with delays is more realistic)
Looping: Loop threads in test – keep pounding away
Purpose: drive throughput up to find sustainable rate, weaknesses in code
Look for: scalability plot elbow (at what rate does the max throughput slow)

A steady-state test is a performance test that ramps up load to a stable point and sustains it. This is the simplest type of test to do with JMeter, as it can be done with just the built-in thread group. With a steady state test, you can see if your application performs under a fixed load without degrading. When done over a long time period at high rates, this is called a soak test.

Thundering Herd (Spike)

Delays: Runs with “real person” delays between requests
Looping: Do not loop threads in test – each thread / user runs once only
Purpose: Find max number of users the system can sustain
Look for: Maximum number of users the system will tolerate with reasonable delays

If your application experiences huge, quick surges in load (a “thundering herd”) you will probably want to run a series of load tests that have a realistic simulation of this behavior – a spike load profile. Systems, where a constrained number of items in high demand go on sale on a schedule (such as concert tickets, in-demand classes, or hotel reservations for a popular convention), have this load profile. Running a spike load test can help determine whether your system might have a problem similar to the classic computer science thundering herd problem or cache stampede – where contention for a limited resource limits throughput. If there are 2000 concert tickets expected to be sold in 15 minutes, you could configure 2000 threads, but do not have them loop. In a spike test, the load on the system rapidly increases then typically falls off as demand is satisfied.

Ramp to Fail (Stress)

Delays: Runs with or without delays between requests (running with delays is more realistic)
Looping: Loop threads in test – keep pounding away
Purpose: drive throughput past point of failure to find max rate, weaknesses in code
Look for: scalability plot elbow (at what rate does the max throughput slow), throughput RPM where failures start

In order to survive either a heavy steady state load or a thundering herd load, you need to be able to know where the failure point of your system lies. To do this, you perform a stress test to help find this point – a test that ramps up load over the entire duration of the test until the system fails to keep up, either by having throughput crash or with an elevated error rate or both symptoms of failure. The key observation for this sort of test is: what is the maximum throughput the system can take before it plateaus in throughput or starts failing?

When to Use Each Type of Test

If you have to meet a service level agreement for concurrency, a steady state test can help prove that you can do it, by sustaining a transaction rate lower than the system’s maximum capacity. This test can be run over a short period of time to induce enough load to measure the obvious weak spots, or over a longer time frame as a soak test, where over many hours or even days you can verify that the performance of the application does not degrade over time. However, in many cases when people first start creating tests, they unintentionally run a stress test by exceeding the system’s maximum capacity during the initial ramp up. The results after the system suffers congestion collapse are not valid after that point.

When you expect crushing load to suddenly appear, you should apply a rapidly increasing load profile that matches the real-world load curve of expected demand – the thundering herd or spike profile. In systems that have a history of failure under load, the initial spike load tests will probably be stress tests – they will cause system failure. The goal of running a series of thundering herd load tests is typically to get to a spike load test that is not a stress test, one where your system stays up and continues to serve traffic under a crushing load.

To increase the reliability of either a steady state test or a thundering herd test, you should run a “ramp to fail” stress test. This might require several runs to calibrate initially. If you don’t see a failure or throughput decline, you are not pushing enough load into the system to make it fail. If the throughput peak or system failure happens too soon in the test (in the first 25% of the duration) you are pushing too much load into the system and the stress test won’t be valid.

Things to Watch For

Congestion Collapse

Congestion collapse is the enemy of all systems that have to sustain load that exceeds their capacity. Most of the theory and studies of congestion collapse comes from computer networking but the same principles apply to application performance. The graph below shows a production system hitting capacity and failing as a thundering herd load hits it just after 1 PM – note the massive latency spike and red error rate indicators in the New Relic graph. Note the mesa or volcano shape in the throughput graph – this system failed after the throughput exceeded 400rpm.

See below for comparable New Relic graphs taken from a JMeter-driven Ramp To Fail stress test on a staging environment of the same system, replicating the congestion collapse first seen in production:

You can see that around 2 PM, the latency spikes and the throughput levels off, even as the total load on the system increases. The graph assumes a mesa-like appearance, similar to the production system graph, and shows the system struggling when it hits about 500 rpm, with the throughput falling even as the input load increased. This set of patterns are indicative of congestion collapse. Stress test loads and real production loads that exceed system capacity often look like this. If the system were able to handle the load, the throughput would have increased linearly along with load until the end of this test.

System Failure

The second most significant failure mode to look for is increased error rates. Even if the throughput and latency stay within acceptable bounds, elevated error rates can show that your system is failing to service requests. The following graphs show a load test from a staging deployment of the same system shown above where a firewall router failure driven by the intense load made the error rate of the system spike. Even if the throughput and latencies of a system are in an acceptable range, if clients are experiencing errors, you may have some remediation ahead of you.

Conclusion

Once you have measured the performance of your system with a variety of test strategies, the next step is remediation, covered in part 3 of this series. As you apply remediations, you should continue testing performance with both the load profile that fits your production system and with profiles designed to stress the boundaries of the system. Applying the steady state, thundering herd, and ramp to fail variants of the soak, spike, and stress tests can really help with this. When the performance of a system is mission critical, you might end up performing dozens or even hundreds of these tests along the way, so understanding which tests to apply is critical.

Richard Bullington-McGuire

Richard Bullington-McGuire is a Director, Engineering. He is a serial entrepreneur and versatile technologist with experience in software development, system architecture, devops, agile processes, mobile computing, for-profit and non-profit start-up companies, and design. Richard holds both Professional-level AWS certifications. He has more than 25 years of consulting experience, and is a member of both IEEE and ACM.

Creating Realistic Load Tests