Fixing Bad Performance with New Relic: 3 of 3

When everything is on fire, where should you throw the first bucket of water? To fix performance, you have to profile. Figuring out where to start can be overwhelming, especially when your system performs OK under normal daily loads, but collapses when a traffic spike hits you. This is part 3 of 3 in this series, see part 1 and part 2.

Where To Look First

Once you have a test that can simulate the load you expect to get, you can dive into the details of fixing the problems. So many things can drag you down, but we have found some areas that are frequently to blame. Typically these issues cause increased latency (time to complete a request) and thus limit the throughput of the system according to Little’s Law.

When you need to handle lots of traffic, you need high throughput. This is where New Relic’s most time-consuming transactions report really shines. Looking at the most time-consuming transactions and database operations is the first place to start. Because the throughput of the system is limited by the slowest transactions, it’s almost always correct to focus on the worst ones in order. That way you maximize the improvement yielded by your remediation efforts.

The system you have may not adhere to well-known system design principles – but by examining and remedying trouble spots, you can re-shape your application to fix the worst issues. We recommend that you look at remediating these four issues in order:

1. Slow Database Queries

Slow database queries are often at the root of performance problems. New Relic’s APM database and slow queries page can help you identify what queries are slow, particularly with respect to queries that run frequently. Another thing to watch for is high network traffic from your database server, which you might be able to identify using the server monitor if you run your database server, or with a database plugin if one is available for your database.

High CPU Usage as Danger Sign

Seeing high CPU utilization on both the database server and the app server is another sign that you may have a serious query performance problem. If queries are returning multiple megabytes of data to build one page, this is going to burn huge amounts of processor time and invalidate CPU caches on both the database server and the application servers.

2. Lack of Caching

In systems that rely on SQL or object databases, getting the same information from the database under load is often very slow. Many e-commerce systems rely on retrieving a list of products or events, and these lists can often be cached. Sometimes you can even cache entire pages using a CDN (content delivery network). Introducing a fast in-memory cache such as Redis or Memcached to buffer frequently retrieved content can have a huge payoff. As you introduce caching, look for a reduction in the total time spent in related transactions. Sometimes you can use advanced caching strategies such as keeping an estimate of inventory of frequently ordered items in the cache in an e-commerce system to speed up a shopping cart system.

Session State Caching

Another area where the lack of caching can really hurt is session state caching. Most server-side programming environments now have modules that let you track session states in Redis or Memcached. These allow you to scale out horizontally more evenly, because you can unpin session cookie or source IP address affinity in your load balancers. Using these stores is also often faster than file-based or SQL database session stores.

3. Inadequate Horizontal Scaling

If you have fixed the worst of the problems related to database queries and caching, but your application is bound by CPU utilization on the app servers or database servers, you may be able to add more servers or virtual machines to spread the load across more application instances.

Scaling Compute Resources

To distribute the traffic, put a load balancer in front of all the instances. This is dramatically easier if you can use cloud computing to provision extra capacity on-demand, for example, you could use AWS Auto Scaling Groups and rules that scale out the cluster if average CPU utilization exceeds 40%, and use the AWS Application Load Balancer to spread the traffic among your healthy auto scaling group members. Another excellent path to take is to use containers and a container orchestration system, including Kubernetes and Mesos, or serverless computing approaches that use containers under the hood, such as the “Serverless” family of options that includes AWS Lambda, OpenFaaS, and Apache OpenWhisk, in conjunction with the load balancing features baked into those platforms.

Scaling Database Resources

Sometimes scaling of compute resources is not enough – sometimes, you also have to scale out your databases. Many database architectures support read replicas. Once you have read replicas available, you must alter your application to take advantage of the read replicas, using the Read-Write Split pattern. Use the reader database connection for retrieving information that does not need to be written back to the database, such as for reporting or creating lists of information in a user interface. Then when you must write, use the writer connection. One thing to watch for in this case is that replication lag can make readers see stale information, and you may need to work harder to avoid updating records with stale data. Your system from the user’s viewpoint will have become eventually consistent in some cases.

If you use MySQL or PostgreSQL a very simple and performant way to horizontally scale out your databases is to use Amazon Aurora and Aurora Auto Scaling – Aurora has the advantage that it minimizes replication lag, usually to a sub-second range, and has super-fast failover in practice, usually under 20 seconds.

4. Over-reliance on Synchronous Processing

Imagine that you are writing an e-commerce web application, and you have a customer with a full shopping cart who is ready to check out and give you their money. Your system might start out fully committing all the database transactions for the order and related inventory in your SQL database, compose and send email to the customer, and do other housekeeping chores.

In many systems, all these operations happen in order, synchronously, before the customer sees the results from hitting the Submit Order button. However, completing all these items could take many seconds, even with a single user checking out. Remember – latency is the enemy of throughput due to Little’s Law.

Introduce transactional asynchronous messaging

If you introduce transactional asynchronous messaging and the Publish / Subscribe pattern, you can pare down the amount of work you have to do at checkout time to a much more minimal set, you might be able to get away with just doing some level of checking on the credit card before you tell the customer the order is submitted. All the rest of the work can happen asynchronously, and queue workers can perform the work without the customer waiting, and at a rate that will not overwhelm your busy database servers, even when a load spike hits.

Choosing software and libraries that can support this pattern is its own topic, one I will revisit in a future blog post, but doing the simplest thing possible that matches the message durability and performance constraints you face is important.

Monitor and Test All The Things All The Time

As you introduce changes in each of the four areas above, repeat the performance tests you’ve built with JMeter. You might need to run them hundreds of times over the course of months to verify the improvements. As you do this, monitor the results with New Relic. You should run load tests regularly, before and after you introduce new features (even seemingly unrelated ones), after you make infrastructure changes, before every major release, and on a periodic scheduled basis to guard against regressions.

Use the New Relic Notes feature to record how your system behaves while you load test it, and once you get to the peak day for your system’s performance, record all the relevant metrics, including at the very least throughput, Apdex, average latency and error rate. Tracking and saving the latency and throughput of APIs that your application calls is also really important, as external factors could be causing problems for your application. Using Notes will preserve the snapshot of your critical performance data at high resolution for years to come.

Conclusion

By examining trouble spots in the four key areas of database query performance, caching, horizontal scaling, and identifying slow processes that can move to message queues, you can eliminate some of the most thorny performance problems in your systems, and survive peak day. Measuring the results every step of the way with New Relic and JMeter tests is essential. If you do this with valid tests that simulate load at or above real peak production traffic with a representative spread of traffic, you can survive what would otherwise be crushing loads.

Posted in DevOps

Richard Bullington-McGuire

Richard Bullington-McGuire is a Director, Engineering. He is a serial entrepreneur and versatile technologist with experience in software development, system architecture, devops, agile processes, mobile computing, for-profit and non-profit start-up companies, and design. Richard holds both Professional-level AWS certifications. He has more than 25 years of consulting experience, and is a member of both IEEE and ACM.