Investigating Performance Issues for a SaaS Platform

A popular B2B SaaS organization witnessed a drop in their CSAT score due to performance issues in their compliance management platform. Customers were increasingly complaining about slow application speed and page load times. The company wanted to take action before the issue could spiral into a major problem causing customers to churn. Modus Create’s team of consultants, engineers, and UX professionals ran a 4-week assessment to uncover the root cause of the problem and suggest actionable outcomes.

Our Work Involved

Extensive team interviews
Reviewing source code
Conducting experiments to measure performance
Real-time observation of customer support calls
Reviewing monitoring tools such as New Relic and Splunk
Creating an action plan to implement recommendations

Impact

Setting up a dedicated performance engineering team
Recommendations for improvements in observability
A collection of purpose-built performance dashboards
Action plan to address the root cause of scrolling performance
Improvements in UX/UI of the application
Recommendations for better testing
New benchmarks for application performance

“I am experiencing performance issues during scrolling.”

“Sometimes, the data doesn’t get saved.”

“Loading complex data can take up to 10-15 seconds.”

When SaaS companies hit big, they sometimes get on a feature-building spree to delight their customers. Ambitions grow bigger, and so does the desire to explore new product segments.

More often than not, this shifts the focus away from the basics and leads to performance issues in the application. Our client, a global B2B Software as a Service organization, faced a similar challenge. The user feedback of their enterprise application was increasingly turning critical. Many users complained about the slow loading times across the app, latency in navigation, and other performance issues.

The client wished to uncover the root cause of the problem. They approached Modus Create for a 4-week engagement to investigate the performance issues and suggest actionable outcomes.

Understanding the Problem

The client had identified a few issues before the beginning of the assessment:

Slow loading time of certain features throughout the application
Customers networks, including SSL inspection and environments impacting performance
Excessive calls to and from the backend
Gaps in the observability of application’s performance
High client-side memory usage and known memory leaks

The team designed a custom engagement to investigate and come up with actionable recommendations. The goal was not to simply build on the client’s observations but to test their hypothesis. The focus of the investigation was on:

Diving deep into the customer complaints around the application’s performance
Reviewing how/if customer systems, such as VPNs, firewalls, SSL inspection, and endpoint security agents were impacting their perception of performance
Conducting code reviews and identifying issues that may impact application latency and responsiveness
Interviewing team members and reviewing infrastructure architecture to understand if it was affecting the application performance
Assessing current monitoring, Site Reliability Engineering (SRE), and product optimization processes to understand gaps in observability and testing

Investigation and Recommendations

The team performed an extensive investigation and divided their findings into four key areas - Observability, Performance, UI/UX, and Testing. Note that these aren’t discrete areas but have some overlap with each other. For example, latency errors are related to both observability and performance.

Here are the major findings of the engagement and the recommendations suggested by Modus Create to the client:

1. Existence of Information Silos

Different groups in the client’s organization had their own priorities. As a result, the bigger product picture was getting lost. Despite the client’s efforts at making data-driven decisions, siloing of information was leading to inaccurate strategies.

Recommendation: After reviewing the client’s existing monitoring tools, the team discovered no clear way to aggregate information from different sources. The client was using three different tools - Splunk, New Relic, and Insight. There were blind spots around the product due to the lack of a single source of truth. The team recommended building traceability, logging, and metrics integration into a custom dashboard on New Relic.

Example of a custom application deployed to New Relic One

Consolidation into one toolset for better visibility will help the client drill down into individual components while gaining a 360-degree view of what is happening.

2. Recurring Performance Issues

The client hadn’t turned a blind eye to the performance issues in the application. They had launched several performance-focused initiatives in the past five years. But the client measured performance based on each team’s metrics and there was no active ownership of company-wide performance indicators. Thus, performance issues continued popping up.

Recommendation: Recurrent performance problems in the software are primarily an organizational issue rather than product deficiency or technical debt. Organizations typically ignore recurrent issues and overlook the root cause for short-term vision and gains. This can lead to team frustration, customer mistrust, and low CSAT scores. The team suggested creating a dedicated performance engineering team to eliminate recurring performance problems. In this new team, everyone will have direct accountability for performance issues.

3. Difficulty in Replicating Customer Environments

Quick bug fixes demand successful replication of customer environments. However, as the client’s customers were of different sizes and complexities, this was a challenge. Limited technical capabilities at capturing metrics such as user flows, size of data/files, and behavior made it extremely difficult to understand the client context. This hindered replicating issues in a lab environment.

Recommendation: The team suggested that the client a few tools to help them narrow down their customer issues and find the right solution such as Steps Recorder (from Microsoft) to record actions taken by a user on a computer and assist with troubleshooting.

4. Lack of Customer Nurture Campaigns

A customer issue doesn’t get resolved once the fix is made. You need to update the customer to close the loop. The client lacked automated customer nurture campaigns to demonstrate how the issues were being resolved. This was having a direct impact on the clients’ CSAT score.

Recommendation: The client already has a Standard Operating Procedure (SOP) for capturing customer feedback and interviews. To complete the process, the team suggested extending the scope of the SOP to communicate resolved issues to customers in the form of demos from LAN environments.

5. Lack of Tracing

Although the client had multiple tools for tracking and logging, they lacked end-to-end tracing of their application. There was a limited correlation between traces, logs, and metrics to help identify the customer problems. The client also had limited integration with infrastructure metrics and lacked a framework for data collection, tagging, context, logs, etc.

Recommendation: The team suggested standardizing tools for instrumentation and simplifying data collection. It also recommended leveraging tracing data to discover app flow and dependencies. Infrastructure metrics (K8s, AWS, etc.) would give more visibility to the client and help build dashboards based on KPIs and personas.

Why Was The Application Slow?

The previous section's insights indicate high-level factors leading to performance issues and customer dissatisfaction. The team also highlighted specific factors slowing down the application’s real-time performance.

Page scrolling performance - The documents within the client’s data management platform were rendered as a combination of layered SVGs and DOM elements. The combination of the expensive nature of this approach to rendering documents and rapid updates to the DOM occurring as users scroll pages into/out of view was the fundamental root cause of the performance problems.
No benchmarks - The client lacked baseline metrics and benchmarks that customer-reported performance issues can be measured against. There was no custom benchmark tool that a customer can run from their system, which subsequently collects results and sends them to a central dashboard.
Memory leak - To help understand the memory usage of the client’s product, the team compared the same document in Google Doc and the client site. The client had 10x more array objects than an equivalent document in Google.
Ineffective UX feedback - Users are more accepting of delays in the app if they get informed that something is happening, rather than no feedback. For example, when uploading a picture, if you see a message “Uploading,” you’ll be willing to wait longer than if there was no message at all. The client’s application had the same issue.

Impact of the Engagement

Although this was an investigative engagement, the client received not just a list of underlying causes but also specific measures to address them. They now have a roadmap to do quick prototyping to help improve the application’s performance.

Before the engagement, the client believed that the slow performance was a result of network issues. They’re now aware that it is largely due to the rendering of the application’s front end. The investigation also highlighted the need for a dashboard to act as a single source of truth and create a foundation for measuring strategic KPIs.

This client story is a classic example of how seemingly tech-related issues sometimes have their solutions in the organizational structure. With a performance engineering team, the client would now have a dedicated set of professionals to own the entire application's performance.

If your application is facing scalability or performance issues, talk to Modus. For over ten years, our consultants have solved similar problems for both startups and the Fortune 500.

Investigating Performance Issues for a B2B SaaS Platform