I have a confession to make: as a growth-oriented product manager, I tend to gravitate towards A/B testing whenever I can. I’m admittedly biased. Being able to measure the impact of our work and thousands of dollars in increased revenue (and bonuses!) derived from it is hard to beat. What’s not to love? A/B testing lets us see what customers actually do rather than what they say – which can be a pitfall when using other research methods. And of course the thrill of watching the test gains reach significance on the screen is as great of a dopamine rush as any social network! And yet, sometimes A/B testing can be an inefficient waste of resources, or worse, lead us down a wrong path away from a better product. How can we know when it’s the right tool to use?
When to Use A/B Testing
Funnel optimization is perfectly suited for A/B testing, especially around short term metrics such as conversion or immediate engagement (which is often highly correlated with Lifetime Value, or LTV). It fits well for A/B testing because we are able to measure our impact numerically, rather relying solely on intuition, or on customers’ potentially biased responses in user interviews or surveys. We can’t determine exactly why customers convert more on the credit card page after five steps, rather than three (one of the surprising learnings in some recent growth tests we’ve done for a client), but we do know it would bring double digit conversion gains and hundreds of thousands of dollars in the bank. Not too shabby.
We can also use A/B testing for metrics that require a greater time lag (such as retention), or those that are harder to reach significance on (such as revenue), as long as we can find a reliable proxy for those metrics. For example, we might feel that there is a strong correlation between completing an exercise in the app and becoming a paying subscriber – so we can optimize to exercise completion and assume that revenue will follow. In this case, we will want to come back and sanity check our test results after the data ages just to make sure that these data points align as expected: exercise completion leads to paying for a subscription. So while the test only had to run 2 weeks to reach significance in the proxy, we’d revisited the test several weeks later to make sure that it was a winner, before bringing the new experience to the rest of the site.
Typically with A/B testing, we prefer to have only one metric in our hypothesis, but secondary metrics and even additional qualitative data can help us form a more complete picture of customer’s motivations as they interact with our site, and to maintain confidence in our results and conclusions. A/B testing focuses on outcomes, but understanding the “why” can help us advance with a more nuanced understanding.
Another great case for A/B testing is to test interest in a new feature before investing heavily in engineering. Customer interviews can definitely help here as well, but nothing speaks more of customer demand for a potential feature than customers actually trying to engage with it. So short of a crystal ball, a ‘smoke’ A/B test with simple ‘coming soon’ content pages behind the links will help us gauge customer’s desire for these potential features, as well as potential effects of a new feature on other page metrics. For example, although we won’t really know for sure the page Click Through Rate (CTR) benefits resulting from a feature until we implement it, the following numbers indicate that more customers are interested in feature A than feature B, and that presence of that feature would potentially help CTR as well. It’s a good sign. We’ve now prioritized our feature backlog with evidence.
When to NOT Use A/B Testing: Traffic Considerations
So when would A/B testing not make sense? One obvious, and often overlooked, case is when we simply don’t yet have enough traffic to detect the difference with confidence. There are many online sample calculators (such as the Optimizely one) that can help determine the sample size needed for a certain expected conversion rate, gain and desired confidence/risk level. For example, if my current conversion rate is 10%, and I’m looking to detect a 5% difference with 95% confidence, I will need >60k people per variation. If I run the experiment with less traffic, I might make a conclusion based on statistically invalid results, which could lead to a decrease in quality of the product, if implemented sitewide. A key to this statistics math: the larger the percentage change, the smaller the sample needed to validate. Based on that, a creative solution to this type of issue is to test larger changes or a combination of several promising improvements. The downside of testing multiple changes is that you won’t know which one of them caused the gains. Sometimes, you can live with lower than 95% confidence, especially since there may not be much to lose. . For example, if we are testing several color buttons against each other, a cost of implementing a non-winning variation won’t likely be high, however if we are planning to build a brand new feature based on the test, we might want to keep our confidence threshold at 90-95%.
In reality, traffic often becomes the most scarce resource as many experiments or even teams compete for its utilization. Since we want to make sure we run the tests for a predetermined amount of time to reach the threshold of significance, we want to make sure we don’t waste our traffic on the experiments which won’t reach significance in a reasonable amount of time. Ideally this is 1-2 full weeks, but can also be run for longer if desired. Coordination between the teams is critical in this situation and while some tools provide the ability to test many tests at once, we’d need to make sure that tests don’t interfere with each other in undesirable ways.
When to NOT Use A/B Testing: Diminishing Returns
Another case for not testing is if we’ve been optimizing the same funnel for a long time and our gains are dwindling. Maybe it’s time to focus our attention elsewhere or look for brand new and bolder ideas. We do expect most (7/10) tests to fail, but if you’ve been running experiment after experiment for several weeks or months and we are getting mostly neutral or insignificant results, you might have an optimized funnel. Congratulations!
Some significant revamps or brand new ideation sessions with broader teams might bring bolder and fresher ideas, but otherwise we wouldn’t recommend continuing experimenting in the same way after dozens of similar results. Keep in mind that just like with psychologists and consultants, our job is to work ourselves out of the optimization job, and a fully optimized funnel might just mean there are product growth opportunities elsewhere to explore. Trust us, sometimes it’s best to step away from iterating without result on the same funnel. Go find growth opportunities in SEO or a new product feature. You can revisit A/B experimentation once the market changes or your product changes enough to warrant it.
When to NOT Use A/B Testing: Resource Constraints
If your team lacks sufficient resources to invest into this process, then don’t do it. In addition to tools costs, which can range from free for Google Optimize to many thousands of dollars for the enterprise licenses for more sophisticated and user-friendly tools, there are many other unseen costs. You need to create and test robust hypotheses, as well as analyze and socialize the results to arrive at the next hypothesis. Creating and testing some hypotheses won’t take that much time, but some do, and unless we can ensure proper attention to the entire testing process, our experiments won’t bring us the learnings and results we want.
More specifically, you need a dedicated and consistent team to ensure that your ideas are thought through and based on insights, that your hypotheses are well formed, experiments are designed and prioritized efficiently, and learnings are complete and through.
It’s important to note, while some tools exist to conduct experiments without needing help of engineering, it can be a risky process and a simple human mistake can result in your site being down (and I’m speaking from experience…) . Because of this I would recommend sufficient quality assurance processes to make sure that anything that your customers see has been at the very least tested on a variety of browsers, and preferably goes through the whole QA (Quality Assurance) cycle.
Dedicating resources and time will assure that the tests will get enough attention in all stages of the process and that we are confident in our results.
A/B testing is a powerful way to optimize your product and make sure that you invest in the features and modifications with the highest potential. Yet, we want to make sure that we have enough traffic and that we are continuously finding wins, so knowing when and when not to use A/B testing is an important product skill. To do it correctly, make sure you have enough dedicated resources to invest into the process of identifying and testing the best hypothesis as well as implementing results to continue improving your product.