What You’re Getting Wrong About A/B Tests (And How to Fix Them)
A/B tests are a powerful statistical tool, commonly used (or abused) for making decisions about everything from button colors to machine learning models. But, we’ve seen A/B tests frequently used incorrectly:
Data analysts can’t reproduce test results because they’re missing data on how users were allocated between the control and test variants.
Without a clear winning condition, the team spends weeks in post-test analysis, arguing about whether or not to productionize an ML model.
False tribal knowledge: “we know a shorter onboarding survey actually hurts conversion”.
This blog post describes a light framework for planning out A/B tests, influenced by best practices at places like Square and Stitch Fix. Many data scientists will already know much of this content, so it’s written for founders, software engineers, and product managers who do not have strong statistical backgrounds.
How to Write an A/B Test Planning Document
Before you run your test, write down the answers to three key pre-test planning questions:
What metric do I want to move?
How much of a win do I need?
How long will this test need to run for?
What is the exact metric I want to move, and what is its current value?
We’ve talked with multiple teams who cannot measure their primary test metric split out by A/B test variant. Any number of reasons, from inconsistent metric definitions, to missing allocation assignments, to “was it revenue per user or revenue per session?” can lead to confusion. We’ve also seen misalignment between PMs and leadership, or across teams, where two groups disagree on which metric a team should target. This lack of clear goals usually leads to a post-test analysis paralysis – unable to launch yet unable to fix iterate.
Go into your database and calculate your control metric and your test metric, or build a dashboard in your analytics tool of choice BEFORE you launch a test.
What’s the minimum win I need with this metric to make it worth investing in launching this experiment to production? 1%? 5%? 20%?
From our experience in ML, we’ve seen teams launch A/B tests on prototypes that will require a significant productionisation effort. When the test doesn’t achieve a big enough win to get leadership buy on actually doing the production work, the team either has to maintain the shitty prototype infrastructure, or see the project die on the backlog.
In larger cross-company efforts, we’ve seen changes to recommender systems boost engagement, but lower revenue ever so slightly. You don’t want to deal with an angry ads executive after finishing a test – and you don’t have to!
Negotiate with your team and the teams around you BEFORE launching a test. Ask, Are we ok with incurring X cost for Y win?
How long will I need to run this test for?
Plug the values from the first two questions into an A/B test calculator, like the one in Evan’s Awesome A/B Test Tools (leave alpha and beta at the default value). This gives you a sample size per variation.
The definition of your metric will tell you what a “sample” is – a user, a session, a search query. Assuming you’re running a standard A/B test with one control and one test variant, you need to observe two times the sample size number of users / sessions / queries before you declare victory.
So, go into your analytics tool, and see how many calendar days it took you to gather 2 times the sample size number of users / sessions / queries in the past. That’s how many days you should plan to run your test for. Peeking at your results before collecting the appropriate number of samples could lead you to draw false conclusions, discussed in more detail below.
I Don’t Like My Plan
Did you realize that you might have to run your test for four months to draw statistical conclusions about the performance? Of course you’re upset!
A/B tests can run faster when you have more samples. An easy way to get more samples is to pick a test metric that is measured more frequently. For example, rather than measuring the percentage of sessions that convert to purchase, move up the conversion funnel and measure the percentage of sessions that like at least one item.
A/B tests can run faster if you are making bigger swings. If you don’t have enough users to reach significance on a 20% lift in your primary metric within a month, then consider an idea that’s bigger and bolder, one that you think might double your metric.
Also consider that A/B tests might not be the best choice for your launch. For example, if you’re already committed to re-branding your logo, use your A/B testing tools to run a staged rollout and look for bugs, but skip the full-blown statistical analysis.
As a rule of thumb, design tests that can easily reach significance within two weeks. Otherwise, you’ll spend months blocking your ability to iterate.
The Do’s and Don’ts of Looking at Test Results Early
People love looking at A/B test results. We’ve seen many teams who look at test results every day while tests are running, stopping the test and launching at the first sign of victory.
From a practical standpoint, looking at A/B tests early is a great way to find bugs! We recently ran a test where the week-one results looked bad, we investigated, and we found a few critical in a section of untested integration code. Once those were fixed, we restarted the test.
From a math standpoint, peeking at your test results early increases the risk of false positives. This means that the “win” you saw was due to random fluctuations in the data, and your team is vulnerable to drawing false conclusions – like “longer onboarding surveys increase conversion” – that will stay with the company for years.
Do use the directional test results to find bugs. Don’t peek at results early.
Conclusion
A/B tests are powerful statistical tools designed to help you decide whether or not to launch new features. However, misusing A/B tests erodes their usefulness, leading your team to waste time maintaining features that don’t truly improve the customer experience.
In this blog post, we discussed:
Writing a pre-test planning document, that includes these three points:
What metric do I want to move?
How much of a win do I need?
How long will this test need to run for?
When a test can’t reach statistical significance in a month, consider changing the metric, or scrapping the test for a bolder bet
Instead of an A/B test, used a staged rollout for launches like website redesigns when you want to check results every day to find bugs
Peeking at test results every day leaves you more likely to find false positives, creating false tribal knowledge
Feel free to contact us to provide feedback or ask questions. Thanks for reading!
At Rubber Ducky Labs, we know recommender systems. We’re currently a two-person team building recommender systems for consumer companies, with a decade of experience in both MLOps and recommender systems for fashion tech. Get in touch to talk about getting your in-house AI / ML recommender systems program started – no prior data science experience required!