Talking about A/B testing — an in-depth analysis and understanding
Coming in fresh from Peep Laja’s masterclass on running growth experiments, I was really excited to hear Ton Wesseling’s insights on A/B testing. I think everyone in the world does A/B testing with everything. It is such an overused term in the marketing domain with many either committing mistakes while planning, implementation or data connection.
So Ton starts unlike anyone else, with a timeline of A/B testing and laying the foundation bricks. The history takes us back to 1995 and Ton explains the path A/B testing has taken throughout the years up till 2019.
It was fascinating to see where we came from log file analysis during comparing weeks, then redirect scripts, not using cookies. Then we started using cookies. Then we had a real tool, Google optimizer, to run experiments on with drag and drop interfaces, gone to frameworks, personalization AI, and now all the way to server side.
Ton then explains the value of A/B testing that revolves around putting value on top of experiments. A/B testing evolves the Agile method of achieving a success metric. He then explains when should we use A/B testing —
- Deploy: You want to understand the impact of any deployment.
- Research: The first thing one can use it for is conversion signal map where you can leave out certain elements. Here we are looking for elements which are not required rather than looking for winners. We can also test fly-ins which can test the motivation for users. This is research for optimization.
- Optimization: It is like lean deployment, as in this stage if it is working we would want to deploy, if not we won’t.
We then moved on to the starting question for every marketer, Do you have enough data to conduct A/B tests?
Ton exposes us to the ROAR model he created some years back that helps us to not only answer this question but also explains the connect between statistical power and significance. ROAR signifies — Risk, Optimization, Automation and Re-think.
It was fascinating to understand the logic behind understanding wether to run an AB test or not. I think if one understands the key logic behind when and why then to run AB tests and it adds to the overall analysis. Ton explains that till one doesn’t has 1000 conversions an AB test is of less value.
The reason behind this is two fold: Time to run an AB test and the uplift required to prove a real winner.
Now, we can use calculators designed by the experts that can do the heavy lifting for you. Some rule of thumbs that you should know that will assist in making a decision:
- Conduct AB tests once you have more than 1000 conversions. At 1000 conversions you need 15% uplift (challenger has to outperform the control by 15%)
- AB tests at 10,000 conversions will need 5% uplift. At this stage look to hire a complete team.
- Tests shouldn’t be for more than 4 weeks, it can be for 2, 3, 4 weeks but not for 2 weeks 3 days. You need to cover the complete business cycle and also tests above the 4 week mark can pollute the data (a concept explained by Peep Laja too). Furthermore, this is lean deployment.
- Statistical power is the likelihood that an experiment will detect an effect when there is an effect to be detected.
- We need to understand the relation between power and significance to avoid the cases of false positives and false negatives.
- As a rule of thumb start with a high power >80% (to avoid false positives) and with significance, start with a high significance >90% (to avoid false negatives).
After the when let’s cover what should we AB tests. This is more commonly known as Goal metrics and Ton explains a hierarchy that one can keep in mind.
He also explained that there might be conflicting goal metrics within a larger company. One metric in a product team might create a tussle for the other team and hence something like an OEC can come in handy. OEC stands for Overall Evaluation Criterion. This task is for very matured companies and gets the buy in from every department. This aims at predicting long term value or customer lifetime value for the company rather than creating short wins.
There is so much similarity between science and the way AB tests are done. One method for this can be the FACT & ACT — Find something that needs to be fixed, Analyze the findings and Create a challenger to Test the hypothesis created. Then we again Analyse the results between the control and the challenger and in the end we want to Combine the results from other tests and then Tell the complete company. However, this is not for just one test but an overall testing programme.
In order to derive a hypothesis we always should test what needs to be optimized and the reason behind why it needs to be optimized. Hence, consumer behaviour research is very important.
You also cannot AB test in an isolated manner. To help this Ton explains the 6V canvas:
- Value: Determine the overall company values and the focus that delivers the most impact. This is mostly done by CRO specialist and Data wizards.
- Versus: Determine the competitors and which competitors are placed where. What are the best practices implemented by our competitors. You need to become their users, check their environment changes, their AB tests, their winners. A lot of tools are available to help in this.
- View: The insights that can be derived out of web analytics and web behaviour. This mostly deals with the data and psychology areas. Understand Where do visitors start on the site? What is the flow of those visitors? What’s the behavior on the most important pages? On the landing pages think about Where do visitors start their journey? Difference between new / existing customers? Deep dive into traffic sources Do they already have a product in mind? Do they already know the brand?
You can then analyse customer journey and create customer segments.
- Voice: This is really important to understand the voice of the customer. All this is collected through surveys and forms and by speaking with customer service team or chat logs. You can also conduct user research, recruit testers and per-test your assumptions.
- Verified: What scientific models, research and behaviours are already available. We need to research our way into the AB testing. There is a lot of scientific research already available on a particular industry or user segment.
- Validated: What insights are validated in the previous experiments.
Further into the planning pillar, Ton talks about Hypothesis setting. Hypothesis helps us to understand what the problem is, what the proposed solutions is and what the predicted outcome can be. This gets everyone on the same page. To set up a concrete hypothesis we need to follow this guideline:
If I APPLY THIS, then THIS BEHAVORIAL CHANGE will happen, ( among THIS GROUP ), because of THIS REASON.
Finally in the planning stage, we talk about the prioritization of AB tests. Many models like PIE and ICE help in this. Ton, however, explains a much robust model — PIPE which is based on Hypothesis x Location. Potential, Impact, Power and Ease covers the hypothesis x location.
It’s about coming up with analysed and verified hypothesis and understanding the potential of the hypothesis with the help of 6V model-this will give us Hypothesis strength. Now prioritize this based on the pages, some will be good on some page while some will be powerful on other page. Then we come up with filtering it with the PIPE model to finally assign ranks to the hypothesis.
There are multiple sheets that Ton shares to better understand this and also shares the concept of Minimum Detectable Evidence.
The key to prioritize Hypothesis is to have a combination of hypothesis x location x time. Ton then moves into the execution pillar and starts with some dos and dont’s:
- Have one challenger, not 2 or more
- Always consider implementation costs but not feel limited by it.
- You can have more than 1 change — optimization vs research
- Mirror design with the hypothesis
- Consider the MDE
He stresses the importance of the ROAR model that we read about earlier, also not to use WYSIWYG code editor and considering to inject client side code also in the default. You should always QA your AB test on multiple browser and devices, check if other page interactions still work and remove main elements to see if the test still holds.
We then moved to configuration of the AB test -pre-testing and post testing. It is here Ton explains it through Google Optimize which is a free tool however, there are many other tools available.
We should not change the distribution shift in between the experiments basis the conversion rates. This avoids Simpsons Paradox, don’t start a new event while in a cycle of testing. Pre testing selection will also consider the experiments for really new users. In post testing only consider users with a cookie value.
But how do we calculate the length of AB test and can we shorten the length?
Ton explains that in order to calculate the length, criteria like the unique visitors, conversion, power and significance has to be considered. Also, it is in weeks as it considers a complete business cycle and considers weekends and weekdays too. However, you also need to consider the full business cycle to consider the decision cycle of your users. The thumb rule is to run the experiment for 1,2,3,4 weeks. In order to avoid false positives, we should not look to shorten the length of the experiment.
Avoid peeking on the results in between the experiments.
Sequential testing takes into consideration that during the experiment, there will be, for instance, four times that you’ll peek, and the first time, you need to reach a certain significance level to call a winner. With Sequential testing we can peek and complete the experiment earlier.
We then move on to the basics of monitoring your AB tests where I learned about the multiple factors that we should consider while the experiment is running. Certain important points being the number of visitors/traffic and the sample ration mismatch errors (SRM).
We need to listen to the users that helps in determining any broken setups. In case the setup is faulty or broken or there is a SRM error or we are losing money due the we should immediately stop the tests and make necessary changes.
Now once the test is running and after the test is completed we need to measure the results. Before we start analysing the hypothesis, we need to know the test duration and determine the users and isolate users.
Some of the basic points while analysing are — we need to analyse in the analytical tool, we should avoid sampling data, analyse users and not sessions, analyse users who have converted and not users and total conversions. We can get the data from Google analytics and analyse it or we can understand the real winner through the calculator that Ton shares.
This will also tell us of any SRM errors. Any failure doesn’t mean that the hypothesis and results were incorrect and if the negative impact is quite small and the variation has a higher measured conversion rate then the default then you can apply the changes. We also need to avoid the over analysis of the data to find a winner.
In case you do find a winner initially then just implement asap. However, an uplift percentage in the test doesn’t necessarily mean the exact same uplift for the overall traffic after implementation. You also need to understand what caused the win and which segment is the largest contributor. You can use this for new experiments.
Now after the outcome, how should we present the results. It directly depends on the outcome you are looking for, if it is the direct business value you wish to add then Bayesian method is to be used but if you are looking to learn about the user behaviour then frequentist model is to be used to present a business case.
Ton explains 5 main tagging labels for all the data stored in our databases — 1. Product/service, 2. Customer journey phase, 3. User Segments, 4. Which template, 5. Which persuasion template? All this helped in understanding how to present a business case to the user.
Ton explained that optimization is more to with combining efficiency and effectiveness. He explains that CRO teams need to become a center of Centre of Excellence. As a company overall we need to see the ROI of optimization programmes and basis we can hire more people in the team, increase budgets, creates more robust teams to impact not just the impact but the behavioural aspect too.
This course was definitely the most intensive till now. The course was clearly divided and went into the origination of every topic making it robust and knowledge heavy. This one will definitely result in a lot of referencing back to but in a fun and collaborative manner. I thoroughly enjoyed the approach and I am looking forward to the next track covering Statistics fundamentals for optimization.