A/B testing is a widely employed statistical methodology used to compare the performance of two variations of a product or system, designated as version A and version B. This technique is used to determine which version performs better by showing one version to a subset of users while exposing the other version to a different subset of users. For example, an organization may wish to evaluate whether a redesigned website leads to an increase in sales compared to the previous design and conduct an A/B test to measure the conversion rate of each group.

One important statistical measure utilized in A/B testing is the "p-value". The p-value is used to determine the likelihood that any observed differences between the two groups being compared are the result of random chance rather than a real difference in performance. A low p-value (generally below 0.05) implies that the observed difference is statistically significant and is unlikely to be the result of chance. Conversely, a high p-value (above 0.05) suggests that the difference may not be statistically significant and could have occurred by chance. It's worth noting that the p-value threshold of 0.05 is widely used as a convention in many fields, but it's important to keep in mind that other p-value thresholds such as 0.01 or 0.1 may be utilized depending on the context and desired level of confidence.

For instance, let's consider a scenario where an organization wants to evaluate whether a new marketing campaign will be more effective in increasing sales compared to the current campaign. The organization decides to conduct an A/B test by exposing half of the customers to the new campaign while the other half is exposed to the current campaign. The organization measures six different effects: the number of clicks on the campaign's landing page, the number of purchases made, the average order value, customer satisfaction score, repeat purchase rate, and overall revenue. Utilizing a p-value of 0.05 to determine statistical significance, the organization finds that the new campaign performed significantly better than the current campaign in terms of repeat purchase rate, but there was no significant difference in the other five effects.

It's crucial to recognize that when conducting A/B tests and measuring multiple variables, the likelihood of observing any statistically significant difference due to random noise instead of a real difference increases. If we take the example above with a threshold of 0.05 for the p-value and the measurement of six variables, there is a 26.5% (1 - 0.95^6) chance of measuring a "significant" difference due to random noise in any experiment. Given that one out of every four experiments can yield "significant" differences due to random noise alone in this example, can we truly be certain that the new campaign is superior? I am not so sure. This probability increases to 51.2% when measuring 14 variables. To address this pitfall, it is essential to carefully choose and measure just a limited number of key variables that are directly related to the hypothesis of the experiment, as well as consider using a different p-value threshold depending on the context and desired level of confidence.

In conclusion, A/B testing is a powerful tool for evaluating the performance of two variations of a product or system. However, it is essential to understand the limitations of the p-value, the increasing probability of observing false positive results when measuring multiple variables, and the importance of selecting only relevant variables and choosing the appropriate p-value threshold. By doing so, organizations can ensure that their experiment results are accurate and their actions based on these are sound.

Great write up Alan, 👏👏 It’s so important to understand the nuances of a/b testing. It’s very rare that it gives a straight answer in my experience, especially as the distance between the a/b variants and the measured metric (purchase, re-purchase, etc.) increases. But testing in a systematic way like this has many other benefits, like doing many a/b tests around a set of landing pages for e.g. usually gives many higher level customer insights, or they are excellent to introduce large scale changes on critical pages.

Great insights

The book "How not to be wrong - the power of mathematical thinking" is also a must have for statistical contextualization and the shenanigans in studies.

Thanks for continuing to get the word out.